What are the measures of central tendency, and how do they differ from each other?

Measures of central tendency describe the typical or central value of a dataset. The most common measures of central tendency are the mean, median and mode.

Mean: The mean is the arithmetic average of all the values in a dataset. This uses all the data points and provides the comprehensive measure of the central tendency. This is highly influenced by the outliers or skewed data.

Median: The median is the middle value when the dataset is arranged in order from least to greatest. If there are an even number of values, the median is the average of the two middle values. The advantage of median is that this will not be affected by outliers or skewed data. But it does not consider all the data points, providing the less comprehensive information than the Mean.

Mode: The mode is the value that occurs most frequently in the dataset. If no value occurs more than once, there is no mode. This is useful for categorical data to know the most common category. This may not be useful if all values occur with the same frequency or if the data is continuous without repeated values.

Mode = Value that occurs most frequently

2. How do you interpret the standard deviation in the context of data variability?

The standard deviation (

)is the square root of the variance and is also a measure of how spread out the values in a dataset are. It provides the measure of dispersion in the same unit as the data but it can be influenced by the outliers.

Less standard deviation value represents that the lower variability suggesting that the data points are consistently distributed. Higher standard deviation indicates the higher variability of the data points

Based on the standard deviation, there are several common types of data distributions such as Normal, Skewed, Uniform, Bimodal, Exponential, distribution, Log-normal, Poisson and Binomial distributions

3. What is a box plot, and what information can you extract from it?

Box-and-whisker plot, is a standardized way of displaying the distribution of data based on a five-point summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It is a powerful graphical representation that helps in understanding the spread, central tendency, and variability of a dataset. Box plots are particularly useful for comparing distributions between several groups or datasets.

One can understand the Central tendency, Spread, variability, symmetry, skewness and outliers in the given data set.

4. Explain the significance of the interquartile range (IQR) and how it is used to detect outliers

The Interquartile Range (IQR) is a measure of statistical dispersion, which represents the range within which the central 50% of the data lies. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1).

The most common method is to define outliers as data points that lie beyond 1.5 times the IQR above the third quartile or below the first quartile.

Lower Boundary: Q1− 1.5 × IQR

Upper Boundary: Q3 + 1.5 × IQR

Any data point below the lower boundary or above the upper boundary is considered an outlier.

5. How Do Maximum Likelihood Estimators (MLE) Work?

Maximum Likelihood Estimation (MLE) is a method used in statistics to estimate the parameters of a statistical model. It works by finding the parameter values that maximize the likelihood function, which measures how well the model explains the observed data. Here the likelihood function is a mathematical function that represents the probability of observing the given data under various parameter values. The goal of MLE is to find the values that maximizes the likelihood function.

Assignment 2:

What does linear regression try to optimize?

Linear regression tries to optimize the parameters of the model so as to minimize the cost function, which is typically the Mean Squared Error (MSE) between the predicted values and the actual values.

Mathematically, linear regression aims to minimize the cost function J $($ $θ$ $)$ , which is defined as:
$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$
where:
$m$ is the number of training examples.
$h_\theta(x^{(i)})$ is the hypothesis (predicted value) for the $i$ th training example, given by $h_{θ} (x^{(i)}) = θ_{0} + θ_{1} x_{1}^{(i)} + θ_{2} x_{2}^{(i)} + \dots + θ_{n} x_{n}^{(i)}$
$y^{(i)}$ is the actual value for the $i$ -th training example.
$\theta$ represents the parameters of the model.
The cost function $J(\theta)$ represents the average of the squared differences between the predicted values and the actual values.
The goal is to find the parameter values $\theta$ that result in the smallest possible value of $J(\theta)$ . This is achieved by using optimization algorithms such as gradient descent.

Is it possible to use linear regression to represent quadratic equations? Explain with an example.

Yes, it is possible to use linear regression to represent quadratic equations by introducing polynomial features. Although linear regression is inherently linear, you can extend it to fit non-linear data by creating new features that are polynomial terms of the original features.

For example, to fit a quadratic equation of the form

y = ax^2 + bx + c

, we can create new features base

d on the original feature

x

as below,

Now, you can use these new features in a linear regression model:

$y = \theta_0 + \theta_1 x_1 + \theta_2 x_2$

This is still a linear regression

c. Why is it crucial to detect and remove outliers?

Detecting and removing outliers is crucial for several reasons:
1. Model Accuracy

Distortion of Model Parameters: Outliers can significantly influence the estimated coefficients in linear regression, leading to biased results and a model that does not generalize well to new data. For example, if you have an outlier with a very large value, it can pull the regression line towards it, distorting the true relationship between the variables.

2. Statistical Validity

Assumptions of Statistical Methods: Many statistical methods, including linear regression, make assumptions about the data (e.g., normally distributed residuals, homoscedasticity). Outliers can violate these assumptions, leading to invalid statistical inferences and unreliable p-values and confidence intervals.

3. Improved Performance

Algorithm Efficiency: Outliers can slow down the convergence of gradient-based optimization algorithms (e.g., gradient descent) used in training machine learning models. Removing outliers can help the algorithm converge faster and find a better solution.

4. Better Insights

Data Interpretation: Outliers can obscure the true nature of the data. By removing them, you can get a clearer understanding of the underlying patterns and relationships in the data, leading to more accurate and actionable insights.

5. Robustness

Model Robustness: Outliers can make the model overly sensitive to a few extreme values, reducing its robustness. A model that performs well without outliers is likely to be more stable and reliable.

Examples
Linear Regression Example: Suppose you have a dataset of house prices based on their size. Most houses fall within a certain price range, but there is one mansion with an exceptionally high price. Including this mansion in your dataset can skew the regression line upwards, leading to inaccurate predictions for more typical houses.
Optimization Example: In gradient descent, the presence of outliers can cause large gradients that lead to significant jumps in parameter updates, potentially overshooting the minimum of the cost function. Removing outliers helps in stabilizing the learning process and achieving better convergence.
Techniques for Handling Outliers
Statistical Methods:

Z-Score: Identify and remove data points with z-scores beyond a certain threshold.

IQR Method: Identify and remove data points that lie beyond 1.5 times the interquartile range (IQR) from the first and third quartiles.

Model-Based Methods:

Robust Regression: Use regression techniques that are less sensitive to outliers, such as RANSAC or Huber Regression.

Transformation: Apply transformations (e.g., log transformation) to reduce the impact of outliers.

Detecting and removing outliers ensures that the models are accurate, efficient, and provide reliable predictions and insights.

d. What is feature scaling? When is it required?

Feature scaling is the process of normalizing or standardizing the range of independent variables or features in your dataset. It ensures that all features contribute equally to the model and helps improve the performance and convergence speed of various machine learning algorithms.
Common methods for feature scaling include:
Min-Max Scaling (Normalization):

x

'

=

x

-

\min

(

x

)

\max

(

x

)

-

\min

(

x

)

x^{'} = \frac{x - m i n ( x )}{m a x ( x ) - m i n ( x )}

This scales the features to a fixed range, typically [0, 1].
Standardization (Z-score Normalization):

x

'

=

x

-

μ

σ

x^{'} = \frac{x - μ}{σ}

This transforms the data to have a mean of 0 and a standard deviation of 1.
Robust Scaling:

x

'

=

x

-

median

(

x

)

IQR

(

x

)

x^{'} = \frac{x - median ( x )}{IQR ( x )}

This method uses the median and the interquartile range (IQR) and is less sensitive to outliers.
When is Feature Scaling Required?
Feature scaling is particularly important in the following scenarios:
Distance-Based Algorithms:

K-Nearest Neighbors (KNN): KNN uses distance metrics to find the nearest neighbors. If features are on different scales, the distance calculations will be dominated by the features with larger scales.

Support Vector Machines (SVM): SVMs maximize the margin between classes, which involves distance calculations that can be skewed by features on different scales.

Gradient-Based Optimization Algorithms:

Gradient Descent: Algorithms like gradient descent used in linear regression, logistic regression, and neural networks converge faster with scaled features. Without scaling, some weights might update too slowly (due to small feature values), and others might update too quickly (due to large feature values).

Principal Component Analysis (PCA):

PCA is sensitive to the variances of the features. Features with larger variances will dominate the principal components. Scaling ensures that each feature contributes equally to the analysis.

Neural Networks:

Neural networks perform better when input features are scaled. It helps in faster convergence during training as it prevents weights from oscillating during updates.

Example
Consider a dataset with two features: height (in centimeters) and weight (in kilograms). Suppose the ranges of these features are:

Height: 150-200 cm

Weight: 50-100 kg

Without scaling, algorithms that rely on distance metrics or gradient-based optimization might give undue importance to the height feature due to its larger numeric range.
After scaling (e.g., using standardization), both features would be transformed to have a mean of 0 and a standard deviation of 1, making them comparable.

e. State two differences between linear regression and logistic regression.

Here are two key differences between linear regression and logistic regression:

1. Nature of the Output

Linear Regression:
Output Type: Predicts a continuous numerical value.
Use Case: Used for regression tasks where the goal is to predict a quantity, such as predicting house prices, temperatures, or sales figures.
Example: Given the features of a house (e.g., size, number of bedrooms), linear regression might predict the price of the house.
Logistic Regression:
Output Type: Predicts a probability that a given input belongs to a certain class, which is then used to classify the input into one of two (binary classification) or more (multi-class classification) discrete categories.
Use Case: Used for classification tasks where the goal is to predict a categorical outcome, such as whether an email is spam or not, whether a customer will churn, or whether a patient has a certain disease.
Example: Given features of an email (e.g., presence of certain keywords, length), logistic regression might predict the probability that the email is spam.

2. Cost Function

Linear Regression:
Cost Function: Uses Mean Squared Error (MSE) as the cost function. $J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$
Optimization Goal: Minimizes the sum of the squared differences between the predicted values and the actual values.
Logistic Regression:
Cost Function: Uses Log Loss (also called Cross-Entropy Loss) as the cost function. $J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]$
Optimization Goal: Maximizes the likelihood of the observed data by minimizing the log loss, ensuring that the predicted probabilities are as close as possible to the actual class labels.
These differences highlight the distinct purposes and methodologies of linear and logistic regression, with linear regression focused on predicting continuous outcomes and logistic regression on predicting categorical probabilities.

f. Why is the Mean Square Error cost function unsuitable for logistic regression?

The Mean Squared Error (MSE) cost function is unsuitable for logistic regression for several reasons:

1. Non-linearity of the Hypothesis Function

Logistic Regression Hypothesis: The hypothesis function in logistic regression is a sigmoid function, which is non-linear: $h_\theta(x) = \frac{1}{1 + e^{-\theta^T x}}$ This function outputs probabilities that lie between 0 and 1.
Gradient Descent with MSE: When using MSE with a non-linear hypothesis like the sigmoid function, the cost function becomes non-convex, leading to multiple local minima. This makes it difficult for gradient descent to find the global minimum, resulting in poor optimization.

2. Interpretation of Errors

Error Interpretation: In logistic regression, we are dealing with probabilities and binary outcomes. The errors are not simply the difference between predicted values and actual values but involve the likelihood of the observed data.
Log Loss: The cross-entropy loss (log loss) used in logistic regression is designed to handle probabilities and provides a more appropriate measure of the error. It penalizes incorrect classifications more heavily, especially when the model is confident but wrong.

3. Proper Gradient for Optimization

Gradient Calculation: The gradient of the log loss function aligns better with the goal of logistic regression, which is to maximize the likelihood of the observed data. The log loss function's gradient is more informative for updating the model parameters in the right direction.
Log Loss Function: $J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]$ The log loss function is convex, meaning it has a single global minimum, which simplifies the optimization process and ensures that gradient descent converges reliably to the optimal solution.

Summary

In summary, the MSE cost function is unsuitable for logistic regression because it is not well-suited to the non-linear nature of the sigmoid hypothesis function, does not properly handle probability-based errors, and does not provide the appropriate gradients for effective optimization. Instead, logistic regression uses the log loss function, which is convex and better suited to the probabilistic framework of classification tasks.

Aspect	Linear Regression	Logistic Regression
Output Type	Continuous numerical value	Probability of class membership (0 to 1)
Use Case	Regression tasks (e.g., predicting house prices)	Classification tasks (e.g., spam detection)
Example	Predicting house prices based on features	Predicting if an email is spam based on features
Cost Function	Mean Squared Error (MSE)	Log Loss (Cross-Entropy Loss)
Cost Function Formula	$J(\theta) = \frac{1}{2m} \sum_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)})^2$	$J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} \left[ y^{(i)} \log(h_\theta(x^{(i)})) + (1 - y^{(i)}) \log(1 - h_\theta(x^{(i)})) \right]$
Optimization Goal	Minimize the sum of squared differences	Maximize the likelihood of the observed data

This table concisely captures the key differences between linear regression and logistic regression.

g. What can be inferred if the cost function initially decreases but then increases or gets stuck at a high value?

If the cost function initially decreases but then increases or gets stuck at a high value, several issues might be inferred:

1. Learning Rate Issues

Too High Learning Rate: If the learning rate is too high, the gradient descent algorithm may overshoot the minimum, causing the cost function to increase after initially decreasing. This can lead to the cost function oscillating or even diverging.
Too Low Learning Rate: A very low learning rate might cause the algorithm to get stuck in a plateau or a region where the cost function decreases very slowly, making it seem like the cost function is stuck at a high value.

2. Local Minima or Saddle Points

Local Minima: The optimization process might get stuck in a local minimum, especially if the cost function is non-convex. This local minimum might have a higher cost value compared to the global minimum.
Saddle Points: The algorithm might get stuck at a saddle point, where the gradient is zero but it is not a minimum, causing the optimization to halt prematurely.

3. Overfitting

Training vs. Validation Performance: If the cost function decreases on the training set but increases on the validation set, it indicates overfitting. The model might be learning the noise in the training data rather than the underlying pattern, leading to poor generalization.

4. Data Quality Issues

Noisy Data: High variance in the data or noisy data can cause the cost function to behave unpredictably. The model may fit to the noise initially, causing a decrease in the cost function, but as training progresses, the noise interferes, and the cost function increases.
Outliers: Outliers can have a significant impact on the cost function. Initially, the model may fit well to the majority of the data points, but as training continues, the influence of outliers can cause the cost function to increase or become unstable.

5. Inappropriate Model Complexity

Too Complex: A model that is too complex (e.g., too many parameters) might overfit the training data, causing the cost function to decrease initially but then increase as it fails to generalize to new data.
Too Simple: A model that is too simple might not be able to capture the underlying patterns in the data, causing the cost function to get stuck at a high value.

6. Optimization Algorithm Issues

Suboptimal Algorithm: The choice of optimization algorithm might be suboptimal for the given problem. Some algorithms might get stuck in local minima or saddle points more easily than others.

Example

Imagine training a neural network with a high learning rate. Initially, the cost function decreases rapidly, indicating that the model is learning. However, due to the high learning rate, the updates are too large, causing the model to overshoot the optimal parameters, and the cost function starts increasing.

Mitigation Strategies

Adjust the Learning Rate: Use a learning rate scheduler to adapt the learning rate during training or try different learning rates.
Regularization: Apply regularization techniques (e.g., L2 regularization, dropout) to prevent overfitting.
Data Preprocessing: Remove or handle outliers and normalize/standardize the data.
Use Advanced Optimizers: Use more sophisticated optimization algorithms like Adam, RMSprop, or others that handle local minima and saddle points better.
Cross-Validation: Monitor performance on a validation set to detect overfitting and adjust model complexity accordingly.
By addressing these issues, you can ensure more stable and reliable convergence of the cost function.

h. Describe two ways to perform multi-class classification using logistic regression

Multi-class classification using logistic regression can be performed using two common techniques: One-vs-Rest (OvR) and Softmax Regression (also known as Multinomial Logistic Regression). Here’s a detailed description of each method:

1. One-vs-Rest (OvR)

Concept: In the One-vs-Rest approach, also known as One-vs-All (OvA), the multi-class classification problem is broken down into multiple binary classification problems. For each class, a separate logistic regression model is trained to distinguish that class from all other classes.

How it Works:

Step 1: For a dataset with $K$ $K$ classes, train $K$ $K$ binary classifiers. Each classifier $k$ $k$ is trained to predict whether an instance belongs to class $k$ $k$ or not.
- For class $k$ , create a binary label $y_k$ where $y_k = 1$ if the instance belongs to class $k$ and $y_k = 0$ otherwise.
- Train a logistic regression model on this binary classification problem.
Step 2: For a new instance, each of the $K$ classifiers outputs a probability that the instance belongs to its respective class.
Step 3: The class with the highest probability is chosen as the predicted class.

Example: Suppose we have a dataset with three classes: A, B, and C.

Train classifier $C_A$ to distinguish class A from classes B and C.
Train classifier $C_B$ to distinguish class B from classes A and C.
Train classifier $C_C$ to distinguish class C from classes A and B.
For a new instance, get the probabilities $P_A$ , $P_B$ , and $P_C$ from the respective classifiers and choose the class with the highest probability.

2. Softmax Regression (Multinomial Logistic Regression)

Concept: Softmax regression extends logistic regression to handle multiple classes directly. Instead of breaking the problem into multiple binary classifications, it models the probability distribution over all classes.

How it Works:

Step 1: The model uses a single set of weights and biases for each class.
Step 2: For a given instance $x$ , compute the linear scores for each class $k$ : $z_k = \theta_k^T x$ where $\theta_k$ is the weight vector for class $k$ .
Step 3: Apply the softmax function to these scores to obtain the probabilities for each class: $P(y = k \mid x) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$ This ensures that the probabilities sum up to 1 and each probability is between 0 and 1.
Step 4: During training, use the cross-entropy loss function to optimize the model parameters. The cross-entropy loss for a single instance is: $J(\theta) = -\sum_{k=1}^{K} y_k \log(P(y = k \mid x))$ where $y_k$ is a binary indicator (0 or 1) if class $k$ is the correct classification for instance $x$ .

Example: Suppose we have a dataset with three classes: A, B, and C.

Compute the scores $z_A$ , $z_B$ , and $z_C$ for a given instance.
Apply the softmax function to get the probabilities $P_A$ , $P_B$ , and $P_C$ .
Predict the class with the highest probability.

Summary

Method	One-vs-Rest (OvR)	Softmax Regression
Approach	Breaks down the problem into multiple binary classifications	Models the probability distribution over all classes
Training	Train $K$ binary classifiers, one for each class	Train a single model with a softmax output layer
Prediction	Choose the class with the highest probability from $K$ classifiers	Choose the class with the highest softmax probability
Complexity	Involves training multiple models	Involves a single model but with more complex computation
Loss Function	Binary cross-entropy for each classifier	Multinomial cross-entropy (softmax loss)

Both methods are widely used and have their own advantages depending on the specific problem and dataset characteristics.

Question 2:

Consider a linear regression model with two variables:

h(x) = w0 + w1.x1 + w2.x2;

which has been initialized with the following weights: w0 = 0; w1 = 1; w2 = 1. Consider the learning rate alpha (𝞪) = 0.0002. You are given the following data:

Write the values of the weights after performing the gradient descent algorithm for 2 iterations. Calculate the initial mean squared error before any iterations, and the final error after updating the weights for 2 iterations. Provide the values in tables like the following:

Answer: w0 w1 w2 After iteration 1 After Iteration 2 Initial Mean squared error Final Mean squared error

To solve this, we'll follow these steps:

Calculate the initial mean squared error.
Perform 2 iterations of gradient descent.
Update weights in each iteration.
Calculate the mean squared error after each iteration.

Given:

Initial weights: $w0 = 0$ , $w1 = 1$
Learning rate: $α = 0.000$
Data: $\begin{array}{ccc} x1 & x2 & y \\ 60 & 22 & 140 \\ 67 & 24 & 159 \\ 71 & 15 & 192 \\ 75 & 20 & 200 \\ 78 & 16 & 212 \\ \end{array}$

Step 1: Calculate the initial mean squared error (MSE)

The hypothesis function: $h(x) = w0 + w1 \cdot x1 + w2 \cdot x2$

Given initial weights: $w0 = 0, \, w1 = 1, \, w2 = 1$

For each data point, calculate $h(x)$ and the squared error:

$\begin{align*} h(x_1) &= 60 + 22 = 82 & \text{(error)}^2 &= (140 - 82)^2 = 3364 \\ h(x_2) &= 67 + 24 = 91 & \text{(error)}^2 &= (159 - 91)^2 = 4624 \\ h(x_3) &= 71 + 15 = 86 & \text{(error)}^2 &= (192 - 86)^2 = 11236 \\ h(x_4) &= 75 + 20 = 95 & \text{(error)}^2 &= (200 - 95)^2 = 11025 \\ h(x_5) &= 78 + 16 = 94 & \text{(error)}^2 &= (212 - 94)^2 = 13924 \\ \end{align*}$

MSE:

$\text{MSE} = \frac{1}{5} (3364 + 4624 + 11236 + 11025 + 13924) = \frac{44173}{5} = 8834.6$

Step 2: Perform 2 iterations of gradient descent

Update rules for gradient descent:

$w_j := w_j - \alpha \cdot \frac{\partial J}{\partial w_j}$

where $J$ is the cost function:

$J = \frac{1}{2m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})^2$

Partial derivatives:

$\frac{\partial J}{\partial w0} = \frac{1}{m} \sum_{i=1}^{m} (h(x^{(i)}) - y^{(i)})$ $\frac{\partial J}{\partial w1} = \frac{1}{m} \sum_{i=1}^{m} ((h(x^{(i)}) - y^{(i)}) \cdot x1^{(i)})$ $\frac{\partial J}{\partial w2} = \frac{1}{m} \sum_{i=1}^{m} ((h(x^{(i)}) - y^{(i)}) \cdot x2^{(i)})$

Iteration 1:

Calculate partial derivatives:

$\begin{align*} \frac{\partial J}{\partial w0} &= \frac{1}{5} \left( (82 - 140) + (91 - 159) + (86 - 192) + (95 - 200) + (94 - 212) \right) \\ &= \frac{1}{5} \left( -58 - 68 - 106 - 105 - 118 \right) = \frac{1}{5} (-455) = -91 \\ \frac{\partial J}{\partial w1} &= \frac{1}{5} \left( (82 - 140) \cdot 60 + (91 - 159) \cdot 67 + (86 - 192) \cdot 71 + (95 - 200) \cdot 75 + (94 - 212) \cdot 78 \right) \\ &= \frac{1}{5} \left( -3480 - 4556 - 7556 - 7875 - 9156 \right) = \frac{1}{5} (-32623) = -6524.6 \\ \frac{\partial J}{\partial w2} &= \frac{1}{5} \left( (82 - 140) \cdot 22 + (91 - 159) \cdot 24 + (86 - 192) \cdot 15 + (95 - 200) \cdot 20 + (94 - 212) \cdot 16 \right) \\ &= \frac{1}{5} \left( -1276 - 1632 - 1590 - 2100 - 1888 \right) = \frac{1}{5} (-8486) = -1697.2 \\ \end{align*}$

Update weights:

$\begin{align*} w0 &:= 0 - 0.0002 \cdot (-91) = 0.0182 \\ w1 &:= 1 - 0.0002 \cdot (-6524.6) = 2.30492 \\ w2 &:= 1 - 0.0002 \cdot (-1697.2) = 1.33944 \\ \end{align*}$

Iteration 2:

Calculate partial derivatives with updated weights:

$\begin{align*} h(x_1) &= 0.0182 + 2.30492 \cdot 60 + 1.33944 \cdot 22 = 169.6372 \\ h(x_2) &= 0.0182 + 2.30492 \cdot 67 + 1.33944 \cdot 24 = 187.3096 \\ h(x_3) &= 0.0182 + 2.30492 \cdot 71 + 1.33944 \cdot 15 = 176.8474 \\ h(x_4) &= 0.0182 + 2.30492 \cdot 75 + 1.33944 \cdot 20 = 189.9212 \\ h(x_5) &= 0.0182 + 2.30492 \cdot 78 + 1.33944 \cdot 16 = 186.1524 \\ \end{align*}$ $\begin{align*} \frac{\partial J}{\partial w0} &= \frac{1}{5} \left( (169.6372 - 140) + (187.3096 - 159) + (176.8474 - 192) + (189.9212 - 200) + (186.1524 - 212) \right) \\ &= \frac{1}{5} \left( 29.6372 + 28.3096 - 15.1526 - 10.0788 - 25.8476 \right) = \frac{1}{5} (6.8678) = 1.37356 \\ \frac{\partial J}{\partial w1} &= \frac{1}{5} \left( (169.6372 - 140) \cdot 60 + (187.3096 - 159) \cdot 67 + (176.8474 - 192) \cdot 71 + (189.9212 - 200) \cdot 75 + (186.1524 - 212) \cdot 78 \right) \\ &= \frac{1}{5} \left( 1778.232 + 1891.1332 - 1075.6346 - 755.91 - 2026.1176 \right) = \frac{1}{5} (-188.297) = -37.6594 \\ \frac{\partial J}{\partial w2} &= \frac{1}{5} \left( (169.6372 - 140) \cdot 22 + (187.3096 - 159) \cdot 24 + (176.8474 - 192) \cdot 15 + (189.9212 - 200) \cdot 20 + (186.1524 - 212) \cdot 16 \right) \\ &= \frac{1}{5} \left( 652.0184 + 679.4304 - 228.789 - 202.076 - 413.5616 \right) = \frac{1}{5} (487.0222) = 97.40444 \\ \end{align*}$

Update weights:

$\begin{align*} w0 &:= 0.0182 - 0.0002 \cdot 1.37356 = 0.017925288 \\ w1 &:= 2.30492 - 0.0002 \cdot (-37.6594) = 2.312452 \\ w2 &:= 1.33944 - 0.0002 \cdot 97.40444 = 1.320859 \\ \end{align*}$

Final Calculations

Initial MSE: 8834.6
Final MSE: To calculate this, we need the final values of $h(x)$ after the last weight update, and then we can calculate the MSE.

Final values:

$\begin{align*} h(x_1) &= 0.017925288 + 2.312452 \cdot 60 + 1.320859 \cdot 22 = 170.774312 \\ h(x_2) &= 0.017925288 + 2.312452 \cdot 67 + 1.320859 \cdot 24 = 188.6266 \\ h(x_3) &= 0.017925288 + 2.312452 \cdot 71 + 1.320859 \cdot 15 = 177.56296 \\ h(x_4) &= 0.017925288 + 2.312452 \cdot 75 + 1.320859 \cdot 20 = 191.26752 \\ h(x_5) &= 0.017925288 + 2.312452 \cdot 78 + 1.320859 \cdot 16 = 187.870828 \\ \end{align*}$

Final MSE:

$\begin{align*} \text{MSE} &= \frac{1}{5} \left( (140 - 170.774312)^2 + (159 - 188.6266)^2 + (192 - 177.56296)^2 + (200 - 191.26752)^2 + (212 - 187.870828)^2 \right) \\ &= \frac{1}{5} \left( 950.473 + 877.5236 + 208.8051 + 76.2269 + 585.4371 \right) \\ &= \frac{1}{5} (2698.4657) = 539.6931 \end{align*}$

Summary:

Weights	w0	w1	w2
Iteration 1	0.0182	2.30492	1.33944
Iteration 2	0.017925	2.312452	1.320859

Error	Value
Initial MSE	8834.6
Final MSE	539.6931

Decision Tree:

1. What is a Decision Tree, and how does it make decisions during test time?

A Decision Tree is a non-parametric supervised learning algorithm that is used for both classification and regression tasks. It models decisions and their possible consequences as a tree-like structure, with nodes representing decisions based on feature values, branches representing the outcomes of those decisions, and leaf nodes representing the final prediction (class label or value).

Decision tree during test time:

· The input data point starts at the root node.

· The tree evaluates the feature value associated with the root node and moves down the branch corresponding to the outcome of that decision.

· This process continues down the tree, traversing nodes based on the feature values of the input data until a leaf node is reached.

· The prediction of the model is the value or class label stored in that leaf node.

2. How does Bagging improve the performance of a Decision Tree?

Bagging or Bootstrap Aggregating enhances the performance of a Decision Tree by reducing its variance and increasing robustness.

· Bootstrap Sampling: Multiple subsets of the training data are created by sampling with replacement. Each subset may have duplicates of some data points and may miss others.

· Independent Trees: A Decision Tree is trained on each of these subsets independently. Because of the variance in training sets, the trees may differ from one another.

· Aggregation: For classification, the final prediction is made by majority voting across the trees; for regression, it’s the average of the predictions.

This process helps in reducing the variance of the model (which is high for single Decision Trees) by averaging out the errors of individual trees, thus making the final model more stable and less prone to overfitting.

3. In what situations might a Decision Tree overfit the training data, and how can this be mitigated?

Overfitting occurs when a Decision Tree learns the noise and details in the training data to such an extent that it negatively impacts its performance on new, unseen data.

Situations leading to overfitting:

· Complex Trees: If the tree is allowed to grow without constraints, it may become very deep, modeling the noise in the training data.

· Small Training Data: When the dataset is small, the tree might learn the idiosyncrasies of the specific dataset rather than general patterns.

· High Dimensionality: When there are many features relative to the number of observations, the tree may overfit by splitting on irrelevant features.

Mitigation strategies:

· Pruning: Reduce the size of the tree by cutting off branches that have little importance. This can be achieved by post-pruning (after the tree is fully grown) or pre-pruning (by setting constraints like maximum depth or minimum samples per leaf).

· Cross-Validation: Use cross-validation to fine-tune the tree's hyperparameters, ensuring it generalizes well to unseen data.

· Use of Ensemble Methods: Employ methods like Bagging (e.g., Random Forest) or Boosting (e.g., AdaBoost) to aggregate multiple trees and reduce the likelihood of overfitting.

4. How does Random Forest differ from a single Decision Tree?

Random Forest is an ensemble method that builds multiple Decision Trees and combines their predictions to improve accuracy and robustness.

Key Differences:

· Multiple Trees vs. Single Tree: Random Forest consists of a large number of Decision Trees, while a single Decision Tree is a stand-alone model.

· Feature Subset Randomness: In Random Forest, each tree is trained on a random subset of the data and, crucially, on a random subset of the features at each split. This feature randomness makes the trees less correlated, which reduces variance.

· Aggregation of Predictions: Random Forest aggregates the predictions of all trees to make the final decision (majority voting for classification, averaging for regression), while a single Decision Tree provides one direct prediction.

Random Forest generally provides better accuracy and robustness compared to a single Decision Tree because it reduces both variance and overfitting.

5. What is the main idea behind Boosting in ensemble methods?

Boosting is an ensemble technique that aims to convert a set of weak learners into a strong learner by sequentially building models that focus on correcting the errors of the previous ones.

Core Concepts:

· Sequential Model Training: Boosting trains models sequentially, each new model tries to correct the errors made by the previous ones.

· Weighted Data Points: Initially, all data points are weighted equally. As models are added, Boosting increases the weights of misclassified points so that subsequent models focus more on these difficult cases.

· Model Aggregation: The final prediction is a weighted combination of all the models' predictions, where more accurate models have more influence.

Effect: Boosting can significantly improve model accuracy, especially in cases where the base models are simple and have high bias. However, it also increases the risk of overfitting, especially if the number of models (iterations) is too high.

Interview Questions

2. How do you interpret the standard deviation in the context of data variability?

3. What is a box plot, and what information can you extract from it?

4. Explain the significance of the interquartile range (IQR) and how it is used to detect outliers

5. How Do Maximum Likelihood Estimators (MLE) Work?

Assignment 2:

What does linear regression try to optimize?

Is it possible to use linear regression to represent quadratic equations? Explain with an example.

d. What is feature scaling? When is it required?

e. State two differences between linear regression and logistic regression.

Here are two key differences between linear regression and logistic regression:

1. Nature of the Output

2. Cost Function

f. Why is the Mean Square Error cost function unsuitable for logistic regression?

The Mean Squared Error (MSE) cost function is unsuitable for logistic regression for several reasons:

1. Non-linearity of the Hypothesis Function

2. Interpretation of Errors

3. Proper Gradient for Optimization

Summary

g. What can be inferred if the cost function initially decreases but then increases or gets stuck at a high value?

If the cost function initially decreases but then increases or gets stuck at a high value, several issues might be inferred:

1. Learning Rate Issues

2. Local Minima or Saddle Points

3. Overfitting

Training vs. Validation Performance: If the cost function decreases on the training set but increases on the validation set, it indicates overfitting. The model might be learning the noise in the training data rather than the underlying pattern, leading to poor generalization.

4. Data Quality Issues

5. Inappropriate Model Complexity

6. Optimization Algorithm Issues

Suboptimal Algorithm: The choice of optimization algorithm might be suboptimal for the given problem. Some algorithms might get stuck in local minima or saddle points more easily than others.

Example

Mitigation Strategies

h. Describe two ways to perform multi-class classification using logistic regression

1. One-vs-Rest (OvR)

2. Softmax Regression (Multinomial Logistic Regression)

Summary

Step 1: Calculate the initial mean squared error (MSE)

Step 2: Perform 2 iterations of gradient descent

Iteration 1:

Iteration 2:

Final Calculations

Summary:

Decision Tree:

Comments

Post a Comment

Popular posts from this blog

1.2. Basic statistical concepts

Chapter 1: Introduction to Statistics