4.2. Simple Linear Regression (Statistical View)

Core Skills Guide for AI Interviews (Math, Code, SQL) 2025

Probability & Statistics for Data Science

6 min read 1089 words

🪄 Step 1: Intuition & Motivation

Core Idea: Simple linear regression is the bridge between correlation (how two variables move together) and prediction (how one can forecast the other).
It answers the question:
“If $X$ changes by one unit, how much does $Y$ typically change?”
In plain words — it’s the statistical version of drawing the best-fitting straight line through your data.
Simple Analogy: Imagine you’re a weather reporter trying to predict tomorrow’s temperature ($Y$) from today’s temperature ($X$). You plot the past data points — they look roughly linear. Now, you draw a line that captures the trend as best as possible — not passing through every point, but minimizing the “miss” overall. That line is your regression line.

🌱 Step 2: Core Concept

What’s Happening Under the Hood?

Linear regression models the relationship between a dependent variable $Y$ and an independent variable $X$:

$$ Y = \beta_0 + \beta_1 X + \epsilon $$

$\beta_0$ → Intercept (value of $Y$ when $X = 0$)
$\beta_1$ → Slope (expected change in $Y$ per unit change in $X$)
$\epsilon$ → Random error (the “noise” we can’t explain)

We find $\beta_0$ and $\beta_1$ such that the sum of squared residuals (differences between predicted and actual $Y$ values) is minimized. That’s why it’s called the Least Squares Method.

Why It Works This Way

By minimizing squared errors, the regression line gives the “average best guess” of $Y$ for any $X$.

Squaring errors ensures:

Positive and negative errors don’t cancel out.
Larger deviations are penalized more (so the line stays close to most points).

This approach naturally leads to a closed-form (algebraic) solution — elegant, simple, and efficient.

How It Fits in ML Thinking

Linear regression is both a statistical tool and a machine learning model:

In statistics → the goal is interpretation (what does $\beta_1$ mean?).
In ML → the goal is prediction (how well does it generalize?).

Understanding its assumptions, derivation, and limitations prepares you for modern models like logistic regression and neural networks, which are all built on similar optimization ideas.

📐 Step 3: Mathematical Foundation

⚙️ 1. Deriving Regression Coefficients via Least Squares

Derivation Step-by-Step

The regression equation:

$$ \hat{Y_i} = \beta_0 + \beta_1 X_i $$

We minimize the Sum of Squared Errors (SSE):

$$ SSE = \sum_{i=1}^{n} (Y_i - \hat{Y_i})^2 = \sum (Y_i - \beta_0 - \beta_1 X_i)^2 $$

Differentiate with respect to $\beta_0$ and $\beta_1$, set to zero:

[ \frac{\partial SSE}{\partial \beta_0} = -2 \sum (Y_i - \beta_0 - \beta_1 X_i) = 0 ] [ \frac{\partial SSE}{\partial \beta_1} = -2 \sum X_i (Y_i - \beta_0 - \beta_1 X_i) = 0 ]

Solving these gives:

$$ \hat{\beta_1} = \frac{\sum (X_i - \bar{X})(Y_i - \bar{Y})}{\sum (X_i - \bar{X})^2} $$

$$ \hat{\beta_0} = \bar{Y} - \hat{\beta_1}\bar{X} $$

Interpretation:

$\hat{\beta_1}$ is proportional to covariance$(X,Y)$ and inversely proportional to variance$(X)$.
If $X$ and $Y$ move together strongly, slope is large.
The slope “leans” more if the scatterplot looks more diagonal — less if it’s flat or noisy.

📉 2. Residuals and Model Fit

Definition & Role

Residuals ($e_i$):

$$ e_i = Y_i - \hat{Y_i} $$

They represent the unexplained part of the data — how far each observation is from the regression line.

Good Fit: Small, randomly scattered residuals around zero. Bad Fit: Large or patterned residuals (systematic bias).

Residual analysis helps check model assumptions and diagnose issues (e.g., heteroscedasticity, autocorrelation).

Residuals are like “leftovers” — the smaller and more random they are, the better your model has eaten the data’s structure.

💡 3. Coefficient of Determination ($R^2$)

Definition & Interpretation

$R^2$ measures how much of the variation in $Y$ is explained by $X$:

$$ R^2 = 1 - \frac{SSE}{SST} $$

where:

$SSE = \sum (Y_i - \hat{Y_i})^2$ (unexplained variation)
$SST = \sum (Y_i - \bar{Y})^2$ (total variation)

$R^2$ ranges from 0 to 1.

$R^2 = 0$: model explains nothing.
$R^2 = 1$: model perfectly fits the data.

Caution: A high $R^2$ doesn’t mean the model is correct — only that it fits this data well.

$R^2$ is like a report card — it tells how much of $Y$’s chaos you managed to explain with your $X$.

⚖️ 4. Key Statistical Assumptions

The Big Four

Linear regression relies on four main assumptions:

Linearity: Relationship between $X$ and $Y$ is linear.
Independence: Residuals are independent (no autocorrelation).
Homoscedasticity: Constant variance of residuals across all values of $X$.
Normality: Residuals are normally distributed.

Violation Example — Heteroscedasticity: If residuals’ spread increases with $X$, confidence intervals and p-values become unreliable — your model’s certainty is overstated.

Fixes:

Transform data (e.g., log-scale).
Use robust regression or weighted least squares.
When variance grows with $X$, the model’s “voice” shakes — it’s confident where data is dense, uncertain where it’s sparse.

Aspect	Statistical Regression	Machine Learning Regression
Goal	Understand relationships	Predict new outcomes
Focus	Inference (significance, confidence)	Performance (error minimization)
Assumptions	Strong (linearity, normality)	Flexible (can be nonlinear)
Techniques	OLS, t-tests, ANOVA	Regularization, Gradient Descent
Output	Coefficients + interpretation	Predictions + metrics (RMSE, MAE)

4.2. Simple Linear Regression (Statistical View)

🪄 Step 1: Intuition & Motivation

🌱 Step 2: Core Concept

📐 Step 3: Mathematical Foundation

⚙️ 1. Deriving Regression Coefficients via Least Squares

📉 2. Residuals and Model Fit

💡 3. Coefficient of Determination ($R^2$)

⚖️ 4. Key Statistical Assumptions

🔄 5. Statistical vs. Machine Learning Regression

🧠 Step 4: Assumptions or Key Ideas

⚖️ Step 5: Strengths, Limitations & Trade-offs

🚧 Step 6: Common Misunderstandings

🧩 Step 7: Mini Summary