Back to Basics - Linear Regression Overview


A data scientist's first love (or hate) is ultimately linear regression. It's generally the first "complex" model learned in a statistics course, and the first "simple" model learned in a data science course. Let's go over some of the basics.

 

What is linear regression?

Linear regression aims to study the relationship between two sets of variables. The first set is made up of one variable: y. y is referred to as the dependent variable, the endogenous variable or the "response". The second set is made up of one or more variables: X. X is referred to as the independent variable, the exogenous variable or the "predictor". The output is a best fit line that passes through the data in order to approximate the linear relationship between the dependent and independent variables.

Linear regression can be used in the context of inferential and predictive statistics.

What does that mean? Inferential statistics in this context is understanding how the independent variables relate and affect the dependent variable, and predictive statistics in this context is making a prediction of the dependent variable given the independent variables.

Inference example: How much does a one unit increase in stress (x) affect the amount of chocolate I eat (y)?

Prediction example: If I eat one bar of chocolate a day and drink one glass of wine (maybe three) a day, how many pounds will I gain?

The most common method of fitting a linear regression is OLS (ordinary least squares). This method aims to minimize the residuals (loss function) (the distance from the best fit line to the individual data points). However, there are more robust alternatives to the least squares method.

 

The math.

Linear regression assumes that the relationship between the dependent variable, y, and the transpose of a vector of independent variables is linear. 

Epsilon represents the residuals.

 

The Gauss-Markov Assumptions.

There are 6 assumptions that ensure that our OLS model is BLUE (the best linear unbiased estimator). If you studied anything statistics heavy and don't have these memorized - shame on you! (just kidding - I had to google them)

  1. Model is correctly specified: the endogenous variable is on the left and the exogenous variables are on the right, it is linear in parameters (the coefficients of the variables must be linear in nature) and has an additive error term.
  2. Random sampling: the data is variable and is a random sample from the population.
  3. No perfect multicollinearity: no perfect relationship between the independent variables (no one variable can be a transformation of the other).
  4. Zero conditional mean: the average error must equal to zero, E(ε|X) = 0.
  5. No heteroskedasticity: variance of the error terms is constant,  Var(ε|X) = σ².
  6. No autocorrelation: errors are independent of each other and follow a normal distribution: ε ~ N(0, σ²).

If assumptions 1-6 are not violated, then we can safely assume that our OLS model is BLUE. However, if assumptions 5 and/or 6 are violated, our model is still unbiased and consistent.

 

To be continued.

In my next post about linear regression I will be discussing evaluation methods, dealing with outliers, variations and use cases.