Linear Regression Explained

Photo by Annie Spratt on Unsplash

An algorithm is a set of rules that a computer can follow to solve a particular problem. In the context of modeling, each algorithm will use a different approach to obtain a function which best represents the target variable. Recall that regression algorithms explore the relationship between independent variable(s) and a continuous dependent variable.

Linear regression models a linear relationship between the dependent variable and at least one independent / explanatory variable. A linear regression model with one explanatory variable is known as simple linear regression while one with multiple variables is known as multiple linear regression.

When fitted over a training set, the linear regression model estimates unknown parameters to quantify how the average dependent variable is affected by linear movements in the independent variable(s).

Given observed targets \(y_1, ... y_n\) and explanatory variables \(x_{11}, ..., x_{1j}\) (there are \(n\) observations and \(j\) explanatory variables), the linear model assumes that each observation can be modelled as:

$$y_i = \beta_0 + \beta_1 x_{i1} + ... + \beta_j x_{ij} + \epsilon_i$$

The coefficients \(\hat{\boldsymbol{\beta}} = [\hat{\beta_0}, \hat{\beta_1}, ...]\) are the estimated parameters. With the coefficients, we can estimate the values of future observations so long as we know the explanatory variables.

The \(\epsilon\) (pronounced epsilon) is the error term, also known as residuals. The error is the difference between the true target variable and the predicted target variable.

The target is predicted as \(\hat{y} = \hat{\beta_0} + \hat{\beta_1} x_{i1} + ... + \hat{\beta_j} x_{ij}\) which gives us \(\epsilon = y - \hat{y}\)

How are the coefficients calculated?

In a nutshell, we choose the "optimal" coefficients to minimize the error terms.

There are several different methods of doing this. One of the standard approaches is to use the Ordinary Least-Squares Estimation (OLS) method. The OLS method makes several assumptions:

  • the data is able to be modelled using a linear relationship, similar to form shown above
  • the error term \(\epsilon\) is independent and identically distributed with a population mean of zero
  • the variance of the error term must be constant; it cannot be a function of the target
  • independent variables are uncorrelated with the error term and with each other
  • ideally, the error term should be normally distributed -- this allows us to reliably calculate the confidence interval
  • Under these assumptions, we can construct a closed form formula to calculate the coefficients that will minimize our error term.

    $$ \hat{\boldsymbol{\beta}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$$

    If you're not interested in the math behind this derivation, feel free to skip to the next section. You need some prior linear algebra knowledge to understand it.

    The derivation begins with a goal: to minimize the Sum of Squared Residuals.

    Recall that we can measure error as \(\mathbf{y} - \mathbf{X}\boldsymbol \beta\).

    The Sum of Squared error can be defined as a function of the coefficients. In matrix notation:

    $$\begin{aligned} S(\boldsymbol \beta) &= (\mathbf{y} - \mathbf{X} \boldsymbol \beta)^\top (\mathbf{y} - \mathbf{X} \boldsymbol \beta) \\ &= \mathbf{y}^\top \mathbf{y} - \boldsymbol \beta^\top \mathbf{X}^\top \boldsymbol{y} - \mathbf{y} - \mathbf{y}^T \mathbf{X} \boldsymbol{\beta} + \boldsymbol \beta^\top \mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} \end{aligned}$$

    This is a quadratic function of \(\beta\), which means we can find the global minimum if we take the derivative with respect to \(\beta\). We can then solve for \(\hat{\beta}\) by setting that to zero.

    $$\begin{aligned} \frac{dS(\boldsymbol{\beta})}{d\boldsymbol{\beta}} &= 0 - 2\mathbf{X}^\top \boldsymbol{y} + 2\mathbf{X}^\top \mathbf{X} \boldsymbol{\beta} \end{aligned}$$

    Set the left hand side to equal zero. Our OLS assumptions allows us to assume that \(\mathbf{X}\) has full column rank, which makes \(\mathbf{X}^\top \mathbf{X}\) invertible.

    $$\begin{aligned} 0 &= - 2\mathbf{X}^\top \boldsymbol{y} + 2\mathbf{X}^\top \mathbf{X} \boldsymbol{\hat{\beta}} \\ \mathbf{X}^\top \mathbf{X} \boldsymbol{\hat{\beta}} &= \mathbf{X}^\top \boldsymbol{y} \\ \hat{\boldsymbol{\beta}} &= (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y} \end{aligned}$$

    Model Interpretation

    The simplicity of linear regression is a double edged sword. The linear model is incredibly easy to interpret and understand, but is also a simplification of real world complexity. Most datasets will also violate the assumptions made by the OLS method.

    The importance of model interpretation should not be underestimated. Being able to interpret the coefficients allows us to use the model to drive business strategy, quantify the effects of our efforts and also gives stakeholders trust in our algorithms.

    Let's go back to the example of Jim and his model for predicting house prices. His model predicts house price as a function of square footage and age. Let \(y\) represent the price of a home and \(x_1\) the square footage, \(x_2\) the age in years.

    $$\hat{y} = 1000 + 500 x_1 - 1.2 x_2$$

    Jim's model tells us a lot about house prices:

  • For every unit increase in square footage, the average home price increases by $500
  • For every unit increase in house age, the average home price drops by $1.2
  • This information can be leveraged to increase a home's valuation. The reality is not so simple, but if adding a square feet of extra space only costs $200, wouldn't you want to increase the size of your home to sell it for a higher value?

    Tacking on Interaction Effects & Categorical Variables

    Jim realizes that the interaction effect between size and age is very important. He adds that as a predictor and retrains his model. His new formula becomes:

    $$\hat{y} = 1000 + 400 x_1 - 1.1 x_2 + 0.1 x_1 x_2$$

    Jim decides that he wants to incorporate the type of home as a predictor to his model. Every house is either a townhouse, semi-detached or detached home. He decides to treat these as categorical variables.

    The new model will leverage these additional predictors:

  • \(x_3\): 1 if the house is a townhouse, otherwise 0
  • \(x_4\): 1 if the house is semi-detached, otherwise 0
  • \(x_5\): 1 if the house is detached, otherwise 0
  • This augments his formula again:

    $$\hat{y} = 1000 + 450 x_1 - 1.1 x_2 + 0.2 x_1 x_2 - 6000 x_3 + 1000 x_4 + 5000 x_5$$

    This tells us that, if all else equal, a detached home will be on average $4000 more expensive than a semi-detached home and $11,000 more expensive than a townhouse.

    Common Pitfalls

  • multicollinearity: the independent variables can be highly correlated, which violates the OLS assumptions (maybe they built bigger houses back in the day, allowing us to predict square footage with home age)
  • heteroscedasticity: non-constant variance also violates the OLS assumptions (perhaps, the older the home, the more variance in the price)
  • Some other tidbits to think about:

  • does our linear model need an intercept? (should a house with no square footage at zero years old be worth $1000? Is that still a house?)
  • what is the range of plausible values for our target variable? House prices certainly cannot be negative ...



  • « Previous: Data CleaningTutorialsNext: Logistic Regression »