Ordinary Least Squares Regression in Python from Scratch

This article introduces how to calculate the coefficients for an Ordinary Least Squares regression in Python using only the NumPy package. NumPy is the fundamental package for scientific computing with Python. It performs in some way similar to R. First, let us import the NumPy package.

# Import NumPy
import numpy as np

Then, let's generate some toy data to play with.

# Generate some random data
n, k = 100, 2  # set the dimensions of the design matrix
beta = np.array([1, 1, 10])  # set the true coefficients
x = np.concatenate([np.ones((n, 1)), np.random.randn(n, k)], axis=1)  # generate random x
y = np.matmul(x, beta) + np.random.randn(n)  # generate random y

OLS Estimator

Now we have an \(n \times k\) matrix \(x\) as the design matrix and an \(n \times 1\) vector \(y\) as the vector of the endogenous variable. The OLS estimator can be calculated using the formula below:

\begin{equation*} \hat{\beta} = (x'x)^{-1}x'y. \end{equation*}

Next, we calculate the value of the OLS estimator with the formula above using NumPy.

# Calculate OLS regression coefficients
beta_hat = np.matmul(np.matmul(np.linalg.inv(np.matmul(np.array(x).transpose(), np.array(x))), x.transpose()), y)
beta_hat

array([ 1.05552416,  1.0684565 , 10.08328924])

The values of the calculated OLS coefficients using the simulated toy data are shown above. We can further develop it into a function and add other functionalities like calculating the variance of the OLS estimate.

Variance of Estimator

Next let us calculate the variance of the OLS estimate. From the formula of the OLS estimator above, we can derive the formula of its variance to be

\begin{equation*} Var(\hat{\beta})=\sigma^2(X'X)^{-1}, \end{equation*}

where \(\sigma^2\) is the variance of the error term. We can use the sample variance of the error term

\begin{equation*} \hat{\sigma}^2 = \frac{\sum\hat{u}_i^2}{n-k-1}, \end{equation*}

where \(\hat{u_i}\) are the residuals from the OLS regression as an estimate of its true variance.

# Calculate Variance of OLS estimate
residual = y - np.matmul(x, beta_hat)  # calculate the residual
sigma_hat = sum(residual ** 2) / (n - k - 1)  # estimate of error term variance
variance_beta_hat = sigma_hat * np.linalg.inv(np.matmul(x.transpose(), x))  # Calculate variance of OLS estimate
variance_beta_hat

array([[ 0.01182717,  0.00117423, -0.00012874],
       [ 0.00117423,  0.01178575,  0.00108286],
       [-0.00012874,  0.00108286,  0.01194932]])