In a previous article, we calculated the OLS estimate and its variance using only the Numpy package in Python. In most cases, we would also need the t-statistic and its p-value to see if the coefficients we got are statistically significant. To achieve this, we will use SciPy, a powerful scientific computing library in Python.
First, we import the class of student's t-distributed random variables from SciPy. In order to use the OLS estimate and variance estimate we calculated using NumPy. We also need to import the NumPy package.
# Import SciPy and NumPy from scipy.stats import t # We only need the t class from scipy.stats import numpy as np
Next, let's generate some toy data and calculate the OLS estimate and variance estimate as before.
# Generate some random data n, k = 100, 2 # set the dimensions of the design matrix beta = np.array([1, 1, 10]) # set the true coefficients x = np.concatenate([np.ones((n, 1)), np.random.randn(n, k)], axis=1) # generate random x y = np.matmul(x, beta) + np.random.randn(n) # generate random y # Calculate OLS regression coefficients beta_hat = np.matmul(np.matmul(np.linalg.inv(np.matmul(np.array(x).transpose(), np.array(x))), x.transpose()), y) # Calculate Variance of OLS estimate residual = y - np.matmul(x, beta_hat) # calculate the residual sigma_hat = sum(residual ** 2) / (n - k - 1) # estimate of error term variance variance_beta_hat = sigma_hat * np.linalg.inv(np.matmul(x.transpose(), x)) # Calculate variance of OLS estimate
We can get the t-statistic by dividing an OLS estimate by its standard error like below.
# Calculate the t-statistic for each coefficient t_stat = beta_hat / np.sqrt(variance_beta_hat.diagonal()) t_stat
array([ 10.94146498, 12.18736475, 107.65469311])
Now, let us calculate the p-value for a two-sided test for each t-statistic using the cumulative distribution function (CDF) of the student's t-distribution.
# Get the p-value for each t-statistic p_value = 1 - 2 * np.abs(0.5 - np.vectorize(t.cdf)(t_stat, n - k - 1)) p_value
array([0., 0., 0.])
Since all the parameters are indeed in the data generating process (DGP), the coefficients we got are super significant.