**LINEAR REGRESSION**

**¿What is linear regression?**

In this notebook we are learning how to perform a linear regression for financial data using python. Linear regression is a linear model, e.g. a model that assumes a linear relationship between the input variables (x) and the single output variable (y). More specifically, that y can be calculated from a linear combination of the input variables (x). The equation for the 1 dimensional case is:

*y* = *β*_{0} + *β*_{1} * *x*

Let’s suppose we have two variables x, y with the following values:

x | y |
---|---|

5.3 | 5.8 |

1.1 | 1.5 |

… | … |

7.8 | 8.3 |

If we plot these points we get:

The straight line shows the relationship between both variables using the following equation:

*y*_{i} = *β*_{0} + *β*_{1} * *x*_{i} + *e**r**r**o**r*

In this case one could say that Y depends on X (or viceversa) and that there is a linear relationship between both. That statement should be proven though. It is important to note, that two variables can be closely related and still not have a linear relationship. In those cases, linear regression is not a good modeling choice at first glance. However, one can transform those variables and maybe then, they will have a linear relationship. Some transformation that are frequently used are the logarithm or power laws. What we are saying here is that the term «linear regression» is used to specify models that are linear in the parameters

*β*_{0} and *β*_{1}

¿How do you get that straight line and its parameters? The most used method is called «Ordinary Least Squares» (O.L.S) y is shown in the chart below:

The methodology calculates the difference between each point and the one suggested by the line. We should then get the square of those differences and add them all. We should look for the parameters that minimice that difference and those would be the O.L.S parameters for the straight line.

We can download the following libraries to perform linear regression analysis:

import warnings warnings.filterwarnings('ignore') #Import libraries import pandas as pd import numpy as np import seaborn as sns import scipy from pandas_datareader import wb from datetime import datetime import matplotlib.pyplot as plt

**¿What kind of data should we use to do linear regression?**

We normally use cross sectional data. That means, that all the data have been collected in a single moment. Unfortunatelly in finance, most of the data come from a time series. That doesn’t mean we can’t use linear regression for data series, but that we need to be careful when applying this technique under such circumstances. As we are only getting familiar with linear regression, we will start with cross sectional data.

We can use the World Bank database, which contains a lot information on differente variables. We explained how to download data from this API in great detail **here**. You could have heard that education is one of the most important determinants for economic growth. Thus, a reasonable question to explore if the amount of money that a government spends on education has any relationship with its GDP per capita.

- X: Education spending as % of GDP
- Y: GDP per capita

Now we will learn how to get the data using pandas_datareader library in python:

```
#Parameters to get the data from World Bank Database
start = datetime(1990, 1, 1)
end = datetime(2021, 12, 31)
indicator_id1 = 'NY.GDP.PCAP.KD' #GDP per capita
# indicator_id1 = 'NY.GDP.PCAP.PP.KD' #GDP per capita PPP
indicator_id2 = 'SE.XPD.TOTL.GD.ZS' #Government expenditure on education, total (% of GDP)
#Download data
gdp_per_capita = wb.download(indicator=indicator_id1, start=start, end=end, country='all')
gdp_per_capita.reset_index(inplace=True)
education_spending = wb.download(indicator=indicator_id2, start=start, end=end, country = 'all')
education_spending.reset_index(inplace=True)
```

```
#Filter the data for only one year - so that we get cross sectional data
gdp_per_capita_2018 = gdp_per_capita[gdp_per_capita['year']=='2018']
education_spending_2018 = education_spending[education_spending['year']=='2018']
```

```
gdp_per_capita_2018
```

country | year | NY.GDP.PCAP.KD | |
---|---|---|---|

3 | Africa Eastern and Southern | 2018 | 1542.238527 |

35 | Africa Western and Central | 2018 | 1830.664340 |

67 | Arab World | 2018 | 6353.722456 |

99 | Caribbean small states | 2018 | 10151.041175 |

131 | Central Europe and the Baltics | 2018 | 14295.752003 |

… | … | … | … |

8355 | Virgin Islands (U.S.) | 2018 | 35183.235752 |

8387 | West Bank and Gaza | 2018 | 3417.794408 |

8419 | Yemen, Rep. | 2018 | 1194.820561 |

8451 | Zambia | 2018 | 1331.449244 |

8483 | Zimbabwe | 2018 | 1462.590279 |

266 rows × 3 columns

```
education_spending_2018
```

country | year | SE.XPD.TOTL.GD.ZS | |
---|---|---|---|

3 | Africa Eastern and Southern | 2018 | 4.739750 |

35 | Africa Western and Central | 2018 | 3.071543 |

67 | Arab World | 2018 | NaN |

99 | Caribbean small states | 2018 | 4.315300 |

131 | Central Europe and the Baltics | 2018 | 4.243570 |

… | … | … | … |

8355 | Virgin Islands (U.S.) | 2018 | NaN |

8387 | West Bank and Gaza | 2018 | NaN |

8419 | Yemen, Rep. | 2018 | NaN |

8451 | Zambia | 2018 | 4.739750 |

8483 | Zimbabwe | 2018 | 2.050490 |

266 rows × 3 columns

```
#Merge the data into a single DataFrame
df_TOTAL = pd.merge(education_spending_2018,gdp_per_capita_2018[['country',indicator_id1]], left_on='country',right_on='country', how='left')
df_TOTAL.dropna(inplace=True)
df_TOTAL
```

country | year | SE.XPD.TOTL.GD.ZS | NY.GDP.PCAP.KD | |
---|---|---|---|---|

0 | Africa Eastern and Southern | 2018 | 4.739750 | 1542.238527 |

1 | Africa Western and Central | 2018 | 3.071543 | 1830.664340 |

3 | Caribbean small states | 2018 | 4.315300 | 10151.041175 |

4 | Central Europe and the Baltics | 2018 | 4.243570 | 14295.752003 |

5 | Early-demographic dividend | 2018 | 3.751430 | 3516.704413 |

… | … | … | … | … |

256 | Uruguay | 2018 | 4.613600 | 17440.381939 |

257 | Uzbekistan | 2018 | 4.835893 | 3062.486222 |

260 | Viet Nam | 2018 | 3.295370 | 3090.772595 |

264 | Zambia | 2018 | 4.739750 | 1331.449244 |

265 | Zimbabwe | 2018 | 2.050490 | 1462.590279 |

206 rows × 4 columns

```
#Filter using only country data - they start at 50 row
df_countries = df_TOTAL.iloc[49:200]
```

```
df_countries
```

country | year | SE.XPD.TOTL.GD.ZS | NY.GDP.PCAP.KD | |
---|---|---|---|---|

56 | Argentina | 2018 | 4.877740 | 13105.397163 |

57 | Armenia | 2018 | 2.255868 | 4215.899394 |

59 | Australia | 2018 | 5.124310 | 58600.790356 |

60 | Austria | 2018 | 5.226550 | 46154.625096 |

61 | Azerbaijan | 2018 | 2.455430 | 5262.368705 |

… | … | … | … | … |

247 | Turkiye | 2018 | 4.300040 | 12148.294250 |

249 | Turks and Caicos Islands | 2018 | 2.743770 | 25080.164489 |

251 | Uganda | 2018 | 2.130520 | 896.162532 |

252 | Ukraine | 2018 | 5.319920 | 2336.976562 |

254 | United Kingdom | 2018 | 5.198070 | 46740.553593 |

151 rows × 4 columns

```
#Rename the columns
df_countries.rename(columns={indicator_id2: 'Educatspend', indicator_id1: 'Gdppercapita'}, inplace=True)
```

```
#Plot the data
fig = plt.figure(figsize=(10,5))
ax1 = fig.add_subplot(1, 1, 1)
ax1.scatter(df_countries['Educatspend'],df_countries['Gdppercapita'])
plt.box(on=None)
plt.show()
```

**¿Do we have a linear relationship here?**

A linear relationship is not easily observerd. We can use the python library statsmodel to run Linear Regression. That library has the most useful statistics to analyze this data modelling technique. Let’s watch:

```
#Import the function formula.api
import statsmodels.formula.api as sm
#Create a linear model - model1
model1 = 'Gdppercapita ~ Educatspend'
lm1 = sm.ols(formula = model1, data = df_countries).fit()
print('LINEAR REGRESSION - MODEL1')
print(lm1.summary())
print('-------------------------------------------------------------')
```

LINEAR REGRESSION - MODEL1 OLS Regression Results ============================================================================== Dep. Variable: Gdppercapita R-squared: 0.004 Model: OLS Adj. R-squared: -0.003 Method: Least Squares F-statistic: 0.5567 Date: Mon, 20 May 2024 Prob (F-statistic): 0.457 Time: 15:39:15 Log-Likelihood: -1721.6 No. Observations: 151 AIC: 3447. Df Residuals: 149 BIC: 3453. Df Model: 1 Covariance Type: nonrobust =============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 1.285e+04 4655.644 2.761 0.006 3654.698 2.21e+04 Educatspend 734.8575 984.879 0.746 0.457 -1211.277 2680.992 ============================================================================== Omnibus: 77.558 Durbin-Watson: 1.799 Prob(Omnibus): 0.000 Jarque-Bera (JB): 240.836 Skew: 2.117 Prob(JB): 5.05e-53 Kurtosis: 7.511 Cond. No. 12.9 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. -------------------------------------------------------------

**¿What does all that data mean?**

We got a lot of information when we ran our linear regression. Let’s focus only on the most important: The coefficients (the estimates of $\beta _0$ and $\beta _1$ and ), as well as their -$p$ and -$R$ squared values.

**Coefficients**

The intercept is about 7352 USD. This can be thought of as the baseline GDP per Capita. (Frequently, the intercept does not have a meaningful interpretation – that is okay.) The slope (the coefficient for the Education Spending) is 1599.87. The interpretation of this coefficient is this: if a country spends 1% more of its GDP on education , its GDP per capita is expected to increase by 1599.87 USD on average.

**p-values**

The null hypothesis we are testing here is

*H*_{0} : *β*_{1} = 0

and the alternative is

*H*_{1} : *β*_{1} ≠ 0

The $p$-value of $ \beta _1$ (given under the column: $P > \vert t \vert $ ) is 0.127. Thus, it is not statistically significant at the 0.05 level, and we can not reject the null hypothesis. This implies that education spending as GDP percentage does explain the GDP per capita of a country.

There is no perfect or exact way of setting a statisically significant – value threshold. It really depends on your needs as a data professional. – values are used to assess how accurate we think a specific coefficient is. How certain do you need your interpretation of your coefficients to be? Generally, 0.05 is a common heuristic but it is not a hard-line number.

**R-squared**

One of the key quantities that should be paid attention to while interpreting a regression table is the quantity $R$-squared. Note that the table shows $R$-squared and adjusted $R$-squared. We will focus on $R$-squared. This quantity is always going to be between 0 and 1. For the GDP per Capita vs. Education Spending model, this quantity is 0.019 = 1.9%. In the world, there is a large variation in GDP per Capita. This means different countries have different GDP per Capita.

An $R$-squared of 1.9% in this linear model means that this observed variation in GDP per Capita is mostly due to random chance; we should probably look for another variable with higher explanatory powe. The higher the $R$-squared, the higher the percentage of observed variation that can be explained by the model. Since model1 only explains about 1.9% of the variation, this motivates us to investigate if factors other than Education Spending can be used to explain the GDP per Capita differences. Or if we should transform the variable Education Spending and see if it has a more powerful explanation of the variation on the GDP per capita. Let’s do that with a logarithmic transformation:

```
#Log transformation
df_countries['ln_Gdppercapita'] = np.log(df_countries['Gdppercapita'])
```

```
#Plot the data
fig = plt.figure()
ax1 = fig.add_subplot(1, 1, 1)
ax1.scatter(df_countries['Educatspend'],df_countries['ln_Gdppercapita'])
plt.box(on=None)
plt.show()
```

```
#Create a linear model - model2
model2 = 'ln_Gdppercapita ~ Educatspend'
lm2 = sm.ols(formula = model2, data = df_countries).fit()
print('LINEAR REGRESSION - MODEL2')
print(lm2.summary())
print('-------------------------------------------------------------')
```

LINEAR REGRESSION - MODEL2 OLS Regression Results ============================================================================== Dep. Variable: ln_Gdppercapita R-squared: 0.025 Model: OLS Adj. R-squared: 0.018 Method: Least Squares F-statistic: 3.817 Date: Mon, 20 May 2024 Prob (F-statistic): 0.0526 Time: 15:39:15 Log-Likelihood: -267.85 No. Observations: 151 AIC: 539.7 Df Residuals: 149 BIC: 545.7 Df Model: 1 Covariance Type: nonrobust =============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 8.2358 0.307 26.839 0.000 7.629 8.842 Educatspend 0.1268 0.065 1.954 0.053 -0.001 0.255 ============================================================================== Omnibus: 12.204 Durbin-Watson: 1.901 Prob(Omnibus): 0.002 Jarque-Bera (JB): 4.547 Skew: -0.032 Prob(JB): 0.103 Kurtosis: 2.152 Cond. No. 12.9 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. -------------------------------------------------------------

That was a very small improvement.

We should look for other variables or explore another problem. Let’s see if GDP per capita has any relationship with life expectancy. You could probably think, that the richer the country the longer people live. Let’s explore that problem:

```
#Another indicator
indicator_id3 = 'SP.DYN.LE00.IN' #Life expectacy
#Download the data
life_expectancy = wb.download(indicator=indicator_id3, start=start, end=end, country='all')
life_expectancy.reset_index(inplace=True)
life_expectancy_2018 = life_expectancy[life_expectancy['year']=='2018']
df_TOTAL2 = pd.merge(life_expectancy_2018,gdp_per_capita_2018[['country','NY.GDP.PCAP.KD']], left_on='country',right_on='country', how='left')
df_TOTAL2.dropna(inplace=True)
df_countries2 = df_TOTAL2.iloc[49:200]
df_countries2.rename(columns={'SP.DYN.LE00.IN': 'LifeExp', indicator_id1: 'Gdppercapita'}, inplace=True)
df_countries2['ln_Gdppercapita'] = np.log(df_countries2['Gdppercapita'])
#Hacer un gráfico de los puntos
fig = plt.figure()
ax1 = fig.add_subplot(1, 1, 1)
ax1.scatter(df_countries2['LifeExp'],df_countries2['ln_Gdppercapita'])
plt.box(on=None)
plt.show()
```

Now the relationship looks much more linear than before. This looks promising!

Let’s jump into modelling these variables immediately using linear regression.

```
#Create a linear model - model2
model3 = 'LifeExp ~ ln_Gdppercapita'
lm3 = sm.ols(formula = model3, data = df_countries2).fit()
print('LINEAR REGRESSION - MODEL3')
print(lm3.summary())
print('-------------------------------------------------------------')
```

LINEAR REGRESSION - MODEL3 OLS Regression Results ============================================================================== Dep. Variable: LifeExp R-squared: 0.710 Model: OLS Adj. R-squared: 0.708 Method: Least Squares F-statistic: 364.2 Date: Mon, 20 May 2024 Prob (F-statistic): 7.54e-42 Time: 15:39:16 Log-Likelihood: -429.38 No. Observations: 151 AIC: 862.8 Df Residuals: 149 BIC: 868.8 Df Model: 1 Covariance Type: nonrobust =================================================================================== coef std err t P>|t| [0.025 0.975] ----------------------------------------------------------------------------------- Intercept 32.2774 2.148 15.026 0.000 28.033 36.522 ln_Gdppercapita 4.6034 0.241 19.083 0.000 4.127 5.080 ============================================================================== Omnibus: 24.020 Durbin-Watson: 2.184 Prob(Omnibus): 0.000 Jarque-Bera (JB): 31.468 Skew: -0.927 Prob(JB): 1.47e-07 Kurtosis: 4.250 Cond. No. 56.9 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. -------------------------------------------------------------

$R$-squared is much higher now around 71.5%.

The $p$-value of $ \beta _1$ is 0.00. Thus, it is statistically significant at the 0.05 level, and we can reject the null hypothesis. This implies that Log of GDP per Capita does explain the Life Expectancy of the inhabitants of a country.

Let’s dig a little deeper into the meaning of $R$-squared. Graphically you can say that $R$-squared shows how much better does the linear regression model explain the data, when compared to a model of the average.

Recall that the formula for $R$-squared is the following:

$R^2= 1 – $

</font>

Here RSS and TSS denote the Residual and Total Sum of Squares respectively. To understand the above formula carefully, look at the chart below where we plotted Life Expectancy vs. Log of GDP per Capita again.

If you consider a naive model given by:

*L**i**f**e**E**x**p**e**c**t**a**n**c**y* = *β*_{0} + *e**r**r**o**r*

The best guess for $\beta _0$ is just the average Life Expectancy. This is given by the red line. The sum of the squares of the residuals here is called the Total Sum of Squares (TSS). R-squared measures how well the regression line of model3 (Life Expectancy vs. Log of GDP per Capita) given by the blue line, explains the observed variation as compared to the naive model. The sum of residual squares for this model is the RSS.

```
#Plot the regression and the naive model
sns.lmplot(x = 'ln_Gdppercapita', y = 'LifeExp', data = df_countries2, scatter_kws = {'color': (0/255,0/255,0/255)},line_kws = {'color': "blue"})
plt.title("Log GDP per CAP vs. LIFE EXP", fontsize=20, verticalalignment='bottom')
plt.axhline(df_countries2['LifeExp'].mean(), ls='--',color = 'r')
plt.xlabel("ln_Gdppercapita")
plt.ylabel("LifeExp")
plt.box(on=None);
```

So far we have followed an example of linear regression without asking if there are some assumptions we need to check. Indeed there are a few. Some of the most important assumptions are:

- The mean value of the residuals is cero.
- The variance of the residuals is constant.
- The residuals are not auto-correlated.

In the classic linear regression, one assumption is that residuals are normal and indepently distributed. If that happens 1, 2 and 3 are ok.

In the following chart, we can check if the mean of the residuals is zero from visual inspection.

```
#Plot the residuals
plt.plot(lm3.resid,marker='o',linestyle='None',color='dimgrey');
```

The mean seems to be around zero, but there is some skew in the data. Let’s plot an histogram an watch if they look somehow normally distributed.

```
#Plot the residuals in an histogram
plt.hist(lm3.resid,
density=True, # the histogram integrates to 1
# (so it can be compared to the normal distribution)
bins=100, # draw a histogram with 100 bins of equal width
label="residuals" # label for legend
)
# now plot the normal distribution for comparison
xx = np.linspace(lm3.resid.min(), lm3.resid.max(), num=1000)
plt.plot(xx, scipy.stats.norm.pdf(xx, loc=0.0, scale=np.sqrt(lm3.scale)),
label="normal distribution")
outliers = np.abs(lm3.resid)>4*np.sqrt(lm3.scale)
sns.rugplot(lm3.resid[outliers],
color="red", # otherwise the rugplot has the same color as the histogram
label="outliers")
plt.legend(loc="upper right");
```

One can plot the residuals using a probability plot as well. If the residuals are normally distributed they will perfectly fit on the red line in the following chart.

```
#Probability Plot
from scipy import stats
fig, ax = plt.subplots(nrows=1, ncols=1,figsize=(10,5))
stats.probplot(x=lm3.resid, dist = "norm", plot = ax)
plt.show()
```

To finish this topic we will show a way to identify which variables might me transformed, as we did we the logarithmic GDP per capita in this example. You can use the `boxcox statistics`. If this number is close to 1 you don’t need to do anything. When it is close to zero, the logarithmic transformation is recomended. When it is closer to 0.5 you can use the square root transformation. When is closer to zero or negative, logarithmic transformation is a good choice.

```
#Calculate boxcox statistic
price,fitted_lambda = stats.boxcox(df_countries2['Gdppercapita'])
round(fitted_lambda,2)
```

0.02

**Please remember:**Here the boxcox statistics is close to zero. It suggest a logarithmic transformation. We can corroborate that transformation using the probability plot we introduce before.

```
from scipy import stats
fig, ax = plt.subplots(nrows=1, ncols=2,figsize=(10,5))
stats.probplot(x=df_countries2['Gdppercapita'], dist = "norm", plot = ax[0])
stats.probplot(x=df_countries2['ln_Gdppercapita'], dist = "norm", plot = ax[1])
plt.show()
```

**ONE LAST THING:**To finish this topic about linear regression, we would like to show another library that handles these calculations. It is sklearn. Below we show and example on how to do linear regression using this package.

```
from sklearn.linear_model import LinearRegression
X = df_countries2['ln_Gdppercapita'].values
y = df_countries2['LifeExp'].values
X_train = X[0:120].reshape(-1, 1)
X_test = X[121:150].reshape(-1, 1)
y_train= y[0:120].reshape(-1, 1)
y_test = y[121:150].reshape(-1, 1)
linear = LinearRegression(normalize=True)
linear.fit(X_train,y_train)
print('R2 =',linear.score(X_train,y_train))
print('R2 =',linear.score(X_test,y_test))
```

R2 = 0.7291076097475874 R2 = 0.6202315969346572

```
print('w1 =',linear.coef_)
print('w0 =',linear.intercept_)
```

w1 = [[4.67749164]] w0 = [31.59751511]

```
y_pred = linear.predict(X_test)
plt.scatter(X_test, y_test, color='black',label=r'test data')
plt.scatter(X_train, y_train, color='red',alpha=0.4,label=r'train data')
plt.plot(X_test, y_pred, color='blue',label=r'prediction')
plt.legend(loc='lower right')
plt.xlabel(r'log GDP per Capita')
plt.ylabel(r'Life Expectancy')
plt.show()
```

### Artículos relacionados

### How to download data from Federal Reserve FRED API?

In this post we will show how to download data from Federal Reserve FRED API using python code.

### How to download data from EUROSTAT database?

In this post we will show how to download data from EUROSTAT database using python code.

### How to download data from World Bank API?

In this post we will show how to download data from World Bank API using python code.