Modeling Madrid House Prices with Linear Regression

Contents

Modeling Madrid House Prices with Linear Regression#

Requirements#

import polars as pl
import pandas as pd
import numpy as np
#import sys
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from IPython.display import Image
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import ast
from math import factorial
import itertools
from itertools import combinations
from sklearn.model_selection import KFold
import scipy
#sys.path.insert(0, 'C:/Users/fscielzo/Documents/DataScience-GitHub/EDA')
from EDA import dtypes_df, count_col_nulls, corr_matrix, outliers_table,  scatter_matrix, boxplot_2D_matrix, histogram_2D_matrix, ecdf_2D_matrix, quant_to_cat
from EDA import  histogram, freq_table, outlier_filter, summary, boxplot, ecdfplot, stripplot_matrix, transform_to_dummies, cross_quant_cat_summary

Conceptual description of the data#

The data set of this project contains information about Madrid houses extracted from Real State web portals.

The data has been obtained from Kaggle: https://www.kaggle.com/datasets/makofe/housesclean

The following table summarize conceptually the variables contained in the data set with which we will work along this project.

Variable Name

Description

Type

sq_mt_built

size of the house in square meter built

Quantitative

n_rooms

Number of rooms of the house

Quantitative

n_bathrooms

Number of bathrooms of the house

Quantitative

n_floors

Number of floors of the house

Quantitative

sq_mt_allotment

for detached houses, full size including house, pool, garden, etc

Quantitative

floor

Indicates the house height

Multiclass

is_renewal_needed

Indicates wether the house needs renewal or not

Binary

has_lift

Indicates wether the house has lift or not

Binary

is_exterior

Indicates wether the house is exterior or not

Binary

energy_certificate

higher values mean more efficient energy system

Multiclass

has_parking

Indicates wether the house has parking or not

Binary

neighborhood

Madrid’s neighborhoods

Multiclass

district

Madrid’s districts

Multiclass

house_type

Indicates the house type: flat (0), chalet (1), study (2), duplex (3), top floor (4)

Multiclass

buy_price

The buy price of the house

Quantitative

Reading the data#

First of all, we read the data.

madrid_houses_df = pl.read_csv('madrid_houses.csv')
madrid_houses_df.head()
shape: (5, 17)
idsq_mt_builtn_roomsn_bathroomsn_floorssq_mt_allotmentfloorbuy_priceis_renewal_neededhas_liftis_exteriorenergy_certificatehas_parkingneighborhooddistricthouse_type
i64i64f64i64i64i64f64i64i64boolboolbooli64booli64i64i64
02174264.02110.0385000falsefalsetrue4false135211
12174170.03110.04129900truetruetrue0false132211
22174094.02210.01144247falsetruetrue0false134211
32173964.02110.0-1109900falsetruetrue0false134211
421738108.02210.04260000falsetruetrue0true133211
madrid_houses_df.shape
(21739, 17)

Data preprocessing#

In this section we are going to carry out some preprocessing tasks on the data.

Removing some variables#

We will remove some variables that will be not used in the project.

variables_to_remove = ['', 'id', 'district', 'neighborhood']
variables_to_keep = [x for x in madrid_houses_df.columns if x not in variables_to_remove]
madrid_houses_df = madrid_houses_df[variables_to_keep]
madrid_houses_df.head()
shape: (5, 13)
sq_mt_builtn_roomsn_bathroomsn_floorssq_mt_allotmentfloorbuy_priceis_renewal_neededhas_liftis_exteriorenergy_certificatehas_parkinghouse_type
f64i64i64i64f64i64i64boolboolbooli64booli64
64.02110.0385000falsefalsetrue4false1
70.03110.04129900truetruetrue0false1
94.02210.01144247falsetruetrue0false1
64.02110.0-1109900falsetruetrue0false1
108.02210.04260000falsetruetrue0true1

Data types#

We check the variables data types according to polars structure.

dtypes_df(df=madrid_houses_df)
shape: (13, 2)
ColumnsPython_type
strobject
"sq_mt_built"Float64
"n_rooms"Int64
"n_bathrooms"Int64
"n_floors"Int64
"sq_mt_allotmen…Float64
"floor"Int64
"buy_price"Int64
"is_renewal_nee…Boolean
"has_lift"Boolean
"is_exterior"Boolean
"energy_certifi…Int64
"has_parking"Boolean
"house_type"Int64

Unique values#

We compute the unique values of each variable, what is useful to get a more real idea of their nature.

for col in madrid_houses_df.columns :
    display(madrid_houses_df[col].unique())
shape: (741,)
sq_mt_built
f64
13.0
15.0
16.0
18.0
19.0
20.0
21.0
22.0
23.0
24.0
25.0
26.0
1500.0
1600.224593
1619.87041
1700.477327
1800.379027
1930.501931
2000.0
2035.278155
2080.0
2081.332053
2199.874293
2400.0
shape: (19,)
n_rooms
i64
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
18
24
shape: (16,)
n_bathrooms
i64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
shape: (6,)
n_floors
i64
1
2
3
4
5
7
shape: (356,)
sq_mt_allotment
f64
0.0
10.0
20.0
22.0
25.0
27.0
28.0
29.0
30.0
36.0
40.0
50.0
991.0
994.0
997.0
1000.0
2000.0
3000.0
4000.0
5000.0
6000.0
10000.0
11000.0
21000.0
shape: (15,)
floor
i64
-5
-3
-2
-1
0
1
2
3
4
5
6
7
8
9
10
shape: (2_403,)
buy_price
i64
36000
39000
42000
49000
49500
52000
52990
53000
54000
55000
56000
57000
6500000
6750000
6800000
6900000
7000000
7525000
7800000
7900000
8000000
8500000
8700000
8800000
shape: (2,)
is_renewal_needed
bool
false
true
shape: (2,)
has_lift
bool
false
true
shape: (2,)
is_exterior
bool
true
false
shape: (8,)
energy_certificate
i64
0
1
2
3
4
5
6
7
shape: (2,)
has_parking
bool
false
true
shape: (5,)
house_type
i64
1
2
3
4
5

Standard encoding#

We will apply standard encoding on the categorical variables.

The variables to apply standard encoding are the following:

cols_to_std_encoding = ['house_type', 'has_parking', 'is_exterior', 'is_renewal_needed', 'has_lift']
for col in cols_to_std_encoding :

    X_std_encoding = OrdinalEncoder(dtype=int).fit_transform(madrid_houses_df[[col]]).flatten()
    madrid_houses_df = madrid_houses_df.with_columns(pl.Series(X_std_encoding).alias(col))
madrid_houses_df.head()
shape: (5, 13)
sq_mt_builtn_roomsn_bathroomsn_floorssq_mt_allotmentfloorbuy_priceis_renewal_neededhas_liftis_exteriorenergy_certificatehas_parkinghouse_type
f64i64i64i64f64i64i64i32i32i32i64i32i32
64.02110.0385000001400
70.03110.04129900111000
94.02210.01144247011000
64.02110.0-1109900011000
108.02210.04260000011010

Missing values#

We check if there are missing values (null/NaN) in our data.

count_col_nulls(madrid_houses_df)
shape: (1, 13)
sq_mt_builtn_roomsn_bathroomsn_floorssq_mt_allotmentfloorbuy_priceis_renewal_neededhas_liftis_exteriorenergy_certificatehas_parkinghouse_type
u32u32u32u32u32u32u32u32u32u32u32u32u32
0000000000000

There are non missing values in this case.

Train-Test Split#

We perform a train-test split. Along the most part of the project we will only use the train part, since it plays the role of the data that we have in reality, and the test partition only should be used to estimate the predictive performance of the model with “new” data (estimation of future performance).

response = 'buy_price'
predictors = [x for x in madrid_houses_df.columns if x != response]
X = madrid_houses_df[predictors]
Y = madrid_houses_df[response]

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=0.70, random_state=123)
X_Y_train_df = pl.concat((X_train, pl.DataFrame(Y_train)), how='horizontal')
X_Y_test_df = pl.concat((X_test, pl.DataFrame(Y_test)), how='horizontal')

EDA#

In this section we are going to complement this predictive modeling project with an exploratory data analysis.

We are going to carry out an exploratory data analysis (EDA) only on training data because of the above.

Descriptive summary#

quant_columns = ['sq_mt_built', 'n_rooms', 'n_bathrooms', 'n_floors', 'sq_mt_allotment', 'buy_price']
cat_columns = [x for x in madrid_houses_df.columns if x not in quant_columns]
quant_summary, cat_summary = summary(df=X_Y_train_df, auto_col=False,
                                     quant_col_names=quant_columns,
                                     cat_col_names=cat_columns)
quant_summary
sq_mt_built n_rooms n_bathrooms n_floors sq_mt_allotment buy_price
n_unique 689 17 15 6 304 2034
prop_nan 0.0 0.0 0.0 0.0 0.0 0.0
mean 154.019155 3.012683 2.098311 1.235263 59.190708 651756.495038
std 164.515279 1.509707 1.407854 0.710341 398.791114 776142.000602
min 13.0 0.0 1.0 1.0 0.0 39000.0
Q10 55.0 1.0 1.0 1.0 0.0 135500.0
Q25 70.0 2.0 1.0 1.0 0.0 199000.0
median 100.0 3.0 2.0 1.0 0.0 375000.0
Q75 165.0 4.0 2.0 1.0 0.0 755000.0
Q90 314.0 5.0 4.0 2.0 0.0 1550000.0
max 2400.0 24.0 16.0 7.0 21000.0 8800000.0
kurtosis 35.977168 11.227589 8.851221 11.921627 681.703793 19.434455
skew 4.454875 1.390361 2.028374 3.140461 19.343328 3.29838
n_outliers 1574 151 1964 1811 1003 1487
n_not_outliers 13643 15066 13253 13406 14214 13730
prop_outliers 0.103437 0.009923 0.129066 0.119012 0.065913 0.09772
prop_not_outliers 0.896563 0.990077 0.870934 0.880988 0.934087 0.90228
cat_summary
floor is_renewal_needed has_lift is_exterior energy_certificate has_parking house_type
n_unique 15 2 2 2 8 2 5
prop_nan 0.0 0.0 0.0 0.0 0.0 0.0 0.0
mode 1 0 1 1 0 0 0

Outliers#

outliers_table(df=X_Y_train_df, auto=False, col_names=quant_columns)
shape: (6, 7)
quant_variableslower_boundupper_boundn_outliersn_not_outliersprop_outliersprop_not_outliers
strf64f64i64i64f64f64
"sq_mt_built"-72.5307.51574136430.1034370.896563
"n_rooms"-1.07.0151150660.0099230.990077
"n_bathrooms"-0.53.51964132530.1290660.870934
"n_floors"1.01.01811134060.1190120.880988
"sq_mt_allotmen…0.00.01003142140.0659130.934087
"buy_price"-635000.01.589e61487137300.097720.90228
X_Y_train_df_trimmed = X_Y_train_df
for col in quant_columns:
    X_Y_train_df_trimmed = outlier_filter(X_Y_train_df_trimmed, col=col, h=1.5)

Response analysis#

Histogram#

histogram(X=X_Y_train_df[response], bins=25, color='purple', figsize=(7,5), random=True, n=500, seed=123, x_rotation=90) 
_images/dd58b71ec469d587de298c1dcd56d46e07f4f046acf76b7f2b883f4509efa7b8.png

Frequencies#

response_cat = quant_to_cat(X_Y_train_df[response], rule='quartiles', n_intervals=4, random_seed=123)
freq_table(X=response_cat)
shape: (5, 5)
buy_price: unique valuesabs_freqrel_freqcum_abs_freqcum_rel_freq
objecti64f64i64f64
(-inf, 39000]10.000110.000066
(39000, 199000]38290.251638300.251692
(199000, 375000]38270.251576570.503187
(375000, 755000]37560.2468114130.750016
(755000, 8800000]38040.25152171.0

Boxplot#

boxplot(X=X_Y_train_df[response], color='orange', figsize=(9,5), n_xticks=15, x_rotation=0, statistics=False)
_images/04b258ef3f6968a9926f653ad68c7c25bc6344f95a1117f32fbafb463c74971f.png

ECDF-plot#

ecdfplot(X=X_Y_train_df[response], color='blue', figsize=(9,6))
_images/9d5a716de3df3c9e808de9d0ff72a49afa434bcb2cfea37bb7fbfb5866103126.png

Relation between response and predictors#

quant_predictors = [x for x in quant_columns if x != response]
cat_predictors = [x for x in cat_columns if x != response]
corr_matrix_response = corr_matrix(df=X_Y_train_df, response=response, predictors=quant_predictors, method='pearson')
fig = plt.subplots(figsize=(5,4))
ax = sns.heatmap(corr_matrix_response, cmap="Blues", annot=True)
ax.set(xlabel="", ylabel="")
ax.xaxis.tick_top()
ax.set_xticklabels(ax.get_xticklabels(), rotation=0)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0)
plt.show()
_images/0bc541e1b88a6132ac73680ad4c600328d7ab30e72821ca7cc3661fc11053f23.png
scatter_matrix(df=X_Y_train_df, n_cols=2, tittle='Buy price vs Quantitative predictors', 
               figsize=(15,15), response=[response], predictors=quant_predictors,
               n_yticks=5, n_xticks=5, title_height=0.93, hspace=0.5, fontsize=20)
_images/974952cbf2ba43acfe42d874a422cbf463aecabf6ac99e200ea4d757da26970e.png
boxplot_2D_matrix(df=X_Y_train_df, n_cols=2, tittle='Buy prices vs Categorical predictors', figsize=(15,15), 
                 response=[response], predictors=cat_predictors,
                 n_yticks=5, title_height=0.93, hspace=0.5, fontsize=20,
                 statistics=['mean'], lines_width=0.8, bbox_to_anchor=(0.53,0.1), 
                 legend_size=12, color_stats=['red'], showfliers=True)
_images/e4dfdce0083c3e4325fc01f1fb6eab4f2f31b690b66c907d0afbab38664d0a0f.png
boxplot_2D_matrix(df=X_Y_train_df, n_cols=2, tittle='Buy prices vs Categorical predictors', figsize=(15,15), 
                 response=[response], predictors=cat_predictors,
                 n_yticks=5, title_height=0.93, hspace=0.5, fontsize=20,
                 showfliers=False)
_images/c11483b0cb4d7f671a91e30620525fe84f3532ab8f36e5cd0a53f1a5de787296.png
stripplot_matrix(df=X_Y_train_df, n_cols=2, tittle='', figsize=(15,15), jitter=0.2,
                 response=[response], predictors=cat_predictors,
                 n_yticks=5, title_height=0.93, hspace=0.5, fontsize=20,
                 statistics=['mean'], lines_width=0.8, bbox_to_anchor=(0.53, 0.1), 
                 legend_size=8, color_stats=['red'])
_images/901631897f089310b418f17d29e3f42f31efdda500c15655e55701a9b1e31c5a.png
histogram_2D_matrix(df=X_Y_train_df, bins=10, n_cols=2, tittle='', figsize=(15,20), 
                 response=[response], predictors=cat_predictors,
                 n_yticks=5, n_xticks=10, title_height=0.93, fontsize=20,
                 bbox_to_anchor=(1.15, 1), wspace=0.5, hspace=0.5,
                 legend_size=8, transparency=0.7)
_images/4ff5b55847da7aa2dcc309c43989e7ebc2250122136c5e58e3f819f2dff8702a.png
ecdf_2D_matrix(df=X_Y_train_df, n_cols=2, tittle='', figsize=(15,20), 
                 response=[response], predictors=cat_predictors,
                 n_xticks=10, n_yticks=5, title_height=0.93, fontsize=20,
                 bbox_to_anchor=(1.15, 1), wspace=0.5, hspace=0.5,
                 legend_size=8, transparency=0.8)
_images/f39e93e650bc32aea1f0d50d704989ece2dd35ce074149693d0ea18a1a9293dc.png
for col in cat_predictors:
    display(cross_quant_cat_summary(df=X_Y_train_df, quant_col=response, cat_col=col))
shape: (15, 14)
floormean_buy_pricestd_buy_pricemin_buy_priceQ10_buy_priceQ25_buy_pricemedian_buy_priceQ75_buy_priceQ90_buy_pricemax_pricekurtosis_buy_priceskew_buy_priceprop_outliers_buy_priceprop_nan_buy_price
f64f64f64f64f64f64f64f64f64f64f64f64f64f64
-5.01.7851e61.2945e639000.0475200.0890000.01.49e62.3e63.495e68.7e64.5121.7450.0918370.0
-3.0237733.333102940.106115000.0160000.0175000.0200000.0250000.0385000.0494900.01.0781.4230.1551720.0
-2.0224156.014154217.8674000.093900.0130000.0175000.0259000.0415000.0950000.07.0282.3980.0678790.0
-1.0323802.642343219.3742000.0109000.0140000.0210000.0359000.0630000.03.2e616.5753.510.1059390.0
0.0386954.342415719.73580000.0124000.0155000.0230000.0440000.0790000.02.2e66.5732.5770.0943520.0
1.0515647.077513707.29261000.0140000.0199900.0360000.0630000.0999000.08e629.9493.920.0832820.0
2.0533162.318610772.40770000.0130000.0180000.0315000.0625000.01.2e65e614.5343.2730.1008770.0
3.0534463.084606692.74370000.0130000.0178000.0315000.0600000.01.2e65.5e610.3092.8780.1481480.0
4.0611404.858714715.60255000.0130000.0188400.0365000.0720000.01.375e66.5e611.7153.0090.0919860.0
5.0669036.809630926.29949500.0175000.0270000.0450000.0805000.01.46e64.5e66.8632.3440.0973680.0
6.0769124.274809872.06666000.0234000.0319000.0550000.0850000.01.74e68.8e627.774.1230.0468180.0
7.0652867.224572276.15391000.0215000.0288000.0482500.0759000.01.425e64.1e68.1762.4850.096070.0
8.0695522.421552615.08120000.0211000.0320000.0540000.0795000.01.5e62.8e63.1541.8280.0073530.0
9.0703932.059528026.495121000.0203000.0330000.0536920.01.079e61.48e63.3e63.3461.5270.0684930.0
10.0935160.8791.1675e6174750.0254760.0321074.0488250.0939000.02.2797e67.525e615.8563.5340.0829740.0
shape: (2, 14)
is_renewal_neededmean_buy_pricestd_buy_pricemin_buy_priceQ10_buy_priceQ25_buy_pricemedian_buy_priceQ75_buy_priceQ90_buy_pricemax_pricekurtosis_buy_priceskew_buy_priceprop_outliers_buy_priceprop_nan_buy_price
f64f64f64f64f64f64f64f64f64f64f64f64f64f64
0.0635567.578759885.12749000.0139000.0200000.0365000.0735000.01.495e68.8e617.8643.4220.0956370.0
1.0723705.952840998.95839000.0126900.0187000.0430000.0875000.01.75e68e611.7962.8510.0876570.0
shape: (2, 14)
has_liftmean_buy_pricestd_buy_pricemin_buy_priceQ10_buy_priceQ25_buy_pricemedian_buy_priceQ75_buy_priceQ90_buy_pricemax_pricekurtosis_buy_priceskew_buy_priceprop_outliers_buy_priceprop_nan_buy_price
f64f64f64f64f64f64f64f64f64f64f64f64f64f64
0.0658804.0121.0056e639000.0109990.0135000.0198000.0750000.01.85e68.7e611.7362.9860.1230380.0
1.0648529.815644379.97249000.0180000.0265000.0445000.0758000.01.4e68.8e616.463.1860.0884270.0
shape: (2, 14)
is_exteriormean_buy_pricestd_buy_pricemin_buy_priceQ10_buy_priceQ25_buy_pricemedian_buy_priceQ75_buy_priceQ90_buy_pricemax_pricekurtosis_buy_priceskew_buy_priceprop_outliers_buy_priceprop_nan_buy_price
f64f64f64f64f64f64f64f64f64f64f64f64f64f64
0.0335710.161242012.87952000.0135000.0185000.0275000.0399000.0600000.02.7e616.4533.0520.063710.0
1.0679795.234800626.23639000.0135988.0200000.0395000.0800000.01.599e68.8e615.2233.1740.087930.0
shape: (8, 14)
energy_certificatemean_buy_pricestd_buy_pricemin_buy_priceQ10_buy_priceQ25_buy_pricemedian_buy_priceQ75_buy_priceQ90_buy_pricemax_pricekurtosis_buy_priceskew_buy_priceprop_outliers_buy_priceprop_nan_buy_price
f64f64f64f64f64f64f64f64f64f64f64f64f64f64
0.0642185.792753992.24239000.0137500.0199000.0369000.0750000.01.5e68.7e615.7463.2090.1024390.0
1.0346668.228428497.1253000.0113400.0135000.0190000.0370000.0740000.03.595e618.5183.8690.0874670.0
2.0475198.415662838.17366000.0112000.0148000.0259000.0515000.0910000.05e617.0323.7850.1090490.0
3.0622401.725710730.54957000.0139000.0209000.0385000.0720000.01.45e68e617.8113.4190.0963620.0
4.0805999.855881449.7780000.0169000.0266000.0495000.0950000.01.85e67.8e611.1962.8230.0938990.0
5.0908499.3221079620.270000.0184000.0290000.0590000.01.079e61.8e68.8e619.8923.7870.078550.0
6.0805822.84815363.79370000.0185000.0268000.0482000.0998000.01.85e64.5e64.1932.0070.0813190.0
7.01.0018e61.1524e680000.0175000.0315000.0570620.01.15e62.5e67.525e67.2642.5010.0603020.0
shape: (2, 14)
has_parkingmean_buy_pricestd_buy_pricemin_buy_priceQ10_buy_priceQ25_buy_pricemedian_buy_priceQ75_buy_priceQ90_buy_pricemax_pricekurtosis_buy_priceskew_buy_priceprop_outliers_buy_priceprop_nan_buy_price
f64f64f64f64f64f64f64f64f64f64f64f64f64f64
0.0461199.273555863.20839000.0125000.0161000.0275000.0540000.0945000.08e626.3324.160.0765610.0
1.0993549.052973058.00561000.0235000.0350000.0650000.01.3e62.2e68.8e69.9972.5460.0605840.0
shape: (5, 14)
house_typemean_buy_pricestd_buy_pricemin_buy_priceQ10_buy_priceQ25_buy_pricemedian_buy_priceQ75_buy_priceQ90_buy_pricemax_pricekurtosis_buy_priceskew_buy_priceprop_outliers_buy_priceprop_nan_buy_price
f64f64f64f64f64f64f64f64f64f64f64f64f64f64
0.0518594.384569854.24942000.0130000.0185000.0330000.0605000.01.06e66.5e615.0063.2690.082990.0
1.01.7851e61.2945e639000.0475200.0890000.01.49e62.3e63.495e68.7e64.5121.7450.0520450.0
2.0210529.651143737.95252990.099900.0135000.0185000.0246500.0330000.01.6e635.6154.6820.0528510.0
3.0810144.684780032.50892260.0169000.0270000.0495000.01.175e61.75e65e65.5272.0540.0468180.0
4.0846746.026826488.552120000.0247000.0359000.0598000.01.095e61.69e68.8e632.4174.4220.0451750.0

Linear Regression Model#

Why Linear regression?#

The main usefulness of the linear regression model is to predict the values of a quantitative variable, called response variable, depending on the values of other, quantitative or categorical variables , called predictors.

The other important usefulness of this model is to make inference, in other words, analyze the relation between the response variable and the predictors.

Assumptions#

We have a response variable \(\mathcal{Y}\) and \(p\) predictors \(\mathcal{X} = (\mathcal{X}_1,\dots, \mathcal{X}_p).\)

We also have an \(n\) size sample of this variables: \(\hspace{0.1cm} Y\in \mathbb{R}^n, \hspace{0.2cm} X=(X_1,\dots , X_p)\in \mathbb{R}^{n\times p}\)

  • The model is defined as follows:

    \[y_i \hspace{0.1cm} = \hspace{0.1cm} \beta_0 + x_i^\prime \cdot \beta \hspace{0.05cm}+\hspace{0.05cm} \varepsilon_i \hspace{0.1cm} = \hspace{0.1cm} \beta_0 + \beta_1 \cdot x_{i1} + \dots + \beta_p \cdot x_{ip} + \varepsilon_i\]

    Where:

    • \(x_i = (x_{i1},\dots, x_{ip})^\prime \in \mathbb{R}^p\)

    • \(\beta = (\beta_1, \dots, \beta_p)^\prime \in \mathbb{R}^p\)

    • \(\varepsilon_i\) is the residual term.

  • The residual term has to fulfill the following properties:

    • \(E[\varepsilon_i]=0\)

    • \(Var(\varepsilon_i)\hspace{0.05cm} = \hspace{0.05cm}\sigma^2\)

    • \(\varepsilon_i \sim N(0,\sigma)\)

    • \(cov(\varepsilon_i , \varepsilon_j)\hspace{0.05cm} = \hspace{0.05cm}0 \hspace{0.1cm} , \hspace{0.1cm} \forall i\neq j\)

  • Additional assumptions:

    • \(n \hspace{0.1cm}> \hspace{0.1cm} p+1\hspace{0.3cm}\) \((\text{nº observations} \hspace{0.05cm}> \hspace{0.05cm} \text{nº of beta coefficients})\)

    • \(Rg(X) \hspace{0.05cm}= \hspace{0.05cm}p+1\)

    These additional assumptions are important because of the dimensionality problem.

Assumptions consequences#

  • \(y_i\) is a random variable because \(\varepsilon_i\) is a random variable

  • \(E[y_i] \hspace{0.05cm} = \hspace{0.05cm} \textbf{x}_i^T \cdot \boldsymbol{\beta}\)

  • \(Var(y_i) \hspace{0.05cm} = \hspace{0.05cm} \sigma^2\)

  • \(y_i \sim N(\hspace{0.02cm} \mathbf{x}_i^T \cdot \beta \hspace{0.03cm} , \hspace{0.1cm} \sigma^2 \hspace{0.02cm} )\)

  • \(cov(y_i , y_j)\hspace{0.05cm} = \hspace{0.05cm}0 \hspace{0.1cm},\hspace{0.15cm} \forall i\neq j\)

Matrix representation of the basic assumption #

  • \(Y \hspace{0.05cm} = \hspace{0.05cm} \beta_0 + X\cdot \beta \hspace{0.05cm} + \hspace{0.05cm}\varepsilon\)

  • \(\varepsilon_i \sim N(0,\sigma^2) \hspace{0.05cm} , \hspace{0.1cm} \forall \hspace{0.1cm} i=1,...,n \)

  • \(cov(\varepsilon_i , \varepsilon_j)\hspace{0.05cm} = \hspace{0.05cm}0 \hspace{0.05cm} , \hspace{0.1cm} \forall \hspace{0.1cm} i\neq j =1,...,n \)

    Where:

    • \(\varepsilon =(\varepsilon_{1}, \varepsilon_{2}, ..., \varepsilon_{n})^\prime\)

Coefficient estimation#

Parameters \((\beta_0, \beta)\) are estimated by Maximum Likelihood or Ordinary Least Square (OLS). Both are equivalent under the previous assumptions.

\[\widehat{\beta}_0, \widehat{\beta} \hspace{0.1cm} = \hspace{0.1cm} \text{arg} \hspace{0.1cm}\underset{\beta_0, \beta}{\text{Min}} \hspace{0.1cm} \sum_{i=1}^{n} \hspace{0.07cm}(y_i - \beta_0 + x_i^\prime \cdot \beta)^2\]

The solution to this problem is:

\[\left(\widehat{\beta}_0, \widehat{\beta}^\prime\right)^\prime \hspace{0.1cm} = \hspace{0.1cm} (X^\prime \cdot X)^{-1} \cdot X^\prime \cdot Y\]

Interpretation :

  • \(\widehat{y} = \widehat{\beta}_0 + x^\prime \cdot \widehat{\beta} \hspace{0.05cm},\hspace{0.1cm} x\in \mathbb{R}^p\hspace{0.05cm}\) is the hyperplane that minimize the euclidean distance between the values of the response variable \((y_i)\) and the projections of these values onto that hyperplane \(\hspace{0.04cm}\widehat{y}_i = \widehat{\beta}_0 + x_i^\prime \cdot \widehat{\beta} \hspace{0.04cm} .\)

image.png

We can estimate a Linear regression model in Python using statsmodels, a fantastic library to implement statistical models.

The following code is a class to perform linear regression using statsmodels implementation but with the possibility of including categorical predictors as dummy variables and interactions between the predictors.

class linear_model :

    def __init__(self, X, Y):

        if isinstance(X, pl.DataFrame):
            X = X.to_pandas()
        if isinstance(Y, pl.DataFrame):
            Y = Y.to_pandas()

        X = X.reset_index(drop=True)   
        Y = Y.reset_index(drop=True)    

        self.X = X
        self.Y = Y

    def dummies(self, X, pred_to_dummies):

        X = transform_to_dummies(X, cols_to_dummies=pred_to_dummies, drop_first=True)
        X = X.to_pandas()
        self.pred_to_dummies = pred_to_dummies
        return X

    def interactions(self, X, interactions_to_add):

        X_inter = pd.DataFrame()
        x = interactions_to_add
        if self.pred_to_dummies is not None:
            X_initial = self.X # X without any changes
            interactions_to_add = [(x[i][0], x[i][1]+f'_{j}') 
                                   for i in range(0, len(x)) 
                                   for j in np.sort(X_initial[x[i][1]].unique())[1:]]

        for quant, cat in interactions_to_add:
            X_inter[f'{quant}:{cat}'] = X[quant]*X[cat]
        X = pd.concat((X, X_inter), axis=1)
        return X

    def fit(self, pred_to_dummies=None, interactions_to_add=None):

        X = self.X
        self.interactions_to_add = interactions_to_add
        self.pred_to_dummies = pred_to_dummies

        if pred_to_dummies is not None:
            X = self.dummies(X, pred_to_dummies=pred_to_dummies)
        if interactions_to_add is not None:
            X = self.interactions(X, interactions_to_add=interactions_to_add)
        self.X_modified = X

        poisson = sm.OLS(endog=self.Y, exog=X)
        self.linear_fit = poisson.fit()
    
    def predict(self, X):

        pred_to_dummies = self.pred_to_dummies
        interactions_to_add = self.interactions_to_add

        if pred_to_dummies is not None:
            X = self.dummies(X, pred_to_dummies=pred_to_dummies)
        if interactions_to_add is not None:
            X = self.interactions(X, interactions_to_add=interactions_to_add)
        
        Y_hat = self.linear_fit.predict(X)
        return Y_hat

Just to see how to use the class and their summary output we are going to train four full linear models. Full in the sense that hey will include all the available predictors, but we will distinguish four cases:

  1. Full model with dummies for categorical predictors and some interactions

  2. Full model without dummies but with interactions.

  3. Full model with dummies but without interactions.

  4. Full model without dummies and interactions.

# Statmodels only reads Pandas dataframes, for now.
X_train = X_train.to_pandas()
Y_train = Y_train.to_pandas()
# Full model - interactions - dummies
full_model_inter_dummies = linear_model(Y=Y_train, X=X_train)
interactions = [('n_rooms', 'is_exterior'), ('sq_mt_built', 'is_exterior'), ('sq_mt_built', 'has_lift'), ('sq_mt_built', 'has_parking')]
full_model_inter_dummies.fit(pred_to_dummies=cat_predictors, interactions_to_add=interactions)
results_full_model_inter_dummies = full_model_inter_dummies.linear_fit.summary()
# Full model - interactions 
full_model_inter = linear_model(Y=Y_train, X=X_train)
interactions = [('n_rooms', 'is_exterior'), ('sq_mt_built', 'is_exterior'), ('sq_mt_built', 'has_lift'), ('sq_mt_built', 'has_parking')]
full_model_inter.fit(pred_to_dummies=None, interactions_to_add=interactions)
results_full_model_inter = full_model_inter.linear_fit.summary()
# Full model - dummies
full_model_dummies = linear_model(Y=Y_train, X=X_train)
full_model_dummies.fit(pred_to_dummies=cat_predictors, interactions_to_add=None)
results_full_model_dummies = full_model_dummies.linear_fit.summary()
# Full model - basic - not interactions - not dummies
full_model_basic = linear_model(Y=Y_train, X=X_train)
full_model_basic.fit(pred_to_dummies=None, interactions_to_add=None)
results_full_model_basic = full_model_basic.linear_fit.summary()

We are going to print the summaries of the respective models just to see how they look like.

print(results_full_model_inter_dummies)
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:              buy_price   R-squared (uncentered):                   0.858
Model:                            OLS   Adj. R-squared (uncentered):              0.858
Method:                 Least Squares   F-statistic:                              2478.
Date:                Fri, 29 Dec 2023   Prob (F-statistic):                        0.00
Time:                        10:07:47   Log-Likelihood:                     -2.1718e+05
No. Observations:               15217   AIC:                                  4.344e+05
Df Residuals:                   15180   BIC:                                  4.347e+05
Df Model:                          37                                                  
Covariance Type:            nonrobust                                                  
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
sq_mt_built                1532.8474    342.962      4.469      0.000     860.601    2205.094
n_rooms                   -2.334e+04   1.27e+04     -1.834      0.067   -4.83e+04    1604.965
n_bathrooms                1.161e+05   4883.095     23.768      0.000    1.06e+05    1.26e+05
n_floors                   2954.8637   1.15e+04      0.257      0.797   -1.96e+04    2.55e+04
sq_mt_allotment              17.1423     11.454      1.497      0.135      -5.309      39.594
floor_-2                    2.16e+04   4.57e+04      0.473      0.636   -6.79e+04    1.11e+05
floor_-3                   2.852e+04   7.43e+04      0.384      0.701   -1.17e+05    1.74e+05
floor_-5                   1.798e+05    1.6e+04     11.269      0.000    1.49e+05    2.11e+05
floor_0                   -2.145e+04    2.9e+04     -0.741      0.459   -7.82e+04    3.53e+04
floor_1                    4.552e+04   1.19e+04      3.838      0.000    2.23e+04    6.88e+04
floor_10                   1.921e+05   5.37e+04      3.580      0.000    8.69e+04    2.97e+05
floor_2                    6.145e+04    1.2e+04      5.137      0.000     3.8e+04    8.49e+04
floor_3                    7.209e+04   1.28e+04      5.646      0.000    4.71e+04    9.71e+04
floor_4                    8.572e+04   1.36e+04      6.319      0.000    5.91e+04    1.12e+05
floor_5                    7.016e+04   1.61e+04      4.356      0.000    3.86e+04    1.02e+05
floor_6                    1.201e+05   1.87e+04      6.424      0.000    8.35e+04    1.57e+05
floor_7                    5.762e+04   2.24e+04      2.574      0.010    1.37e+04    1.01e+05
floor_8                    3.871e+04   2.75e+04      1.407      0.159   -1.52e+04    9.26e+04
floor_9                   -3.907e+04   3.46e+04     -1.130      0.259   -1.07e+05    2.87e+04
is_renewal_needed_1       -1446.7651   8443.816     -0.171      0.864    -1.8e+04    1.51e+04
has_lift_1                -2.001e+05   1.03e+04    -19.477      0.000    -2.2e+05    -1.8e+05
is_exterior_1             -6.444e+04   1.56e+04     -4.139      0.000    -9.5e+04   -3.39e+04
energy_certificate_1      -2.954e+04   1.61e+04     -1.840      0.066    -6.1e+04    1921.923
energy_certificate_2       1.601e+04   1.84e+04      0.871      0.384      -2e+04     5.2e+04
energy_certificate_3      -1.083e+04   9581.489     -1.130      0.258   -2.96e+04    7952.322
energy_certificate_4       2.634e+04   1.44e+04      1.833      0.067   -1820.563    5.45e+04
energy_certificate_5       8.849e+04   1.96e+04      4.511      0.000       5e+04    1.27e+05
energy_certificate_6       4.643e+04   2.15e+04      2.162      0.031    4330.549    8.85e+04
energy_certificate_7       5.894e+04    1.9e+04      3.098      0.002    2.17e+04    9.62e+04
has_parking_1             -6.756e+04   1.07e+04     -6.285      0.000   -8.86e+04   -4.65e+04
house_type_1               1.798e+05    1.6e+04     11.269      0.000    1.49e+05    2.11e+05
house_type_2                914.7454   2.48e+04      0.037      0.971   -4.77e+04    4.95e+04
house_type_3              -9.499e+04   2.15e+04     -4.427      0.000   -1.37e+05   -5.29e+04
house_type_4               2.223e+04   1.61e+04      1.382      0.167   -9307.052    5.38e+04
n_rooms:is_exterior_1     -2.243e+04    1.3e+04     -1.732      0.083   -4.78e+04    2961.653
sq_mt_built:is_exterior_1   911.2754    341.102      2.672      0.008     242.674    1579.877
sq_mt_built:has_lift_1     3242.8230     65.614     49.423      0.000    3114.211    3371.435
sq_mt_built:has_parking_1    -8.3343     47.758     -0.175      0.861    -101.947      85.278
==============================================================================
Omnibus:                    11702.338   Durbin-Watson:                   2.014
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          1216497.472
Skew:                           2.995   Prob(JB):                         0.00
Kurtosis:                      46.391   Cond. No.                     1.33e+16
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The smallest eigenvalue is  2e-23. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
print(results_full_model_inter)
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:              buy_price   R-squared (uncentered):                   0.855
Model:                            OLS   Adj. R-squared (uncentered):              0.855
Method:                 Least Squares   F-statistic:                              5595.
Date:                Fri, 29 Dec 2023   Prob (F-statistic):                        0.00
Time:                        10:07:47   Log-Likelihood:                     -2.1734e+05
No. Observations:               15217   AIC:                                  4.347e+05
Df Residuals:                   15201   BIC:                                  4.348e+05
Df Model:                          16                                                  
Covariance Type:            nonrobust                                                  
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
sq_mt_built              1480.0192    341.947      4.328      0.000     809.762    2150.276
n_rooms                 -2.711e+04   1.25e+04     -2.161      0.031   -5.17e+04   -2516.047
n_bathrooms              1.239e+05   4894.994     25.311      0.000    1.14e+05    1.33e+05
n_floors                 6.808e+04   8106.271      8.398      0.000    5.22e+04     8.4e+04
sq_mt_allotment            20.4074     11.486      1.777      0.076      -2.106      42.920
floor                    1806.0012   1445.093      1.250      0.211   -1026.555    4638.558
is_renewal_needed         448.5235   8482.591      0.053      0.958   -1.62e+04    1.71e+04
has_lift                -1.997e+05   1.02e+04    -19.554      0.000    -2.2e+05    -1.8e+05
is_exterior             -1.057e+05   1.24e+04     -8.512      0.000    -1.3e+05   -8.13e+04
energy_certificate       6299.9469   1657.801      3.800      0.000    3050.459    9549.435
has_parking             -4.571e+04   1.07e+04     -4.263      0.000   -6.67e+04   -2.47e+04
house_type              -2961.0812   3440.179     -0.861      0.389   -9704.246    3782.083
n_rooms:is_exterior      -1.23e+04   1.29e+04     -0.957      0.339   -3.75e+04    1.29e+04
sq_mt_built:is_exterior  1120.7517    339.798      3.298      0.001     454.707    1786.797
sq_mt_built:has_lift     2962.2037     63.507     46.643      0.000    2837.722    3086.686
sq_mt_built:has_parking   -63.0704     48.004     -1.314      0.189    -157.163      31.023
==============================================================================
Omnibus:                    11375.312   Durbin-Watson:                   2.012
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          1245548.674
Skew:                           2.840   Prob(JB):                         0.00
Kurtosis:                      46.957   Cond. No.                     2.81e+03
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 2.81e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
print(results_full_model_dummies)
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:              buy_price   R-squared (uncentered):                   0.834
Model:                            OLS   Adj. R-squared (uncentered):              0.834
Method:                 Least Squares   F-statistic:                              2317.
Date:                Fri, 29 Dec 2023   Prob (F-statistic):                        0.00
Time:                        10:07:47   Log-Likelihood:                     -2.1835e+05
No. Observations:               15217   AIC:                                  4.368e+05
Df Residuals:                   15184   BIC:                                  4.370e+05
Df Model:                          33                                                  
Covariance Type:            nonrobust                                                  
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
sq_mt_built           3335.6205     46.524     71.697      0.000    3244.428    3426.812
n_rooms              -1.104e+04   3651.373     -3.024      0.003   -1.82e+04   -3883.486
n_bathrooms           1.745e+05   5109.825     34.159      0.000    1.65e+05    1.85e+05
n_floors             -1.873e+05   1.05e+04    -17.917      0.000   -2.08e+05   -1.67e+05
sq_mt_allotment       -203.9127     11.375    -17.927      0.000    -226.208    -181.617
floor_-2             -1.672e+04   4.92e+04     -0.340      0.734   -1.13e+05    7.97e+04
floor_-3             -2.159e+04      8e+04     -0.270      0.787   -1.78e+05    1.35e+05
floor_-5              9.356e+04   1.64e+04      5.716      0.000    6.15e+04    1.26e+05
floor_0              -4.151e+04    3.1e+04     -1.338      0.181   -1.02e+05    1.93e+04
floor_1               1.507e+04   1.24e+04      1.219      0.223   -9170.640    3.93e+04
floor_10              1.517e+05   5.78e+04      2.622      0.009    3.83e+04    2.65e+05
floor_2               3.915e+04   1.24e+04      3.145      0.002    1.48e+04    6.36e+04
floor_3               5.149e+04   1.33e+04      3.858      0.000    2.53e+04    7.77e+04
floor_4               7.532e+04   1.43e+04      5.280      0.000    4.74e+04    1.03e+05
floor_5               5.628e+04   1.71e+04      3.296      0.001    2.28e+04    8.97e+04
floor_6               1.078e+05   1.99e+04      5.414      0.000    6.87e+04    1.47e+05
floor_7               3.238e+04    2.4e+04      1.350      0.177   -1.46e+04    7.94e+04
floor_8               1.446e+04   2.96e+04      0.488      0.625   -4.36e+04    7.25e+04
floor_9              -3.121e+04   3.73e+04     -0.838      0.402   -1.04e+05    4.18e+04
is_renewal_needed_1   2.865e+04   9052.496      3.165      0.002    1.09e+04    4.64e+04
has_lift_1            8.137e+04   8818.001      9.228      0.000    6.41e+04    9.87e+04
is_exterior_1        -6.605e+04   1.11e+04     -5.964      0.000   -8.78e+04   -4.43e+04
energy_certificate_1 -4.099e+04   1.73e+04     -2.369      0.018   -7.49e+04   -7078.883
energy_certificate_2 -9636.5864   1.98e+04     -0.486      0.627   -4.85e+04    2.92e+04
energy_certificate_3  -1.93e+04   1.03e+04     -1.867      0.062   -3.95e+04     957.103
energy_certificate_4  2.805e+04   1.55e+04      1.809      0.070   -2343.800    5.85e+04
energy_certificate_5   1.01e+05   2.12e+04      4.767      0.000    5.95e+04    1.42e+05
energy_certificate_6   2.95e+04   2.32e+04      1.273      0.203   -1.59e+04    7.49e+04
energy_certificate_7   5.37e+04   2.05e+04      2.619      0.009    1.35e+04    9.39e+04
has_parking_1        -2.764e+04   8066.556     -3.427      0.001   -4.35e+04   -1.18e+04
house_type_1          9.356e+04   1.64e+04      5.716      0.000    6.15e+04    1.26e+05
house_type_2          5826.3125   2.66e+04      0.219      0.827   -4.63e+04    5.79e+04
house_type_3          1.437e+05   2.22e+04      6.465      0.000       1e+05    1.87e+05
house_type_4          7.542e+04   1.73e+04      4.351      0.000    4.14e+04    1.09e+05
==============================================================================
Omnibus:                     8988.079   Durbin-Watson:                   2.022
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          1125984.333
Skew:                           1.880   Prob(JB):                         0.00
Kurtosis:                      44.973   Cond. No.                     1.27e+16
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The smallest eigenvalue is 1.72e-23. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
print(results_full_model_basic)
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:              buy_price   R-squared (uncentered):                   0.833
Model:                            OLS   Adj. R-squared (uncentered):              0.833
Method:                 Least Squares   F-statistic:                              6323.
Date:                Fri, 29 Dec 2023   Prob (F-statistic):                        0.00
Time:                        10:07:47   Log-Likelihood:                     -2.1841e+05
No. Observations:               15217   AIC:                                  4.368e+05
Df Residuals:                   15205   BIC:                                  4.369e+05
Df Model:                          12                                                  
Covariance Type:            nonrobust                                                  
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
sq_mt_built         3361.2021     46.165     72.808      0.000    3270.713    3451.691
n_rooms            -1.082e+04   3522.682     -3.073      0.002   -1.77e+04   -3919.647
n_bathrooms         1.762e+05   5103.800     34.515      0.000    1.66e+05    1.86e+05
n_floors           -1.394e+05   6846.789    -20.359      0.000   -1.53e+05   -1.26e+05
sq_mt_allotment     -195.4226     11.286    -17.316      0.000    -217.544    -173.301
floor               5199.6852   1494.271      3.480      0.001    2270.735    8128.636
is_renewal_needed   2.364e+04   9034.689      2.616      0.009    5928.776    4.13e+04
has_lift            6.459e+04   8148.929      7.926      0.000    4.86e+04    8.06e+04
is_exterior        -8.815e+04   9927.824     -8.879      0.000   -1.08e+05   -6.87e+04
energy_certificate  5251.0132   1771.320      2.964      0.003    1779.014    8723.012
has_parking        -1.978e+04   7990.224     -2.476      0.013   -3.54e+04   -4118.021
house_type          2.391e+04   3626.919      6.592      0.000    1.68e+04     3.1e+04
==============================================================================
Omnibus:                     8971.290   Durbin-Watson:                   2.020
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          1170728.967
Skew:                           1.863   Prob(JB):                         0.00
Kurtosis:                      45.809   Cond. No.                     1.42e+03
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 1.42e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Standard deviation of coefficients#

The precision of beta coefficients estimations is given by the standard deviation of their estimators:

\[s_d(\widehat{\beta}_j)=\sqrt{\sigma^2 \cdot q_{jj}}\]

where:

  • \(q_{jj}\hspace{0.05cm} = diag\left( (X^T \cdot X)^{-1} \right)[j+1]\hspace{0.05cm}\) is the \(j+1\) element of the principal diagonal of the matrix \((X^T \cdot X)^{-1}\) \(\hspace{0.01cm} , \hspace{0.05cm}\) for \(\small{j=0,1,\dots,p}\) .

To estimate \(sd(\widehat{\beta}_j)\) we need to estimate \(\sigma^2\), i.e, the variance of the error term \(\varepsilon_i\).

We can estimate it as follows:

\[\widehat{\sigma}^2 \hspace{0.1cm}=\hspace{0.1cm} \dfrac{1}{n-p-1} \cdot \sum_{i=1}^{n} \hspace{0.05cm}\hat{\varepsilon}_i^2 \]

where:

  • \(RSS = \sum_{i=1}^{n} \hspace{0.05cm}\widehat{\varepsilon}_i^2\) is the residual sum of squares.

  • \(\widehat{\varepsilon}_i = y_i - \widehat{y}_i\) is the estimation of the \(i\)-th residual term.

Then, we have the following estimation for the standard deviation of the coefficients:

\[\widehat{s_d}(\widehat{\beta}_j)=\sqrt{\widehat{\sigma}^2 \cdot q_{jj}}\]

Why is the variance of the coefficient estimators important?

The standard deviation of a coefficient estimator indicates how much the estimation of that coefficient varies, in mean, when the model is trained with many different samples.

Suppose many samples are obtained, and with each of them a linear regression model is trained. Then, we get many estimations of the model coefficient \(\widehat{\beta}_j\), one with each sample.

Then, \(s_d(\widehat{\beta}_j)\) indicates how much \(\widehat{\beta}_j\) varies from one sample to another, in mean.

Interpretation:

  • If the standard deviation is high, this indicates that will be obtained big differences when \(\beta_j\) is estimate with \(\widehat{\beta}_j\) depending on the sample that is used for estimate it, that means estimator \(\widehat{\beta}_j\) is imprecise, because it will be much dispersion of the values of \(\widehat{\beta}_j\) respect to the mean.

  • On the contrary, if the standard deviation is low, this indicates that will be obtained small differences when \(\beta_j\) is estimate with \(\widehat{\beta}_j\) depending on the sample that is used for estimate it, that means estimator \(\widehat{\beta}_j\) is precise, because it will be little dispersion of the values of \(\widehat{\beta}_j\) respect to the mean.

  • In addition, \(\widehat{s_d}(\widehat{\beta}_j)\) allows to define a confidence interval for \(\widehat{\beta}_j\) estimator .

We can obtain the estimation fo the standard deviation of coefficient estimators with our class as follows:

full_model_inter_dummies.linear_fit.bse
sq_mt_built                    342.961695
n_rooms                      12727.596343
n_bathrooms                   4883.094974
n_floors                     11493.166017
sq_mt_allotment                 11.454317
floor_-2                     45685.141817
floor_-3                     74314.873388
floor_-5                     15959.232517
floor_0                      28964.054054
floor_1                      11861.824556
floor_10                     53666.317778
floor_2                      11961.359039
floor_3                      12769.649023
floor_4                      13566.003037
floor_5                      16106.923393
floor_6                      18701.565813
floor_7                      22382.217113
floor_8                      27512.954519
floor_9                      34577.619792
is_renewal_needed_1           8443.816143
has_lift_1                   10273.791241
is_exterior_1                15569.325847
energy_certificate_1         16050.032312
energy_certificate_2         18377.709130
energy_certificate_3          9581.488722
energy_certificate_4         14368.369556
energy_certificate_5         19618.537772
energy_certificate_6         21476.579454
energy_certificate_7         19023.827742
has_parking_1                10748.365169
house_type_1                 15959.232517
house_type_2                 24788.864438
house_type_3                 21458.713817
house_type_4                 16091.495675
n_rooms:is_exterior_1        12956.445872
sq_mt_built:is_exterior_1      341.102462
sq_mt_built:has_lift_1          65.614214
sq_mt_built:has_parking_1       47.758490
dtype: float64

Confidence intervals for coefficients#

Confidence intervals for beta coefficients at a level \(1-\alpha\) can be defined as follows:

\[CI(\beta_j)_{1-\alpha} = \left[\widehat{\beta}_j \hspace{0.1cm} \pm \hspace{0.1cm} t^{n-p-1}_{\alpha/2}\cdot \widehat{s_d}(\widehat{\beta}_j)\right]\]

Observation:

The smaller \(\widehat{s_d}(\widehat{\beta}_j)\), the smaller \(CI(\beta_j)\).

Using our class these confidence intervals can be obtained as follows:

betas_CI = full_model_inter_dummies.linear_fit.conf_int(alpha=0.05)
betas_CI.columns = ['2.5%', '97.5%']
betas_CI
2.5% 97.5%
sq_mt_built 860.601179 2205.093522
n_rooms -48290.274169 1604.965066
n_bathrooms 106488.712368 125631.619276
n_floors -19573.123999 25482.851420
sq_mt_allotment -5.309497 39.594179
floor_-2 -67947.157168 111149.588124
floor_-3 -117142.811317 174189.368494
floor_-5 148561.896433 211125.926824
floor_0 -78223.279453 35322.779618
floor_1 22269.505846 68770.711419
floor_10 86907.704353 297292.579248
floor_2 38001.357637 84892.762324
floor_3 47061.751438 97121.847295
floor_4 59132.930780 112314.925930
floor_5 38586.221564 101729.235716
floor_6 83479.099833 156793.736404
floor_7 13747.950497 101491.625530
floor_8 -15220.609435 92636.790406
floor_9 -106845.010601 28707.576483
is_renewal_needed_1 -17997.660287 15104.130116
has_lift_1 -220240.977098 -179965.244118
is_exterior_1 -94959.788264 -33924.285808
energy_certificate_1 -60998.064103 1921.923328
energy_certificate_2 -20011.903303 52033.137167
energy_certificate_3 -29609.418365 7952.322211
energy_certificate_4 -1820.562564 54506.902347
energy_certificate_5 50037.593213 126946.980438
energy_certificate_6 4330.549406 88523.906965
energy_certificate_7 21652.184603 96230.165451
has_parking_1 -88623.480438 -46487.303504
house_type_1 148561.896433 211125.926824
house_type_2 -47674.410344 49503.901108
house_type_3 -137055.965536 -52932.645570
house_type_4 -9307.052383 53775.481406
n_rooms:is_exterior_1 -47830.731598 2961.652840
sq_mt_built:is_exterior_1 242.673582 1579.877285
sq_mt_built:has_lift_1 3114.211238 3371.434739
sq_mt_built:has_parking_1 -101.946727 85.278040

Goodness of fit#

R-square#

The R-square is a measure of the goodness of fit of the linear regression model to the training data.

\[R^2 \hspace{0.05cm}=\hspace{0.05cm} \dfrac{\hspace{0.07cm}RegSS\hspace{0.07cm}}{TSS} \hspace{0.05cm}=\hspace{0.05cm} \dfrac{\hspace{0.07cm}TSS-RSS\hspace{0.07cm}}{TSS} \hspace{0.05cm}=\hspace{0.05cm} 1 - \dfrac{RSS}{TSS}\]

Where:

  • Total Sum Squares

\[ TSS \hspace{0.1cm} = \hspace{0.1cm} \sum_{i=1}^n ( y_i - \overline{y})^2 \]
  • Residual Sum Squares

\[ RSS= \sum_{i=1}^n \widehat{\varepsilon}_i^2 = \sum_{i=1}^n ( y_i - \hat{y}_i)^2 \]
  • Regression Sum Squares

\[ RegSS = \sum_{i=1}^n ( \hat{y}_i - \overline{y} )^2 \]

It can be proved that:

\[ TSS=RSS+RegSS \]

where:

  • \(TSS \hspace{0.05cm}\) is the total variance of the response variable \(Y\)

  • \(RegSS \hspace{0.05cm}\) is the variance of the response variable \(Y\) explained by the model using \(X\)

  • \(RSS \hspace{0.05cm}\) is the variance of the response variable \(Y\) not explained by the model using \(X\)

Properties:

  • \(R^2\) is the total variance of the response explaining by the linear regression model by mean of the predictors.

  • \(R^2 \in \left[ 0 , 1 \right]\)

For this reason, \(\hspace{0.01cm}R^2\hspace{0.01cm}\) is used as a measure of how well the model fits the response variable.

Interpretation:

The \(\hspace{0.01cm}R^2\hspace{0.01cm}\) interpretation is the following:

  • If \(\hspace{0.01cm}R^2\hspace{0.01cm}\) is close to \(1\), this means that the model is fitting well the response data.

  • If \(\hspace{0.01cm}R^2\hspace{0.01cm}\) is close to \(0\), this means that the model is fitting badly the response data.

We can obtain it using our class as follows:

full_model_inter_dummies.linear_fit.rsquared
0.8579358772571637

Adjusted R-square#

The \(R^2\) has some problems:

  • \(\hspace{0.1cm} R^2\) always increase when increase the number of predictors, although they are not significative.

  • It´s possible estimate two models with the same prediction power but with different \(\hspace{0.1cm}R^2\)

For avoid the disadvantages of \(\hspace{0.1cm}R^2\hspace{0.1cm}\) was created the adjusted \(\hspace{0.1cm}R^2\hspace{0.1cm}\) , denoted as \(\widehat{R^2}\), and defined as:

\[ R^2_{adj} = 1 - \dfrac{RSS/(n-p-1)}{TSS/(n-1)} = 1 - \left( 1- R^2 \right) \cdot \dfrac{n-1}{n-p} \]

This metric doesn’t grow when including irrelevant predictors, because if \(\hspace{0.1cm} RSS\hspace{0.1cm}\) is small because of \(\hspace{0.1cm}p\hspace{0.1cm}\) is large, then \(\hspace{0.1cm}1/(n-p-1)\hspace{0.1cm}\) will be large compensating the \(\hspace{0.01cm}RSS\hspace{0.01cm}\) value

We can obtain it using our class as follows:

full_model_inter_dummies.linear_fit.rsquared_adj
0.8575896076562753

Global Significance Test (ANOVA)#

The test of model global significance is also called ANOVA test, and is defined by the following hypothesis :

\[\begin{split} \hspace{-0.7 cm} H_0: \hspace{0.1cm} \beta_1=\dots =\beta_p=0 \\ H_1: \hspace{0.1cm} \beta_j \neq 0 \hspace{0.05cm},\hspace{0.15cm} \exists \ j=1,\dots ,p \end{split}\]

Statistic:

The test statistic for the ANOVA test is the following:

\[ F_{exp|H_0}= \dfrac{(TSS-RSS)/p}{RSS/(n-p-1)} = \dfrac{RegSS/p}{RSS/(n-p-1)} \hspace{0.05cm} \sim \hspace{0.05cm} F_{\hspace{0.05cm}p,\hspace{0.05cm} n-p-1} \]

Decision Rule:

  • Based on test statistic:

    \[ \text{Reject} \hspace{0.05cm} H_0 \hspace{0.1cm} \Leftrightarrow \hspace{0.1cm} F_{exp|H_0} > F_{\alpha}^{\hspace{0.1cm}p,\hspace{0.05cm} n-p-1} \]

    where:

    \[ P(F_{\hspace{0.05cm}p,\hspace{0.05cm} n-p-1} > F_{\alpha}^{\hspace{0.1cm}p,\hspace{0.05cm} n-p-1} ) = \alpha \]
  • Based on p-value:

    \[ \text{Reject} \hspace{0.05cm} H_0 \hspace{0.1cm} \Leftrightarrow \hspace{0.1cm} p\text{-value} < \alpha \]

    where:

    \[ p\text{-value} = P(F_{\hspace{0.05cm}p,\hspace{0.05cm} n-p-1} \geq F_{exp|H_0} ) \]

We can obtain the observed value of the statistic and the pvalue of the test from our class as follows:

full_model_inter_dummies.linear_fit.fvalue
2477.6528897026333
full_model_inter_dummies.linear_fit.f_pvalue
0.0

Prediction#

Once we have estimated the model parameters we can make predictions on the response using data of the predictors.

The prediction of the response for individuals with the value \(x_i \in \mathbb{R}^p\) for the predictors is:

\[\widehat{y}_i \hspace{0.05cm}=\hspace{0.05cm} \widehat{\beta}_0 + x_i^\prime \cdot \widehat{\beta} \]
# Full model - interactions - dummies
full_model_inter_dummies = linear_model(Y=Y_train, X=X_train)
interactions = [('n_rooms', 'is_exterior'), ('sq_mt_built', 'is_exterior'), ('sq_mt_built', 'has_lift'), ('sq_mt_built', 'has_parking')]
full_model_inter_dummies.fit(pred_to_dummies=cat_predictors, interactions_to_add=interactions)
# Computing predictions
Y_train_hat = full_model_inter_dummies.predict(X_train)
# Computing RMSE
np.sqrt(np.mean((Y_train_hat - Y_train.to_numpy())**2))
381995.050363493

Predictors Selection#

In this section we are going to explore several methods to select predictors.

Given a data set with \(p\) predictors, the number of possible linear regression model that can be trained with a subset (all included) of them is:

\[\sum_{k=1}^p C(p,k) = \sum_{k=1}^p \dfrac{p!}{k! (p-k)!}\]

Where:

\[C(p,k) = \dfrac{p!}{k! (p-k)!}\]

is the number of possible combinations of \(k\) objects selected from a set of \(p\) objects, without repetition and where the order does not matter, is given by the binomial coefficient:

def binom(p,k) :
    return factorial(p) / (factorial(k)*factorial(p-k))
p = X_train.shape[1]
np.sum([binom(p,k) for k in range(1,p+1)])
4095.0

In this case we can train 4095 different linear regression model with our data set, that has \(p=12\) predictors.

So, we are interested in a way to select the “best” combination of predictors. There is no definitive method, so, we will see several options.

Significance Test#

The (individual) significance test (ST) for the predictor \(\mathcal{X}_j\) consists in testing the following hypothesis

\begin{cases} H_0 : \beta_j = 0 \ H_1 : \beta_j \neq 0 \end{cases}

Statistic:

\[t_{exp} \hspace{0.1cm}=\hspace{0.1cm} \dfrac{\widehat{\beta}_j - \beta_j}{\widehat{s_d}(\widehat{\beta}_j)} \hspace{0.1cm}\underset{H_0}{=}\hspace{0.1cm} \dfrac{\widehat{\beta}_j }{\widehat{s_d}(\widehat{\beta}_j)} \hspace{0.1cm}\sim\hspace{0.1cm} t_{n-p-1}\]

p-value:

\[\text{p-value} = P\left( t_{n-p-1} \hspace{0.1cm}\geq\hspace{0.1cm} \dfrac{\widehat{\beta}_j }{\widehat{s_d}(\widehat{\beta}_j)}\right)\]

Decision rule based in p-value:

  • If \(\hspace{0.1cm}\text{p-value} < \alpha \hspace{0.1cm} \Rightarrow\hspace{0.1cm}\) Reject \(H_0 \hspace{0.1cm}\Rightarrow\hspace{0.1cm}\) predictor \(\mathcal{X}_j\) is significative to explain the response.

  • If \(\hspace{0.1cm}\text{p-value} \geq \alpha \hspace{0.1cm}\Rightarrow\hspace{0.1cm}\) Not Reject \(H_0 \hspace{0.1cm}\Rightarrow\hspace{0.1cm}\) predictor \(\mathcal{X}_j\) is not significative to explain the response.

Decision rule basede in confidence intervals:

This test can be performed with the confidence intervals of the coefficients as follows:

  • If \(\hspace{0.1cm}0\in CI(\beta_j)\hspace{0.1cm}\Rightarrow\hspace{0.1cm}\) Not reject \(H_0\)

  • If \(\hspace{0.1cm}0\in CI(\beta_j)\hspace{0.1cm}\Rightarrow\hspace{0.1cm}\) Reject \(H_0\)

    Where:

    • \(P\left(t_{n-p-1} > t^{n-p-1}_{\alpha/2} \right) = \alpha/2\)

The following function allow us to extract the significant predictors of a given linear regression model according to the above test.

def significant_test(model, alpha=0.05, figsize=(5, 6)) :

    pvalues = model.linear_fit.pvalues.values
    pvalues = pvalues[0:len(pvalues)-1]
    predictors = model.linear_fit.pvalues.index.to_numpy()
    predictors = predictors[0:len(predictors)-1]
    significant_results = pvalues < alpha

    significant_predictors = predictors[significant_results==True]
    not_significant_predictors = predictors[significant_results==False]
    significant_results_str = np.array([str(x) for x in significant_results])

    plt.figure(figsize=figsize)
    ax = sns.scatterplot(x=significant_results_str, y=predictors, color='red', s=100)
    ax = sns.scatterplot(x=significant_results_str[significant_results==True], 
                         y=predictors[significant_results==True], color='blue', s=100)
    plt.title('Significance Test', size=15)
    ax.set_ylabel('Predictors', size=12)
    ax.set_xlabel('Significant', size=12)
    plt.xticks(fontsize=13)
    plt.yticks(fontsize=11)
    ax.set_xlim(-0.5, 1.5) 
    plt.show()

    return significant_predictors, not_significant_predictors

The idea is to applied this method to the full model (the one that include all the predictors) to identify which of them are not significative (according to the above test), and therefore we will select one combination of predictors, the significative one.

Since we have four full models, as we explained before, we will apply the test to all of them, and we will obtain a “final” set of predictors in each case.

# Full model - interactions - dummies
significant_pred_model_inter_dummies_ST, not_significant_pred_model_inter_dummies_ST = significant_test(model=full_model_inter_dummies, alpha=0.05, figsize=(6,10))
_images/c12bb29f2edfcf79c330615e719c68766c5f30530ee307b31490e3c6640a790f.png

Given a categorical predictor, if at least one dummy is significative, we consider the predictor significative. And the same for the interactions.

def recognition(initial_list):
    output_list = []
    for x in initial_list :
        try :
            int(x.split('_')[len(x.split('_'))-1]) 
            list_strings = x.split('_')[:len(x.split('_'))-1]
            output_list.append('_'.join(list_strings))
        except:
            output_list.append(x)
    output_list = list(set(output_list))
    return np.array(output_list)
predictors_selected, interactions_selected, pred_to_dummies = {}, {}, {}
significant_pred_model_inter_dummies_ST = recognition(significant_pred_model_inter_dummies_ST)
not_significant = recognition(not_significant_pred_model_inter_dummies_ST)
not_significant_pred_model_inter_dummies_ST = np.array([x for x in not_significant if x not in significant_pred_model_inter_dummies_ST]) 

predictors_selected['ST_inter_dummies'] = [x for x in significant_pred_model_inter_dummies_ST if ':' not in x]
interactions_selected['ST_inter_dummies'] = [(x.split(':')[0], x.split(':')[1]) for x in significant_pred_model_inter_dummies_ST  if ':' in x]
pred_to_dummies['ST_inter_dummies'] = [x for x in cat_predictors if x in predictors_selected['ST_inter_dummies']]
predictors_selected['ST_inter_dummies'] 
['sq_mt_built',
 'is_exterior',
 'floor',
 'n_bathrooms',
 'house_type',
 'energy_certificate',
 'has_parking',
 'has_lift']
interactions_selected['ST_inter_dummies'] 
[('sq_mt_built', 'is_exterior'), ('sq_mt_built', 'has_lift')]
# Full model - interactions 
significant_pred_model_inter_ST, not_significant_pred_model_inter_ST = significant_test(model=full_model_inter, alpha=0.05, figsize=(6,7))
_images/3412b96cc76d2d992fecfcfb6283db72427b46b8e8498d7ac2aeeca4456d48c2.png
predictors_selected['ST_inter'] = [x for x in significant_pred_model_inter_dummies_ST if ':' not in x]
interactions_selected['ST_inter'] = [(x.split(':')[0], x.split(':')[1]) for x in significant_pred_model_inter_dummies_ST  if ':' in x]
pred_to_dummies['ST_inter'] = None
predictors_selected['ST_inter'] 
['sq_mt_built',
 'is_exterior',
 'floor',
 'n_bathrooms',
 'house_type',
 'energy_certificate',
 'has_parking',
 'has_lift']
interactions_selected['ST_inter']
[('sq_mt_built', 'is_exterior'), ('sq_mt_built', 'has_lift')]
# Full model - dummies
significant_pred_model_dummies_ST, not_significant_pred_model_dummies_ST = significant_test(model=full_model_dummies, alpha=0.05, figsize=(6,10))
_images/0106e0df5777c623ef3f2ba3569f1e0c94995195adeea926bfe455454a0d717c.png
significant_pred_model_dummies_ST = recognition(significant_pred_model_dummies_ST)
not_significant = recognition(not_significant_pred_model_dummies_ST)
not_significant_pred_model_dummies_ST = np.array([x for x in not_significant if x not in significant_pred_model_dummies_ST]) 

predictors_selected['ST_dummies'] = significant_pred_model_dummies_ST
interactions_selected['ST_dummies'] = None
pred_to_dummies['ST_dummies'] = [x for x in cat_predictors if x in predictors_selected['ST_inter_dummies']]
predictors_selected['ST_dummies']
array(['sq_mt_allotment', 'sq_mt_built', 'n_floors', 'n_bathrooms',
       'floor', 'is_renewal_needed', 'is_exterior', 'energy_certificate',
       'house_type', 'has_parking', 'has_lift', 'n_rooms'], dtype='<U18')
interactions_selected['ST_dummies']
# Full model - not interactions - not dummies
significant_pred_model_basic_ST, not_significant_pred_model_basic_ST = significant_test(model=full_model_basic, alpha=0.05, figsize=(6,6))
_images/7965a6ace1b6b25701d899889ec59e55a68c5b3d4f4a1f4ae830c4517b22928a.png
predictors_selected['ST_basic'] = significant_pred_model_basic_ST
interactions_selected['ST_basic'] = None
pred_to_dummies['ST_basic'] = None
predictors_selected['ST_basic']
array(['sq_mt_built', 'n_rooms', 'n_bathrooms', 'n_floors',
       'sq_mt_allotment', 'floor', 'is_renewal_needed', 'has_lift',
       'is_exterior', 'energy_certificate', 'has_parking'], dtype=object)

Likelihood Ratio Test#

Deviance

Given an statistical regression model \(M\), such as Poisson regression model, its deviance is defined as:

\[D(M) = -2 \cdot log \left( \mathcal{L}(\widehat{\beta}_0, \widehat{\beta}) \right)\]

Where \(\mathcal{L}(\widehat{\beta}_0, \widehat{\beta})\) is the estimated Likelihood function of the model.

The deviance is a measure of how well the model is fitting the training data, so, is a measure of goodness of fit of the model.

Properties:

  • \(D(M)\in [0, \infty)\)

  • The larger \(D(M)\), the worse \(M\) fits the training data.

  • The smaller \(D(M)\), the better \(M\) fits the training data.

  • The null model \(M_0\) (the one that doesn’t include predictors) has the maximum deviance.

  • The classic determination coefficient \(R\)-square can be obtain through the deviance as \(\hspace{0.05cm}R^2(M) = 1 - \dfrac{D(M)}{D(M_0)}\)

Likelihood Ratio Test as significance test

Originally the Likelihood ratio test is a test to compare the goodness of fit of different nested models, but in this section we are going to adapt it to be a significance test for the predictors.

Given two nested models \(M_p\), \(M_{p-1}\) such that \(M_p\) is the full model (includes the \(p\) available predictors) and \(M_{p-1}\) is the same model but without the predictor \(X_j\).

The significant test for predictor \(X_j\) can be perform as a Likelihood Ratio Test:

\begin{cases} H_0 : M_{p-1} \ H_1 : M_p \end{cases}

Statistic:

\[LTR = D(M_{p-1}) - D(M_p) \sim \chi_{1}^2\]

p-value:

\[pvalue = P(\chi_{1}^2 \geq LRT)\]

Decision rule:

  • If \(pvalue < \alpha \hspace{0.1cm} \Rightarrow\hspace{0.1cm}\) Reject \(H_0 \hspace{0.1cm}\Rightarrow\hspace{0.1cm}\) \(M_p\) fits better the training data than \(M_{p-1}\) \(\hspace{0.1cm}\Rightarrow\hspace{0.1cm}\) predictor \(\mathcal{X}_j\) is significative to explain the response.

  • If \(pvalue \geq \alpha \hspace{0.1cm} \Rightarrow\hspace{0.1cm}\) Not Reject \(H_0 \hspace{0.1cm}\Rightarrow\hspace{0.1cm}\) \(M_{p-1}\) fits better the training data than \(M_{p}\) \(\hspace{0.1cm}\Rightarrow\hspace{0.1cm}\) predictor \(\mathcal{X}_j\) is not significative to explain the response.

Let’s implements the Likelihood Test Ratio as significance test in Python:

def LRT(full_model, predictor_to_test) :

    X_full = full_model.X
    Y = full_model.Y
    interactions = full_model.interactions_to_add
    pred_to_dummies = full_model.pred_to_dummies
    predictors = X_full.columns.tolist().copy()

    if predictor_to_test in predictors:
        predictors_reduced = [x for x in predictors if x != predictor_to_test]
        interactions_reduced = [(quant, cat) for (quant, cat) in interactions if cat != predictor_to_test and quant != predictor_to_test] if interactions is not None else None
        pred_to_dummies_reduced = [x for x in pred_to_dummies if x != predictor_to_test] if pred_to_dummies is not None else None

    elif predictor_to_test in interactions:
        predictors_reduced = predictors
        interactions_reduced = [(quant, cat) for (quant, cat) in interactions if (quant, cat) != predictor_to_test]
        pred_to_dummies_reduced = pred_to_dummies

    reduced_model = linear_model(Y=Y, X=X_full[predictors_reduced])
    reduced_model.fit(pred_to_dummies=pred_to_dummies_reduced, interactions_to_add=interactions_reduced)

    loglike_full = full_model.linear_fit.llf
    loglike_reduced = reduced_model.linear_fit.llf
    deviance_full = -2*loglike_full
    deviance_reduced = -2*loglike_reduced
    LRT_ = deviance_reduced - deviance_full
    df_full = full_model.linear_fit.df_model # number of coefficients of full model
    df_reduced = reduced_model.linear_fit.df_model # number of coefficients of reduced model 
    pvalue = stats.chi2.sf(LRT_, df_full-df_reduced)

    return pvalue
def LTR_significant_test(full_model, alpha=0.05, figsize=(7,7)) :

    if full_model.interactions_to_add is not None :
        predictors_to_test =  full_model.X.columns.tolist() + full_model.interactions_to_add 
    else:
        predictors_to_test =  full_model.X.columns.tolist()
    
    pvalue = dict()

    for predictor in predictors_to_test :
        pvalue[predictor] = LRT(full_model, predictor_to_test=predictor)
    
    predictors_tested = np.array([str(x) for x in pvalue.keys()])
    pvalues = np.array([x for x in pvalue.values()])
    significance_results = pvalues < alpha
    significant_predictors = predictors_tested[significance_results==True]
    not_significant_predictors = predictors_tested[significance_results==False]
    significance_test_results_str = np.array([str(x) for x in significance_results])

    plt.figure(figsize=figsize)
    ax = sns.scatterplot(x=significance_test_results_str, y=predictors_tested, color='red', s=100)
    ax = sns.scatterplot(x=significance_test_results_str[significance_results==True], 
                         y=predictors_tested[significance_results==True], color='blue', s=100)

    plt.title('Significance Test - LRT', size=15)
    ax.set_ylabel('Predictors', size=12)
    ax.set_xlabel('Significant', size=12)
    plt.xticks(fontsize=11)
    plt.yticks(fontsize=11)
    ax.set_xlim(-0.5, 1.5) 
    plt.show()

    return significant_predictors, not_significant_predictors
# Full model LRT - interactions - dummies
significant_pred_model_inter_dummies_LRT, not_significant_pred_model_inter_dummies_LRT = LTR_significant_test(full_model=full_model_inter_dummies, 
                                                                                                              alpha=0.05, figsize=(7,8))
_images/156a194888dfe5ea5b6a5be48f34a0b7e54ab2d50e6971e723ac727a8e65eded.png
significant_pred_model_inter_dummies_LRT
array(['sq_mt_built', 'n_rooms', 'n_bathrooms', 'floor', 'has_lift',
       'is_exterior', 'energy_certificate', 'has_parking', 'house_type',
       "('sq_mt_built', 'is_exterior')", "('sq_mt_built', 'has_lift')"],
      dtype='<U30')
not_significant_pred_model_inter_dummies_LRT
array(['n_floors', 'sq_mt_allotment', 'is_renewal_needed',
       "('n_rooms', 'is_exterior')", "('sq_mt_built', 'has_parking')"],
      dtype='<U30')
predictors_selected['LRT_inter_dummies'] = [x for x in significant_pred_model_inter_dummies_LRT if not x[0] == '(']
interactions_selected['LRT_inter_dummies'] = [ast.literal_eval(x) for x in significant_pred_model_inter_dummies_LRT if x[0] == '(']
pred_to_dummies['LRT_inter_dummies'] = [x for x in cat_predictors if x in predictors_selected['LRT_inter_dummies']]
predictors_selected['LRT_inter_dummies'] 
['sq_mt_built',
 'n_rooms',
 'n_bathrooms',
 'floor',
 'has_lift',
 'is_exterior',
 'energy_certificate',
 'has_parking',
 'house_type']
interactions_selected['LRT_inter_dummies']
[('sq_mt_built', 'is_exterior'), ('sq_mt_built', 'has_lift')]
# Full model - interactions 
significant_pred_model_inter_LRT, not_significant_pred_model_inter_LRT = LTR_significant_test(full_model=full_model_inter, alpha=0.05, figsize=(7,8))
_images/d5f21a1e7bde1b4e18a992616f4d9634e3c186104b12f1b13e9b66d8697fbc94.png
predictors_selected['LRT_inter'] = [x for x in significant_pred_model_inter_LRT if not x[0] == '(']
interactions_selected['LRT_inter'] = [ast.literal_eval(x) for x in significant_pred_model_inter_LRT if x[0] == '(']
pred_to_dummies['LRT_inter'] = None
predictors_selected['LRT_inter']
['sq_mt_built',
 'n_rooms',
 'n_bathrooms',
 'n_floors',
 'has_lift',
 'is_exterior',
 'energy_certificate',
 'has_parking']
interactions_selected['LRT_inter']
[('sq_mt_built', 'is_exterior'), ('sq_mt_built', 'has_lift')]
# Full model - dummies
significant_pred_model_dummies_LRT, not_significant_pred_model_dummies_LRT = LTR_significant_test(full_model=full_model_dummies, 
                                                                                                  alpha=0.05, figsize=(7,6))
_images/68bb006c9de65d61a03467b80307afb9db9461d851629a29bd85de72671a2802.png
predictors_selected['LRT_dummies'] = significant_pred_model_dummies_LRT
interactions_selected['LRT_dummies'] = None
pred_to_dummies['LRT_dummies'] = None
predictors_selected['LRT_dummies']
array(['sq_mt_built', 'n_rooms', 'n_bathrooms', 'n_floors',
       'sq_mt_allotment', 'floor', 'is_renewal_needed', 'has_lift',
       'is_exterior', 'energy_certificate', 'has_parking', 'house_type'],
      dtype='<U18')
# Full model - basic (not interactions - not dummies)
significant_pred_model_basic_LRT, not_significant_pred_model_basic_LRT = LTR_significant_test(full_model=full_model_basic, 
                                                                                              alpha=0.05, figsize=(7,5))
_images/58f8de4223f537bb45ebc16cf23a512acd9e74002d5b3989469ca64a2bd00869.png
predictors_selected['LRT_basic'] = significant_pred_model_basic_LRT
interactions_selected['LRT_basic'] = None
pred_to_dummies['LRT_basic'] = None
predictors_selected['LRT_basic']
array(['sq_mt_built', 'n_rooms', 'n_bathrooms', 'n_floors',
       'sq_mt_allotment', 'floor', 'is_renewal_needed', 'has_lift',
       'is_exterior', 'energy_certificate', 'has_parking', 'house_type'],
      dtype='<U18')

Predictors selection based on predictive performance#

The idea is to consider the predictors as hyper-parameters of the model and optimize them using algorithms like grid search and cross validation.

Train-Train - Train-Validate split#

Since we need to test the predictive performance of the models we need to split train partition in train-train (train2) to train the models and train-validate (val) to test them.

We don’t use test partition in this case since this task implies taking decisions regarding the predictive modeling process and these decisions should be done with the training data, otherwise this can lead to data leakage.

The test partition will only be used to estimate the future performance of our final model.

Best subset selection algorithm#

Best subset selection algorithm :

We have \(p\) predictors.

  • We train all the possible linear models with \(1\) predictor, and select the one with less test prediction error computed by simple validation \(\Rightarrow M_1^*\)

  • We train all the possible linear models with \(2\) predictor, and select the one with less test prediction error computed by simple validation \(\Rightarrow M_2^*\)
    \(\dots\)

  • We train all the possible linear models with \(p-1\) predictor, and select the one with less test prediction error computed by simple validation \(\Rightarrow M_{p-1}^*\)

  • We train the full linear model \(\Rightarrow M_{p}\)

Among \(M_1^*, M_2^*,\dots, M_{p-1}^*, M_p\) we select the one with less test prediction error, and it is consider the best Poisson model according to best subset selection algorithm.

This method is feasible in this case, because as we saw before, the number of possible linear regression models that can be trained with all the possible subsets of the \(p=12\) predictors is \(4095\), which is acceptable computationally with the statsmodels implementation of the linear regression model.

p=len(predictors)
binom_values = [binom(p=len(predictors), k=k) for k in np.arange(1,p+1)]

plt.figure(figsize=(9, 5))
ax = sns.lineplot(y=binom_values, x=np.arange(1,p+1), color='blue')
ax.set_ylabel('Num. Possible combinations', size=12)
ax.set_xlabel('Num. predictors (k)', size=12)
plt.xticks(np.arange(1,p+1), fontsize=11)
plt.yticks(np.arange(0,1000, 300))
plt.title('Binomial coefficient in Best-subset selection', size=15)
plt.show()
_images/71236df706b0e50bc1c554f28d3a4c4c95fc6566f63b49bfe9420ddeafc249d5.png
def simple_validation(X, Y, train_size, seed, pred_to_dummies, interactions_to_add):

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=train_size, random_state=seed)

    model = linear_model(Y=Y_train, X=X_train)
    model.fit(pred_to_dummies=pred_to_dummies, interactions_to_add=interactions_to_add)
    Y_test_hat = model.predict(X_test)
    RMSE = np.sqrt(np.mean((Y_test - Y_test_hat)**2))

    return RMSE
def best_subset_selection(X, Y, pred_to_dummies=None, interactions=None, train_size=0.75, seed=123) :
    
    results = dict()
    RMSE = dict()
    predictors = X_train.columns.tolist()
    n_iter = len(predictors)

    for iter in range(1, n_iter+1) :

        predictors_combinations = []
        predictors_combinations = list(combinations(predictors, iter))

        for predictors_combi in predictors_combinations:
 
            pred_to_dummies_reduced = [x for x in pred_to_dummies if x in list(predictors_combi)] if pred_to_dummies is not None else None
            interactions_reduced = [(quant, cat) for (quant, cat) in interactions if cat in list(predictors_combi) and quant in list(predictors_combi)] if interactions is not None else None
            RMSE[predictors_combi] = simple_validation(X=X[list(predictors_combi)], Y=Y, train_size=train_size, seed=seed,
                                                       pred_to_dummies=pred_to_dummies_reduced, interactions_to_add=interactions_reduced)

        RMSE_predictors = [x for x in RMSE.keys()]
        RMSE_values = np.array([x for x in RMSE.values()])
        best_predictors = RMSE_predictors[np.argsort(RMSE_values)[0]]
        best_interactions = tuple([(quant,cat) for (quant,cat) in interactions if quant in list(best_predictors) and cat in list(best_predictors)]) if interactions is not None else None
        if best_interactions is not None:
            results[best_predictors + best_interactions] = np.min(RMSE_values) 
        else:
            results[best_predictors] = np.min(RMSE_values) 
        # Clean the dictionary
        RMSE = dict()

    return results
# Models with: dummies - interactions
results = best_subset_selection(X=X_train, Y=Y_train, train_size=0.75, seed=123, 
                                pred_to_dummies=cat_predictors, 
                                interactions=[('n_rooms', 'is_exterior'), ('sq_mt_built', 'is_exterior'), ('sq_mt_built', 'has_lift'), ('sq_mt_built', 'has_parking')]) 
result_predictors = [x for x in results.keys()]
result_predictors_str = np.array([str(x) for x in results.keys()])
result_score = np.array([x for x in results.values()])
best_pred_bestsubset_inter_dummies_str = result_predictors_str[np.argsort(result_score)[0]]
best_pred_bestsubset_inter_dummies = list(result_predictors[np.argsort(result_score)[0]])
best_RMSE_bestsubset_inter_dummies = np.min(result_score)

predictors_selected['BestSubset_inter_dummies'] = [x for x in best_pred_bestsubset_inter_dummies if not isinstance(x,tuple)]
interactions_selected['BestSubset_inter_dummies'] = [ast.literal_eval(x) for x in best_pred_bestsubset_inter_dummies if isinstance(x,tuple)]
pred_to_dummies['BestSubset_inter_dummies'] = [x for x in cat_predictors if x in predictors_selected['BestSubset_inter_dummies']]
predictors_selected['BestSubset_inter_dummies'] 
['is_renewal_needed', 'has_lift', 'is_exterior']
interactions_selected['BestSubset_inter_dummies']
[]
plt.figure(figsize=(3, 7))
ax = sns.scatterplot(x=result_score, y=result_predictors_str, color='blue', s=75)
ax = sns.scatterplot(x=best_RMSE_bestsubset_inter_dummies, y=[best_pred_bestsubset_inter_dummies_str], color='red', s=75)
plt.title('Best-subset Selection', size=15)
ax.set_ylabel('Predictors', size=12)
ax.set_xlabel('RMSE', size=12)
min = np.min(result_score)
max = np.max(result_score)
plt.xticks(np.round(np.linspace(min,max, 4), 2), fontsize=11)
plt.yticks(fontsize=11)
plt.show()
_images/fbe9ab95d6ffeb6ea13375fc97ebccb6c0436946c5317f33d72e47e4436f3f99.png
# Models with: dummies
results = best_subset_selection(X=X_train, Y=Y_train, train_size=0.75, seed=123, 
                                pred_to_dummies=cat_predictors, 
                                interactions=None) 
result_predictors = [x for x in results.keys()]
result_predictors_str = np.array([str(x) for x in results.keys()])
result_score = np.array([x for x in results.values()])
best_pred_bestsubset_dummies_str = result_predictors_str[np.argsort(result_score)[0]]
best_pred_bestsubset_dummies = list(result_predictors[np.argsort(result_score)[0]])
best_RMSE_bestsubset_dummies = np.min(result_score)

predictors_selected['BestSubset_dummies'] = best_pred_bestsubset_dummies
interactions_selected['BestSubset_dummies'] = None
pred_to_dummies['BestSubset_dummies'] = [x for x in cat_predictors if x in predictors_selected['BestSubset_dummies']]
predictors_selected['BestSubset_dummies']
['is_renewal_needed', 'has_lift', 'is_exterior']
plt.figure(figsize=(3, 7))
ax = sns.scatterplot(x=result_score, y=result_predictors_str, color='blue', s=75)
ax = sns.scatterplot(x=best_RMSE_bestsubset_dummies, y=[best_pred_bestsubset_dummies_str], color='red', s=75)
plt.title('Best-subset Selection', size=15)
ax.set_ylabel('Predictors', size=12)
ax.set_xlabel('RMSE', size=12)
min = np.min(result_score)
max = np.max(result_score)
plt.xticks(np.round(np.linspace(min,max, 4), 2), fontsize=11)
plt.yticks(fontsize=11)
plt.show()
_images/93ec358a58469aebbbe5541dc835e888428c1ab927b1fcf880163fef80d4744b.png
# Models with: interactions
results = best_subset_selection(X=X_train, Y=Y_train, train_size=0.75, seed=123, 
                      pred_to_dummies=None, 
                      interactions=[('n_rooms', 'is_exterior'), ('sq_mt_built', 'is_exterior'), ('sq_mt_built', 'has_lift'), ('sq_mt_built', 'has_parking')]) 
result_predictors = [x for x in results.keys()]
result_predictors_str = np.array([str(x) for x in results.keys()])
result_score = np.array([x for x in results.values()])
best_pred_bestsubset_inter_str = result_predictors_str[np.argsort(result_score)[0]]
best_pred_bestsubset_inter = list(result_predictors[np.argsort(result_score)[0]])
best_RMSE_bestsubset_inter = np.min(result_score)

predictors_selected['BestSubset_inter'] = [x for x in best_pred_bestsubset_inter if not isinstance(x, tuple)]
interactions_selected['BestSubset_inter'] = [x for x in best_pred_bestsubset_inter if isinstance(x, tuple)]
pred_to_dummies['BestSubset_inter'] = None
predictors_selected['BestSubset_inter']
['sq_mt_built',
 'n_rooms',
 'n_bathrooms',
 'n_floors',
 'has_lift',
 'is_exterior',
 'energy_certificate',
 'house_type']
interactions_selected['BestSubset_inter']
[('n_rooms', 'is_exterior'),
 ('sq_mt_built', 'is_exterior'),
 ('sq_mt_built', 'has_lift')]
plt.figure(figsize=(3, 7))
ax = sns.scatterplot(x=result_score, y=result_predictors_str, color='blue', s=75)
ax = sns.scatterplot(x=best_RMSE_bestsubset_inter, y=[best_pred_bestsubset_inter_str], color='red', s=75)
plt.title('Best-subset Selection', size=15)
ax.set_ylabel('Predictors', size=12)
ax.set_xlabel('RMSE', size=12)
min = np.min(result_score)
max = np.max(result_score)
plt.xticks(np.round(np.linspace(min,max, 4), 2), fontsize=11)
plt.yticks(fontsize=11)
plt.show()
_images/8cb345ce43c6d669cb9e3fe588df900d8b4d7954e703ab71bd4a6d0a6c98ab07.png
# basic (not inter - not dummies)

results = best_subset_selection(X=X_train, Y=Y_train, train_size=0.75, seed=123, 
                      pred_to_dummies=None, 
                      interactions=None) 
result_predictors = [x for x in results.keys()]
result_predictors_str = np.array([str(x) for x in results.keys()])
result_score = np.array([x for x in results.values()])
best_pred_bestsubset_basic_str = result_predictors_str[np.argsort(result_score)[0]]
best_pred_bestsubset_basic = list(result_predictors[np.argsort(result_score)[0]])
best_RMSE_bestsubset_basic = np.min(result_score)

predictors_selected['BestSubset_basic'] = best_pred_bestsubset_basic 
interactions_selected['BestSubset_basic'] = None
pred_to_dummies['BestSubset_basic'] = None
predictors_selected['BestSubset_basic'] 
['sq_mt_built',
 'n_rooms',
 'n_bathrooms',
 'n_floors',
 'sq_mt_allotment',
 'floor',
 'has_lift',
 'is_exterior',
 'energy_certificate',
 'has_parking']
interactions_selected['BestSubset_basic']
plt.figure(figsize=(3, 7))
ax = sns.scatterplot(x=result_score, y=result_predictors_str, color='blue', s=75)
ax = sns.scatterplot(x=best_RMSE_bestsubset_basic, y=[best_pred_bestsubset_basic_str], color='red', s=75)
plt.title('Best-subset Selection', size=15)
ax.set_ylabel('Predictors', size=12)
ax.set_xlabel('RMSE', size=12)
min = np.min(result_score)
max = np.max(result_score)
plt.xticks(np.round(np.linspace(min,max, 4), 2), fontsize=11)
plt.yticks(fontsize=11)
plt.show()
_images/232eaad1f2f3f5c772c381a11661318077848ee1c137e5a267f223f7c9089e5c.png

Backward and forward are not necessary in this case since we have been able to perform best-subset, that explore all the possible combinations of predictors (included the ones explore with backward and forward).

Model selection#

Since we we want to compare different models based on their predictive performance we need to use train-train (train2) to train them and train-validate (val) to test them, so that we will be able to compare their “real” predictive performance and choose the best one.

We don’t use test partition in model selection because it is part of training, since implies taking decisions regarding the predictive modeling process and this can lead to data leakage.

The test partition will only be used to estimate the future performance of our final model.

def repeated_simple_validation(X, Y, train_size, n_repetitions, seed, pred_to_dummies, interactions_to_add):

    np.random.seed(seed)
    seed_arr = np.random.randint(1, 9999999, n_repetitions)
    RMSE_list = []

    for iter in range(0, n_repetitions) :
        
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=train_size, random_state=seed_arr[iter])
        model = linear_model(Y=Y_train, X=X_train)
        model.fit(pred_to_dummies=pred_to_dummies, interactions_to_add=interactions_to_add)
        try: # If X_test contains categorical variable without some categories that are present in X_train, an error will occur.
            Y_test_hat = model.predict(X_test)
            RMSE_list.append(np.sqrt(np.mean((Y_test - Y_test_hat)**2)))
        except: 
            continue

    RMSE = np.mean(RMSE_list)
    return RMSE
def kfold_cv(X, Y, n_folds, shuffle, seed, pred_to_dummies, interactions_to_add, ):

    
    kfold = KFold(n_splits=n_folds, shuffle=shuffle, random_state=seed)
    folds = list(kfold.split(X=X, y=Y))
    RMSE_list = []

    for iter in range(0, n_folds) :

        X_train = X.iloc[folds[iter][0], :] ; X_test = X.iloc[folds[iter][1], :]
        Y_train = Y.iloc[folds[iter][0]] ; Y_test = Y.iloc[folds[iter][1]]
        model = linear_model(Y=Y_train, X=X_train)
        model.fit(pred_to_dummies=pred_to_dummies, interactions_to_add=interactions_to_add)
        try: # If X_test contains categorical variable without some categories that are present in X_train, an error will occur.
            Y_test_hat = model.predict(X_test)
            RMSE_list.append(np.sqrt(np.mean((Y_test - Y_test_hat)**2)))
        except: 
            continue

    RMSE = np.mean(RMSE_list)
    return RMSE
RMSE_rep_sv, RMSE_kfold = {}, {}

for x in predictors_selected.keys() :

    print(x)

    RMSE_rep_sv[x] = repeated_simple_validation(X=X_train[predictors_selected[x]], Y=Y_train, n_repetitions=100,
                                                                 train_size=0.75, seed=123, 
                                                                 pred_to_dummies=pred_to_dummies[x], 
                                                                 interactions_to_add=interactions_selected[x])

    RMSE_kfold[x] = kfold_cv(X=X_train[predictors_selected[x]], Y=Y_train, n_folds=5,
                                              shuffle=True, seed=123, 
                                              pred_to_dummies=pred_to_dummies[x], 
                                              interactions_to_add=interactions_selected[x])
ST_inter_dummies
ST_inter
ST_dummies
ST_basic
LRT_inter_dummies
LRT_inter
LRT_dummies
LRT_basic
BestSubset_inter_dummies
BestSubset_dummies
BestSubset_inter
BestSubset_basic
RMSE_rep_sv_arr = np.array([x for x in RMSE_rep_sv.values()])
RMSE_kfold_arr = np.array([x for x in RMSE_kfold.values()])
Models_arr = np.array(['Linear_' + x for x in RMSE_rep_sv.keys()])

Best_Model_rep_sv = Models_arr[np.argsort(RMSE_rep_sv_arr)[0]]
RMSE_Best_Model_rep_sv = np.min(RMSE_rep_sv_arr)

Best_Model_kfold = Models_arr[np.argsort(RMSE_kfold_arr)[0]]
RMSE_Best_Model_kfold = np.min(RMSE_kfold_arr)
plt.figure(figsize=(4, 4))
ax = sns.scatterplot(x=RMSE_rep_sv_arr, y=Models_arr, color='blue', s=75)
ax = sns.scatterplot(x=RMSE_Best_Model_rep_sv, y=[Best_Model_rep_sv], color='red', s=75)
plt.title('Model Selection - Rep. Simple Validation', size=13)
ax.set_ylabel('Models', size=12)
ax.set_xlabel('RMSE (Repeated simple validation)', size=10)
min = np.min(RMSE_rep_sv_arr)
max = np.max(RMSE_rep_sv_arr)
plt.xticks(np.round(np.linspace(min,max, 4), 3), fontsize=11)
plt.yticks(fontsize=10)
plt.show()
_images/5d78944972acb314d7362a608ac414436c63af5af0d5a4dc4974f079cf5f759b.png
plt.figure(figsize=(4, 4))
ax = sns.scatterplot(x=RMSE_kfold_arr, y=Models_arr, color='blue', s=75)
ax = sns.scatterplot(x=RMSE_Best_Model_kfold, y=[Best_Model_kfold], color='red', s=75)
plt.title('Model Selection - kfold CV', size=13)
ax.set_ylabel('Models', size=12)
ax.set_xlabel('RMSE (kfold CV)', size=10)
min = np.min(RMSE_kfold_arr)
max = np.max(RMSE_kfold_arr)
plt.xticks(np.round(np.linspace(min,max, 4), 3), fontsize=11)
plt.yticks(fontsize=10)
plt.show()
_images/079be80c8063f0945685665c5bfa3c490551ab7f060bbf21840bfc87054129db.png
plt.figure(figsize=(4, 4))
ax = sns.scatterplot(x=RMSE_kfold_arr[RMSE_kfold_arr < 400000], y=Models_arr[RMSE_kfold_arr < 400000], color='blue', s=75)
ax = sns.scatterplot(x=RMSE_Best_Model_kfold, y=[Best_Model_kfold], color='red', s=75)
plt.title('Model Selection - kfold CV', size=13)
ax.set_ylabel('Models', size=12)
ax.set_xlabel('RMSE (kfold CV)', size=10)
min = np.min(RMSE_kfold_arr[RMSE_kfold_arr < 400000])
max = np.max(RMSE_kfold_arr[RMSE_kfold_arr < 400000])
plt.xticks(np.round(np.linspace(min,max, 4), 3), fontsize=11)
plt.yticks(fontsize=10)
plt.show()
_images/3ad681a7dfa4a6aabbad7f8d0a2058200b9b94e3c0c87d6c198172df883bab6b.png
Best_Model_rep_sv
'Linear_LRT_inter'
RMSE_Best_Model_rep_sv
386539.9511010434
Best_Model_kfold
'Linear_LRT_inter'
RMSE_Best_Model_kfold
388569.913637558
predictors_selected['LRT_inter']
['sq_mt_built',
 'n_rooms',
 'n_bathrooms',
 'n_floors',
 'has_lift',
 'is_exterior',
 'energy_certificate',
 'has_parking']
interactions_selected['LRT_inter']
[('sq_mt_built', 'is_exterior'), ('sq_mt_built', 'has_lift')]

Estimation of future performance#

In this section we are going to estimate the future predictive performance of the best model (the one selected in the precious section). To estimate the future performance of the model we will use the training partition to train the model and the testing to make predictions an estimate the error, i.e, its future performance.

# Statsmodels doesn't work with Polars but Pandas.
X_test = X_test.to_pandas()
best_model = linear_model(Y=Y_train, X=X_train[predictors_selected['LRT_inter']])
best_model.fit(pred_to_dummies=None, interactions_to_add=interactions_selected['LRT_inter'])
Y_test_hat = best_model.predict(X_test[predictors_selected['LRT_inter']])
RMSE_future_performance = np.sqrt(np.mean((Y_test.to_numpy() - Y_test_hat)**2))
RMSE_future_performance
423290.67658042104

Prediction intervals#

We have the following property:

\[ y_i \hspace{0.1cm} \sim \hspace{0.1cm} N \left( \beta_0 + x_i^\prime \cdot \beta \hspace{0.05cm},\hspace{0.1cm} \sigma^2 \right) \]
\[ \widehat{y}_i = \beta_0 + x_i^\prime \cdot \widehat{\beta} \hspace{0.1cm} \sim \hspace{0.1cm} N \left( \beta_0 + x_i^\prime \cdot \beta \hspace{0.05cm},\hspace{0.1cm} \sigma^2 \cdot v_{i} \right) \]

where:

  • \(v_{i} \hspace{0.05cm}=\hspace{0.05cm} x_i^\prime \cdot (X^\prime \cdot X)^{-1} \cdot x_i\)

Therefore, we have the following:

\[ y_i - \hat{y}_i \hspace{0.1cm} \sim \hspace{0.1cm} N(0 \hspace{0.05cm},\hspace{0.12cm} \sigma^2 \cdot (1 + v_{i})) \]

where:

  • \(E[y_i - \hat{y}_i] = 0\)

  • \(Var(y_i - \hat{y}_i) = \sigma^2 \cdot (1 + v_{i})\)

Then, we have the following pivotal quantity, with which we are going to build the prediction interval for a significance level of \(1-\alpha\):

\[ \dfrac{ ( y_i - \hat{y}_i ) - E[y_i - \hat{y}_i]}{\widehat{s_d}(y_i - \hat{y}_i ) } \hspace{0.1cm} =\hspace{0.1cm} \dfrac{y_i - \hat{y}_i }{\sqrt{\widehat{\sigma}^2 \cdot (1 + v_{i} ) }} \hspace{0.1cm} \sim \hspace{0.1cm} t_{n-p-1} \]

Finally, we get the prediction (probability) interval:

\[ PI\left( \hspace{0.02cm} y_i \hspace{0.02cm} \right)_{1-\alpha}\hspace{0.1cm} =\hspace{0.1cm} \left[\hspace{0.15cm} \hat{y}_i \hspace{0.15cm}\pm\hspace{0.15cm} t_{\alpha/2}^{n-p-1} \cdot \sqrt{ \widehat{\sigma}^2 \cdot (1 + v_{i} ) } \hspace{0.15cm} \right] \]

Observation:

In the last interval the \(y_i\) term is a random variable and not a parameter, so, we cannot talk about confidence interval for \(y_i\). This is the reason why we call it probability interval, because, as \(y_i\) is a random variable, it makes sense to talk about that the probability of \(y_i\) belonging to the \(PI\left( \hspace{0.02cm} y_i \hspace{0.02cm} \right)_{1-\alpha}\) interval is \(1-\alpha\).

def prediction_intervals(X, Y, pred_to_dummies, interactions_to_add, alpha=0.05, train_size=0.70, seed=123, subset_len=70, figsize=(12,7)):

    if isinstance(X, pl.DataFrame):
        X = X.to_pandas()
    if isinstance(Y, (pl.DataFrame, pl.Series)):
        Y = Y.to_pandas()

    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, train_size=train_size, random_state=seed)

    n = len(X_test)
    p = X_test.shape[1]
    t_alpha_2 = scipy.stats.t.isf(q=alpha/2, df=n-p-1, loc=0, scale=1)

    model = linear_model(Y=Y_train, X=X_train)
    model.fit(pred_to_dummies=pred_to_dummies, interactions_to_add=interactions_to_add)
    Y_test_hat = model.predict(X_test)

    sigma_2_est = np.sum((Y_test - Y_test_hat)**2) / (n-p-1)
    if isinstance(X, (pd.DataFrame, pl.DataFrame)):
        X_test = X_test.to_numpy()
    v = np.array([X_test[i,:].T @ np.linalg.inv(X_test.T @ X_test) @ X_test[i,:] for i in range(0,n)])
    L1 = Y_test_hat - t_alpha_2 * np.sqrt(sigma_2_est*(1+v))
    L2 = Y_test_hat + t_alpha_2 * np.sqrt(sigma_2_est*(1+v))

    prop_Y_test_out_interval = np.mean((Y_test < L1) | (Y_test > L2)) 

    plt.figure(figsize=figsize)
    subset_len = subset_len + 1
    ax = sns.lineplot(y=Y_test_hat[0:subset_len], x=np.arange(0,len(Y_test_hat[0:subset_len])), 
                      marker='o', linewidth=1.5, color='blue', markersize=6, label='Response Predictions - Test')
    ax2 = sns.lineplot(y=L1[0:subset_len], x=np.arange(0,len(L1[0:subset_len])), 
                       linestyle='--', linewidth=1, color='green', label='Lower bound - Prediction Interval')
    ax3 = sns.lineplot(y=L2[0:subset_len], x=np.arange(0,len(L2[0:subset_len])), 
                       linestyle='--', linewidth=1, color='fuchsia', label='Upper bound - Prediction Interval')
    ax4 = sns.scatterplot(y=Y_test[0:subset_len], x=np.arange(0,len(Y_test[0:subset_len])), color='red', s=30, label='Response Real Values - Test')
    ax.set_ylabel('Y_test_hat', size=12)
    ax.set_xlabel(f'X_test - Index [0:{subset_len-1}]', size=12)
    plt.xticks(fontsize=11)
    plt.yticks(fontsize=11)
    plt.title(f'Prediction Intervals - Test Set', size=15)
    ax.legend(loc='upper right', bbox_to_anchor=(1.32, 0.95), ncol=1, fontsize=11)
    plt.show()

    print('The percentage of real values of the response (all the red points) that are out of the prediction intervals is', np.round(prop_Y_test_out_interval*100, 2), '%')
prediction_intervals(X[predictors_selected['LRT_inter']], Y, 
                     pred_to_dummies=None, 
                     interactions_to_add=interactions_selected['LRT_inter'], 
                     alpha=0.05, train_size=0.70, seed=123, subset_len=100, figsize=(12,8))
_images/49122a9889096b71a38ab59a53da8f0eb1aa94152ff18b7c44eed874562b8851.png
The percentage of real values of the response (all the red points) that are out of the prediction intervals is 3.99 %

Coefficient Interpretation: theory#

Without interactions#

Interpretation of quantitative predictors#

Suppose that \(\mathcal{X}_j\) is a quantitative variable.

We have the following estimation:

\[\widehat{y}_i = \widehat{\beta}_0 + \widehat{\beta}_{1}\cdot x_{i1} + \dots + \widehat{\beta}_{j} \cdot x_{ij} + \dots + \widehat{\beta}_{p}\cdot x_{ip}\]
  • If \(\mathcal{X}_j\) increases in \(h>0\) units, in sample terms, the new value of the variable \(X_j\) is \(X_j + c\), and we can express this as \(X'_j =X_j +c\), then:

    \[\widehat{y}_i | x_{ij}^\prime = x_{ij} + h \hspace{0.12cm} - \hspace{0.12cm} \widehat{y}_i | x_{ij} \hspace{0.12cm} = \hspace{0.12cm} h \cdot \widehat{\beta}_j\]

    Interpretation:

    • If \(\widehat{\beta}_j > 0 \hspace{0.02cm}\), then:

      • \(\widehat{y}_i\hspace{0.05cm}\) increases in \(\hspace{0.05cm}h \cdot \widehat{\beta}_j\hspace{0.05cm}\) \(y\)-units if \(x_{ij}\) increases in \(h>0\) units, for \(\small{i=1,\dots,n}\).

    • If \(\widehat{\beta}_j < 0 \hspace{0.2cm}\), then:

      • \(\widehat{y}_i\hspace{0.05cm}\) decreases in \(\hspace{0.05cm}h \cdot \widehat{\beta}_j\hspace{0.05cm}\) \(y\)-units if \(x_{ij}\) increases in \(h>0\) units, for \(\small{i=1,\dots,n}\).

    Observation: the reasoning is analogous if the predictor decreases rather than increases, but in the opposite sense.

Interpretation of categorical predictors#

Suppose that \(\mathcal{X}_j\) is a categorical variable with \(g\) categories \((Range(\mathcal{X}_j) = \lbrace 0,1,\dots, g-1\rbrace)\).

\(\mathcal{X}_j\) enter in the model with \(g-1\) dummy predictors. Considering \(0\) as the reference category, these dummy predictors that enter in the model to represent \(\mathcal{X}_j\) are the following:

\[X_{j1},X_{j2},\dots , X_{j(g-1)}\]

with their associated beta parameters:

\[\beta_{j1}, \beta_{j2}, \dots , \beta_{j(g-1)}\]

Where:

\[\begin{split}x_{ijr} = \begin{cases} 1, \quad x_{ij} = r \\ 0, \quad x_{ij} \neq r \end{cases}\end{split}\]

Note that \(x_{ijr} = 1\) is equivalent to \(x_{ij} = r\), and \(x_{ij}=0\) is equivalent to \(x_{ijr}=0\) for all \(\small{r=1,2, \dots, g-1}\).

The estimation would be:

\[\widehat{y}_i = \widehat{\beta}_0 + \widehat{\beta}_{1}\cdot x_{i1} + \dots + \widehat{\beta}_{j1} \cdot x_{i j1} + \widehat{\beta}_{j2} \cdot x_{i j2} + \dots + \widehat{\beta}_{j(g-1)} \cdot x_{i j(g-1)} + \dots + \widehat{\beta}_{p}\cdot x_{ip}\]
  • Comparison between category \(r\) and \(0\) (the reference category):

    \[\widehat{y}_i | x_{ijr} = r \hspace{0.12cm} - \hspace{0.12cm} \widehat{y}_i | x_{ij} =0 \hspace{0.12cm} = \hspace{0.12cm} \widehat{\beta}_{jr}\]

    Interpretation:

    • If \(\widehat{\beta}_{jr} > 0 \hspace{0.02cm}\), then:

      • \(\widehat{y}_i\hspace{0.05cm}\) is \(\hspace{0.05cm}\widehat{\beta}_{jr}\hspace{0.1cm}\) \(y\)-units greater if \(x_{ij}=r\) than if \(x_{ij}=0\), for \(\small{i=1,\dots,n}\).

    • If \(\widehat{\beta}_{jr} < 0 \hspace{0.02cm}\), then:

      • \(\widehat{y}_i\hspace{0.05cm}\) is \(\hspace{0.05cm}\widehat{\beta}_{jr}\hspace{0.1cm}\) \(y\)-units lower if \(x_{ij}=r\) than if \(x_{ij}=0\), for \(\small{i=1,\dots,n}\).

  • Comparison between category \(r_1\) and \(r_2\) (the reference category):

    \[\widehat{y}_i | x_{ijr} = r_1 \hspace{0.12cm} - \hspace{0.12cm} \widehat{y}_i | x_{ij} = r_2 \hspace{0.12cm} = \hspace{0.12cm} \widehat{\beta}_{jr_1} - \widehat{\beta}_{jr_2}\]

    Interpretation:

    • If \(\widehat{\beta}_{jr_1} -\widehat{\beta}_{jr_2} > 0 \hspace{0.02cm}\), then:

      • \(\widehat{y}_i\hspace{0.05cm}\) is \(\hspace{0.05cm}\widehat{\beta}_{jr_1} -\widehat{\beta}_{jr_2} \hspace{0.1cm}\) \(y\)-units greater if \(x_{ij}=r_1\) thant if \(x_{ij}=r_2\), for \(\small{i=1,\dots,n}\).

    • If \(\widehat{\beta}_{jr} < 0 \hspace{0.02cm}\), then:

      • \(\widehat{y}_i\hspace{0.05cm}\) is \(\hspace{0.05cm}\widehat{\beta}_{jr_1} -\widehat{\beta}_{jr_2} \hspace{0.1cm}\) \(y\)-units greater if \(x_{ij}=r_1\) thant if \(x_{ij}=r_2\), for \(\small{i=1,\dots,n}\).

With interactions#

Suppose that \(X_{j}\) is quantitative and \(X_h\) is categorical with \(g\) categories \(\lbrace 0, 1, \dots, g-1\rbrace\) and the reference category is \(0\).

And also suppose that the model includes an interaction between \(X_j\) and \(X_h\), then:

\[ \begin{align}\begin{aligned}\begin{split}\widehat{y}_i = \widehat{\beta}_0 + \dots + \widehat{\beta}_j\cdot x_{ij} + \widehat{\beta}_{h1} \cdot x_{i h1} + \dots + \widehat{\beta}_{h(g-1)} \cdot x_{i h(g-1)} + \\[0.15cm]\end{split}\\+ \widehat{\beta}_{jh1}\cdot x_{ij}\cdot x_{ih1} + \dots + \widehat{\beta}_{jh(g-1)}\cdot x_{ij}\cdot x_{ih(g-1)}\end{aligned}\end{align} \]

Where:

  • \(\widehat{\beta}_{jh}\) is the coefficient fot he interaction between \(X_j\) and \(X_h\).

  • \(\widehat{\beta}_{jhr}\) is the coefficient fot he interaction between \(X_j\) and \(X_{hr}\), i.e. the dummy variable for category \(r\) of \(X_h\), for \(\small{r=1,2,\dots,g-1}.\)

Quantitative predictors:#
  • If \(\mathcal{X}_j\) increases in \(w>0\) units, in sample terms, the new value of the variable \(X_j\) is \(X_j + w\), and we can express this as \(X'_j =X_j +w\), then:

    \[\begin{split}\widehat{y}_i |x'_{ij} = x_{ij} + w \hspace{0.1cm} - \hspace{0.1cm} \widehat{y}_i |x_{ij} \hspace{0.1cm}=\hspace{0.1cm} w\cdot \widehat{\beta}_j + w\cdot x_{ih1} \cdot \widehat{\beta}_{jh1}+ \dots + w\cdot x_{ih(g-1)} \cdot \widehat{\beta}_{jh(g-1)} \\[0.3cm] = w\cdot \left(\widehat{\beta}_j + x_{ih1} \cdot \widehat{\beta}_{jh1}+ \dots + x_{ih(g-1)} \cdot \widehat{\beta}_{jh(g-1)} \right) = \phi(\cdot)\end{split}\]

    Interpretation:

    Now the interpretation not only depend on the coefficient of the variable, but also on the coefficients and value of the categorical predictor with which an interaction has been set.

    For an individual with \(x_{ih} = r\), i.e. \(x_{ihr}=1\), then \(\phi(\cdot) = w\cdot (\widehat{\beta}_j + \widehat{\beta}_{jhr})\).

    • If \(\widehat{\beta}_j + \widehat{\beta}_{jhr} > 0\), then:

      • \(\widehat{y}_i\hspace{0.05cm}\) is \(\hspace{0.05cm}w (\widehat{\beta}_j + \widehat{\beta}_{jhr})\hspace{0.05cm}\) \(y\)-units higher if \(x_{ij}\) increases in \(w>0\) units, for \(\small{i=1,\dots,n}\).

    • If \(\widehat{\beta}_j + \widehat{\beta}_{jhr} < 0\) , then:

      • \(\widehat{y}_i\) is \(w(\widehat{\beta}_j + \widehat{\beta}_{jhr})\hspace{0.05cm}\) \(y\)-units lower if \(x_{ij}\) increases in \(w>0\) units, for \(\small{i=1,\dots,n}\).

    Observation: the reasoning is analogous if the predictor decreases rather than increases, but in the opposite sense.

Categorical predictors:#
  • Comparison between category \(r\) and \(0\) (the reference category) in \(X_h\):

    \[\widehat{y}_i |x_{ih}=r \hspace{0.1cm} - \hspace{0.1cm} \widehat{y}_i |x_{ih}=0 \hspace{0.1cm}=\hspace{0.1cm} \widehat{\beta}_{hr} + x_{ij} \cdot \widehat{\beta}_{jhr}\]

    Interpretation:

    Now the interpretation not only depend on the coefficient of the variable, but also on the coefficients and value of the quantitative predictor with which an interaction has been set.

    • If \(\widehat{\beta}_{hr} + x_{ij} \cdot \widehat{\beta}_{jhr} > 0\) , then:

      • \(\widehat{y}_i\hspace{0.05cm}\) is \(\hspace{0.05cm}\widehat{\beta}_{hr} + x_{ij} \cdot \widehat{\beta}_{jhr}\hspace{0.05cm}\) \(y\)-units greater if \(x_{ih}=r\) than if \(x_{ih}=0\), for \(\small{i=1,\dots,n}\).

    • If \(\widehat{\beta}_{hr} + x_{ij} \cdot \widehat{\beta}_{jhr} < 0\) \(\hspace{0.15cm}\), is fulfilled that:

      • \(\widehat{y}_i\hspace{0.05cm}\) is \(\hspace{0.05cm}\widehat{\beta}_{hr} + x_{ij} \cdot \widehat{\beta}_{jhr}\hspace{0.05cm}\) \(y\)-units lower if \(x_{ih}=r\) than if \(x_{ih}=0\), for \(\small{i=1,\dots,n}\).

  • Comparison between category \(r_1\) and \(r_2\) (the reference category) in \(X_h\):

    \[\widehat{y}_i |x_{ih}=r_1 - \widehat{y}_i |x_{ih}=r_2 = (\widehat{\beta}_{hr_1}- \widehat{\beta}_{hr_2})+ x_{ij} \cdot (\widehat{\beta}_{jhr_1}-\widehat{\beta}_{jhr_2})\]

    Interpretation:

    Now the interpretation not only depend on the coefficient of the variable, but also on the coefficients and value of the quantitative predictor with which an interaction has been set.

    • If \((\widehat{\beta}_{hr_1}- \widehat{\beta}_{hr_2})+ x_{ij} \cdot (\widehat{\beta}_{jhr_1}-\widehat{\beta}_{jhr_2}) > 0\) , then:

      • \(\widehat{y}_i\hspace{0.05cm}\) is \(\hspace{0.05cm}(\widehat{\beta}_{hr_1}- \widehat{\beta}_{hr_2})+ x_{ij} \cdot (\widehat{\beta}_{jhr_1}-\widehat{\beta}_{jhr_2})\hspace{0.05cm}\) \(y\)-units higher if \(x_{ih}=r_1\) than if \(x_{ih}=r_2\), for \(\small{i=1,\dots,n}\).

    • If \((\widehat{\beta}_{hr_1}- \widehat{\beta}_{hr_2})+ x_{ij} \cdot (\widehat{\beta}_{jhr_1}-\widehat{\beta}_{jhr_2}) < 0\) , then:

      • \(\widehat{y}_i\hspace{0.05cm}\) is \(\hspace{0.05cm}(\widehat{\beta}_{hr_1}- \widehat{\beta}_{hr_2})+ x_{ij} \cdot (\widehat{\beta}_{jhr_1}-\widehat{\beta}_{jhr_2})\hspace{0.05cm}\) \(y\)-units lower if \(x_{ih}=r_1\) than if \(x_{ih}=r_2\), for \(\small{i=1,\dots,n}\).

Coefficient Interpretation: application#

The best model for prediction may not be the best for interpretation.

In order to make this section more complete, we will use a model with interactions and dummies.

predictors_interpretation = ['sq_mt_built', 'n_rooms', 'n_bathrooms', 'n_floors', 
                              'has_lift', 'is_exterior', 'energy_certificate', 'has_parking', 'house_type']

interactions_interpretation = [('n_bathrooms', 'is_exterior'), ('sq_mt_built', 'is_exterior'), ('sq_mt_built', 'has_lift')]

pred_to_dummies_interpretation = [x for x in cat_predictors if x in predictors_interpretation]
interpretation_model = linear_model(Y=Y_train, X=X_train[predictors_interpretation])
interpretation_model.fit(pred_to_dummies=pred_to_dummies_interpretation, 
                        interactions_to_add=interactions_interpretation)
print(interpretation_model.linear_fit.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:              buy_price   R-squared (uncentered):                   0.857
Model:                            OLS   Adj. R-squared (uncentered):              0.857
Method:                 Least Squares   F-statistic:                              4338.
Date:                Fri, 29 Dec 2023   Prob (F-statistic):                        0.00
Time:                        10:12:38   Log-Likelihood:                     -2.1723e+05
No. Observations:               15217   AIC:                                  4.345e+05
Df Residuals:                   15196   BIC:                                  4.347e+05
Df Model:                          21                                                  
Covariance Type:            nonrobust                                                  
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
sq_mt_built                1891.5886    386.822      4.890      0.000    1133.372    2649.806
n_rooms                   -4.214e+04   3423.603    -12.309      0.000   -4.89e+04   -3.54e+04
n_bathrooms                1.486e+05   2.29e+04      6.489      0.000    1.04e+05    1.94e+05
n_floors                   7617.8433   1.13e+04      0.676      0.499   -1.45e+04    2.97e+04
has_lift_1                -1.875e+05   9914.230    -18.914      0.000   -2.07e+05   -1.68e+05
is_exterior_1             -2.974e+04    1.4e+04     -2.120      0.034   -5.72e+04   -2238.597
energy_certificate_1      -2.643e+04   1.61e+04     -1.646      0.100   -5.79e+04    5037.501
energy_certificate_2       1.673e+04   1.84e+04      0.909      0.364   -1.94e+04    5.28e+04
energy_certificate_3      -9561.8350   9600.364     -0.996      0.319   -2.84e+04    9256.031
energy_certificate_4       2.722e+04   1.44e+04      1.891      0.059    -990.865    5.54e+04
energy_certificate_5       8.633e+04   1.96e+04      4.394      0.000    4.78e+04    1.25e+05
energy_certificate_6        4.58e+04   2.15e+04      2.130      0.033    3653.324    8.79e+04
energy_certificate_7       6.181e+04    1.9e+04      3.260      0.001    2.46e+04     9.9e+04
has_parking_1             -6.759e+04   7511.460     -8.998      0.000   -8.23e+04   -5.29e+04
house_type_1               2.941e+05   2.98e+04      9.869      0.000    2.36e+05    3.53e+05
house_type_2              -1.205e+04   2.48e+04     -0.486      0.627   -6.06e+04    3.65e+04
house_type_3              -1.071e+05   2.12e+04     -5.046      0.000   -1.49e+05   -6.55e+04
house_type_4               4.747e+04    1.5e+04      3.167      0.002    1.81e+04    7.69e+04
n_bathrooms:is_exterior_1 -3.459e+04   2.32e+04     -1.492      0.136      -8e+04    1.09e+04
sq_mt_built:is_exterior_1   579.6226    387.280      1.497      0.135    -179.493    1338.738
sq_mt_built:has_lift_1     3204.0080     60.319     53.118      0.000    3085.775    3322.241
==============================================================================
Omnibus:                    11583.203   Durbin-Watson:                   2.015
Prob(Omnibus):                  0.000   Jarque-Bera (JB):          1181206.545
Skew:                           2.950   Prob(JB):                         0.00
Kurtosis:                      45.757   Cond. No.                     3.89e+03
==============================================================================

Notes:
[1] R² is computed without centering (uncentered) since the model does not contain a constant.
[2] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[3] The condition number is large, 3.89e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Interpreting coefficient of sq_mt_built

The model has a interaction between sq_mt_built \((X_j)\) , is_exterior \((X_h)\) and has_lift \((X_k)\):

\[\begin{split}\widehat{y}_i | x_{ij}^\prime = x_{ij} + w \hspace{0.2cm} - \hspace{0.2cm} \widehat{y}_i | x_{ij} \hspace{0.2cm} = \hspace{0.2cm} w\cdot (\widehat{\beta}_j + x_{ih1}\cdot \widehat{\beta}_{jh1} + x_{ik1}\cdot \widehat{\beta}_{jk1}) \\[0.2cm] \hspace{5.3cm} = \hspace{0.2cm} w\cdot (1891.6 + x_{ih1}\cdot 579.6 + x_{ik1}\cdot 3204)\end{split}\]

Observation: is_exterior and has_lift are binaries \(\lbrace 0, 1\rbrace\).

  • If the number of sq meters built of a house increases in \(w=50 m^2\), then:

    • If the house is exterior \((x_{ih1}=1)\) and has lift \((x_{ik1}=1)\), then its price increases in 284066.5 €

    • If the house is not exterior \((x_{ih1}=0)\) but has lift \((x_{ik1}=1)\), then its price increases in 243765 €

    • If the house is exterior \((x_{ih1}=1)\) but doesn’t have lift \((x_{ik1}=0)\), then its price increases in 123524 €

    • If the house is not exterior \((x_{ih1}=0)\) and doesn’t have lift \((x_{ik1}=0)\), then its price increases in 83222.5 €

Summarizing:

When the sq_mt_built of a house increases, the estimated price increases, but the intensity (quantity) of this increases depend on wether the house is exterior or not and wether has lift or not, because we have establish an interaction between these predictors.

We can make an intensity ranking, from more to less intensity, depending on the values of is_exterior \((x_{ih1})\) and has_lift \((x_{ik1})\):

\[x_{ih1} = x_{ik1} = 1\]
\[x_{ih1} = 0 \hspace{0.1cm};\hspace{0.1cm} x_{ik1} = 1\]
\[x_{ih1} =1 \hspace{0.1cm};\hspace{0.1cm} x_{ik1} = 0\]
\[x_{ih1} = x_{ik1} = 0\]

Interpreting coefficient of n_bathrooms

The model has a interaction between n_bathrooms \((X_j)\) and is_exterior \((X_h)\):

\[\begin{split}\widehat{y}_i | x_{ij}^\prime = x_{ij} + w \hspace{0.2cm} - \hspace{0.2cm} \widehat{y}_i | x_{ij} \hspace{0.2cm} = \hspace{0.2cm} w\cdot (\widehat{\beta}_j + x_{ih1}\cdot \widehat{\beta}_{jh1} ) \\[0.2cm] \hspace{5.3cm} = \hspace{0.2cm} w\cdot (148600 - x_{ih1}\cdot 34590 )\end{split}\]

Observation: is_exterior is binary \(\lbrace 0, 1\rbrace\).

  • If the number of bathrooms of a house increases in \(w=1\) bathroom, then:

    • If the house is exterior \((x_{ih1}=1)\), its price increases in 114010 €

    • If the house is not exterior \((x_{ih1}=0)\), its price increases in 148600 €

In conclusion:

When the n_bathrooms of a house increases, the estimated price increases as well, but the intensity (quantity) of this increases depend on wether the house is exterior or not, because we have establish an interaction between these predictors.

If the house is exterior \((x_{ih1}=1)\), the intensity of the increases is lower than if is not exterior \((x_{ih1}=0)\).

Interpreting coefficient of n_rooms

Let suppose that n_rooms is \(X_j\):

\[\begin{split}\widehat{y}_i | x_{ij}^\prime = x_{ij} + w \hspace{0.2cm} - \hspace{0.2cm} \widehat{y}_i | x_{ij} \hspace{0.2cm} = \hspace{0.2cm} w\cdot \widehat{\beta}_j \hspace{0.2cm} \\[0.2cm] \hspace{4.9 cm} = \hspace{0.2cm} -w\cdot 42140\end{split}\]
  • If the number of rooms increases in \(w=1\) room, the house price decreases in 42140 €

Interpreting coefficient of n_floors

Let suppose that n_floors is \(X_j\):

\[\begin{split}\widehat{y}_i | x_{ij}^\prime = x_{ij} + w \hspace{0.2cm} - \hspace{0.2cm} \widehat{y}_i | x_{ij} \hspace{0.2cm} = \hspace{0.2cm} w\cdot \widehat{\beta}_j \hspace{0.2cm} \\[0.2cm] \hspace{4.9 cm} = \hspace{0.2cm} w\cdot 7617.84\end{split}\]
  • If the number of floors increases in \(w=1\) floor, the house price increases in 7617.84 €

Interpreting coefficient of has_lift

The model has a interaction between has_lift \((X_h)\) and sq_mt_built \((X_j)\):

\[\begin{split}\widehat{y}_i | x_{ih} = 1 \hspace{0.2cm} - \hspace{0.2cm} \widehat{y}_i | x_{ih} = 0\hspace{0.2cm} = \hspace{0.2cm} \widehat{\beta}_{h1} + x_{ij}\cdot \widehat{\beta}_{jh1} ) \\[0.2cm] \hspace{5.3cm} = \hspace{0.2cm} -187500 + x_{ij}\cdot 1891.6\end{split}\]

Observation: has_lift is binary \(\lbrace 0, 1\rbrace\).

  • For houses with 150 sq meters built, their price is 96150€ greater if they have lift than if not.

  • For house with 90 sq meter built, their price is 17256 € lower if they have lift than if not.

In conclusion:

  • For big houses, tho have a lift increases the price.

  • For small houses, tho have a lift decreases the price.

Interpreting coefficient of is_exterior

The model has a interaction between is_exterior \((X_h)\), n_bathrooms \((X_j)\) and sq_mt_built \((X_k)\):

\[\begin{split}\widehat{y}_i | x_{ih}=1 \hspace{0.2cm} - \hspace{0.2cm} \widehat{y}_i | x_{ih}=0 \hspace{0.2cm} = \hspace{0.2cm}\widehat{\beta}_{h1} + x_{ij}\cdot \widehat{\beta}_{jh1} + x_{ik}\cdot \widehat{\beta}_{kh1}) \\[0.2cm] \hspace{5.1cm} = \hspace{0.2cm} -29740-x_{ij}\cdot 34590 + x_{ik}\cdot 579.6\end{split}\]

Observation: is_exterior is binary \((0,1)\).

  • For houses with 2 bathrooms and 100 sq meters built, their price is 40960 € lower if are exterior.

  • For houses with 2 bathrooms and 200 sq meters built, their price is 17000 € greater if are exterior.

  • For houses with 3 bathrooms and 200 sq meters built, their price is 17590 € lower if are exterior.

In conclusion:

  • Houses with few enough bathrooms and large enough sq meters built are more expensive if are exterior.

  • Houses with not few enough bathrooms or not large enough sq meters built are more expensive if are not exterior.

Interpreting coefficient of energy_certificate

energy_certificate \((X_h)\) is multiclass with 8 categories \(\lbrace 0,1,\dots , 7\rbrace\).

\[\widehat{y}_i | x_{ih}=7 \hspace{0.2cm} - \hspace{0.2cm} \widehat{y}_i | x_{ih}=0 \hspace{0.2cm} = \hspace{0.2cm} \widehat{\beta}_{h7} = 61810\]
\[\widehat{y}_i | x_{ih}=7 \hspace{0.2cm} - \hspace{0.2cm} \widehat{y}_i | x_{ih}=6 \hspace{0.2cm} = \hspace{0.2cm} \widehat{\beta}_{h7} - \widehat{\beta}_{h6} = 16010\]
  • Price of houses with energy certificate 7 is 61810€ greater than the one of houses with certificate 0.

  • Price of houses with energy certificate 7 is 16010 € greater than the one of houses with certificate 6.

Interpreting coefficient of has_parking

has_parking \((X_h)\) is binary \(\lbrace 0, 1\rbrace\)

\[\widehat{y}_i | x_{ih}=1 \hspace{0.2cm} - \hspace{0.2cm} \widehat{y}_i | x_{ih}=0 \hspace{0.2cm} = \hspace{0.2cm} \widehat{\beta}_{h1} = -67590\]
  • Price of houses with parking is 67590 € lower than the one of houses with without.

Interpreting coefficient of house_type

house_type \((X_h)\) is multiclass with 4 categories \(\lbrace 0,1,2,3\rbrace\).

\[\widehat{y}_i | x_{ih}=1 \hspace{0.2cm} - \hspace{0.2cm} \widehat{y}_i | x_{ih}=0 \hspace{0.2cm} = \hspace{0.2cm} \widehat{\beta}_{h1} = 294100\]
\[\widehat{y}_i | x_{ih}=1 \hspace{0.2cm} - \hspace{0.2cm} \widehat{y}_i | x_{ih}=4 \hspace{0.2cm} = \hspace{0.2cm} \widehat{\beta}_{h1} - \widehat{\beta}_{h4} = 246630\]
  • Price of houses of type 1 (chalets) is 294100€ greater than the one of houses type 0 (flats).

  • Price of houses of type 1 (chalets) is 246630 € greater than the one of houses type 4 (top floor).

Model Problems#

Violation of model assumptions#

In this section we will check if the basic model assumptions are fulfilled.

First, since the assumptions are related with the residuals, we estimate them, for the best model.

Y_train_hat = best_model.predict(X_train[predictors_selected['LRT_inter']])
residuals_est =  Y_train - Y_train_hat
Violation of null mean of the residuals#

To check the null mean assumption we can perform a t-test on the residuals along with a plot like the next one (although in this case in not very informative due to the big scale of the response variable).

plt.subplots(figsize=(7,5))
ax = sns.scatterplot(y=residuals_est, x=np.arange(0,len(residuals_est)), color='blue')
plt.axhline(y=np.mean(residuals_est), color='r', linestyle='-')
ax.set_ylabel('Estimated Residuals', size=11)
ax.set_xlabel('Index', size=11)
print(f'Residuals mean = {np.mean(residuals_est)}')
plt.show()
Residuals mean = 558.8809228895557
_images/351668fa2ad3886695007f3f0cce9cfb1722b696443efe230b049e6f02e2de7f.png

We compute the pvalue of the t-test for the null mean of the residuals.

t_statistic, p_value = stats.ttest_1samp(residuals_est, 0)
p_value
0.8583305763037111

For the typical significance level \((\alpha = 0.05)\), we cannot reject that \(E[\varepsilon_i]=0\). Therefore, the null mean of the residuals assumption is not fulfilled.

Violation of constant variance of the residuals#

It is not possible to check the constant variance assumption just by analyzing the residuals, because some of them will be high and others low, but this doesn’t prove anything.

It is necessary to check if the residual variance is related to any other quantity, so that for certain values of that quantity the residual variance is bigger than for others, and, therefore confirming that it is not constant.

The most useful diagnosis is the scatter plot for \(\widehat{\varepsilon}\) vs \(\widehat{Y}\).

plt.subplots(figsize=(7,5))
ax = sns.scatterplot(x=Y_train_hat, y=residuals_est, marker='o', color='blue', s=20) 
ax.set_ylabel('Estimated Residuals', size=11)
ax.set_xlabel('Y_train_hat', size=11)
plt.show() 
_images/f6c00c9d993364ec7b7f80424a95689549991ef395caaf23a5dd0e4b2ccea87b.png

Violation of Normality of the residuals#

To check the normality of the residuals we will compare their histogram with the one they would have if would be perfectly normal distributed.

normal_residuals = np.random.normal(np.mean(residuals_est), np.std(residuals_est), len(residuals_est))
plt.figure(figsize=(8, 5))
sns.histplot(residuals_est, bins=50, color='blue', label='Estimated Residuals')
sns.histplot(normal_residuals, bins=20, color='orange', label='Normal Residuals', alpha=0.5)
plt.legend() 
plt.show()
_images/d128fb80b714cb23f090ef888387083b3b438cd6a16b68ea1953c464ae468c7f.png

We can also use a qq-plot to test normality graphically.

plt.figure(figsize=(8, 6))
ax = sm.qqplot(residuals_est, line ='q', loc=residuals_est.mean(), scale=residuals_est.std()) 
plt.show() 
<Figure size 800x600 with 0 Axes>
_images/c33ee07ab0cf9e6dac6ef5e0772292bfd90d2436befab10918ec53ba8fd8e404.png

In this case hypothesis test are not useful due to the number of observation involved in our data. So, our analysis can only be based on this kind of visual tools, which is imprecise.

Violation of null auto-correlation of the residuals#

The residuals null correlation assumption can be check by mean of the Durban-Watson test.

The Durban-Watson test hypothesis are the following:

\[H_0: \hspace{0.05cm} Corr\left(\varepsilon_{i}\hspace{0.05cm},\hspace{0.05cm} \varepsilon_{i+1}\right) = 0 \hspace{0.05cm} ,\hspace{0.13cm} \forall \hspace{0.1cm}i =1,...,n-1\]
\[\begin{split}H_1: \hspace{0.05cm} Corr\left(\varepsilon_{i}\hspace{0.05cm},\hspace{0.05cm} \varepsilon_{i+1}\right) \neq 0 \hspace{0.05cm} ,\hspace{0.13cm} \exists \hspace{0.1cm}i =1,...,n-1 \\[0.5cm]\end{split}\]

Statistic Test:

\[ DW \hspace{0.05cm}=\hspace{0.05cm} \dfrac{\sum_{i=1}^{n-1} (\hat{\varepsilon}_i - \hat{\varepsilon}_{i+1})^2 }{\sum_{i=1}^{n} \hat{\varepsilon}_{i}^2 } \]

Interpretation:

The following is fulfilled:

\[ DW \hspace{0.1cm}\simeq \hspace{0.1cm} 2\cdot (1- r_{12}) \]

where:

\(r_{12}\hspace{0.02cm}\) is the linear correlation between \(\hspace{0.02cm}\hat{\varepsilon}_{(1)} = (\hat{\varepsilon}_1 , \hat{\varepsilon}_2 ,..., \hat{\varepsilon}_{n-1})\hspace{0.05cm}\) and \(\hspace{0.05cm}\hat{\varepsilon}_{(2)} = (\hat{\varepsilon}_2 ,...,\hat{\varepsilon}_{n-1}, \hat{\varepsilon}_{n})\)

  • The statistic test is between \(0\) and \(4\), because \(r_{12}\) is between \(-1\) and \(1\).

  • An statistic of \(DW=2\) indicates evidence of that residuals are not autocorrelate, so, evidence in favor of \(H_0\).

  • An statistic of \(DW\) close to \(0\) indicates evidence of that residuals are positively autocorrelate, so, evidence in favor of \(H_1\).

  • An statistic of \(DW\) close to \(4\) indicates evidence of that residuals are negatively autocorrelate, so, evidence in favor of \(H_1\).

residuals_est = residuals_est.to_numpy()
residuals_est_0 = residuals_est[0:len(residuals_est)-1]
residuals_est_1 = residuals_est[1:len(residuals_est)]
plt.figure(figsize=(8, 5))
sns.scatterplot(x=residuals_est_0, y=residuals_est_1, color='blue',marker='o', s=20)
plt.xlabel('errors 0')
plt.ylabel('errors 1')
plt.show()
_images/59b164e27b533762c00fe38b06bef40006c9c752bd593b0da470f97e987e1d26.png
np.corrcoef(residuals_est_0, residuals_est_1)[0,1]
-0.006553447280479784
sm.stats.stattools.durbin_watson(residuals_est, axis=0)
2.012363440433914

The Durban Watson statistic is close to \(2\), what indicate evidence of no auto-correlation of the residuals.

Multicollinearity#

The multicollinearity problem occurs when some of the predictors are linearly dependent.

Why is multicollinearity a problem?

  • In the worst of the cases, the existence of multicollinearity makes impossible the estimation of the model

  • In the best of cases, the existence of multicollinearity makes the estimators of the beta coefficients very imprecise, therefore, with high variance. This implies that some significant predictors could be accepted as not significant, because the results of the significance test are affected by the variance of those estimators. The larger the variance of an estimator of the beta coefficient is, the smaller the statistical test of significance will be, and this makes it easier not rejecting the null hypothesis that the predictor is not significant.

Perfect Multicollinearity#

At least one of the predictors is a linear combination of the rest, so, \(\hspace{0.02cm}Rg(X) \hspace{0.03cm}<\hspace{0.03cm} p+1 \hspace{0.02cm}\), in other words, the \(\mathbf{X}\) matrix hasn’t complete range.

Because of the theorem of range-nullity we have te following:

\[Rg\left({X}^\prime \cdot {X}\right) \hspace{0.03cm}<\hspace{0.03cm} p+1\]

Therefore, \(\hspace{0.01cm}\mathbf{X}^\prime \cdot {X}\hspace{0.02cm}\) hasn’t complete range, so \(\hspace{0.02cm}\left({X}^\prime\cdot {X}\right)^{-1}\hspace{0.02cm}\) doesn’t exist. As a consequence of the last, \(\hspace{0.02cm}(\beta_0,{\beta})\hspace{0.02cm}\) cannot be estimated with ordinary least squares method (OLS).

High Multicollinearity#

There are pairs of predictors with high linear correlation among them.

But, even in this case, the estimation of the beta coefficients is still possible, but the variance of the estimators will be very high, so, the estimations will be very imprecise and significant predictors could be shown as not significant by mean of the significance test.

Multicollinearity Identification#

Multicollinearity identification can be carry out following different ways.

  • If the linear regression model has only quantitative predictors:

    • With the Pearson correlation matrix of the predictors.

    • With the Variance Increment Factor \(VIF\) .

    • With the conditioning number of the correlation matrix.

  • If the linear regression model has both quantitative and categorical predictors:

    • With the Generalized Variance Increment Factor \(GVIF\).

In this project we will not go beyond the first option, in order not to make the project even more extensive.

Multicollinearity identification with the linear correlation matrix#

Assuming that \(\hspace{0.02cm}X=(X_1 ,..., X_p)\hspace{0.02cm}\) is a quantitative data matrix.

The linear correlation matrix of \(\hspace{0.02cm}X\hspace{0.02cm}\) is define as follows:

\[\begin{split} \\[0.5cm] R_{X}= \begin{pmatrix} r_{11} & r_{12}&...&r_{1p}\\ r_{21} & r_{22}&...&r_{2p}\\ ...&...&...&...\\ r_{p1}& r_{p2}&...&r_{pp} \end{pmatrix} = [r_{jh} ]_{\hspace{0.04cm} j,h \hspace{0.02cm} = \hspace{0.02cm} 1,..,p} \end{split}\]

where:

\[ r_{j h} = \frac{s_{jh}}{\hspace{0.07cm} s_j \cdot s_h\hspace{0.07cm} } \]

is the linear correlation coefficient between \(X_j\) and \(X_h\), being \(\hspace{0.02cm} s_j = \sigma^2(X_j) \hspace{0.02cm} and \)s_{jh}\( the covariance between \)X_j\( and \)X_h\(, i.e: \)\( s_{jh} \hspace{0.05cm}=\hspace{0.05cm} \frac{\hspace{0.1cm} 1\hspace{0.1cm} }{n} \cdot \sum_{i=1}^{n} \left(\hspace{0.05cm} x_{ij} - \overline{X}_j \hspace{0.05cm}\right)\cdot \left(\hspace{0.05cm} x_{ih} - \overline{X}_h \hspace{0.05cm}\right) \)$

Criteria:

We have a estimated linear regression model \(\widehat{Y}=\beta_0 + X\cdot \widehat{\beta} \hspace{0.04cm}\) only with quantitative predictors.

  • If there are any high correlation, namely, \(\hspace{0.02cm} | r_{ih} |\hspace{0.02cm} > \hspace{0.02cm} 0.75 \hspace{0.07cm}\) \(\Rightarrow\hspace{0.07cm}\) multicollinearity problems.

corr_matrix_ = corr_matrix(df=X_Y_train_df, quant_col_names=quant_columns, method='pearson')
fig = plt.subplots(figsize=(9,7))
ax = sns.heatmap(corr_matrix_, cmap="Blues", annot=True)
ax.set(xlabel="", ylabel="")
ax.xaxis.tick_top()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
ax.set_yticklabels(ax.get_yticklabels(), rotation=0)
plt.show()
_images/54eef8112466b9142ab39de876536f6d6ebb95c6be9f2d5e4413334e3ed30f05.png

As we can see there are quantitative predictors with high correlation, such as (sq_mt_built, n_bathrooms). So that, a model that include these predictors may have multicolinearity problems, leading to imprecise estimation of the beta coefficients.