Kaggle: House Prices Predictions Using Advanced Regression Techniques
In this post we will attempt to predict house prices Ames, Iowa, using 79 explanatory variables describing (almost) every aspect of residential homes.
Overview
1) Loading Data
2) Understand Data
3) Feature Engineering and Data Preprocessing
4) Model Testing and Building
5) Model Optimisation
6) Results
7) Conclusions
Importing Packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.set_option('display.max_rows', None)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import matplotlib.pyplot as plt
import seaborn as sns
1) Loading Data
df_train = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/train.csv') # importing training set
df_test = pd.read_csv('/kaggle/input/house-prices-advanced-regression-techniques/test.csv') # importing testing set
2) Understand Data
In order to use and accurately predict the data, we must first understand the data and be able to extract information from it properly. Understanding the data allows us to plan ahead and clean the data properly in the following stages.
df_train.head()
</p>
df_test.head()
</p>
The above outputs briefly show the contents of the data files (test and train) and it we can assume that there are many missing values to be dealt with just by looking at the first 5 observations. Now that we know this, we can see how many missing values there are in total for each feature and can attempt to solve this problem.
null_values_train = df_train.isnull().sum()
print(null_values_train)
null_values_test = df_test.isnull().sum()
print(null_values_test)
It is clear that our initial assumption was correct and there are many missing values. It would be unwise to use features with too many missing values so we will drop features with more than 50% missing values, and we will attempt to fill in the rest of the values.
3) Feature Engineering and Data Preprocessing
The training and testing data sets that we are using in this report have many null values and the data needs to be cleaned in order to give acceptable predictions. If the feature has too many null values and/or is not relevant to predicting the model, we will drop the feature entirely. If the feature is important and has missing/null values, we will be filling those values manually. For floats/integers, we will use the mean to fill in the null values, and for objects we will be using the mode, and other values to fill in the nulls.
We will be observing the training and testing data simultaneously and will be applying changes on both data sets where needed. This is often easier to do on two separate files, but in this case, we can handle both tasks together in one file.
The test set is missing a column called “SalePrice” which is what we will use the training set to predict for the test set, otherwise, both datasets have the same features.
We will go through the features systematically and will decide whether it is worth keeping or not. In this report, we will only mention the features that are relevant or worth looking at else the report would be too long!
We can create a few simple functions that will help remove null values with ease. One function is for the training set and one is for the testing set. Though these could have been put together, we can use these separately if/when needed.
# function for filling null values in the training set features using mean or mode.
def fill_null_train(feature, method):
if method == 'mean':
df_train[feature] = df_train[feature].fillna(df_train[feature].mean())
return df_test[feature].isnull().sum()
elif method == 'mode':
df_train[feature] = df_train[feature].fillna(df_train[feature].mode()[0])
return df_test[feature].isnull().sum()
else:
return 'Method Error: Choose mean or mode in second input.'
# function for filling null values in the testing set features using mean or mode.
def fill_null_test(feature, method):
if method == 'mean':
df_test[feature] = df_test[feature].fillna(df_test[feature].mean())
return df_test[feature].isnull().sum()
elif method == 'mode':
df_test[feature] = df_test[feature].fillna(df_test[feature].mode()[0])
return df_test[feature].isnull().sum()
else:
return 'Method Error: Choose mean or mode in second input.'
#function for dropping a feature in the training set.
def drop_train(feature):
df_train.drop([feature], axis = 1, inplace = True)
#function for dropping a feature in the testing set.
def drop_test(feature):
df_test.drop([feature], axis = 1, inplace = True)
We can now use these functions for our feature engineering. First we fill the features which only have missing values in the test set.
# fill missing values using the mode (since it is an object which can only take a few values)
# do this only for test set since train set already has zero null values in MSZoning
fill_null_test('MSZoning', 'mode')
Now we can fill in the values using the mean when necessary. We only do this if the feature is a float64 value.
# "LotFrontage" is a float64 value so we can fill the null values using the mean.
fill_null_test('LotFrontage', 'mean')
fill_null_train('LotFrontage', 'mean')
Next we can drop all the insignificant features that either provide no usefulness, or have too many missing values to be considered in the prediction.
# we remove these features entirely due to the large amount of null values
drop_train('Alley')
drop_test('Alley')
drop_train('GarageYrBlt')
drop_test('GarageYrBlt')
drop_train('PoolQC')
drop_test('PoolQC')
drop_train('Fence')
drop_test('Fence')
drop_train('MiscFeature')
drop_test('MiscFeature')
Now we can fill in the missing values for all the features that have any, using the mode.
# fill missing values using the mode for both train and test
fill_null_train('BsmtCond', 'mode')
fill_null_test('BsmtCond', 'mode')
fill_null_train('BsmtQual', 'mode')
fill_null_test('BsmtQual', 'mode')
fill_null_train('FireplaceQu', 'mode')
fill_null_test('FireplaceQu', 'mode')
fill_null_train('GarageType', 'mode')
fill_null_test('GarageType', 'mode')
fill_null_train('GarageFinish', 'mode')
fill_null_test('GarageFinish', 'mode')
fill_null_train('GarageQual', 'mode')
fill_null_test('GarageQual', 'mode')
fill_null_train('GarageCond', 'mode')
fill_null_test('GarageCond', 'mode')
fill_null_train('MasVnrType', 'mode')
fill_null_test('MasVnrType', 'mode')
fill_null_train('MasVnrArea', 'mode')
fill_null_test('MasVnrArea', 'mode')
fill_null_train('BsmtExposure', 'mode')
fill_null_test('BsmtExposure', 'mode')
fill_null_train('BsmtFinType1', 'mode')
fill_null_test('BsmtFinType1', 'mode')
fill_null_train('BsmtFinType2', 'mode')
fill_null_test('BsmtFinType2', 'mode')
Here, we fill missing values for features that are only in train.
# fill missing values\ using the mode for only train
fill_null_train('Electrical', 'mode')
Here, we fill missing values for features that are only in test.
# fill missing values using the mode for only test
fill_null_test('Utilities', 'mode')
fill_null_test('Exterior1st', 'mode')
fill_null_test('Exterior2nd', 'mode')
fill_null_test('BsmtFullBath', 'mode')
fill_null_test('BsmtHalfBath', 'mode')
fill_null_test('KitchenQual', 'mode')
fill_null_test('Functional', 'mode')
fill_null_test('SaleType', 'mode')
# fill missing values using the mode for only test
fill_null_test('BsmtFinSF1', 'mean')
fill_null_test('BsmtFinSF2', 'mean')
fill_null_test('BsmtUnfSF', 'mean')
fill_null_test('TotalBsmtSF', 'mean')
fill_null_test('GarageCars', 'mean')
fill_null_test('GarageArea', 'mean')
Now that we can removed or filled all the null values, we can check if there are any remaining null values in any of the data.
df_train.isnull().sum()
df_test.isnull().sum()
We can now handle the catergorical features by creating a feature set that will be used later.
# Creating feature set to be used later
columns = ['MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood',
'Condition2', 'BldgType', 'Condition1', 'HouseStyle', 'SaleType', 'SaleCondition', 'ExterCond', 'ExterQual',
'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'RoofStyle', 'RoofMatl',
'Exterior1st', 'Exterior2nd', 'MasVnrType', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive'
]
len(columns)
39
We now create a function that converts all the features into categorical features. We use the above feature set here.
# function that converts all the features into categorical features
def category_onehot_multcols(multcolumns):
df_final=final_df
i=0
for fields in multcolumns:
print(fields)
df1=pd.get_dummies(final_df[fields],drop_first=True)
final_df.drop([fields],axis=1,inplace=True)
if i==0:
df_final=df1.copy()
else:
df_final=pd.concat([df_final,df1],axis=1)
i=i+1
df_final=pd.concat([final_df,df_final],axis=1)
return df_final
We make a copy of the original dataframe before we concatenate the train and test data into a single dataset.
# copy of the original dataframe
main_df = df_train.copy()
# dataframe containing both datasets
final_df = pd.concat([df_train, df_test], axis = 0)
final_df.shape
(2919, 76)
Now we can use the function created earlier to convert all features in this new dataframe into catergorical features.
final_df = category_onehot_multcols(columns)
final_df.shape
(2919, 237)
We can now see the new dataframe
final_df = final_df.loc[:,~final_df.columns.duplicated()]
final_df
We can now split this dataframe into the train and test datasets.
df_Train = final_df.iloc[:1460,:]
df_Test = final_df.iloc[1460:,:]
We now drop the “Sale Price” feature from the training set because we do not have any of these values (all the values are null).
X_train = df_Train.drop(['SalePrice'], axis = 1).values
y_train = df_Train['SalePrice'].values
By doing this we have now created the train/test split which we can now use to build a model and generate predictions.
4) Model Testing and Building
Here we build the model and can experiment with different models with default parameters. We
from sklearn.model_selection import cross_val_score
from sklearn import linear_model
from sklearn.linear_model import RidgeCV
import xgboost as xgb
from sklearn import tree
from sklearn.linear_model import Lasso
from xgboost import XGBClassifier
from xgboost import XGBRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV
We will test a few of the models using 5 fold cross validation to get a baseline model to see which one performs the best and delivers the most accurate results.
# Decision Tree Regressor
dt = tree.DecisionTreeRegressor()
cv = cross_val_score(dt, X_train, y_train, cv = 5)
print(cv)
print(cv.mean())
mean_dt = round(cv.mean(), 4)
# Bayesian Ridge Regression
brr = linear_model.BayesianRidge()
cv = cross_val_score(brr, X_train, y_train, cv = 5)
print(cv)
print(cv.mean())
mean_brr = round(cv.mean(), 4)
# Ridge CV
rdg = RidgeCV(alphas=(0.01, 0.05, 0.1, 0.3, 1, 3, 5, 10))
cv = cross_val_score(rdg, X_train, y_train, cv = 5)
print(cv)
print(cv.mean())
mean_rdg = round(cv.mean(), 4)
# Extreme Gradient Boosting
xgb = xgb.XGBRegressor(n_estimators=340, max_depth=2, learning_rate=0.2)
cv = cross_val_score(xgb, X_train, y_train, cv = 5)
print(cv)
print(cv.mean())
mean_xgb = round(cv.mean(), 4)
# Lasso Regression
lso = Lasso(alpha=0.1,random_state=0)
cv = cross_val_score(lso, X_train, y_train, cv = 5)
print(cv)
print(cv.mean())
mean_lso = round(cv.mean(), 4)
The table above shows the average results (accuracy) for each model which gives us an idea of which model would be optimal for our case.
5) Model Optimisation
From the baseline models, it is clear that the Extreme Gradient Boosting model performed the best out of the group in the 5-fold cross validation test. We can now optimise this model to try and predict the data.
Attempt 1: XGBoost
On the first attempt, I have used the XGBoost model with the parameters shown below.
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
# Hyperparameters
params = {
'objective': 'reg:squarederror', # MSE as loss function
'eval_metric': 'rmse', # RMSE as metric
'eta': 0.3, # Learning rate
}
# model
model = XGBRegressor(**params)
# Fit the model to the data
model.fit(X_train, y_train)
# Predictions
y_test = model.predict(df_Test.drop(['SalePrice'], axis = 1).values)
Attempt 2: RidgeCV
For the second attempt, I have used the RidgeCV model with the parameters shown below.
rdg = RidgeCV(alphas=(0.01, 0.05, 0.1, 0.2, 0.3, 0.35, 0.39, 0.4, 0.41, 0.45, 0.5, 1, 3, 5, 10), normalize=True)
rdg_model = rdg.fit(X_train, y_train)
rdg_model.alpha_
0.39
# Predictions
y_test = rdg_model.predict(df_Test.drop(['SalePrice'], axis = 1).values)
Attempt 3: XGBoost (Tuned)
Here, I have used the XGBoost model again with some better hyperparameter tuning.
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
# Hyperparameters
params = {
'objective': 'reg:squarederror', # MSE as loss function
'eval_metric': 'rmse', # RMSE as metric
'eta': 0.09, # Learning rate
}
# model
model = XGBRegressor(**params)
# Fit the model to the data
model.fit(X_train, y_train)
# Predictions
y_test = model.predict(df_Test.drop(['SalePrice'], axis = 1).values)
Attempt 4: XGBoost (Tuned -Version2)
Finally I have used the XGBoost model as my submitted model with more complex hyperparameter tuning that fits the data much better.
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
# Hyperparameters
params = {
'objective': 'reg:squarederror', # MSE as loss function
'eval_metric': 'rmse', # RMSE as metric
'eta': 0.09, # Learning rate
'max_depth': 5,
'min_child_weight':4,
'colsample_bytree':0.9,
}
# model
model = XGBRegressor(**params)
# Fit the model to the data
model.fit(X_train, y_train)
# Predictions
y_test = model.predict(df_Test.drop(['SalePrice'], axis = 1).values)
Submitting
submission = pd.DataFrame({'Id':df_test.Id,'SalePrice':y_test})
submission.to_csv('submission.csv', index=False)
print('Submitted!')
Submitted!
6) Results
These are the final results for all the different attempt and their respective scores.
Attempt 1: XGBoost
This is the result of the first attempt:
print(y_test)
The score for this attempt was: 0.14349.
The lower the score, the better the accuracy of the model.
This score placed me at rank 2580 which is roughly in the middle (5158 total at the time). There is room for improvement.
Attempt 2: RidgeCV
This is the result of the second attempt:
print(y_test)
The score for this attempt was: 0.15328.
This attempt was clearly inferior to the first attempt, which means that the Extreme Gradient Boosting model is still the best one so far.
Attempt 3: XGBoost (Tuned)
This is the result of the third attempt:
print(y_test)
The score for this attempt was 0.13986
This attempt has been the best one so far and places me at rank 2338 (up by 758 positions).
Attempt 4: XGBoost (Tuned -Version2)
This is the result of the fourth and final attempt:
print(y_test)
The score for this attempt was 0.13411
This attempt has given the best score and places me at rank 1833 (up by 505 positions).
Table of the model attempts and their respective scores:
Model | Score |
---|---|
XGB | 0.14349 |
RidgeCV | 0.15328 |
XGB (Tuned) | 0.13986 |
XGB (Tuned -v2) | 0.13411 |
The final score of 0.13411 places me at rank 1833 of 5339 (at the time of submission), which is approximately top 34% of all submissions.
7) Conclusion
On the first attempt, XGBoost was used, as it was the best performing model from the baseline testing results. The model was optimised from the one used in testing by defining the evaluation metrics for validation data and the learning rate of the model. The score (0.14349) obtained from this was admissable, but it was clear that there was room for improvement.
On the second attempt, other models that gave good baseline testing results, similar to the XGBoost model were optimized and used, but none of them returned a better score than the first attempt. From these models, RidgeCV scored the highest (0.15328) which was still much lower than the first attempt and was therefore not admissable. This meant that XGBoost was the best model from the ones tested (as the baseline model testing had suggested), and it needed to be properly optimized.
On attempt 3, the parameters associated to XGBoost were tweaked, particularly the learning rate parameter (eta). This resulted in a better score (0.13986) and confirmed that parameter optimisatio was the key to a higher accuracy and ultimately a better score. With this new-found knowledge, the optimisation continued.
On attempt 4, the final attempt, the XGBoost model was heavily refined and many new parameters were optimised, particularly the tree-specific paramters. These include maximum depth, minimum child weight, solumn sample by tree. This returned a much higher score (0.13411) which gave an increase of 1263 total positions from the first attempt, and was in the top 34% of all submissions.
After attempt 4, parameter optimisation was attempted with new and different parameters, but the score did not improve, so the score from attempt 4 was used as the final score. To make any improvements to the score, the model would need to be changed to a better performing model (reiterate from model building step onwards) which would be too time consuming. Deep learning models or better machine learning models could be used to improve the predictions, but research would be required to figure this out.
Leave a Comment