Kaggle: Titanic Survivor Predictions

8 minute read

In this post we will attempt to solve the infamous Kaggle Titanic Machine Learning problem in an organised fashion.

Overview

1) Loading Data

2) Understand Data

3) Visualising Data

4) Feature Engineering

5) Data Preprocessing

6) Data Scaling

7) Model Testing and Building

8) Model Optimisation

9) Results

10) Conclusions

Importing Packages

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

import matplotlib.pyplot as plt
from itertools import cycle
cycol = cycle('bgrcmk')

1) Loading Data

df_train = pd.read_csv('/kaggle/input/titanic/train.csv') # importing training set
df_test = pd.read_csv('/kaggle/input/titanic/test.csv') # importing test set
df_train['train_test'] = 1
df_test['train_test'] = 0
df_test['Survived'] = np.NaN
df_all = pd.concat([df_train,df_test])

%matplotlib inline
df_all.columns

2) Understand Data

In order to use and accurately predict the data, we must first understand the data and be able to extract information from it properly.

df_train.head()

layers


df_train.describe()

layers

The above table shows the different statistics of each of the numerical attributes of the dataset. It is clear that there are 891 total passengers in the train.csv file of which an average of 0.3838 survived. Since survivor = 1, and casualty = 0, the sum of all survivors divided by total passengers (mean) means that 38.38% of total passengers survived.
</p> We can now analyse individual features. The gender_submission.csv dataset assumes that most/all female passengers survived. We can check for this in the train.csv dataset.

women = df_train[df_train.Sex == 'female']['Survived']

perc_women = sum(women)/len(women) * 100

print(perc_women)

74.20382165605095

This shows that the 74.2% of women survived which indicates that sex is a relevant factor for survival. This means that the gender_submission.csv dataset is not a bad assumption to make and it performed reasonable well.

3) Visualising Data

By visualising the data, we can see more clearly how the data is behaving and what to look out for. We can visualise data by creating graphs such as histograms, bar charts, scatter graphs, etc. Different data requires different kinds of graphs.

# Look at numeric and categorical values separately 
df_num = df_train[['Age', 'SibSp', 'Parch', 'Fare']] # histograms
df_cat = df_train[['Survived', 'Pclass', 'Sex', 'Ticket', 'Cabin', 'Embarked']] # value counts
for col in df_num.columns:
    plt.hist(df_num[col])
    plt.title(col)
    plt.show()

layers
layers

# Compare survival rate across Age, SibSp, Parch, and Fare
pd.pivot_table(df_train, index = 'Survived', values = ['Age', 'SibSp', 'Parch', 'Fare'])

layers

The pivot table above shows the significance of the different attributes, to survival rate.

for col in df_cat.columns:
    plt.bar(df_cat[col].value_counts().index, df_cat[col].value_counts())
    plt.title(col)
    plt.xticks(rotation=90)
    plt.show()

layers


layers


layers

# Compare survival rate across Pclass, Sex, Embarked with the type of Ticket that was used.
print(pd.pivot_table(df_train, index = 'Survived', columns = 'Pclass', values = 'Ticket', aggfunc = 'count'))
print()
print(pd.pivot_table(df_train, index = 'Survived', columns = 'Sex', values = 'Ticket', aggfunc = 'count'))
print()
print(pd.pivot_table(df_train, index = 'Survived', columns = 'Embarked', values = 'Ticket', aggfunc = 'count'))

layers

4) Feature Engineering

Simplifying features can help respresent the data most accurately for analysis or model development. Feature engineering can include splitting features, aggregating features, or combining features to create new ones, etc.

# creates catergoe=ries based on who had multiple cabins.

df_cat.Cabin

df_train['cabin_multiple'] = df_train.Cabin.apply(lambda x: 0 if pd.isna(x) else len(x.split(' ')))
                                                                                          
df_train['cabin_multiple'].value_counts()

pd.pivot_table(df_train, index = 'Survived', columns = 'cabin_multiple', values = 'Ticket', aggfunc = 'count')

layers

It is clear that that most people did not have multiple cabins.

# creates categories based on the cabin letter (n = null)

df_train['cabin_letter'] = df_train.Cabin.apply(lambda x: str(x)[0])
pd.pivot_table(df_train, index = 'Survived', columns = 'cabin_letter', values = 'Name', aggfunc = 'count')

layers

A lot of the people in the Null column did not survive. Higher survival rate in B - E. Due to the consistent data, we can comfortably use the column letter as a categorical variable. This might give us better insight and reduct the number of total cabins.

# understand ticket values better 
# numeric vs non numeric

df_train['numeric_ticket'] = df_train.Ticket.apply(lambda x: 1 if x.isnumeric() else 0)
# 1 if text else 0
df_train['ticket_letters'] = df_train.Ticket.apply(lambda x: ''.join(x.split(' ')[:-1]).replace('.','').replace('/','').lower() if len(x.split(' ')[:-1]) > 0 else 0)
df_train['numeric_ticket'].value_counts()

layers

# this allows us to view all rows in dataframe through scrolling. This is for convenience.
#pd.set_option("max rows", None)
df_train['ticket_letters'].value_counts()

layers

# difference between numeric and non-numeric survival rate
pd.pivot_table(df_train, index = "Survived", columns = 'numeric_ticket', values = 'Ticket', aggfunc = 'count')

layers

Nothing of relevance or importance here.

# survival rate across different ticket types
pd.pivot_table(df_train, index = 'Survived', columns = 'ticket_letters', values = 'Ticket', aggfunc = 'count')

layers

Nothing of revelence or importance here either.

# feature engineering on person's title
df_train.Name.head(50)
df_train['name_title'] = df_train.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())
df_train['name_title'].value_counts()

layers

pd.pivot_table(df_train, index = 'Survived', columns = 'name_title', values = 'Name', aggfunc = 'count')

layers

5) Data Preprocessing

Data preprocessing (or data mining) is used to transform the raw data and make it more usable and useful. Here, data is cleaned and missing values are restored or handled.

# create all the catergorical vairables that we did in previous steps for both train and test sets.
# most of this is taken from previous code
df_all['cabin_multiple'] = df_all.Cabin.apply(lambda x: 0 if pd.isna(x) else len(x.split(' ')))
df_all['cabin_adv'] = df_all.Cabin.apply(lambda x: str(x)[0])
df_all['numeric_ticket'] = df_all.Ticket.apply(lambda x: 1 if x.isnumeric() else 0)
df_all['ticket_letters'] = df_all.Ticket.apply(lambda x: ''.join(x.split(' ')[:-1]).replace('.','').replace('/','').lower() if len(x.split(' ')[:-1]) > 0 else 0)
df_all['name_title'] = df_all.Name.apply(lambda x: x.split(',')[1].split('.')[0].strip())

# create nulls for continuous data
df_all.Age = df_all.Age.fillna(df_train.Age.median()) # take median of age to fill because it is not normally distributed
df_all.Fare = df_all.Fare.fillna(df_train.Fare.median()) # take median of fare to fill because it is not normally distributed
# drop null values in 'embarked' rows. This only happens twice in the training and never in test.
df_all.dropna(subset=['Embarked'], inplace = True)
# log norm of fare 
df_all['norm_fare'] = np.log(df_all.Fare+1)
df_all['norm_fare'].hist()

layers

# converted fare to category for use
df_all.Pclass = df_all.Pclass.astype(str)
# created dummy variables from categories 
df_dummies = pd.get_dummies(df_all[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'norm_fare', 'Embarked', 'cabin_adv', 'cabin_multiple', 'numeric_ticket', 'name_title', 'train_test']])
# split to train test again
X_train = df_dummies[df_all.train_test == 1].drop(['train_test'], axis = 1)
X_test = df_dummies[df_all.train_test == 0].drop(['train_test'], axis = 1)

y_train = df_all[df_all.train_test == 1].Survived

6) Data Scaling

Data scaling is used to normalize the range of the values of the data. We transfrom the data so that it is normalised (between 0 and 1).

from sklearn.preprocessing import StandardScaler
df_scaler = StandardScaler()
df_dummies_scaled = df_dummies.copy()
df_dummies_scaled[['Age', 'SibSp', 'Parch', 'norm_fare']] = df_scaler.fit_transform(df_dummies_scaled[['Age', 'SibSp', 'Parch', 'norm_fare']])
df_dummies_scaled

layers

X_train_scaled = df_dummies_scaled[df_dummies_scaled.train_test == 1].drop(['train_test'], axis = 1)
X_test_scaled = df_dummies_scaled[df_dummies_scaled.train_test == 0].drop(['train_test'], axis = 1)

y_train = df_all[df_all.train_test == 1].Survived

7) Model Testing and Building

Here we build the model and can experiment with different models with default parameters.

# import all necessary packages from sklearn
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier

We will try a few of the model using 5 fold cross validation to get a baseline model to see which model performs the best and delivers the most accurate results.

# We use Naive Bayes as a baseline for the classification tasks
gnb = GaussianNB()
cv = cross_val_score(gnb, X_train_scaled, y_train, cv = 5)
print(cv)
print(cv.mean())

layers

# Logistic Regression test
lr = LogisticRegression(max_iter = 1000)
cv = cross_val_score(lr, X_train, y_train, cv = 5)
print(cv)
print(cv.mean())

layers

# Decision Tree test
dt = tree.DecisionTreeClassifier(random_state = 1)
cv = cross_val_score(dt, X_train_scaled, y_train, cv = 5)
print(cv)
print(cv.mean())

layers

# KNN test
knn = KNeighborsClassifier()
cv = cross_val_score(knn, X_train, y_train, cv = 5)
print(cv)
print(cv.mean())

layers

# Random Forest test
rf = RandomForestClassifier(random_state = 1)
cv = cross_val_score(rf, X_train, y_train, cv = 5)
print(cv)
print(cv.mean())

layers

# Support Vector Machine
svc = SVC(probability = True)
cv = cross_val_score(svc, X_train_scaled, y_train, cv = 5)
print(cv)
print(cv.mean())

layers

# Extreme Gradient Boosting
xgb = XGBClassifier(random_state = 1)
cv = cross_val_score(xgb, X_train_scaled, y_train, cv = 5)
print(cv)
print(cv.mean())

layers

layers

From the table above, it is clear that the Support Vector Machine model performed the best out of the group in the 5-fold cross validation test.

8) Model Optimisation

The “Voting Classifier” takes all of the inputs and averages the results. For a “hard” voting classifier each classifier gets 1 vote “yes” or “no” and the resutl is just a popular vote.

A “soft” classifier averages the confidence of each of the models. If the average confidence is > 50% that it is a 1 it will be counted as such.

from sklearn.ensemble import VotingClassifier
voting_cl = VotingClassifier(estimators = [('lr', lr), ('knn', knn), ('rf', rf), ('dt', dt), ('gnb', gnb), ('svc', svc)], voting = 'soft')

cv = cross_val_score(voting_cl, X_train_scaled, y_train, cv = 5)
print(cv)
print(cv.mean())

layers

voting_cl.fit(X_train_scaled, y_train)

y_pred = voting_cl.predict(X_test_scaled).astype(int)

y_pred = pd.Series(voting_cl.predict(X_test_scaled).astype(int))

9) Results

These are the final results for the predictions.

print(y_pred)

layers

print('The survival rate of the predicted Passengers is: ', "%.2f" % ((1 - sum(y_pred) / len(y_pred)) * 100), '%')

layers

submission = {'PassengerId': df_test.PassengerId, 'Survived': y_pred}

submission = pd.DataFrame(data = submission)

submission.to_csv('submission.csv', index=False)

print('Submitted!')

layers

10) Conclusion

On the first attempt, using the Logistic Regression predictor, we were able to predict data points at an accuracy of 0.75837 (75.84%). This was a good first prediction but it was clear that this could be improved. Tweaking parameters was not helpful and the accuracy was not increased at all. The only other option was to try more classifiers that might fit the model better. The Xtreme Gradient Boosting classifier (XGB) was then used and increased this accuracy to 0.77272 (77.27%) which was a slight improvement. From this point onwards it was almost impossible to increase the accuracy using the model we had built. Thus, the final accuracy of the model was 0.77272.

In future versions, we can look at other machine learning techniques such as deep learning neural networks to try to increase this accuracy and make a model more suitable for our data. We can also look into the best way to optimise parameters and observe how the model behaves as each parameter is changed.

Leave a Comment