Titanic - Machine Learning from Disaster

The Titanic: Machine Learning from Disaster project on Kaggle is a popular data science competition. The objective is to develop a predictive model that determines whether a given passenger survived the Titanic disaster. This is achieved by analyzing a dataset that includes variables such as the passenger’s age, sex, fare, cabin, embarkation point, among others.

Expand Icon


Since this is my first project using Artificial Intelligence, I wanted to see how far I could go without the need for a specific tutorial or any external help. Throughout the process, I learned about statistics and various AI models that could assist me with this task. Here is how it went:

This competition presents two datasets containing passenger information from the historic RMS Titanic:

train.csv: Contains details for 891 passengers with survival outcomes:

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
... ... ... ... ... ... ... ... ... ... ... ...
889 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

test.csv: Contains 418 additional passengers data without survival outcomes:

PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
... ... ... ... ... ... ... ... ... ... ...
1307 3 Saether, Mr. Simon Sivertsen male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NaN S
1308 3 Ware, Mr. Frederick male NaN 0 0 359309 8.0500 NaN S
1309 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C

Before diving into data analysis, it's crucial to first ensure the quality of our data. Higher-quality data leads to more reliable and insightful conclusions. Our dataset contains some missing data points, leaving the values for those entries unknown. This could be attributed to either the information being lost or not adequately collected.

First, let's import all the libraries we'll be using in this demonstration:

        
import pandas as pd #For Data Manipulation
import numpy as np #For Numerical Operations 
from sklearn.model_selection import train_test_split #To Split our dataset for training and validation purposes
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error #To evaluate our model
from catboost import CatBoostRegressor #This is to create our Machine Learning model
      


Now, let's check which values are missing in our dataset by using pandas with the following python code:

        
#Reading CSV file
df = pd.read_csv('train.csv')

print(df.isnull().sum())
      


Which results on the following outcome:


        PassengerId      0 
        Survived         0
        Pclass           0
        Name             0
        Sex              0
        Age            177
        SibSp            0
        Parch            0
        Ticket           0
        Fare             0
        Cabin          687
        Embarked         2
        dtype: int64
      


We can see that the Cabin, Age, and Embarked sections contain missing data. Of the 891 data points in our dataset, 687 entries in the Cabin column are missing. Since this column doesn't really contribute much valuable information, we can go ahead and drop it. These missing values don’t offer any useful patterns for the model to learn from.

        
#Dropping cabin column // # Setting inplace=True applies the change directly to the existing DataFrame
df.drop(columns=['Cabin'], inplace=True)
      


For the Embarked section, we'll fill the two missing entries with the mode, since it's just a couple of values, using the most frequent category should work just fine.

        
#Calculating mode for Embarked Column and replacing missing values

print(df['Embarked'].value_counts()) #This shows us the value 'S' is the most repeated one
df['Embarked'] = df['Embarked'].fillna('S')
      


Now, let’s take a look at how our dataset is shaping up in terms of missing values.

        
        PassengerId      0
        Survived         0
        Pclass           0
        Name             0
        Sex              0
        Age            177
        SibSp            0
        Parch            0
        Ticket           0
        Fare             0
        Embarked         0
        dtype: int64
      

The Age column contains 177 missing values, to address this, we'll create a predictive model to estimate the missing ages. Why are we doing this?: The Age data could indicate underlying patterns worth investigating. However, before modeling, it's important to carry out feature selection to determine which variables are most relevant.

Let’s review the dataset's columns:
- PassengerId: This serves merely as an index and isn't necessary for our analysis at this stage.
- Survived: Our target variable for classification. Fortunately, it’s already well-formatted thanks to Kaggle’s preprocessing.
- Pclass: Represents ticket class, and it’s properly structured and ready for use.
- Name: While this field may not seem useful initially, it includes each passenger's title (e.g., Mr., Mrs., Miss). We can extract these titles into a new feature column called Title, which may offer valuable insights. Let's see how to extract this information:

  
#Adding new column 'Title'

pd.set_option('display.max_rows', None) #Used to display all the rows in our dataframe, this way we can explore it better

df['Title'] = None #Setting this column to none will help us filter the data by doing the following for each title we find:

df['Title'] = df['Name'].str.extract('(Mr\.)')
print(df['Title'].value_counts()) #Let's see how many values we were able to filter so far
print(df['Title'].isnull().value_counts()) #Let's see how many values on our new title column are empty

#This process was repeteated multiple times until we got the following filter: 
df['Title'] = df['Name'].str.extract(
'(Mr\.|Mrs\.|Miss|Dr\.|Master\.|Rev\.|Col\.|Major\.|Mlle\.|Jonkheer\.|Countess\. of|Mme\.|Don\.|Mme\.|Ms\.|Lady\.|Sir\.|Capt\.)'
)
print(df['Title'].value_counts()) 
print(df['Title'].isnull().value_counts()) #We should not have any null values left

df['Title'] = df['Title'].str.replace('.','') #Just taking out the '.' to make the column look better

- Sex: This feature is already properly formatted and ready for use.
- Age: This will serve as the target variable for our first predictive model.
- SibSp: Already clean and formatted correctly.
- Parch: No formatting issues, this column is good to go.
- Ticket: This column poses some challenges. Many of the entries appear inconsistent or unstructured, making it difficult to extract meaningful patterns. We'll attempt to clean and standardize this data by applying the same filtering techniques used earlier(It would have been easier to create a function to filter the data in stages, but I realized that halfway through writing the code. Lesson learned):



#Let's see if the data can be grouped and how is grouped
print('Initial Ticket Groups: \n',df['Ticket'].value_counts())


# Some entries contain purely numeric values, while others mix letters and numbers.
# Let's create a filter to isolate the entries with numeric-only values.
booleanMaskForNumbers = pd.to_numeric(df['Ticket'],errors='coerce').notna()

#This will show us the tickets that are only numbers
print('Numeric Tickets: \n', df.loc[booleanMaskForNumbers,['Ticket']].count())


#Let's begin filtering the data by creating a new column called 'NewTicket' with the information for the numeric tickets we just found

df['NewTicket'] = None
df.loc[booleanMaskForNumbers,['NewTicket']] = 'NUMERIC-TICKETS'

print('NewTicket Value counts: \n',df['NewTicket'].count()) #Let's see how many of the values on our new column are not empty


# Now, we will create a new filter for null values (values that have not been filtered yet), to avoid redefining the variable
# each time the dataframe changes, we will create a function that will create a boolean mask each time is called. 
# The function will have a reverse argument which will do the opposite

def get_unfiltered_values (reverse=False):
    if reverse == False:
        booleanMaskForNullValues = df['NewTicket'].isnull()
    else:
        booleanMaskForNullValues = ~df['NewTicket'].isnull()
    return booleanMaskForNullValues


# Now we can begin to filter the rest of the data in our ticket column, let's create another function that will print the
# remaining values that need to be filtered:
# The function will have a reverse argument which will do the opposite

def show_unfiltered_values(reverse=False):
    if reverse == False:
        print('Unfiltered values: \n',df.loc[get_unfiltered_values(),['Ticket','NewTicket']]) 
        #We are printing 'NewTicket' just to verify values are nulll
    else:
        print('Filtered values: \n', df.loc[get_unfiltered_values(reverse=True),['Ticket','NewTicket']])
    
show_unfiltered_values()

#tempBooleanMask will serve as a temporary mask for the letters and sequence of letters that need to be filtered

# 'PC' Categorical Value:

tempBooleanMask = df['Ticket'].str.contains('PC',case=False)

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'PC'

show_unfiltered_values(reverse=False)


# 'SC/PARIS' Categorical Value: 

tempBooleanMask = df['Ticket'].str.contains('PARIS', case=False )
df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'SC/PARIS'


# 'CA' Categorical Value:

tempBooleanMask = df['Ticket'].str.contains('C', case=False ) & (
    df['Ticket'].str.contains('C',case=False)
) & (
    df['Ticket'].str.contains('A',case=False)
) & (
    ~(df['Ticket'].str.contains('S',case=False))
)

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'CA'


# 'STON/02' Categorical Value:

tempBooleanMask = df['Ticket'].str.contains('SOTON', case=False )& (
    ~df['Ticket'].str.contains('C',case=False)
    ) | (
        df['Ticket'].str.contains('STON', case= False)
        )


df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'STON/02'



# 'SOC' Categorical Value:

tempBooleanMask = df['Ticket'].str.contains('S', case= False) & (
    df['Ticket'].str.contains('O',case=False)) & (
        df['Ticket'].str.contains('C',case=False)) & (
            ~df['Ticket'].str.contains('A|W',case=False)
        )
 
df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'SOC'



# 'SOPP' Categorical Value:


tempBooleanMask = df['Ticket'].str.contains('S',case= False) & (
    df['Ticket'].str.contains('O',case= False)
    ) & (
        df['Ticket'].str.contains('P',case= False)
        )


df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'SOPP'



# 'A5/A4' Categorical Value:

tempBooleanMask = df['Ticket'].str.contains('A') & df['Ticket'].str.contains('5') 

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'A5/A4'


# 'FCC' Categorical Value: 

tempBooleanMask = df['Ticket'].str.contains('F',case=False) & ~df['Ticket'].str.contains('A',case=False)

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'FCC'


# 'WC' Categorical Value:

tempBooleanMask = ( df['Ticket'].str.contains('W', case=False)) & (
    df['Ticket'].str.contains('C', case=False)
    ) & (
        ~df['Ticket'].str.contains('SCO',case=False)
        )

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'WC'


# 'PP' Categorical Value: 

tempBooleanMask = df['Ticket'].str.contains('P',case=False) & (
~df['Ticket'].str.contains('S', case=False)
) & (
    ~df['Ticket'].str.contains('W', case=False)
    )


df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'PP'


# 'A5/A4' Categorical Value:

tempBooleanMask =  df['Ticket'].str.contains('A', case=False) & (
~df['Ticket'].str.contains('F',case=False)
) & (
    ~df['Ticket'].str.contains('SOTON') & ~df['Ticket'].str.contains('C')
    )

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'A5/A4'


# 'LINE' Categorical Value:

tempBooleanMask = df['Ticket'].str.contains('LINE')

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'LINE'




# 'C' Categorical Value:


tempBooleanMask = (
    df['Ticket'].str.contains('C', case=False) & ~df['Ticket'].str.contains('S', case=False)
)


df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'C'



# 'WEP' Categorical Value:


tempBooleanMask =  (
df['Ticket'].str.contains('W', case=False) & df['Ticket'].str.contains('E', case=False) & df['Ticket'].str.contains('P', case=False)
)

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'WEP'



# 'SW/PP' Categorical Value:

tempBooleanMask =  ( df['Ticket'].str.contains('S', case=False) ) & (
    df['Ticket'].str.contains('W', case=False) 
    ) & (
        df['Ticket'].str.contains('P', case=False)
        )


df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'SW/PP'



# 'OTHER/STRINGS' Categorical Value: 

df.loc[get_unfiltered_values(),'NewTicket'] = 'OTHER/STRINGS'


show_unfiltered_values()


Now that we've cleaned our data, it's time to build a machine learning model to estimate the missing ages:


#Let's save our original Data Frame for later
originalDF = df


#Let's select all of the relevant data for our model: 

print(df)

df = df[['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Title','NewTicket']]


#Let's drop the missing age values
booleanMask = df['Age'].notnull()

df = df[booleanMask]


#Let's assign our target variable and our independent variables


x = df[['Survived','Pclass','Sex','SibSp','Parch','Fare','Embarked','Title','NewTicket']]
y = df['Age']


# We'll use the CatBoost Regressor, which excels at handling 
# categorical features like the ones below:

catColumns = ['Survived','Pclass','Sex','SibSp','Parch','Embarked','Title','NewTicket']


#Let's split our data to train and then test it: 

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2, random_state=42)




#Creating the model: 

model = CatBoostRegressor(

    iterations=500,
    learning_rate=0.03,
    depth=10,
    random_state=42,
    verbose=0,
 
)

model.fit(xTrain,yTrain,cat_features=catColumns)


yPred = model.predict(xTest)


# Let's calculate the evaluation metrics
mae = mean_absolute_error(yTest, yPred)
rmse = np.sqrt(mean_squared_error(yTest, yPred)) # RMSE is the sqrt of MSE
r2 = r2_score(yTest, yPred)

print("\nRegression Model Evaluation\n")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"On average, our age prediction is off by {mae:.2f} years\n")

print(f"Root Mean Squared Error (RMSE): {rmse:.2f}\n")

print(f"R-squared (R²): {r2:.2f}")
print(f"Our model explains {r2:.2%} of the variance in the age data\n")

# Let's look at the predictions values vs actual values uisng Tableau: 

results = pd.DataFrame({'Actual Age': yTest, 'Predicted Age': yPred})




# Our Model is good but it can be better, let's select only the best columns for our model 

print(model.get_feature_importance(prettified=True))


#Let's select all of the relevant data for our model: 


newDF = originalDF[['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Title','NewTicket']]


#Let's drop the missing age values
booleanMask = newDF['Age'].notnull()

newDF = newDF[booleanMask]


#Let's assign our target variable and our independent variables


x = newDF[['Title','Pclass','Parch','Embarked','SibSp','Fare','NewTicket']]
y = newDF['Age']


# Let's indicate which are categorical values: 

catColumns = ['Title','Pclass','Parch','Embarked','SibSp','NewTicket']

#Let's split our data again to train and then test it: 

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2, random_state=42)


#Creating the model: 

model = CatBoostRegressor(

    iterations=500,
    learning_rate=0.03,
    depth=10,
    random_state=42,
    verbose=0,
 
)

model.fit(xTrain,yTrain,cat_features=catColumns)


yPred = model.predict(xTest)


# Let's calculate the evaluation metrics
mae = mean_absolute_error(yTest, yPred)
rmse = np.sqrt(mean_squared_error(yTest, yPred)) # RMSE is the sqrt of MSE
r2 = r2_score(yTest, yPred)

print("\nRegression Model Evaluation\n")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"On average, our age prediction is off by {mae:.2f} years\n")

print(f"Root Mean Squared Error (RMSE): {rmse:.2f}\n")

print(f"R-squared (R²): {r2:.2f}")
print(f"Our model explains {r2:.2%} of the variance in the age data\n")


This is what we got:

  
  
        Regression Model Evaluation

        Mean Absolute Error (MAE): 7.72
        On average, our age prediction is off by 7.72 years

        Root Mean Squared Error (RMSE): 9.85

        R-squared (R²): 0.48
        Our model explains 47.69% of the variance in the age data
  
  


While there's room to improve model accuracy, this configuration should be fine for our current objectives. Let's retrain the model using the complete dataset and save it:

  
#Let's retrain our model with the full data set: 

model.fit(x,y,cat_features=catColumns)
print('Final Model has ben trained')


#Let's isolate the null values that we will be predicting: 

print(originalDF.isnull().sum())

nullAges = originalDF.loc[originalDF['Age'].isnull(),:]

print(nullAges)

x = nullAges[['Title','Pclass','Parch','Embarked','SibSp','Fare','NewTicket']]
y = nullAges['Age']

yPred = model.predict(x)


#On the original DF, let's assign the null values to the ones we just predicted

originalDF.loc[x.index,'Age'] = yPred

print(originalDF)


#Let's save our model to use it on the test.csv file
model.save_model('AgeRegressor.cbm')


#Let's save our results to use them to train our final model
originalDF.to_csv('Data Frame Cleaned.csv')
  

The dataframe has been cleaned and we're ready to train our final model to predict survival outcomes. To keep this notebook concise, the remaining code is available on our Python Page. And now that our data is prepared, let's visualize it using Tableau: