Kaggle Titanic

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
...	...	...	...	...	...	...	...	...	...	...	...
889	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.0000	C148	C
891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.7500	NaN	Q

PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
892	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
893	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
894	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
...	...	...	...	...	...	...	...	...	...	...
1307	3	Saether, Mr. Simon Sivertsen	male	38.5	0	0	SOTON/O.Q. 3101262	7.2500	NaN	S
1308	3	Ware, Mr. Frederick	male	NaN	0	0	359309	8.0500	NaN	S
1309	3	Peter, Master. Michael J	male	NaN	1	1	2668	22.3583	NaN	C

Before diving into data analysis, it's crucial to first ensure the quality of our data. Higher-quality data leads to more reliable and insightful conclusions. Our dataset contains some missing data points, leaving the values for those entries unknown. This could be attributed to either the information being lost or not adequately collected.

First, let's import all the libraries we'll be using in this demonstration:

        
import pandas as pd #For Data Manipulation
import numpy as np #For Numerical Operations 
from sklearn.model_selection import train_test_split #To Split our dataset for training and validation purposes
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error #To evaluate our model
from catboost import CatBoostRegressor #This is to create our Machine Learning model

Now, let's check which values are missing in our dataset by using pandas with the following python code:

        
#Reading CSV file
df = pd.read_csv('train.csv')

print(df.isnull().sum())

Which results on the following outcome:


        PassengerId      0 
        Survived         0
        Pclass           0
        Name             0
        Sex              0
        Age            177
        SibSp            0
        Parch            0
        Ticket           0
        Fare             0
        Cabin          687
        Embarked         2
        dtype: int64

We can see that the Cabin, Age, and Embarked sections contain missing data. Of the 891 data points in our dataset, 687 entries in the Cabin column are missing. Since this column doesn't really contribute much valuable information, we can go ahead and drop it. These missing values don’t offer any useful patterns for the model to learn from.

        
#Dropping cabin column // # Setting inplace=True applies the change directly to the existing DataFrame
df.drop(columns=['Cabin'], inplace=True)

For the Embarked section, we'll fill the two missing entries with the mode, since it's just a couple of values, using the most frequent category should work just fine.

        
#Calculating mode for Embarked Column and replacing missing values

print(df['Embarked'].value_counts()) #This shows us the value 'S' is the most repeated one
df['Embarked'] = df['Embarked'].fillna('S')

Now, let’s take a look at how our dataset is shaping up in terms of missing values.

        
        PassengerId      0
        Survived         0
        Pclass           0
        Name             0
        Sex              0
        Age            177
        SibSp            0
        Parch            0
        Ticket           0
        Fare             0
        Embarked         0
        dtype: int64

The Age column contains 177 missing values, to address this, we'll create a predictive model to estimate the missing ages. Why are we doing this?: The Age data could indicate underlying patterns worth investigating. However, before modeling, it's important to carry out feature selection to determine which variables are most relevant.

Let’s review the dataset's columns:
- PassengerId: This serves merely as an index and isn't necessary for our analysis at this stage.
- Survived: Our target variable for classification. Fortunately, it’s already well-formatted thanks to Kaggle’s preprocessing.
- Pclass: Represents ticket class, and it’s properly structured and ready for use.
- Name: While this field may not seem useful initially, it includes each passenger's title (e.g., Mr., Mrs., Miss). We can extract these titles into a new feature column called Title, which may offer valuable insights. Let's see how to extract this information:

  
#Adding new column 'Title'

pd.set_option('display.max_rows', None) #Used to display all the rows in our dataframe, this way we can explore it better

df['Title'] = None #Setting this column to none will help us filter the data by doing the following for each title we find:

df['Title'] = df['Name'].str.extract('(Mr\.)')
print(df['Title'].value_counts()) #Let's see how many values we were able to filter so far
print(df['Title'].isnull().value_counts()) #Let's see how many values on our new title column are empty

#This process was repeteated multiple times until we got the following filter: 
df['Title'] = df['Name'].str.extract(
'(Mr\.|Mrs\.|Miss|Dr\.|Master\.|Rev\.|Col\.|Major\.|Mlle\.|Jonkheer\.|Countess\. of|Mme\.|Don\.|Mme\.|Ms\.|Lady\.|Sir\.|Capt\.)'
)
print(df['Title'].value_counts()) 
print(df['Title'].isnull().value_counts()) #We should not have any null values left

df['Title'] = df['Title'].str.replace('.','') #Just taking out the '.' to make the column look better

- Sex: This feature is already properly formatted and ready for use.
- Age: This will serve as the target variable for our first predictive model.
- SibSp: Already clean and formatted correctly.
- Parch: No formatting issues, this column is good to go.
- Ticket: This column poses some challenges. Many of the entries appear inconsistent or unstructured, making it difficult to extract meaningful patterns. We'll attempt to clean and standardize this data by applying the same filtering techniques used earlier(It would have been easier to create a function to filter the data in stages, but I realized that halfway through writing the code. Lesson learned):



#Let's see if the data can be grouped and how is grouped
print('Initial Ticket Groups: \n',df['Ticket'].value_counts())


# Some entries contain purely numeric values, while others mix letters and numbers.
# Let's create a filter to isolate the entries with numeric-only values.
booleanMaskForNumbers = pd.to_numeric(df['Ticket'],errors='coerce').notna()

#This will show us the tickets that are only numbers
print('Numeric Tickets: \n', df.loc[booleanMaskForNumbers,['Ticket']].count())


#Let's begin filtering the data by creating a new column called 'NewTicket' with the information for the numeric tickets we just found

df['NewTicket'] = None
df.loc[booleanMaskForNumbers,['NewTicket']] = 'NUMERIC-TICKETS'

print('NewTicket Value counts: \n',df['NewTicket'].count()) #Let's see how many of the values on our new column are not empty


# Now, we will create a new filter for null values (values that have not been filtered yet), to avoid redefining the variable
# each time the dataframe changes, we will create a function that will create a boolean mask each time is called. 
# The function will have a reverse argument which will do the opposite

def get_unfiltered_values (reverse=False):
    if reverse == False:
        booleanMaskForNullValues = df['NewTicket'].isnull()
    else:
        booleanMaskForNullValues = ~df['NewTicket'].isnull()
    return booleanMaskForNullValues


# Now we can begin to filter the rest of the data in our ticket column, let's create another function that will print the
# remaining values that need to be filtered:
# The function will have a reverse argument which will do the opposite

def show_unfiltered_values(reverse=False):
    if reverse == False:
        print('Unfiltered values: \n',df.loc[get_unfiltered_values(),['Ticket','NewTicket']]) 
        #We are printing 'NewTicket' just to verify values are nulll
    else:
        print('Filtered values: \n', df.loc[get_unfiltered_values(reverse=True),['Ticket','NewTicket']])
    
show_unfiltered_values()

#tempBooleanMask will serve as a temporary mask for the letters and sequence of letters that need to be filtered

# 'PC' Categorical Value:

tempBooleanMask = df['Ticket'].str.contains('PC',case=False)

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'PC'

show_unfiltered_values(reverse=False)


# 'SC/PARIS' Categorical Value: 

tempBooleanMask = df['Ticket'].str.contains('PARIS', case=False )
df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'SC/PARIS'


# 'CA' Categorical Value:

tempBooleanMask = df['Ticket'].str.contains('C', case=False ) & (
    df['Ticket'].str.contains('C',case=False)
) & (
    df['Ticket'].str.contains('A',case=False)
) & (
    ~(df['Ticket'].str.contains('S',case=False))
)

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'CA'


# 'STON/02' Categorical Value:

tempBooleanMask = df['Ticket'].str.contains('SOTON', case=False )& (
    ~df['Ticket'].str.contains('C',case=False)
    ) | (
        df['Ticket'].str.contains('STON', case= False)
        )


df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'STON/02'



# 'SOC' Categorical Value:

tempBooleanMask = df['Ticket'].str.contains('S', case= False) & (
    df['Ticket'].str.contains('O',case=False)) & (
        df['Ticket'].str.contains('C',case=False)) & (
            ~df['Ticket'].str.contains('A|W',case=False)
        )
 
df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'SOC'



# 'SOPP' Categorical Value:


tempBooleanMask = df['Ticket'].str.contains('S',case= False) & (
    df['Ticket'].str.contains('O',case= False)
    ) & (
        df['Ticket'].str.contains('P',case= False)
        )


df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'SOPP'



# 'A5/A4' Categorical Value:

tempBooleanMask = df['Ticket'].str.contains('A') & df['Ticket'].str.contains('5') 

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'A5/A4'


# 'FCC' Categorical Value: 

tempBooleanMask = df['Ticket'].str.contains('F',case=False) & ~df['Ticket'].str.contains('A',case=False)

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'FCC'


# 'WC' Categorical Value:

tempBooleanMask = ( df['Ticket'].str.contains('W', case=False)) & (
    df['Ticket'].str.contains('C', case=False)
    ) & (
        ~df['Ticket'].str.contains('SCO',case=False)
        )

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'WC'


# 'PP' Categorical Value: 

tempBooleanMask = df['Ticket'].str.contains('P',case=False) & (
~df['Ticket'].str.contains('S', case=False)
) & (
    ~df['Ticket'].str.contains('W', case=False)
    )


df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'PP'


# 'A5/A4' Categorical Value:

tempBooleanMask =  df['Ticket'].str.contains('A', case=False) & (
~df['Ticket'].str.contains('F',case=False)
) & (
    ~df['Ticket'].str.contains('SOTON') & ~df['Ticket'].str.contains('C')
    )

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'A5/A4'


# 'LINE' Categorical Value:

tempBooleanMask = df['Ticket'].str.contains('LINE')

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'LINE'




# 'C' Categorical Value:


tempBooleanMask = (
    df['Ticket'].str.contains('C', case=False) & ~df['Ticket'].str.contains('S', case=False)
)


df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'C'



# 'WEP' Categorical Value:


tempBooleanMask =  (
df['Ticket'].str.contains('W', case=False) & df['Ticket'].str.contains('E', case=False) & df['Ticket'].str.contains('P', case=False)
)

df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'WEP'



# 'SW/PP' Categorical Value:

tempBooleanMask =  ( df['Ticket'].str.contains('S', case=False) ) & (
    df['Ticket'].str.contains('W', case=False) 
    ) & (
        df['Ticket'].str.contains('P', case=False)
        )


df.loc[get_unfiltered_values() & tempBooleanMask, 'NewTicket'] = 'SW/PP'



# 'OTHER/STRINGS' Categorical Value: 

df.loc[get_unfiltered_values(),'NewTicket'] = 'OTHER/STRINGS'


show_unfiltered_values()

Now that we've cleaned our data, it's time to build a machine learning model to estimate the missing ages:


#Let's save our original Data Frame for later
originalDF = df


#Let's select all of the relevant data for our model: 

print(df)

df = df[['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Title','NewTicket']]


#Let's drop the missing age values
booleanMask = df['Age'].notnull()

df = df[booleanMask]


#Let's assign our target variable and our independent variables


x = df[['Survived','Pclass','Sex','SibSp','Parch','Fare','Embarked','Title','NewTicket']]
y = df['Age']


# We'll use the CatBoost Regressor, which excels at handling 
# categorical features like the ones below:

catColumns = ['Survived','Pclass','Sex','SibSp','Parch','Embarked','Title','NewTicket']


#Let's split our data to train and then test it: 

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2, random_state=42)




#Creating the model: 

model = CatBoostRegressor(

    iterations=500,
    learning_rate=0.03,
    depth=10,
    random_state=42,
    verbose=0,
 
)

model.fit(xTrain,yTrain,cat_features=catColumns)


yPred = model.predict(xTest)


# Let's calculate the evaluation metrics
mae = mean_absolute_error(yTest, yPred)
rmse = np.sqrt(mean_squared_error(yTest, yPred)) # RMSE is the sqrt of MSE
r2 = r2_score(yTest, yPred)

print("\nRegression Model Evaluation\n")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"On average, our age prediction is off by {mae:.2f} years\n")

print(f"Root Mean Squared Error (RMSE): {rmse:.2f}\n")

print(f"R-squared (R²): {r2:.2f}")
print(f"Our model explains {r2:.2%} of the variance in the age data\n")

# Let's look at the predictions values vs actual values uisng Tableau: 

results = pd.DataFrame({'Actual Age': yTest, 'Predicted Age': yPred})




# Our Model is good but it can be better, let's select only the best columns for our model 

print(model.get_feature_importance(prettified=True))


#Let's select all of the relevant data for our model: 


newDF = originalDF[['Survived','Pclass','Sex','Age','SibSp','Parch','Fare','Embarked','Title','NewTicket']]


#Let's drop the missing age values
booleanMask = newDF['Age'].notnull()

newDF = newDF[booleanMask]


#Let's assign our target variable and our independent variables


x = newDF[['Title','Pclass','Parch','Embarked','SibSp','Fare','NewTicket']]
y = newDF['Age']


# Let's indicate which are categorical values: 

catColumns = ['Title','Pclass','Parch','Embarked','SibSp','NewTicket']

#Let's split our data again to train and then test it: 

xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size=0.2, random_state=42)


#Creating the model: 

model = CatBoostRegressor(

    iterations=500,
    learning_rate=0.03,
    depth=10,
    random_state=42,
    verbose=0,
 
)

model.fit(xTrain,yTrain,cat_features=catColumns)


yPred = model.predict(xTest)


# Let's calculate the evaluation metrics
mae = mean_absolute_error(yTest, yPred)
rmse = np.sqrt(mean_squared_error(yTest, yPred)) # RMSE is the sqrt of MSE
r2 = r2_score(yTest, yPred)

print("\nRegression Model Evaluation\n")
print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"On average, our age prediction is off by {mae:.2f} years\n")

print(f"Root Mean Squared Error (RMSE): {rmse:.2f}\n")

print(f"R-squared (R²): {r2:.2f}")
print(f"Our model explains {r2:.2%} of the variance in the age data\n")

This is what we got:

  
  
        Regression Model Evaluation

        Mean Absolute Error (MAE): 7.72
        On average, our age prediction is off by 7.72 years

        Root Mean Squared Error (RMSE): 9.85

        R-squared (R²): 0.48
        Our model explains 47.69% of the variance in the age data

While there's room to improve model accuracy, this configuration should be fine for our current objectives. Let's retrain the model using the complete dataset and save it:

  
#Let's retrain our model with the full data set: 

model.fit(x,y,cat_features=catColumns)
print('Final Model has ben trained')


#Let's isolate the null values that we will be predicting: 

print(originalDF.isnull().sum())

nullAges = originalDF.loc[originalDF['Age'].isnull(),:]

print(nullAges)

x = nullAges[['Title','Pclass','Parch','Embarked','SibSp','Fare','NewTicket']]
y = nullAges['Age']

yPred = model.predict(x)


#On the original DF, let's assign the null values to the ones we just predicted

originalDF.loc[x.index,'Age'] = yPred

print(originalDF)


#Let's save our model to use it on the test.csv file
model.save_model('AgeRegressor.cbm')


#Let's save our results to use them to train our final model
originalDF.to_csv('Data Frame Cleaned.csv')

The dataframe has been cleaned and we're ready to train our final model to predict survival outcomes. To keep this notebook concise, the remaining code is available on our Python Page. And now that our data is prepared, let's visualize it using Tableau:

Titanic - Machine Learning from Disaster