🛳️ The Titanic: Predicting Survival
I have been working A LOT with Fast.ai lately, and I am loving it. I am learning so many new approaches that I feel like get overlooked by many in the field when discussions arise. But I have to admit: I was a little nervous when it came to working with tabular data, i.e. this dataset, with Fast.ai. Most of my work with the framework has been with image recognition. But I figured if it is so revolutionary in imaging, surely I should learn to implement it with ye ol' tabular data. So I did. And once again, I was impressed!
The Titanic dataset is not huge. It is actually dauntingly small, with only 891 rows of data to work with. But I have been trying to really embrace the Fast.ai approach and know that I do not need loads of data to accurately predict an outcome. In addition to the small number of rows, a relatively large number of columns were not very useful. Columns like passenger id, passenger name, number of siblings and parents did not have much affect at all on the predictions. And I knew from the beginning that passenger class and gender were going to be the leading factors. That is just logical.
While I investigated what kind of feature engineering could be done here, I came to the conclusion that was not going to help very much in predicting the survival of a passenger. In fact, I compared my accuracy to that of projects that had a significant number of features added. And there was really no difference. So the only feature engineering I did was to combine the family members from the sibling column and spouse column into one that just told how many family members total a passenger was traveling with. I also added a column for whether or not they were traveling alone, because it does seem that if one was traveling alone, that could affect decisions that would ultimately lead to survival or not.
So I completed this project with the Fast.ai framework first, and then I followed up with a little dance with XGBoost. Of course, XGBoost could not beat out Fast.ai. And I was not surprised. But I figured it was worth a try. Of course, most of my time with XGBoost was spent determining the best hyperparameter settings to go with.
So moral of the story: Fast.ai is just as easy to work with on tabular data as it is with images. AND it still beats out XGBoost! So that's cool!
Interactive Jupyter Notebook | PDF Version | 👇 Scroll through the notebook