data science

🚕 NYC Taxi Prediction: 1.5 million+ rows of data 🫣

This was the most intense project I have completed thus far. And I guess that would be appropriate! It is my final for my current course, Machine Learning: Zero to GBMs at Jovian.ml.

Evan Marie Carr

Oct 18, 2022 • 4 min read

Photo by Luca Bravo / Unsplash

This project was INTENSE! I did not initially realize that the 1.5 million+ rows in this NYC Taxi Trip Duration Prediction dataset would pose the challenge it did, not until I was well into the project. Numerous runtime crashes and RAM overloads later, here I am! And I must say, I am pleased with the outcome, although I will still be working on boosting my accuracy even more! And while I know you can only do so much with the data provided, I look forward to learning perfecting my use of the intricacies of the parameters for my machine learning models and working in more deep learning models to boost my predictions.

The two models I chose to work with on this project were the SKLearn XGBRegressor and RandomForestRegressor. One of my very favorite aspects of working with this data was how limited it was coming in, well not in number of rows (not by a long shot!), but in features. So I got to do a LOT of feature engineering. And it is always so fun to wrap some python code around the data and see what comes out. I think at one point I had almost tripled the columns/features of the dataset just to get a better look at what all was hidden within in. I learned how to extract street addresses from longitude and latitude points in my code so that I could geoplot the data complete with the store the passenger(s) was picked up from. I created SO MANY categories with which to analyze and visualize the data. I REALLY got to know this data. It was exciting and stressful all at the same time!

Overall, I am very content with how this project turned out, although I do wish I could have gotten the accuracy numbers higher on my validation sets. This is something I definitely plan to keep learning more about and working on, more hyperparameter tuning, etc.

Since one of my favorite parts of this work is data visualization, I will share some of my favorite plots from the project. I especially enjoyed working with GeoPandas and creating the maps of taxi trips. But my absolute favorite was creating the heat maps. One aspect I find most intriguing about data visualization is how good plots can help you find the right questions to ask and inspire new avenues to explore in data science.

My entire notebook can be found at the bottom of this page, complete with documentation and a thorough table of contents which is easily accessible when run on Google Colab. Unfortunately, the embedding does not hide the output where I specified or the looooooong code that in the interactive notebook I have collapsed. So if you are interested, please do follow this link, and choose "Run on Google Colab". There you can see my project in all of its beauty with detailed table of contents and emoji coordinated. Hey! I like to make things graphically pleasing! What can I say?!

Data Visualization: It is a passion!

Fig.1 - Initial plotting of distances, looking at the various daytime categories I created to see if there were many trends.

Fig.2 - Initial plotting of durations, looking at the various daytime categories I created to see if there were many trends.

Fig.3 - I created categories for duration to look for trends. This was before I weeded out some very outlandish outliers.

Fig.4 - Before weeding out the outliers. 800 mile taxi service? 6000+ miles per hour? Somehow, I think not.

Fig.5 - After weeding out the outliers. This is probabl one of my favorite plots I have done since starting my data science journey. Comparison to the previous speaks volumes.

Fig.6 - This is a look at a smaller fraction, the top 50 longest trips. It is clear that most of the longest trips in the dataset involve transport to and/or from one of the main NYC metro airports.

Fig.7 - This is a heatmap of the longest duration trips by dropoff location. I found it very interesting that Newark Airport received so much traffic on this dropoff map, as opposed to the pickup heatmap, where it barely shows on the map. (See second heatmap in full project notebook below.

Fig.8 - I did some experimentation with K-Fold Validation on the set, but I believe that because the set is so incredibly large, K-Fold did not turn out to be very useful.

Fig.9 - This is an example of one of the many plots in my project where I compare the effectiveness of different values passed to the model via the hyperparameters.

Fig.10 - As I studied my models more, I became very intrigued with a number of parameters that I have not even heard of or read about. So I decided to try them out and plot them for comparison. This particular plot helped me to weed out some of the features that have little to no effect on the accuracy of the model.

For full interactive notebook, click here. Or scroll through this embedded version 👇.