Apocryphally, 80% of data science is data munging, i.e. cleaning the data into a usable format. If you want to use Pandas on the NYC MTA Subway turnstile data, here’s some steps to get you started. Don’t miss the iPython/jupyter notebook I’ve included at bottom of the post.
As part of the NYC Open Data initiative, the Metropolitan Transit Authority makes available the raw turnstile data on a weekly basis from mid-2010 to Present. You can download these direct from the MTA website. Open up a recent file. How can we transform real-world data into something more user-friendly?
Tools used: Python 2.7; pandas; jupyter notebook
Doing so in this way means that it will append any number of files from the data/ folder, instead of hard-coding the filename directly.
This, like dropping duplicates, is generally a good idea when you are getting external data and don’t know how messy it’s going to be. In our case, many of the MTA CSVs have trailing strings after ‘EXIT’ in the columns headings.
Instead of two columns with date and time separately stored as strings, now you have a single datetime object.
By sorting the line name first, you avoid the problem of creating two or more unique identifiers per station in the case that the order of lines is changed for stations serving multiple lines.
Since the numbers are in the millions and generally keep climbing, we can be fairly confident in making this assumption. We need to take a row-by-row difference in order to get entry counts for the intervals.
Turnstile entry counts are taken approximately at 4hr intervals, although these are not the same for each station. Simplify this issue by binning your data. You could also use a simple linear model to interpolate the entries/exits for comparability.
Here’s a link to a jupyter notebook to take you through these steps and a few more to get you started. In Part Two (coming soon!) I’ll suggest some techniques for exploratory data visualization and treatment of outliers.
Until then… happy cleaning, everyone!