I should have never asked for this data. I have plenty of projects to work on. Still, I couldn't resist the opportunity to have a look at an entire years worth of NYC taxi data. It became this challenge to see if I could actually manage to work with 50GB of data and over 170 million records. So after spending an inordinate amount of time over a few weekends on this, here are my results.
First, I have to thank/blame Chris Whong for this adventure. Please check out his blog. http://chriswhong.com/open-data/foil_nyc_taxi/
All my work can be found on github - https://github.com/tswanson/TaxiNYC2013
Initially, I tried to normalize and compress the data down enough to make it searchable in some form. For about 4hrs I had a 5GB zipped file of the on my dropbox. I tweeted that the data was available to download and soon I got an email saying my dropbox account was suspended. The lesson learned is that the free dropbox only allows for 20GB of downloads per day.
I next decided to try and create a table of total counts by neighborhoods for total origin/pickup and destination/drop-off. Long story short, the output ended up looking like this:
Get your own copy here.
The corresponding shapefile.
A map version looks like this:
I've also started experimenting with making some maps with the data. Here are the first two:
A sample of the processed data for one day, April 30th, 2013, can be found here. If you are interested in more of the data or a specific slice of it, message me @mrsp105 and I will try to assist.
Over the last 2-3 months I have been working with the NYC Parking Violations Issued data released on NYC Open Data. Basically, I wanted to geocode the records, enabling spatial and visualizations and analysis. I'm still thinking and working on ways to map this data but in the mean time, I thought I would share the data.
There were a number of challenges working with this data.
The first challenge was to see if the data was really valid. Creating a histogram of the total reveals that most dates only have what appears to be a small sample of records or the records are incorrectly encoded with the date(there are a number of records in the future). I was able to settle on a date range (07/29/2013 - 10/28/2013) that seemed to be consistent and reasonable in terms of having complete data.
Second, I needed to geocode the records before I could map them. For this I decided to make use of the recently released NYC GeoClient API. Here is the code I used - https://github.com/tswanson/NYCParkingGeocode. It ran @ 1,500 records per min on an Amazon EC2 server. The code is quite sloppy. I just kept adding more code as I found ways to geocode more addresses. Some, intersections others street addresses. I also had to determine borough codes from the precincts.
Here is a simple visualization of all tickets in a heat map.
Full map
A quick update to the bike collision charts to reflect the October data. Big thanks again to John Krauss and OpenScrape for processing the reports into an easy format to work with.
There was a fair amount of concern when CitiBike was launched (May 27th, 2013), that there would be significant public safety issues with thousands of extra bikes riding around in the most congested areas of Manhattan and Brooklyn. Now that the program has been in place for a few months, I thought it would be worth looking at the NYPD's collision data to see what the impacts are.
Over the last year or so, I've been working with the NYPD Motor Collision Data Reports. I thought that they would be very useful for mapping, provided they could be formatted properly. The problem is that the reports are released as PDF documents, and therefore ineditable. The data is aggregated by month to the closest intersection. This limits what can be done with the data because one does not know where the actual incident occurred.
For this analysis I looked at the "Bicycle" count in the Vehicle Type column. This way, I was looking at total bicycles involved in collisions (as reported) instead of just cyclist injuries.
The NYPD Crash Band-Aid project has been working hard on developing Python scripts that extract the PDF documents into a usable and open format - .csv. I was able to help contribute to the project by providing the initial seeding of Latitude and Longitude coordinates for most of the intersections. The table of intersections was created by using the Department of Planning LION street file. Additional intersections geocoded, were made by using the Dept. of Planning GeoSupport desktop geocoder. Currently, just over 99% of the intersections have coordinates.
Now that the data was in a 'mappable' format, I went to work making a number of maps of bike collisions, as well as other vehicle types. In the absence of actual traffic counts by vehicle types, this helps create a picture of traffic patterns by various vehicles. This assumes that there is a strong correlation between collisions and miles driven in those areas, which may not always be the case.
Livery cab collision density map. August 2011 - September 2013
View Larger Map
Comparison of bike collision density with total vehicle collision density.
I was curious enough to go see where the areas with the highest density of collisions was, so I hopped on a CitiBike. Three hours and 5 different CitiBikes later, I had collected photo documentation of those locations.
And finally, I created a 3D heat map of the bike collision density. Warning, you will need firefox, chrome, or safari browser. The file will have to first download and unpack locally. Look for a future blog post on how this was created.
Ok ok... For the CitiBike analysis.
One of the first things I tried was to compare density of bike collisions for June - Sept. 2013 to the same months in 2012. I wanted to see if there were any visible patterns when comparing the months in 2013 when CitiBike was available vs the previous year. It really didn't show much in terms of patterns. The data is also sparse for this type of use and the validity of any conclusion drawn from it would be questionable. Still, I spend a lot of time creating these maps, so I've left them in here.
Lastly, I created a polygon area that outlines the CitiBike docking stations (shown in blue below). I used this boundary to compare total bike collisions inside of this area to the total outside.
View Larger Map
Below are two charts that graph the NYC bike collision data by month.
The first chart shows two bands of data. The top dark blue columns are the total number of bicycles involved in collisions outside of the CitiBike Area, and the lower light blue columns are the number of bicycles involved in collision within the CitiBike Area. I added the orange colors on the bottom columns so that comparisons for the summer months of 2013 can be easily compared to 2012.
This second chart shows the percentage of bike collisions that occurred in just the CitiBike Area.
There has been an increase in the total bike collisions for the CitiBike area since the bike share program started at the end of May. However the increase is small and total bike collisions has increased also. More importantly, the percent of bike collisions in those four months (June - Sept 2013) has stayed consistent with the historical percentages.
Based on the collision data published by the NYPD, and more importantly on the last 4 months of data since the launch of the bike share program by CitiBike, my analysis shows that there is no significant increase to bike collisions in the areas with CitiBike docking stations compared to all bike collisions in the city.
Link to spreadsheet data by month.
Please stay tuned as I update results new data becomes available I will also be posting more details on how some of the maps in this blog were created.
Welcome to my new blog! This is where I will be posting GIS and mapping projects that I think are interesting and useful. Here is a teaser...