Thursday, August 28, 2014

First Line Address Parser and Standardizer

Since my recent move from from NYC to Philly.   This has led to me rethinking projects to work on.

It didn't take long for me to see the need for an open address parser and standardizer.   I often have to work with data that has addresses or street information.   It seems that every one of them is formatted, parsed, and standardized differently.  To do any analytics or comparisons is always difficult.   I did some searching and didn't find anything that met what I thought was needed to so I set out develop my own using Python.

If you just want to skip ahead, the code is here:


Will work on any US based first line address but standardizations are tweaked specifically for Philadelphia addresses.

Does not handle City, State, ZIPCode

Application currently runs self contained in one directory. See phlAddrParse.cfg for input and output settings.
Program attempts to adhere to USPS Pub. 28 address standards.
There are 6 files that determine most of the standardizations. You can modify to meet your needs:
  • apt.csv - postal unit designators/apt, expects a number/unit identifier to follow
  • apte.csv - postal unit designator/apt, no following number/unit identifier required.
  • directional.csv - pre and post directionals for addresses
  • suffix.csv - usps suffixes
  • saint.csv - list of Saints to help with standardizations of 'ST'
  • std.csv - street name standardizations.
123R-27 north ben Franklin blv apt 2b and s pine av
oebOaddress number/range, Odd, Even, Both
alow123address number low
ahigh127address number high
astrlow123Raddress number low as a string
astrhigh127Raddress number high as a string
predirNpre directional
streetnameBENJAMIN FRANKLINstreet name
postdirpost directional
unitAPT 2Bunit designator/apt
predir2Spre directional for intersections, blank otherwise
streetname2PINEstreetname for intersections, blank otherwise
suffix2AVEsuffix for intersections, blank otherwise
postdir2post directional for intersections, blank otherwise
unit2unit designator for intersections, blank otherwise

Friday, May 30, 2014

LaGuardia Airport Taxi Origin and Destinations, 2013

I've always been curious what the taxi travel patterns for LaGuardia Airport look like since it is the only practical method of travel to and from for many travelers.   Using the data from my previous post on processing all taxi trips for 2013, I did some analysis of just LGA taxi trips.

I think this side by side map viewer shows the patterns best.  Click on the map to open.

Of course there is significant activity near locations you would expect such as Penn Station and Grand Central Terminal.  The next level of features is dominated by large hotels.  If you zoom to lower Manhattan I added a number of hotel locations to highlight this pattern.

I would have liked to have also done a JFK comparison but most trips are to and from Manhattan and it is flat rate. This leads to the meters being started and stopped during the trip and not when the trip starts and finishes.  There are thousands of locations along the Van Wyck.

Passenger data for JFK and LGA, 2013

Taxi Trip totals for Airports, 2013

Time of Day, Day of Week Heatmap for LGA Drop-offs, 2013

Avg. number of drop-offs per hr

credit -

Thursday, May 8, 2014

NYPD Motor Vehicle Collisions

After a year of working with NYPD Collision data that was aggregated to the intersection by month, it was exciting to see this data release by individual collision with a date and time.

Congratulations and kudos to all that worked to make this happen!

This is also personally very satisfying.  I became involved in helping geocode the crash data that was at the time in pdf format as a direct result of this article.

The highlight was -

Council Member Jessica Lappin got into an animated discussion with Petito over traffic crash data. When Lappin asked why NYPD is releasing data in PDF form — and only after the council adopted legislation forcing the department to do so — Petito replied that the department is “concerned with the integrity of the data itself.” Petito said NYPD believes data released on a spreadsheet could be manipulated by people who want “to make a point of some sort.” An incredulous Lappin assured Petito that the public only wants to analyze the data to improve safety, not use it for “evil.”

I think everyone involved showed that while not every representation of the data is going to be perfect, providing this data in a publicly available usable format is valuable for everyone.

That said, some caution must be used in working with this new data.  Over 13% of the records do not contain coordinates, borough, or ZIPCode.

The large majority of the intersections missing coordinates are easily geocoded.  The caveat being that you need to make sure it isn't an intersection name that exists in more than one borough.

Here are the top intersections by count in the NYPD Motor Vehicle Collisions data that do not have coordinates.

From my previous work getting this data geocoded, I have a list of NYC intersections and their coordinates.  Would be great if the data could be augmented with this and made available -  

Street,street2,Zipcode,boro code,PolicePrecinct,Lon,Lat
1 AVENUE,39 STREET,11232,3,72,-74.01273692,40.65662545
1 AVENUE,40 STREET,11232,3,72,-74.01327021,40.65605997
1 AVENUE,44 STREET,11232,3,72,-74.01563035,40.6537815
1 AVENUE,47 STREET,11232,3,72,-74.01738143,40.65210968
1 AVENUE,48 STREET,11232,3,72,-74.01796511,40.65154416
1 AVENUE,53 STREET,11220,3,72,-74.02087613,40.64874399
1 AVENUE,53 STREET,11232,3,72,-74.02087613,40.64874399
1 AVENUE,54 STREET,11220,3,72,-74.02145974,40.6481812
1 AVENUE,56 STREET,11220,3,72,-74.02262334,40.64706385
1 AVENUE,ALLEN STREET,10003,1,9,-73.98863937,40.72293374

Otherwise the data looks pretty good and consistent for the dates available. I'm very curious to see if it actually updates every day.  So far it didn't today. Still, awesome to see this data released this way.

The spikes seem to correlate with snow storms.  The 'down' spikes seem to be holidays.

Lastly, the older NYPD data documents the vehicles involved.  This allows for some interesting insights on some specific crashes.  I was very surprised at how many fire trucks and ambulances are involved in collisions.

Tuesday, April 29, 2014

NYC Taxi Data - 2013

I should have never asked for this data. I have plenty of projects to work on. Still, I couldn't resist the opportunity to have a look at an entire years worth of NYC taxi data.  It became this challenge to see if I could actually manage to work with 50GB of data and over 170 million records.  So after spending an inordinate amount of time over a few weekends on this, here are my results.

First, I have to thank/blame Chris Whong for this adventure.  Please check out his blog.

All my work can be found on github -

Initially, I tried to normalize and compress the data down enough to make it searchable in some form.  For about 4hrs I had a 5GB zipped file of the on my dropbox.  I tweeted that the data was available to download and soon I got an email saying my dropbox account was suspended.  The lesson learned is that the free dropbox only allows for 20GB of downloads per day.

I next decided to try and create a table of total counts by neighborhoods for total origin/pickup and destination/drop-off.  Long story short, the output ended up looking like this:

Get your own copy here.
The corresponding shapefile.
A map version looks like this:


I've also started experimenting with making some maps with the data.  Here are the first two:

A sample of the processed data for one day, April 30th, 2013, can be found here.  If you are interested in more of the data or a specific slice of it, message me @mrsp105 and I will try to assist.

Sunday, February 23, 2014

NYC Bike Lane Violation Parking Citations

Here's a quick post for a map of bike lane violation parking citations using the ArcGIS Online Storytelling Text and Legend web application template.

Date ranges for the citations - 7/30/2013 - 10/29/2013.

For the smaller scales, the citations are represented as heat/density maps.  My preference was to not use a heat map but it was the only way to represent the citations and the bike lanes at the same time.

Zoom in to see the actual citation locations as blue circles.

View Larger Map

Monday, January 20, 2014

Mapping NYC Parking Tickets

Over the last 2-3 months I have been working with  the NYC Parking Violations Issued data released on NYC Open Data.  Basically, I wanted to geocode the records, enabling spatial and visualizations and analysis.  I'm still thinking and working on ways to map this data but in the mean time, I thought I would share the data.  

There were a number of challenges working with this data.

The first challenge was to see if the data was really valid.  Creating a histogram of the total reveals that most dates only have what appears to be a small sample of records or the records are incorrectly encoded with the date(there are a number of records in the future).  I was able to settle on a date range (07/29/2013 - 10/28/2013) that seemed to be consistent and reasonable in terms of having complete data.

Second, I needed to geocode the records before I could map them.  For this I decided to make use of the recently released NYC GeoClient API.  Here is the code I used -  It ran @ 1,500 records per min on an Amazon EC2 server.  The code is quite sloppy.  I just kept adding more code as I found ways to geocode more addresses.   Some, intersections others street addresses.  I also had to determine borough codes from the precincts.

Here is a simple visualization of all tickets in a heat map.

Full map

Monday, November 25, 2013

October update to bike collision data

A quick update to the bike collision charts to reflect the October data.  Big thanks again to John Krauss and OpenScrape for processing the reports into an easy format to work with.