Thursday, August 28, 2014

First Line Address Parser and Standardizer

Since my recent move from from NYC to Philly.   This has led to me rethinking projects to work on.

It didn't take long for me to see the need for an open address parser and standardizer.   I often have to work with data that has addresses or street information.   It seems that every one of them is formatted, parsed, and standardized differently.  To do any analytics or comparisons is always difficult.   I did some searching and didn't find anything that met what I thought was needed to so I set out develop my own using Python.

If you just want to skip ahead, the code is here:


Will work on any US based first line address but standardizations are tweaked specifically for Philadelphia addresses.

Does not handle City, State, ZIPCode

Application currently runs self contained in one directory. See phlAddrParse.cfg for input and output settings.
Program attempts to adhere to USPS Pub. 28 address standards.
There are 6 files that determine most of the standardizations. You can modify to meet your needs:
  • apt.csv - postal unit designators/apt, expects a number/unit identifier to follow
  • apte.csv - postal unit designator/apt, no following number/unit identifier required.
  • directional.csv - pre and post directionals for addresses
  • suffix.csv - usps suffixes
  • saint.csv - list of Saints to help with standardizations of 'ST'
  • std.csv - street name standardizations.
123R-27 north ben Franklin blv apt 2b and s pine av
oebOaddress number/range, Odd, Even, Both
alow123address number low
ahigh127address number high
astrlow123Raddress number low as a string
astrhigh127Raddress number high as a string
predirNpre directional
streetnameBENJAMIN FRANKLINstreet name
postdirpost directional
unitAPT 2Bunit designator/apt
predir2Spre directional for intersections, blank otherwise
streetname2PINEstreetname for intersections, blank otherwise
suffix2AVEsuffix for intersections, blank otherwise
postdir2post directional for intersections, blank otherwise
unit2unit designator for intersections, blank otherwise