on Geocoding – (Part 2 Building an Evaluation DataSet)

In essence, an ideal address geocoder will parse, normalize, and geocode an address, which can be of very flexible format. The normalized and geocoded result should follow some standard (USPS standard for address normalization is a pretty well developed standard, which I think, is the de facto standard for address geocoding). The expected output of geocoding should include not only lat, long, but also precision. There  could be some additional spatial information, such as county or MSA this address belongs to, that might be useful for certain applications. There might also be fixing of missing zipcode, misspelled street, city, or state names, depending on how lenient/tolerent you want the geocoder to be.

 

Of all of these complicated software development demand/goals, where do you start?

 

You start with building an address list

An Evaluation DataSet for the geocoder, to be precise.

 

The reason to build a test set first is that you can define as many types of address (or address misform) you want, but you can never to comprehensive (or it’s very difficult to be). Address, among other things, can have features or characteristics that differs across regions. How you treat apts, units, bldgs? A radical example would be road names in PR are in spanish, hence your carefully designed backward matching RegExs using a street type list would not work (what should be “Washington Ave” in Continental US would be “Ave  Washington” in PR, as in “2 Ave Washington, San Juan, PR 00907”). Software developers are usually limited with regard to their language, exposure to different types of address.

==========What should the TestSet look like?==============

Well, first of all, it should be of good size, because “More data is better”, no, seriously, google has a white paper on this (citation follows).

With more data, the test set is more likely to cover a extensive spatial range, more comprehensive list of address types, even more flexibility in the input (messy addresses)

However, we need something as a target, whether the correct lat/lon, or the correct normalized address, otherwise it cannot be called a test, can it?

Also, the addresses should be real.

Sounds a tedious job that requires time? It has already be done. USPS has a CASS (Coding Accuracy Support System, wiki) test for testing address normalization softwares (remember the pop-up when you enter your address as payment information for buying something online, and it says the address you entered might be this one as in a correctly formatted address? That’s CASS certified software doing its job). Unfortunately, the test is only designed for address normalization, not geocoding (no correct lat/lon with each address). It only contains pairs of messy address and correct address. Size? 150,000 pairs.

I’ve already done the job of fetching the lat/lon of every single address in the CASS test set from google/yahoo/bing’s geocoding API, which took quite some time (damn the daily limit issued by the big names). The result and the geocoding API call project is hosted on github: https://github.com/MetalMASK/CASS_Geocoding_Eval

 

Surprisingly, the performance of Google/ Yahoo/Bing are quite mediocre using the CASS test set. I’ll write about the test result next time.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *