On Geocoding (part 1 – definition)

=What is Geocoding=

More than once have I received the same response to a discussion related to geocoding :” hold on for a sec, what is geocoding again?”. To which I response with the same definition you can find on wikipedia: “geocoding is the procedure to translate addresses into lat/lons (wiki), along with a accuracy/precision/confidence, so that you can pin it to a map, as well as know how reliable that pinpoint is.” There are a number of web services that can make this job very easy for you. including google, bing, yahoo, and if you want to develop a geocoder yourself, a list of resources might be helpful. However, a more thorough and modern definition of geocoding requires further discussion. In my opinion, geocoding has been too narrowly defined as pinning addresses. People’s thoughts on “geocoding”, once related to “putting addresses on a map”, would be, “Oh, that’s useful; shouldn’t that be a simple task?”, There are at least the following 3 point that is worth noting when talking about geocoding:

1. There is an normalization process before you can use the human generated address to search on a spatial database. In other words, there are many ways to write an address: “500 108th NE Ave, Bellevue, WA 98004” could be written as “500 108 Avenue, Bellevue”. In dealing with these noisy input, parsing/formatting/normalization is an important pre-processing before you can use the parsed address to do a spatial database lookup.

Do NOT underestimate the complexity of road names, using simple regular expression can only get you unusable result if you want to scale up. Geocoder designer need to have at least experience in dealing with (reading, evaluating, debugging) at least over a few hundreds of addresses, over a wide range of spatial coverage, to ensure some acceptable performance. Road names with numbers, POSTDIR and PREDIR in the roadnames, roadnames in Puerto Rico, can easily f**k you over. Even current commercial geocoding services cannot handle all the complexities (example will follow).

2. Geocoding addresses returns lat/lon, as well as precision. Precision is as important as the returned lat/lon. That’s because spatial database for geocoding only stores Address Range instead of every single household (you think that’s possible, would you? consider how city planning, street layout changes every year and how many old household goes down and new household goes up….even AddressRange data is incomplete for the US, let along more precise single household lat/lon), a majority of addresses are imputed. In other words, in the spatial database, there stores a multi-line segment where one end is “100 Main St”; and the other end is “199 Main St”, this means any address in between, such as “150 Main St” are imputed (by using the middle point of the line segment). In these cases, precision “imputed” can tell you how the address are geocoded, hence the user can use his/her own judgement and perceive the pin on the map with a grain of salt. While in some other cases, precision “ROOFTOP” tell you that this geocoded lat/lon is the most precise possible lat/lon because it matches single household lat/lon (there are a few of these in the spatial DB).

Note: different geocoding services use different terminology for denoting precision. Look at the the Yahoo Address Quality Explanation, Google Location Type, and Bing’s CalculationMethod, Confidence and MatchType. These different terms means pretty much the same thing.

3. Geocoding can be closely related with NLP (natural language processing) and GIR (Geographic Information Retrievel). Most current geocoding application only takes strictly formatted encoded spatial entity, such as address and zipcode. In everyday life, people use flexible language to describe spatial entity, too. For example, street intersection (“the accident happens at the intersection of Main Street and 100th St”), landmarks (“the robbery happens close to MacDonald on the Main St”). Usually, extracting these spatial entities from a plain text is classified as the field of GIR. It involves much more than spatial database lookup, such as geo-disambiguation (there are over 1000 “Main Street” in the US, which do you choose?) Extending geocoding to such broader field would greatly enhance how people think about spatial information and how text can be used.

(to be continued)

Leave a Reply

Your email address will not be published. Required fields are marked *