When dealing with Strings….

Strings are general, strings are tricky. Regular expression is a useful tool when it comes to string manipulation, it also generates problems. Writing your own rules requires extreme precaution. Below are some notes I gathered from dealing with Strings (specifically people data, names, addresses, .etc) at large scale.

1. Have you considered punctuations?

Apostrophes exists in people’s last name, street names and many proper names. So is dash, forward and backward slash. Does your RegEx match these? Should your output normalize them?

2. How about Casing?

The usual casing in “McDonald” is meaningful; do you choose to maintain it? Always use case-insensitive when doing string comparison.

3. Is your input strictly English only?

Accented characters or umlauts (e.g., äöüß in German, àâæçéèêëïîô in French,  áéíñóúü¿¡ in Spanish) can cause problem is your pattern matching mechanism. You need to either consider these cases in matching (preferred solution) or normalize them into English counterparts before processing (loses the original character). Be careful which encoding you use (UTF-8, UTF-16, Unicode, etc.)

4. Rule of Thumb for writing RegEx: no more than three words long

If your RegEx can match longer than 3-words phrase, it is going to suffer on speed. Use a hierarchical data structure for more complicated string pattern matching. Instead of writing super long RegExs, divide and conquer with smaller RegExs. Your code will run faster and more robust.

 

Leave a Reply

Your email address will not be published. Required fields are marked *