I wrote this article in an unusual way: I started from the source code of a Python module, commented it thoroughly, made some minor formatting changes, and fed it through the markdown HTML expander. The result is what I'd like to think is pretty good example of a literate program. You can see the original or formatted source code too, or the pre-markdown text file, if you like.
phoneNormalizer.py
The phoneNormalizer
module demonstrates how a formatted string (a phone number is used
as an example) can be divided into useful atomic parts without loss of
information and without introducing dependencies for storage
in a relational database. In other words, it addresses issues of 3NF storage of
data that is not trivially atomic. The module is self testing, and it has been
thoroughly documented in an attempt at literate programming.
Avoiding Dependent Columns
Here's the problem: Storing the raw string phone number into a single text column in a database makes it un-searchable and unusable. Storing only atomic fields (ZIP code, extension, etc.) throws away important information and makes handling edge cases like international numbers unnecessarily challenging. However, storing both the raw string and the atomic fields introduces a dependency between columns (since the atomic fields can be computed from the raw string) which introduces the possibility of a corrupt database.
Now, I can't imagine anyone would lose any sleep over these theoretical
problems just for the case of phone numbers. The problem is more general,
and this module demonstrates a general technique. The lesson here is to
remember that it is ALWAYS possible to factor the atomic parts out of a
complicated data type.
Testing Strategy
The following snippet (copied from an interactive session) is included in the header
string of the module. But even though it's not code, it will be identified
and parsed by doctest
. doctest
will run the 'commands' and validate the
'output':
>>> normalize('work 123 (456)7890 x12 ')
('123', '456', '7890', '12', 'work %s (%s)%s x%s ')
>>> reconstruct(_)
'work 123 (456)7890 x12 '
>>> normalize('home 345-6789')
('', '345', '6789', '', 'home %s%s-%s%s')
>>> reconstruct(_)
'home 345-6789'
Using doctest
has two advantages: we can quickly create simple unit tests,
and we provide documentation that is guaranteed to be accurate. However,
the testing is too simple to give us any real confidence in the functions.
For that, we supply a 'luggage test' function that generates random
phone numbers in an attempt to cover a large portion of the input space.
The luggage test actually did catch a major bug: I forgot to escape the % characters when I first wrote the function. That's great, but the unit testing actually performed a more important service. I had to tweak the regular expressions several times to catch all the cases. Each time I changed the regular expression, I couldn't be sure that the previous cases would continue to work. However, I was able to tweak and massage the regular expression safely and easily, with the unit tests providing the safety net.
Unit tests enable painless refactoring.
There are two kinds of functions: those that are amenable to unit testing and those that are not. I feel this distinction is important because it reminds us to factor the unit testable portion out and test it thoroughly. That way we can "rely on reliable things." As the non-unit-testable portion of our code shrinks and simplifies, human testing becomes more productive.
Initialization
First, we import the regular expression module, which we'll rely heavily on:
import re
We'll need two regular expressions. The first matches a phone number, with area code, including an optional extension:
_phonePattern = re.compile(
r'^(.*)(\d{3})(\D*)(\d{3})(\D*)(\d{4})(\D*)(\d*)(\D*)$'
)
The second matches a local phone number (no area code) and optional extension:
_localPhonePattern = re.compile(
r'^(.*)()()(\d{3})(\D*)(\d{4})(\D*)(\d*)(\D*)$'
)
These two regular expressions could be combined, but only at a tremendous cost is readability. Note that the patterns are compiled just once, when the module loads. The leading '_' is a Python idiom indicating these are semi- private variables, and should not normally be used.
Seperate Atomic Fields from Formatting
normalize()
is the essence of this module. Notice that I chose to return
a simple tuple, instead of defining a 'phone number' class. It can be
argued that every abstraction should be represented as a class, but
I feel that a class should provide a significant service, and in this
case it wouldn't do much more than group the data together.
Using a tuple has the practical advantage of keeping the code
simple. I wanted the to take advantage of Python's built-in templating
operator %. The % operator can take either a tuple or a dictionary
as the RHS. I then chose the tuple over the dictionary because the
components of a phone number come in a specific order, making it more
tuple-like. Because the tuple really is the right logical choice, a
lot of little things tended to work out; for example, the repr()
of
the tuple is quite readable.
def normalize(phoneNumber):
'Maps a phone number in string format to (areaCode,trunk,branch,ext,form).'
'assert: form % (areaCode,trunk,branch,ext) == phoneNumber'
form
is the original string with the identified atomic substrings
factored out as '%s'.
The assert statement is not validated by doctest
--it's just a compact
way to express a property that the return value will always have, even
if the argument passed in isn't a recognized phone number.
We'll try to match each pattern in turn:
match = _phonePattern.match(phoneNumber)
if not match: match = _localPhonePattern.match(phoneNumber)
If none of the patterns match, we'll return empty strings '' for
each atomic field and append the four placeholders to the end of the raw pattern field
so it may safely be %
'ed with the 4-tuple of atomic fields:
if not match: return ('','','','',phoneNumber.replace('%','%%') +('%s'*4))
Note that replacing '%'s with '%%'s escapes them from the % operator.
Now that we have a regular expression match, we'll seperate out the atomic fields
from the format. We'll construct a a string containing all the non-phone number
characters from the string with %s
place-holders for the atomic fields:
form = '%s'.join(tween.replace('%','%%') for tween in match.groups()[::2])
For example, "work (%s)%s-%s x%s". Note that here and below I use
the [::2]
slice operator. This says take every other field. So, I'm
construct the form out the even groups, and the atomic fields are
the odd groups:
return match.groups()[1::2] + (form,)
So we see that normalize returns a 5-tuple: the four atomic fields and the form field at the end.
Now that we're done with normalize()
, it might be a good time to comment
on Python's docstrings. The first line of the docstring is considered
special; it's what you'll see pop-up in an IDE while typing the function.
the rest of the docstring is available in __doc__
, but the first line
should be self-sufficient for maximum clarity. #
comments (such as this paragraph)
are even lower on the totem pole; they're 'private' comments that
will only been seen by someone reading the code.
Reconstructing the Original String
The entire point of this exercise is to separate the atomic information in a string from its formatting without duplication and without destroying information. Therefore, it's important that we can retrieve the original input. Note that if someone ran a conversion on the database, perhaps updating the ZIP code column, the reconstructed string would reflect the updated value. This is not appropriate for all cases, such as when there's a legal reason to record the input as-is.
def reconstruct(phoneTuple):
'(areaCode,trunk,branch,ext,form) -> original formatted phone number'
return phoneTuple[4] % phoneTuple[0:4]
reconstruct()
turns out to be a trivial function; our simple design
is now paying dividends!
The Testing Framework
We'll start by creating a single entry point for all unit testing of the module:
def _test():
'test the module quietly and complain loudly on failure.'
'the tests may take a few seconds to run.'
import doctest
doctest.testmod()
I think doctest
is a very nice Python module. Here, I'm simply
validating the few basic tests in the module header above.
_luggageTest(1000)
Again, notice the leading _
, indicating semi-private functions.
The Old Luggage Test
luggageTest()
builds strings composed of random digits and
non-digits. It handles the different formats that normalize
accepts as separate cases. It does not achieve complete
coverage. It merely tries to exercise a large number of
variants on the known cases.
def _luggageTest(count):
'tests against randomly generated phone numbers.'
To construct the strings, we need to be clear about what we mean by a digit and a non-digit. The string module provides a string of digits, and we construct a string of non-digits from that. We have to exclude the newline character because the regular expressions were not defined to be multiline, although doing so would not be difficult.
from string import digits, printable
nonDigits = ''.join(c for c in printable if c not in digits and c != '\n')
Next, we'll define helper functions for producing random strings. these are considered 'notation' and are therefore given short names; while it's true that you have to look up a few lines to figure out what each means, the string constructions below fit neatly onto a single line.
import random
def rD(count): # r for random, D for regex non-digit
return ''.join(random.choice(nonDigits) for i in range(count))
def rd(count): #r for random, D for regex digit
return ''.join(random.choice(digits) for i in range(count))
def rInt(count=3): #small random integer
return random.randint(0,count)
def rDdD(count=2): #string containing digits in the middle
return rD(rInt(count)+1) + rd(rInt(count)) + rD(rInt(count)+1)
Now that we have all the tools we need, we'll exercise each known accepted format using the random string generators:
for i in xrange(count):
#with area code
phoneNumber = rD(rInt()) + \
rd(3) + \
rD(rInt()) + \
rd(3) + \
rD(rInt()) + \
rd(4) + \
rD(rInt()) + \
rd(rInt()) + \
rD(rInt())
_check(phoneNumber)
#local
phoneNumber = rD(rInt()) + \
rd(3) + \
rD(rInt()) + \
rd(4) + \
rD(rInt()) + \
rd(rInt()) + \
rD(rInt())
_check(phoneNumber)
# area code, together
phoneNumber = rD(rInt()) + rd(10) + rD(rInt())
_check(phoneNumber)
#local, together
phoneNumber = rD(rInt()) + rd(7) + rD(rInt())
_check(phoneNumber)
# area code, scatter in some digits at the front.
phoneNumber = rDdD() + \
rd(3) + \
rD(rInt()) + \
rd(3) + \
rD(rInt()) + \
rd(4) + \
rD(rInt()) + \
rd(rInt()) + \
rD(rInt())
_check(phoneNumber)
# local, scatter in some digits at the front.
phoneNumber = rDdD() + \
rd(3) + \
rD(rInt()) + \
rd(4) + \
rD(rInt()) + \
rd(rInt()) + \
rD(rInt())
_check(phoneNumber)
We'll also test some cases where we know it is NOT a phone number. we will be able to reconstruct the original, but all the atomic fields will be empty strings:
for i in xrange(count): # not a phone number
string = rD(rInt()+5)
norm = normalize(string)
recon = reconstruct(norm)
if ( recon != string ) or \
( norm[1] !='' ):
print string
print norm
print
That's the end of _luggageTest()
.
Checking Each Phone Number
We still have to define this helper function:
def _check(phoneNumber):
'Verifies that normalize() detects the atomic fields in the phone number.'
norm = normalize(phoneNumber)
recon = reconstruct(norm)
if (recon != phoneNumber) or \
(norm[1] == '') or \
(norm[2] == ''):
print phoneNumber
print norm
print recon
print
And we're done with the testing framework. All we need now is a convenient way to kick off the test.
It's another Python idiom to check the __name__
to see if
we're being run from the command line. For modules that
are meant to be imported into other python scripts, (as
opposed to Python command line scripts) it's common to
use this check to self-test the module. Here I've isolated the test
code in _test(), which is my personal preference.
if __name__ == '__main__':
_test()
Closing Thoughts
I've explored three concepts in this program: the 3NF problem that the module is designed to explore, the unit testing that made it possible to trust code depending heavily on a couple unreadable regular expressions, and the literate programming style of explaining the first two.
Suppose I were to say the string was a regular expression. And suppose I were to say the string was unreadable. But I repeat myself. - Mark Twain.
Python comes with a unit test framework that I did not use here, because
it would not have helped me write the brute force _luggageTest()
and I prefer
the elegance of doctest
.
Afterword
I originally wrote this article as comments embedded in the source itself. I did have to apply a few transformations to get it into neatly formatted HTML, but I that only took a few minutes. Proof-reading took longer, because comments tend to be terse labels, not perfect English. I would say the results were reasonable. If you compare the proof-reading changes in this document to the source code, you'll see that I significantly changed the section structure to better fit the program. It seems precise section and boundary formatting is needed to do proper literate programming.
- Oran Looney May 8th 2007
Thanks for reading. This blog is in "archive" mode and comments and RSS feed are disabled. We appologize for the inconvenience.