A Literate Program on the 3rd Normal Form colorful lead photo

I wrote this article in an unusual way: I started from the source code of a Python module, commented it thoroughly, made some minor formatting changes, and fed it through the markdown HTML expander. The result is what I'd like to think is pretty good example of a literate program. You can see the original or formatted source code too, or the pre-markdown text file, if you like.

phoneNormalizer.py

The phoneNormalizer module demonstrates how a formatted string (a phone number is used as an example) can be divided into useful atomic parts without loss of information and without introducing dependencies for storage in a relational database. In other words, it addresses issues of 3NF storage of data that is not trivially atomic. The module is self testing, and it has been thoroughly documented in an attempt at literate programming.

Avoiding Dependent Columns

Here's the problem: Storing the raw string phone number into a single text column in a database makes it un-searchable and unusable. Storing only atomic fields (ZIP code, extension, etc.) throws away important information and makes handling edge cases like international numbers unnecessarily challenging. However, storing both the raw string and the atomic fields introduces a dependency between columns (since the atomic fields can be computed from the raw string) which introduces the possibility of a corrupt database.

Now, I can't imagine anyone would lose any sleep over these theoretical problems just for the case of phone numbers. The problem is more general, and this module demonstrates a general technique. The lesson here is to remember that it is ALWAYS possible to factor the atomic parts out of a complicated data type.

Testing Strategy

The following snippet (copied from an interactive session) is included in the header string of the module. But even though it's not code, it will be identified and parsed by doctest. doctest will run the 'commands' and validate the 'output':

>>> normalize('work 123 (456)7890 x12 ')
('123', '456', '7890', '12', 'work %s (%s)%s x%s ')
>>> reconstruct(_)
'work 123 (456)7890 x12 '
>>> normalize('home 345-6789')
('', '345', '6789', '', 'home %s%s-%s%s')
>>> reconstruct(_)
'home 345-6789'

Using doctest has two advantages: we can quickly create simple unit tests, and we provide documentation that is guaranteed to be accurate. However, the testing is too simple to give us any real confidence in the functions. For that, we supply a 'luggage test' function that generates random phone numbers in an attempt to cover a large portion of the input space.

The luggage test actually did catch a major bug: I forgot to escape the % characters when I first wrote the function. That's great, but the unit testing actually performed a more important service. I had to tweak the regular expressions several times to catch all the cases. Each time I changed the regular expression, I couldn't be sure that the previous cases would continue to work. However, I was able to tweak and massage the regular expression safely and easily, with the unit tests providing the safety net.

Unit tests enable painless refactoring.

There are two kinds of functions: those that are amenable to unit testing and those that are not. I feel this distinction is important because it reminds us to factor the unit testable portion out and test it thoroughly. That way we can "rely on reliable things." As the non-unit-testable portion of our code shrinks and simplifies, human testing becomes more productive.

Initialization

First, we import the regular expression module, which we'll rely heavily on:

    import re

We'll need two regular expressions. The first matches a phone number, with area code, including an optional extension:

    _phonePattern = re.compile(
        r'^(.*)(\d{3})(\D*)(\d{3})(\D*)(\d{4})(\D*)(\d*)(\D*)$'
    )

The second matches a local phone number (no area code) and optional extension:

    _localPhonePattern = re.compile(
        r'^(.*)()()(\d{3})(\D*)(\d{4})(\D*)(\d*)(\D*)$'
    )

These two regular expressions could be combined, but only at a tremendous cost is readability. Note that the patterns are compiled just once, when the module loads. The leading '_' is a Python idiom indicating these are semi- private variables, and should not normally be used.

Seperate Atomic Fields from Formatting

normalize() is the essence of this module. Notice that I chose to return a simple tuple, instead of defining a 'phone number' class. It can be argued that every abstraction should be represented as a class, but I feel that a class should provide a significant service, and in this case it wouldn't do much more than group the data together.

Using a tuple has the practical advantage of keeping the code simple. I wanted the to take advantage of Python's built-in templating operator %. The % operator can take either a tuple or a dictionary as the RHS. I then chose the tuple over the dictionary because the components of a phone number come in a specific order, making it more tuple-like. Because the tuple really is the right logical choice, a lot of little things tended to work out; for example, the repr() of the tuple is quite readable.

def normalize(phoneNumber):
    'Maps a phone number in string format to (areaCode,trunk,branch,ext,form).'
    'assert: form % (areaCode,trunk,branch,ext) == phoneNumber'

form is the original string with the identified atomic substrings factored out as '%s'. The assert statement is not validated by doctest--it's just a compact way to express a property that the return value will always have, even if the argument passed in isn't a recognized phone number.

We'll try to match each pattern in turn:

    match = _phonePattern.match(phoneNumber)
    if not match: match = _localPhonePattern.match(phoneNumber)

If none of the patterns match, we'll return empty strings '' for each atomic field and append the four placeholders to the end of the raw pattern field so it may safely be %'ed with the 4-tuple of atomic fields:

    if not match: return ('','','','',phoneNumber.replace('%','%%') +('%s'*4))

Note that replacing '%'s with '%%'s escapes them from the % operator.
Now that we have a regular expression match, we'll seperate out the atomic fields from the format. We'll construct a a string containing all the non-phone number characters from the string with %s place-holders for the atomic fields:

    form = '%s'.join(tween.replace('%','%%') for tween in match.groups()[::2])

For example, "work (%s)%s-%s x%s". Note that here and below I use the [::2] slice operator. This says take every other field. So, I'm construct the form out the even groups, and the atomic fields are the odd groups:

    return match.groups()[1::2] + (form,)

So we see that normalize returns a 5-tuple: the four atomic fields and the form field at the end.

Now that we're done with normalize(), it might be a good time to comment on Python's docstrings. The first line of the docstring is considered special; it's what you'll see pop-up in an IDE while typing the function. the rest of the docstring is available in __doc__, but the first line should be self-sufficient for maximum clarity. # comments (such as this paragraph) are even lower on the totem pole; they're 'private' comments that will only been seen by someone reading the code.

Reconstructing the Original String

The entire point of this exercise is to separate the atomic information in a string from its formatting without duplication and without destroying information. Therefore, it's important that we can retrieve the original input. Note that if someone ran a conversion on the database, perhaps updating the ZIP code column, the reconstructed string would reflect the updated value. This is not appropriate for all cases, such as when there's a legal reason to record the input as-is.

def reconstruct(phoneTuple):
    '(areaCode,trunk,branch,ext,form) -> original formatted phone number'
    return phoneTuple[4] % phoneTuple[0:4]

reconstruct() turns out to be a trivial function; our simple design is now paying dividends!

The Testing Framework

We'll start by creating a single entry point for all unit testing of the module:

def _test():
    'test the module quietly and complain loudly on failure.'
    'the tests may take a few seconds to run.'

    import doctest
    doctest.testmod()

I think doctest is a very nice Python module. Here, I'm simply validating the few basic tests in the module header above.

    _luggageTest(1000)

Again, notice the leading _, indicating semi-private functions.

The Old Luggage Test

luggageTest() builds strings composed of random digits and non-digits. It handles the different formats that normalize accepts as separate cases. It does not achieve complete coverage. It merely tries to exercise a large number of variants on the known cases.

def _luggageTest(count):
    'tests against randomly generated phone numbers.'

To construct the strings, we need to be clear about what we mean by a digit and a non-digit. The string module provides a string of digits, and we construct a string of non-digits from that. We have to exclude the newline character because the regular expressions were not defined to be multiline, although doing so would not be difficult.

    from string import digits, printable
    nonDigits = ''.join(c for c in printable if c not in digits and c != '\n')

Next, we'll define helper functions for producing random strings. these are considered 'notation' and are therefore given short names; while it's true that you have to look up a few lines to figure out what each means, the string constructions below fit neatly onto a single line.

    import random
    def rD(count):  # r for random, D for regex non-digit
        return ''.join(random.choice(nonDigits) for i in range(count))

    def rd(count):  #r for random, D for regex digit
        return ''.join(random.choice(digits) for i in range(count))

    def rInt(count=3): #small random integer
        return random.randint(0,count)

    def rDdD(count=2): #string containing digits in the middle
        return rD(rInt(count)+1) + rd(rInt(count)) + rD(rInt(count)+1)


Now that we have all the tools we need, we'll exercise each known accepted format using the random string generators:

    for i in xrange(count):
        #with area code
        phoneNumber = rD(rInt()) + \
             rd(3) + \
             rD(rInt()) + \
             rd(3) + \
             rD(rInt()) + \
             rd(4) + \
             rD(rInt()) + \
             rd(rInt()) + \
             rD(rInt())            
        _check(phoneNumber)

        #local
        phoneNumber = rD(rInt()) + \
             rd(3) + \
             rD(rInt()) + \
             rd(4) + \
             rD(rInt()) + \
             rd(rInt()) + \
             rD(rInt())            
        _check(phoneNumber)

        # area code, together
        phoneNumber = rD(rInt()) + rd(10) + rD(rInt())
        _check(phoneNumber)

        #local, together
        phoneNumber = rD(rInt()) + rd(7) + rD(rInt())
        _check(phoneNumber)

        # area code, scatter in some digits at the front.
        phoneNumber = rDdD() + \
             rd(3) + \
             rD(rInt()) + \
             rd(3) + \
             rD(rInt()) + \
             rd(4) + \
             rD(rInt()) + \
             rd(rInt()) + \
             rD(rInt())            
        _check(phoneNumber)

        # local, scatter in some digits at the front.
        phoneNumber = rDdD() + \
             rd(3) + \
             rD(rInt()) + \
             rd(4) + \
             rD(rInt()) + \
             rd(rInt()) + \
             rD(rInt()) 

        _check(phoneNumber)

We'll also test some cases where we know it is NOT a phone number. we will be able to reconstruct the original, but all the atomic fields will be empty strings:

    for i in xrange(count): # not a phone number
        string = rD(rInt()+5)
        norm = normalize(string)
        recon = reconstruct(norm)
        if ( recon != string ) or \
           ( norm[1] !='' ):
            print string
            print norm
            print

That's the end of _luggageTest().

Checking Each Phone Number

We still have to define this helper function:

def _check(phoneNumber):
    'Verifies that normalize() detects the atomic fields in the phone number.'
    norm = normalize(phoneNumber)
    recon = reconstruct(norm)
    if (recon != phoneNumber) or \
       (norm[1] == '') or \
       (norm[2] == ''):
        print phoneNumber
        print norm
        print recon
        print

And we're done with the testing framework. All we need now is a convenient way to kick off the test. It's another Python idiom to check the __name__ to see if we're being run from the command line. For modules that are meant to be imported into other python scripts, (as opposed to Python command line scripts) it's common to use this check to self-test the module. Here I've isolated the test code in _test(), which is my personal preference.

if __name__ == '__main__':
    _test()

Closing Thoughts

I've explored three concepts in this program: the 3NF problem that the module is designed to explore, the unit testing that made it possible to trust code depending heavily on a couple unreadable regular expressions, and the literate programming style of explaining the first two.

Suppose I were to say the string was a regular expression. And suppose I were to say the string was unreadable. But I repeat myself. - Mark Twain.

Python comes with a unit test framework that I did not use here, because it would not have helped me write the brute force _luggageTest() and I prefer the elegance of doctest.

Afterword

I originally wrote this article as comments embedded in the source itself. I did have to apply a few transformations to get it into neatly formatted HTML, but I that only took a few minutes. Proof-reading took longer, because comments tend to be terse labels, not perfect English. I would say the results were reasonable. If you compare the proof-reading changes in this document to the source code, you'll see that I significantly changed the section structure to better fit the program. It seems precise section and boundary formatting is needed to do proper literate programming.

- Oran Looney May 8th 2007

Thanks for reading. This blog is in "archive" mode and comments and RSS feed are disabled. We appologize for the inconvenience.