Expanding Keywords to Links colorful lead photo

Certain keywords and phrases on this site, like Code Complete , always appear as hyperlinks. That's because I use regular expressions (and Python) to do it automatically.

I'm sure there are many ways to do this. Mine's pretty simple. I maintain a table of keyword/URL pairs that I always want expanded on my site. Every time I write an article, I add a few more. (The advantage over just embedding them in the article is that those keywords then become hyperlinks on every article I've ever written or ever will write.)

In the site's settings.py file, I compile this list into a function that makes all the replacements. The advantage here is that a lot of the work takes place ahead of time. When I tried it out on some lorem ipsum and even with dozens of keywords and hundreds of paragraphs of text, timeit measured it doing the replacements in about 1/20th of a second. For the size of articles I write, it's pretty much instantaneous.

I don't have to make the expansions at run-time, of course; I could use a utility to find and expand the links ahead of time. But if Java has taught us nothing else, it's that things are cleaner when you do them at run-time, as late as possible. If you're willing to pay the run-time cost, you can achieve cleaner, more maintainable code.

Anyway. Here's the code I use:

def makeLinkExpander(linkDictionary):
    ''' Returns a function that wraps each key with an HTML link to its value.
    The keys of linkDictionary should be all uppercase.
    The replacement is case-insensitive and spans lines.
    '''

    import re
    # regex will match any key in the dictionary
    regex = re.compile( r'\s(%s)\s' % '|'.join(linkDictionary.keys()), \
        re.IGNORECASE | re.MULTILINE )

    def replaceMatch(match):
        word = match.group(1)
        return ' <a href="%s">%s</a> ' % (linkDictionary[word.upper()], word)

    def linkExpander(string):
        return regex.sub(replaceMatch, string)
    return linkExpander

So this is kind of interesting; to match the keywords, I just compile a long regular expression

(keyword|expandMe|AnotherMatch|...)

I don't escape the keywords, but that doesn't matter because I'm not going to choose anything other than (\w| )+ for keywords. I then take advantage of the fact that Python's re module lets you supply a function as a substitution. That gives me the flexibility I need to pass the match through the dictionary in a case-insensitive way.

The other cool thing is that I define linkExpander(), the function that does the work, inside of makeLinkExpander() and then return linkExpander. That's called a closure; it's a little bit like the Factory pattern, but it's a concept from functional programming. We're constructing the function at run-time. The advantage here is that a lot of the work (most notably, loading from the database) is done ahead of time.

One thing that I considered but didn't implement was using pickle to store the pre-built link expander as a BLOB or in a file. settings.py was a better solution in this scenario. I never tried this with pickle so I'm not 100% certain that it would work with the compiled regular expressions, but the basic idea is that since Python functions are just objects, we can serialize and save them off.

That's something to keep in mind if you're building a complex function from a simple table each time your application loads: serialize the function and load it directly. I suppose load times would have get pretty noticeable before it's worth the (minor) trouble, but I've seen plenty of apps that struggle at load time.

One last comment about my link expander: it could be a lot smarter. You see how I match on any surrounding whitespace, but then pad the link with literal spaces? That should keep it from matching inside of words, and all whitespace is the same in HTML. (Version 0.1 had some trouble expanding names inside of image filenames, which of course broke the images.) But clearly it doesn't cover <PRE> blocks and other edge cases. I run it on the plain text version of the articles, before using markdown to expand to HTML, so I don't I haven't had too much trouble with it yet.

But I mean, come on! It's a dozen lines of code that does something pretty cool. Python and regex rule!

Update:

I've now changed the regex to absorb the a punctuation character after the end:

\s(?:(%s)([\)\-\'".,;:\s]))

And simply preserve the second group as-is in the expansion:

def replaceMatch(match):
    word = match.group(1)
    punct = match.group(2)
    return ' <a href="%s">%s</a>%s' % \
        (linkDictionary[word.upper()], word, punct)

It was simply too common to put a period or comma after a keyword, so I decided to accept the slightly increased risk of breaking a link or or a tag to handle those cases.

- Oran Looney April 26th 2007

Thanks for reading. This blog is in "archive" mode and comments and RSS feed are disabled. We appologize for the inconvenience.