2004-12-7
Word Labels -> Address Book
Word Documents can be like data motels; data goes in, but you can't get it out in anything but a printed form.
For years, Christine (my wife) has maintained the family address book in a set of Microsoft Word documents configured to print labels. This worked across multiple operating systems and multiple versions of Word.
It is also a horribly inflexible solution that made it impossible to intelligently manage groups of addresses or tie additional information to people. But it worked back in the days prior to Mac OS X when our computing environment was a hodge podge of PCs and NeXT machines running cobbled together non-production operating systems.
In that period of time, Mac OS X was released and Address Book has matured quite nicely. In particular, Address Book has decent group management capabilities, is nicely integrated with other applications on the system, and can quite effectively print mailing labels.
So, we wanted to move the data out of Microsoft Word and into Address Book. Address Book can import vCard formatted address data. So, the problem is really to export from Word Labels to vCard.
Not so easy. The first problem is that Microsoft Word's labels are not a contact manager. Labels don't have any real structure or fields. To compound the problem, there isn't any kind of a useful export format for Word's labels. The text only formats have relatively randomized sets of newlines.
I chose to export to HTML because that, at least, yields structured documents that are very easy to parse.
To make matters more complex, there is a lot of noise in the original label data. Things like "The Smith Family", multiline names, some international addresses, and other weirdness.
The end result is a python script that can parse Word HTML and turn it into vCards. It is terribly specific to the data found in our label documents, containing a number of special cases and the like for cleaning up the data on the fly.
Fortunately, Address Book is really good about consuming poorly deconstructed address data. If you just put the full name into the "FN:" (full name) field and the entire address into the "ADR:" field without decomposing anything, Address Book will happily format labels correctly.
But the result is also something that isn't really complete. Names don't sort correctly and you can't sort by zip or state, for example.
The script goes a bit further and uses a series of regular expressions and some really nasty hardwired logic to try and detect the city/state/zip and break it out into the right vCard fields. The expressions are written such that anything that fails to match exactly will fallback to the less specific rendering of the vCard.
I shoved the whole thing into my random hacques subversion repository and slightly more technical details can be found in a README
Let me re-emphasize the last point from readme: The script is very poorly written. It is a one off and now that the data is freed from the confines of a Word document, I'll hopefully never have to run it again.
Comment on this post [ so far] ... more like this: [Address Book, Mac OS X, Microsoft Word] ... topic exchange: [Address Book, Mac OS X, Microsoft Word]