Speno's Pythonic Avocado 3.6.2004

On a recent Saturday morning, I was called upon to extract data out of my family's web site so that a new site could be constructed using fancy web standards of which I know nothing. The old html files were in good order so it was easy to get the data we needed from them.

The pages make up an online photo album with captions. As the captions and images were in the same order, I was able to write this data scraping class fairly quickly to get all the unique bits out of the pages:



import sgmllib
from cgi import escape

class MSParser(sgmllib.SGMLParser):
    def __init__(self):
        self.dates = {}
        self.datelist = []
        self.curr_date = ''
        self.in_caption = False
        self.in_image = False
        self.in_datebox = False
        self.data = []
        sgmllib.SGMLParser.__init__(self)

    def unknown_starttag(self, tag, attrs):
        """We want anything in captions, including other tags."""
        if self.in_caption:
            s = '<%s ' % tag
            for a, v in attrs:
                s = '%s %s=%r' % (s, a, v)
            self.data.append(s)
            self.data.append('>')

    def start_p(self, attrs):
        """Ignore any p tags in captions"""
        if self.in_caption:
            pass

    def end_p(self):
        """Ignore any p tags in captions"""
        if self.in_caption:
            pass

    def start_td(self, attrs):
        for attr, value in attrs:
            if value == 'DateBox':
                self.in_datebox = True
            elif value == 'CaptionBox':
                self.in_caption = True

    def start_img(self, attrs):
        for attr, value in attrs:
            if attr == 'src':
                img = value
            elif attr == 'width':
                width = value
            elif attr == 'height':
                height = value
        img_data = (img, width, height)
        self.dates[self.curr_date]['images'].append(img_data)

    def end_td(self):
        if self.in_caption:
            caption = ''.join(self.data)
            self.dates[self.curr_date]['captions'].append(caption)
            self.data = [] 
            self.in_caption = False

    def unknown_endtag(self, tag):
        if self.in_caption:
            self.data.append('</%s>' % tag)

    def handle_data(self, text):
        if self.in_caption:
            self.data.append(escape(text, quote=True))
        if self.in_datebox:
            date = date.strip()
            if date: 
                self.curr_date = date
                self.datelist.append(date)
                self.dates[date] = {}
                self.dates[date]['captions'] = []
                self.dates[date]['images'] = []
                self.in_datebox = False

    def error(self, message):
        pass

After that, much more work went into automatically generating the new set of pages based on the dates, captions and image links I had obtained using that MSParser class. I think the resulting pages are fantastic looking. I may be a tad biased, however.

Now we have to think about a new way of keeping that site up to date. It doesn't make any sense to write new pages by hand when you can add new content to a database and have a program generate the resulting pages automatically. Baby steps. Literally. Smiley

This post references topics: python

posted at 23:48:16 # comment [] trackback []

The snake of good Omen

I've had good luck with all the (two) contract Python programmmers that we've hired in the past year. In both cases, they were the only people with any Python experience available immediately, and in both cases we hired them after one short interview. This says good things about both of them, but I also think it may say something good about Python.<wink>

And how is it for me going from a solo programmer to a team leader? Wonderful and difficult. It's wonderful to share ideas and solve problems with another programmer. Our products are better as a result. Also, we're way more productive as a team then when I'm working by myself by several factors.

It's difficult because I can sometimes fall behind dealing with all of the other issues I'm responsible for, while the other programmer only has to worry about one project. I'm trying to work well, and not fast, but I often have days where I feel lucky if I make any progress at all on Project Albatross.

posted at 19:23:44 # comment [] trackback []

One python programmer's search for understanding and avocados. This isn't personal, only pythonic.

Speno's Pythonic Avocado 3.6.2004

Baby Steps

The snake of good Omen