On a recent Saturday morning, I was called upon to extract data out of
my family's web site so that a new site could be constructed using fancy
web standards of which I know nothing. The old html files were in good order so it was easy to
get the data we needed from them.
The pages make up an online photo album with captions. As the captions
and images were in the same order, I was able to write this data scraping
class fairly quickly to get all the unique bits out of the pages:
import sgmllib
from cgi import escape
class MSParser(sgmllib.SGMLParser):
def __init__(self):
self.dates = {}
self.datelist = []
self.curr_date = ''
self.in_caption = False
self.in_image = False
self.in_datebox = False
self.data = []
sgmllib.SGMLParser.__init__(self)
def unknown_starttag(self, tag, attrs):
"""We want anything in captions, including other tags."""
if self.in_caption:
s = '<%s ' % tag
for a, v in attrs:
s = '%s %s=%r' % (s, a, v)
self.data.append(s)
self.data.append('>')
def start_p(self, attrs):
"""Ignore any p tags in captions"""
if self.in_caption:
pass
def end_p(self):
"""Ignore any p tags in captions"""
if self.in_caption:
pass
def start_td(self, attrs):
for attr, value in attrs:
if value == 'DateBox':
self.in_datebox = True
elif value == 'CaptionBox':
self.in_caption = True
def start_img(self, attrs):
for attr, value in attrs:
if attr == 'src':
img = value
elif attr == 'width':
width = value
elif attr == 'height':
height = value
img_data = (img, width, height)
self.dates[self.curr_date]['images'].append(img_data)
def end_td(self):
if self.in_caption:
caption = ''.join(self.data)
self.dates[self.curr_date]['captions'].append(caption)
self.data = []
self.in_caption = False
def unknown_endtag(self, tag):
if self.in_caption:
self.data.append('</%s>' % tag)
def handle_data(self, text):
if self.in_caption:
self.data.append(escape(text, quote=True))
if self.in_datebox:
date = date.strip()
if date:
self.curr_date = date
self.datelist.append(date)
self.dates[date] = {}
self.dates[date]['captions'] = []
self.dates[date]['images'] = []
self.in_datebox = False
def error(self, message):
pass
After that, much more work went into automatically generating the new
set of pages based on the dates, captions and image links I had obtained
using that MSParser class. I think the resulting pages are fantastic
looking. I may be a tad biased, however.
Now we have to think about a new way of keeping that site up to date. It
doesn't make any sense to write new pages by hand when you can add new
content to a database and have a program generate the resulting pages
automatically. Baby steps. Literally.
|