Fixing Simple XHTML Errors with Python

June 11th, 2008Programming

For a long time, one of my secret shames as a programmer was that there were parts of this website held together by some rather shoddy code. Writing this blog has been a great learning experience, but as a consequence I’ve often accidentally done things the “not quite right” way. Armed now with a better understanding of XHTML and CSS, I thought it might be a good idea to fix some of that spaghetti code and bring my blog up to at least XHTML 1.0 Transitional.

As a long overdue followup to my previous experimentation with Python, I decided to kill two birds with one stone by writing a script to clean up some common errors in my posts. Blame it on my sloppy HTML education, but I make these mistakes all the time. For instance:

Using <i> instead of <em>
Using <b> instead of <strong>
Not closing my IMG and BR tags
Forgetting to give my images “alt” text

Furthermore, due mostly to my Musical Box series, I’ve embedded a rather large number of YouTube videos in my posts. For some crazy reason, the default code uses the deprecated <embed> tag and is therefore not XHTML 1.0 compliant! The mind-blowing part is that it’s relatively simple to embed videos in an XHTML friendly way. Why isn’t this code available by default?

Armed with my copy of Dive Into Python and basic experience with regular expressions from Perl, I went about writing a script to parse my blog posts and fix the sloppy HTML. Since trying to embed Python code in a blog post would be inelegant at best, I present the script in three different formats (including some neat JavaScript syntax highlighting):

I’m not sure that my code is especially pythonic; sadly I have too many bad habits from Perl and Java. While I’m told that Python is frequently used as a scripting language, I get the impression that its real strength lies in larger applications. Furthermore, using regular expressions to parse HTML is a suboptimal solution, in retrospect I should have used an XML parser module. Any constructive criticism of my code is more than welcome, I’m here to learn.

Code qualms aside, I can now proudly say that The Quixotic Engineer is written in valid XHTML 1.0 Transitional (the front page at least, I’m still working on some of the archive posts.) Coming up next: XHTML 1.0 Strict, and perhaps some CSS experimentation to spruce the place up a bit.

→ 4 Comments HTML · Python

4 Responses to “Fixing Simple XHTML Errors with Python”

Greg Says:
June 11th, 2008 at 5:25 pm
Thanks for that link on how to embed YouTube videos as valid XHTML. I wrote my own blog too, both the front end and the backend code. My most frequent offended for invalid syntax is the damn ampersand, but I’ve also avoided embedding YouTube videos because of that embed tag.
Matthew Gallant Says:
June 11th, 2008 at 6:16 pm
You’re welcome! That link was actually sent to me by my friend Renaud, who I probably should have thanked here. Ditto Malini for looking over my Python.
Tom Clancy Says:
June 13th, 2008 at 12:04 pm
You may already know this, but no Python + XHTML developer should be without Beautiful Soup.
Matthew Gallant Says:
June 13th, 2008 at 6:43 pm
So it parses the XML and generates a tree? That’s pretty handy, thanks for the tip Tom.

The Quixotic Engineer

Fixing Simple XHTML Errors with Python

4 Responses to “Fixing Simple XHTML Errors with Python”

Games

Selected Posts