fhwang.net

Justifying Dauxite

May 6, 2004

<< Hey man, don't bogart that content | XSLT considered harmful >>

Dauxite is an XML-centric program: All the content filters take XML as input, and they pass on XML to the next content filter. Early on I decided to think of the site as a collection of documents as opposed to the result of logical code, so Dauxite turns the pages into documents as soon as possible and then runs them through various transformations to reach their final state. (The name Dauxite is a pun on "Doc Site", though Google says it's also some sort of metal.)

There are a lot of Ruby programmers who avoid XML whenever possible. Many of us use it anyway, and we certainly have a good standard XML library in REXML, but XML's philosophy goes against the tendencies of most Ruby programmers. Mostly we prefer the local and the concise, where XML is universal and verbose. It's telling that Ruby is (to my knowledge) the only even marginally mainstream language to include YAML processing in its standard distribution.

Sometimes I like XML, because sometimes I like standards. XML is good for emitting data to be read by a wide range of clients, since there are lots of tools for defining standards and then validating documents against those standards. For example, Dauxite validates every XHTML page with xmllint before uploading it. Life's too sort to spend time chasing missing </p> tags.

And since XML is the focal point of many big standardization efforts, data encoded in it has a certain longevity, which is appealing to somebody with my track record. If, ten years from now, I rewrite my publishing engine in Java 14 or SmallRubyThon or C+++++, the odds are good that whatever language I use will have a good XML parser.

For some pages, I tried to keep data in intermediate formats like RSS and Docbook for as long as possible, saving the XHTML transformations for the end. But life isn't perfect, so neither are these solutions. For example, I should be able to simply use the RSS element <pubDate> to get the date string for index.html, but using XSLT to turn "Tue, 21 Oct 2003 00:00:00 GMT" into "October 21, 2003" is painful, so instead I pass along an extra element called <dauxite:date_string>. This will validate as RSS, but it's still sort of a cheat. Obviously there are limits to how useful it is to leverage other people's formats.

Dauxite uses four well-established XML syntaxes: XHTML, RSS, Docbook, and XSLT. (If I could've found a good standard for describing a site, I would've used that too.) XHTML is obviously useful and simple, though I imagine that its focus on being a simple presentation-centric markup will make it expire faster than other formats. RSS is supported by a million programs, but eventually the stench of the <description> will be enough to make me switch to Atom. Docbook is sort of its own universe, and I find its massive tag-set both daunting and intriguing. I do feel good about the content that is coded in Docbook; with any luck I won't have to touch that stuff 'til I'm killing time in the nursing home.

<< Hey man, don't bogart that content | XSLT considered harmful >>