fhwang.net

Justifying Dauxite

May 6, 2004

Introduction

So I did it again: I rewrote my website's publishing engine. This decision warrants a little justification, because even though Ruby—my language of choice these days—is still a relatively small language, its community is big and creative enough to have spawned more than one widely used web publishing library. Still, I had some thoughts on the problem that don't seem to be reflected in the existing libraries, so I wrote Dauxite to see if those thoughts made any sense.

Seeing both sides of the problem

fhwang.net has been up since 2000, and it's been software-generated from the start. Many of its pages lend themselves well to automation. One large part of the site contains pointers to more than 100 articles that I've written for magazines and newspapers, and these pointers are organized by category, date, and my recommendations for readers. As the site continued to grow I added a blog on the front page, and although initially it was only updated to reflect changes in the site or my professional life, I have plans to start using it more, which will hopefully cut down on the number of times I end up ranting to my friends about gender politics or urban planning at a bar at two in the morning.

The first program was written in Perl, because that was my favorite language at the time. The second was in Java, because I realized that I am not enough of a l33t h4x0r to maintain my own Perl code. In addition, I was starting to read the various XP bibles and I wanted to start using unit tests. The third, named simply Publisher, was in Ruby because I realized (along with a lot of other people) that once you use lots of unit testing, Java's static typing causes more pain than relief. (In my case, that's both figuratively and literally: My RSI is bad enough that I notice it more when using something verbose like Java. Though I can't explain why my prose is still so damned long.)

Many of the pages on fhwang.net are fairly static, and in Publisher, handling those was easy enough. But for pages that had any sort of dynamic behavior, I used eRuby. Since then, I've gained a lot more experience writing dynamic HTML pages both in my freelance work and at Rhizome. Rhizome is at times a densely dynamic site, with certain pages pulling from seven or eight database tables at once, and with the page contents differing drastically depending on what resource you're looking at, who you are, and (literally) what day of the week it is.

From that experience I gradually came to the conclusion that eRuby—along with other templating languages—is an abysmal way to generate HTML if that generation is at all complex. Of course, eRuby is better than PHP or ASP, since the language you use between those <% %> isn't entirely moronic, but all such implementations share some common drawbacks. When you use eRuby you're treating markup as if it were raw text, which is like doing arithmetic with bitmasks. You keep telling yourself you'll keep your presentation separate from your logic, but it's too easy to add logic to the eRuby and too hard to move it out. And given that eRuby resists the one-responsibility organizational principle of OO code; that it will spew its contents all over stdout unless you strap it up tighter than a girl in a bondage video; and that when you finally do retrieve those results, the six lines you care about are buried in miles of HTML—given all that, does anybody unit test the stuff? I sure don't.

Over on the smart and friendly ruby-talk list, a thread on HTML generation pops up at least once a month, and there are always a few people who will heartily endorse Amrita. I had looked into Amrita before, but we didn't get off to a good start. For starters, I couldn't even get it to compile on one of the machines I was trying to use it on—though that's mostly my fault since when it comes to compiling and installing I'm a little incompetent and a lot impatient. But even when Amrita did compile, it had some default behaviors that I found surprising and off-putting: Not working well with XHTML, for one, and then hijacking the "id" attribute for its own uses even though that attribute comes up a lot in a number of XML syntaxes, including XHTML itself.

Maybe that stuff isn't important, and can simply be overridden once you know what settings to tweak. But these roadblocks were enough to give me pause, and when I started seriously browsing through the docs I started to think Amrita wasn't what I wanted. It seemed to be geared towards geek-centric sites with minimal design. (For one thing, I can't figure out how you're supposed to handle highly heterogenous mixed content.) Now, I spend a lot of time reading those minimally designed sites, but part of my rent every month is paid by work I do for online business publications or cosmetics companies, and they don't care one bit about the semantic structure of HTML. They just want the page to look good in Internet Explorer, and they reasonably expect that if they email you a new design and haven't changed what data the page depends on, that you can implement it quickly. The closer that design implementation is tied to rich data structures, the slower you'll be in responding to design changes.

Templating code like eRuby and procedural code like Amrita each offer lopsided views of the same problem. By focusing on the text output first, templating code appeals to the designer who thinks that appearance is important and that functionality is trivial. By focusing on data-driven behavior first, procedural code appeals to the programmer who thinks that functionality is important and that design is trivial.

But if I think both are important? Sometimes I want to be a design-patterns-agile-methodologies geek, and when I'm in that mode I want to write in Ruby, which is a great way to express relations among entities. Sometimes I want to be a messy-haired design snob, and when I'm in that mode I want to write in something that looks like XHTML, so I can focus on the surface. Why can't I have both?

Detailing a site structure

So I wrote Dauxite. When I started I had a couple of design principles in mind:

  1. Logic and presentation should be separate in a way that pleases both my inner OO zealot and my inner metrosexual.
  2. When processing page content, Dauxite should spend most of its energy on creating and transforming XML, as opposed to manipulating in-memory data structures that get turned into XHTML at the last minute.

Since Dauxite is mostly intended for my own use, I'm allowed some major drawbacks:

  1. The program will work by pre-publishing static pages, so it can be slow.
  2. The site has no CGI, so there's no need to worry about responding to user input.

(Currently I have no plans to release Dauxite, 'cause it's still sort of messy, I have my hands full maintaining Lafcadio, and I'm not crazy about the idea of releasing unmaintained code.)

As with previous attempts, Dauxite starts by loading a single XML file which describes every resource on the site. The actual site contains about 100 individual pages, so what's below is just a sample of the file, which is called site.xml.

<site>
  <directory name="">
    <index_html name="index">
      <rss_retriever domain_class="BlogEntry" count="10" />
      <renderer />
    </index_html>
    <directory name="rss">
      <rss name="latest">
        <rss_retriever domain_class="BlogEntry" count="10" />
        <renderer name="rss_strip" />
      </rss>
    </directory>
    <file_html name="bio" content_parent="index" />
    <directory name="art">
      <file_html name="art" content_parent="index" file_name="index" />
      <file_html name="analog" content_parent="art" />
      <file_html name="a1" content_parent="analog" />
      <file_html name="a2" content_parent="analog" />
      <file_html name="a3" content_parent="analog" />
      <plain_text name="mokovaDump"><file_input /></plain_text>
    </directory>
  </directory>
</site>

You can add new pages to the site by adding elements like <rss> or <file_html> to site.xml. These elements are represented in Ruby as Nodes, which are responsible for knowing things like their file's path on the web server, and what their content parents are. (Taken together, all the content_parent attributes describe a hierarchy which is used when generating each page's breadcrumb.)

Some of the node elements contain extra elements like <rss_retriever> or <file_input>. These are represented in Ruby as ContentFilters, which are the building blocks of HTML generation in Dauxite. To render its content, a node chains its content filters end-to-end. Each content filter receives XML input from the previous content filter (except for the first, which has to start from scratch), manipulates the XML, and then passes it along.

site.xml actually hides a lot of its content filter data, because many of Node's subclasses pre-define common combinations of nodes and content filters. For example,

<file_html name="a3" content_parent="analog" />

is the same as

<html name="a3" content_parent="analog"><file_input /></html>

site.xml is a little messy, because it isn't screamingly obvious which XML elements are Nodes and which are ContentFilters. (Content filter elements can't contain other elements, but directories can, and directories are nodes.) If this format were designed by a standards committee it might look like

<node type="directory" name="rss">
  <node type="rss" name="latest">
    <content_filter type="rss_retriever" domain_class="BlogEntry" count="10" />
    <content_filter type="renderer" name="rss_strip" />
  </node>
</node>

but I figure, hey, we're all adults here.

Hey man, don't bogart that content

The content filter is an application of the Decorator pattern in the Gang of Four book, also called Pipelines in a recent Pragmatic Programmers article. It's a powerful pattern, but I didn't realize just how powerful until I implemented Dauxite.

Let's look at an extended example, using the page http://fhwang.net/bio.html, which contains a short biographical description. (And a photo that's really out-of-date. Software can't do everything.) In site.xml the relevant XML is:

<site>
  <directory name="">
    <index_html name="index">
      <rss_retriever domain_class="BlogEntry" count="10" />
      <renderer />
    </index_html>
    <file_html name="bio" content_parent="index" />
  </directory>
</site>

In Ruby, bio.html is an instance of FileHtml. Each instance of FileHtml contains these content filters:

1. FileInput —> 2. PageWrapper —> 3. SiteWrapper

  1. FileInput reads a file containing XHTML and passes that file's contents downstream. This is implemented in Ruby with a simple file reading procedure.
  2. PageWrapper slaps an XHTML breadcrumb on top. It handles the work of climbing up the content hierarchy using the content_parent attribute of each node; in the case of bio.html, its parent is index.html, which has no parent. This is implemented in Ruby.
  3. SiteWrapper wraps its input in a mostly static template which contains site-wide information such as the headers, graphic at the top of the page, and nav section on the right side. This simple step is implemented in XSLT using xsltproc.

Looking up the content hierarchy, we see that bio's parent is http://fhwang.net/index.html, which contains the 10 most recent blog entries. When this Node is instantiated, it contains a different sequence of content filters:

1. RssRetriever —> 2. Renderer —> 3. SiteWrapper

  1. RssRetriever pulls rows out of a given MySQL table and uses them to generate an RSS file. In this case, it will pull out the 10 most recent rows in the BlogEntries table. RssRetriever is implemented in Ruby.
  2. Renderer runs the input through an XSLT file, which in this case is index.xsl. index.xsl loops through the RSS and creates one chunk of XHTML for each <item>.
  3. As described above, SiteWrapper wraps up the whole thing in site-wide information.

This approach has some intriguing consequences:

Decomposing a problem along a new axis can shed light on the domain in ways that lead to other solutions. Halfway through programming Dauxite, I decided I needed a caching mechanism because generating the entire tree was becoming time-consuming. The solution was to have each Node ask its content filters if any of them rely on input that has changed, and therefore need a refresh on their behalf. FileInput, for example, compares the mtime of its data file to the mtime of the last generated copy of the Node's contents. RssRetriever uses the time of the last modified blog entry. SiteWrapper never needs an update on its behalf.

This solution asks: What information does our result depend upon, and how do we know when that information has changed? Such a solution would've been nearly impossible to see if all the generation code was mashed together in one eRuby file or one Amrita method call. Decomposing the generation into a chain of individual content filters made it easy for me to see the nature of the problem, and the solution practically wrote itself.

XML, and what it's good for

Dauxite is an XML-centric program: All the content filters take XML as input, and they pass on XML to the next content filter. Early on I decided to think of the site as a collection of documents as opposed to the result of logical code, so Dauxite turns the pages into documents as soon as possible and then runs them through various transformations to reach their final state. (The name Dauxite is a pun on "Doc Site", though Google says it's also some sort of metal.)

There are a lot of Ruby programmers who avoid XML whenever possible. Many of us use it anyway, and we certainly have a good standard XML library in REXML, but XML's philosophy goes against the tendencies of most Ruby programmers. Mostly we prefer the local and the concise, where XML is universal and verbose. It's telling that Ruby is (to my knowledge) the only even marginally mainstream language to include YAML processing in its standard distribution.

Sometimes I like XML, because sometimes I like standards. XML is good for emitting data to be read by a wide range of clients, since there are lots of tools for defining standards and then validating documents against those standards. For example, Dauxite validates every XHTML page with xmllint before uploading it. Life's too sort to spend time chasing missing </p> tags.

And since XML is the focal point of many big standardization efforts, data encoded in it has a certain longevity, which is appealing to somebody with my track record. If, ten years from now, I rewrite my publishing engine in Java 14 or SmallRubyThon or C+++++, the odds are good that whatever language I use will have a good XML parser.

For some pages, I tried to keep data in intermediate formats like RSS and Docbook for as long as possible, saving the XHTML transformations for the end. But life isn't perfect, so neither are these solutions. For example, I should be able to simply use the RSS element <pubDate> to get the date string for index.html, but using XSLT to turn "Tue, 21 Oct 2003 00:00:00 GMT" into "October 21, 2003" is painful, so instead I pass along an extra element called <dauxite:date_string>. This will validate as RSS, but it's still sort of a cheat. Obviously there are limits to how useful it is to leverage other people's formats.

Dauxite uses four well-established XML syntaxes: XHTML, RSS, Docbook, and XSLT. (If I could've found a good standard for describing a site, I would've used that too.) XHTML is obviously useful and simple, though I imagine that its focus on being a simple presentation-centric markup will make it expire faster than other formats. RSS is supported by a million programs, but eventually the stench of the <description> will be enough to make me switch to Atom. Docbook is sort of its own universe, and I find its massive tag-set both daunting and intriguing. I do feel good about the content that is coded in Docbook; with any luck I won't have to touch that stuff 'til I'm killing time in the nursing home.

XSLT considered harmful

And then there's XSLT. XSLT is profoundly useful for a narrow set of problems. Unfortunately, it has ambitions of being a world-class multipurpose language, and if you indulge that ambition it will only repay you with treachery. XSLT can do a lot of things, but most of them can only be done in a way that's convoluted and verbose. But, hey, it's XML, and your output's XML, so this must be the right solution! Right. We've seen this delusion before, only that time it was called ColdFusion.

To be fair, let's start with where XSLT works. The SiteWrapper content filter uses an XSLT stylesheet that looks like:

<xsl:template match="/xhtml:page_content">
  <body>
  <p id="page_head">
    <a href="/"><img src="/img/home.png" alt="fhwang.net" width="223"
                     height="35"/></a>
    <br /><em>Francis Hwang&apos;s site</em>
  </p>
  <hr />
  <div id="page_content">
    <xsl:copy-of select="node()" />
  </div>
  <div id="site_nav">
    <ul>
      <li><a href="/blog/">Archives</a></li>
      <li>
        <a href="/rss/latest.xml"><img src="/img/rss-blue.png" width="36"
                                       height="14" alt="rss" /></a>
      </li>
    </ul>
    <ul>
      <li><a href="/bio.html">About me</a></li>
      <li><a href="/art/">Art</a></li>
      <li><a href="/writing/">Writing</a></li>
      <li><a href="mailto:sera@fhwang.net">Contact me</a></li>
    </ul>
  </div>
  </body>
</xsl:template>

This simply says "Find the elements inside of the <xhtml:page_content> element, and drop it in the middle of all this navigational stuff." This content filter is focused more on presentation than on logic, so you'd rather have a format that makes the design more apparent, and easier to change.

(And yes, I'm cheating, because <page_content> isn't an element in XHTML. After the SiteWrapper runs we'll have stripped out that element, and we'll be validating the results before uploading them anyway. But when you do this sort of stuff you really should grok your XML namespaces.)

XSLT works well wrapping elements in static text. And it can handle fairly simple conditions and loops. index.xsl, for example, loops through an RSS file, emitting a chunk of XHTML for each RSS <item>.

Unfortunately, XSLT can't go much further than that. As mentioned above, it's quite inept at the common task of date formatting. And it failed me completely on turning Docbook into XHTML. In Docbook, footnotes are imbedded in the text they're linked to, so to process them into XHTML I had to:

  1. Save the footnote text in some sort of collection.
  2. Keep a running count of what footnote you're currently processing.
  3. Drop in a numbered link to the footnote that you'll be inserting at the bottom of the page.
  4. When you get to the bottom, dump all your footnotes.

This isn't a difficult problem, but something about the structure of XSLT seems to make it really awkward to handle stateful variables. All I know is that I went online looking for XSLT solutions to this problem—Docbook being, again, a big format with lots of users—and all I found were 100-line chunks of code that made me wonder if maybe Ted Kaczynski wasn't right in some way.

So I implemented this in Ruby instead. It would've been great to use XSLT, especially since much of the work is simple-minded element mappings: turn <citation> into <cite>, <ulink> into <a>, etc. But XSLT simply wasn't the right tool for the job. It's not the right tool for most jobs.

Right now a lot of people are trying to use XSLT as a general-purpose language—apparently it's Turing-complete, whoop-di-doo—and they're spending a lot of money on those fat computer books down at the Barnes & Noble and opening another can of Mountain Dew every time they start to nod off from all those angle brackets. Ill-advised programming trends are nothing new, of course. My main hope is that when people get disgusted with XSLT they won't discard it outright, because for a narrow set of problems it's absolutely great. A language doesn't have to be general purpose to be useful. You can't use regular expressions to connect to a database, but you probably don't want to use anything else for processing raw text.

The final result

In between my day job, freelance work, art-world paperwork, Lafcadio maintenance, and karaoke, the first pass at Dauxite took about 3 months to code. It's about 6000 lines, and it's not done by a long shot, but it's to the point where I can stop thinking about it for a while. (Knock on wood.)

It has some intriguing ideas in it, but if it were expanded to a more demanding site it would need a lot more work. For one thing, it's slow as molasses: Render times on most pages can be measured in seconds. Off the top of my head I suspect that this is because all those content filters duplicate effort to parse and unparse XML, which suggests some good avenues of optimization, but of course it's quite possible that Dauxite's design is just inherently slower than other approaches.

A bigger problem is that Dauxite does not support forms or user interactivity. Model-View-Controller is a lot simpler if you leave out the Controller. I'm not certain how Dauxite would handle a highly dynamic page that contained database-driven content, forms, and designer-configurable text, though I may eventually give it a try.

One thing to note is that having all these content filters promiscuously passing XML among themselves sounds a lot like web services, only without the web. So what would happen if you added the web? Would it ever make sense to have content filters living on different servers?

And what about that site.xml file? If I made that a public resource to be consumed by others, it could be useful. Mostly it just mirrors the site, but it would offer a much clearer way to see some of the site's structure. For example, generating a site map would be easy. (The content hierarchy is not the same thing as the file system hierarchy.) Is that the sort of thing people would ever want to do to other people's websites? Maybe I need to write some sort of standard first ...