So far I've been using Template Toolkit for generating output in Daizu, but there's a problem with that approach: the templating language doesn't understand anything about the structure of the XML or HTML output it's producing. I'm not picking on TT, it's the best and probably most popular templating language for Perl. My problem with using it is that it always generates plain text. Of course you can arrange for the output to be correct XHTML or whatever, but there's very little support for doing that.
To see why this is a problem, look at this simple TT example:
<p>Here's a link generated by some template code: <a href="/example/?get=article&id=[% article %]"> [% title %] </a> </p>
There are several problems with this code. For one thing, I forgot to
escape an & character in the link URL, which makes the
HTML invalid. The problem won't show up when I look at the output of the
template in a browser. There's no way for TT to know that this is an error,
because it might be producing output in some format where it's not a problem
to have ampersands appearing without any special treatment. A templating
language which knew it was generating XML or HTML though would be able to
detect the problem at compile time, so you wouldn't even have to wait until
the buggy code was invoked.
Choosing whether to escape text
Another problem with the code above is that I might have forgotten to escape the values of the two variables that are being interpolated into the output. We can choose one of the following ways to insert the title into our HTML or XHTML output:
[% title %] [% title |html %]
If the title variable is a plain text string, then we want
the second option, and using the first would be a bug. It will almost
always work correctly, until someone wants it to link to something which has
a < or & character in its title.
That will be quite rare, and browsers will usually be able to understand the
output correctly (but not always), so you won't notice the problem unless
you validate all the pages you produce.
Of course, the first option might also be correct, if the
title variable is meant to contain markup. Perhaps titles are
allowed to contain emphasised text, so a
literal <em> is meant to passed through unchanged.
In that case it would be a bug if we did escape the title (but
since most title's won't have any markup in, that would also be an easy bug
to miss).
This is a problem with generating HTML using any templating language that works by splicing strings of text together without regard for the structure of the output The exact same problems occur with HTML::Template and Mason for example.
How it should work
So if the output is treated as text, we have to spend a lot of time thinking about whether variables interpolated into the output need to be escaped in a particular case, and checking for bugs when we get it wrong. A templating language specifically design for XML could avoid these costs, by treating the two different situations like this:
- Variable is plain text
- When you interpolate a string, it is assumed to be plain text, so the right thing conceptually is to simply make the string into a text node in the output infoset. What actually happens is that the compiled template, or some serializing stage in the template processor, will do the appropriate escaping automatically (and also check that the string doesn't contain any characters that can't be represented in XML).
- Variable is a chunk of markup
- We distinguish something with structure in it by parsing it into an XML object model as early as possible, so the templating language sees a XML::LibXML::DocumentFragment DOM object or equivalent. In this case the nodes in the XML fragment are copied to the output, and will get serialized in due course. Of course, the templating engine would have to throw an exception if you tried to insert a structured bit of content into an attribute value.
I'm not the first person to say this
Henri Sivonen in HOWTO Avoid Being Called a Bozo When Producing XML:
Text-based Web templating systems (MovableType, WordPress, etc.) and active page technologies that seem to allow you to embed program code in document skeleton (ASP, PHP, JSP, Lasso, Net.Data, etc.) are designed for tag soup. They don't guarantee well-formed XML output. They don't guarantee correct HTML output, either. They seem to work with HTML, because
text/htmluser agents are lenient and try to cope with broken HTML. The most common mistakes involve not escaping markup-significant characters or escaping them twice.Don't use these systems for producing XML. Making mistakes with them is extremely easy and taking all cases into account is hard. These systems have failed smart people who have actively tried to get things right.
Aristotle Pagaltzis in All XML, all the time:
The tools should be XML all the way from bottom to top. The right way to build something like a CMS is to never stick user input right into the output. Input should either be of the form of something like wiki markup which can always, always be translated to valid XHTML, or if it's tagsoup, should be corrected before storing it.
There should never be any markup-related part in your publishing toolchain just gluing strings together. Ever.