Parsing content as HTML fragments

Parsing the content from files was a bit trickier than I thought. I need to do this to apply various types of filtering to the content before the results are stuffed into a TT template. The one I've been working with, simply because it will be useful for this blog, is having code samples syntax highlighted, with my Text::VimColor module.

So the idea is that I'll have a special non-HTML element in the content which indicates that some text should be marked up specially when the page is generated. My processing removes that element and repaces it with a pre element containing the highlighted code (just text and span elements).

Content in XML

I was originally planning to have my content in HTML 4, since I want to generate that to avoid the mime-type troubles which come with XHTML. That didn't work unfortunately, at least using XML::LibXML as the parser. It does support HTML, but it complains about any elements which aren't recognized. So I've switch to writing my content in XHTML. I'm not validating it, so I can add whatever elements I wish, and there's nothing to prevent using things like MathML in the future. I can do all my processing with XML, and just write it out as HTML 4 when I'm ready to treat it as a simple glob of text.

A nice benefit of this is that I can use XInclude to bring in code samples from other files. So I can do things like this:

<daizu:syntax-highlight filetype="perl">
<xi:include href="foo.pl" parse="text"/>
</daizu:syntax-highlight>

If I want to syntax colour some code I have the choice of using either a separate file or (for smaller code samples) including it directly in the page content. It gets treated the same either way. I will also be able to use the XInclude mechanism in other places if appropriate.

Handling included URIs

To make that work I've had to define a set of callbacks which the libxml XInclude processing uses to resolve URIs. I don't want my relative URI (foo.pl above) to be resolved by looking at the filesystem, because of course it's stored in the database. So I set the base URI of the document I'm parsing to a URI in the daizu scheme, with the path of the document as it appears in the database. The XInclude processor resolves any URIs which don't include a scheme against that, and then I just have to pull out the content from the database when libxml tries to load any URI in the daizu scheme. For now I'm just ignoring any other schemes, but it would probably be a good idea to restrict or disallow them as a security threat.

One weird thing was that the base_uri method on the XML::LibXML parser doesn't seem to work. In fact it only processes XInclude at all if you give it the filename of an XML file. Since I'm pulling the content out of the database, I have to then write it to a temporary file, and override the base URI with an xml:base attribute.

Wrapping the content for parsing

For character entity references to work (things like &nbsp;) I'll need to provide a DTD which defines their expansions. I've borrowed the files from the XHTML specification which do this (there are three of them) and wrapped them in a very simple DTD which just includes them. I don't define any validation rules, and validation is turned off anyway when I parse the content. I've put these files alongside my Perl modules, so that I can always provide a file: URL for the DTD.

So to make all this work I need to wrap a root element round the content before I parse it, and put a DOCTYPE declaration on the front. The root element also means that I don't have to put a root element in the content I write, so it can just be a sequence of block-level elements.

Here's what the file ends up like:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE body SYSTEM "file:///.../xhtml-entities.dtd">
<body xml:base="daizu:///example.com/foo/bar.html"
      xmlns="http://www.w3.org/1999/xhtml"
      xmlns:xi="http://www.w3.org/2001/XInclude"
      xmlns:daizu="http://www.daizucms.org/ns/html-extension/">
<!-- content fragment goes here -->
</body>

(In reality I'm actually not putting any linebreaks in the stuff before the content, so as not to affect the line-numbering in any libxml errors.)

One final thing I've added to the syntax colouring since I first implemented it on my personal blog is that I'm allowing the name of the element into which the markup is placed to be overridden. Usually you'll want it to be pre, but you can use code to highlight very small samples, like this Perl regex: /(?:foo\s+bar)/i

< Loading metadata | Minting ‘tag’ URIs >