Parsing the content from files was a bit trickier than I thought. I need to do this to apply various types of filtering to the content before the results are stuffed into a TT template. The one I've been working with, simply because it will be useful for this blog, is having code samples syntax highlighted, with my Text::VimColor module.
So the idea is that I'll have a special non-HTML element in the content
which indicates that some text should be marked up specially when the page
is generated. My processing removes that element and repaces it with a
pre element containing the highlighted code (just text and
span elements).
Content in XML
I was originally planning to have my content in HTML 4, since I want to generate that to avoid the mime-type troubles which come with XHTML. That didn't work unfortunately, at least using XML::LibXML as the parser. It does support HTML, but it complains about any elements which aren't recognized. So I've switch to writing my content in XHTML. I'm not validating it, so I can add whatever elements I wish, and there's nothing to prevent using things like MathML in the future. I can do all my processing with XML, and just write it out as HTML 4 when I'm ready to treat it as a simple glob of text.
A nice benefit of this is that I can use XInclude to bring in code samples from other files. So I can do things like this:
<daizu:syntax-highlight filetype="perl"> <xi:include href="foo.pl" parse="text"/> </daizu:syntax-highlight>
If I want to syntax colour some code I have the choice of using either a separate file or (for smaller code samples) including it directly in the page content. It gets treated the same either way. I will also be able to use the XInclude mechanism in other places if appropriate.
Handling included URIs
To make that work I've had to define a set of callbacks which the libxml
XInclude processing uses to resolve URIs. I don't want my relative URI
(foo.pl above) to be resolved by looking at the filesystem, because of
course it's stored in the database. So I set the base URI of the document
I'm parsing to a URI in the daizu scheme, with the path of
the document as it appears in the database. The XInclude processor resolves
any URIs which don't include a scheme against that, and then I just have
to pull out the content from the database when libxml tries to load any
URI in the daizu scheme. For now I'm just ignoring any other
schemes, but it would probably be a good idea to restrict or disallow them
as a security threat.
One weird thing was that the base_uri method on the
XML::LibXML parser doesn't seem to work. In fact it only processes
XInclude at all if you give it the filename of an XML file. Since I'm
pulling the content out of the database, I have to then write it to
a temporary file, and override the base URI with an xml:base
attribute.
Wrapping the content for parsing
For character entity references to work (things like
) I'll need to provide a
DTD which defines
their expansions. I've borrowed the files from the XHTML specification
which do this (there are three of them) and wrapped them in a very simple
DTD which just includes them. I don't define any validation rules, and
validation is turned off anyway when I parse the content. I've put these
files alongside my Perl modules, so that I can always provide a
file: URL for the DTD.
So to make all this work I need to wrap a root element round the
content before I parse it, and put a DOCTYPE declaration
on the front. The root element also means that I don't have to put
a root element in the content I write, so it can just be a sequence
of block-level elements.
Here's what the file ends up like:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE body SYSTEM "file:///.../xhtml-entities.dtd"> <body xml:base="daizu:///example.com/foo/bar.html" xmlns="http://www.w3.org/1999/xhtml" xmlns:xi="http://www.w3.org/2001/XInclude" xmlns:daizu="http://www.daizucms.org/ns/html-extension/"> <!-- content fragment goes here --> </body>
(In reality I'm actually not putting any linebreaks in the stuff before the content, so as not to affect the line-numbering in any libxml errors.)
One final thing I've added to the syntax colouring since I first
implemented it on my personal blog
is that I'm allowing the name of the element into which the markup is
placed to be overridden. Usually you'll want it to be pre,
but you can use code to highlight very small samples,
like this Perl regex:
/(?:foo\s+bar)/i