Initial notes on the CMS design

Some notes to help me get a clearer picture of the CMS I'm starting to work on.

I haven't decided on a name for this thing yet. There are a few examples below which use the name ‘foocms’ as a placeholder.

Bullet lists below mostly indicate things I have to do in order for the CMS to work (at least well enough for my own use).

Editing interface

Obviously I'll eventually want a web interface for editing and organizing content, but for now I'm happy to use Vim and svn for that. The only annoyance is that I want to put all the metadata (such as a document's title) into Subversion properties, so that the CMS can keep track of versioned metadata without placing any constraints on the format of the actual content of the files.

To make my life easier, I propose:

  • A program to make it convenient to edit textual content and metadata in properties with $EDITOR.

I'll call it se for ’Subversion edit’ since I'm likely to be running it a lot. It should, for each filename listed, make a temporary file containing headers (something like RFC 2822 format) and the file's contents, and then run the editor on those files. After the editor quits, it should extract the content and put that back in the right place, and run svn propset to set the right properties.

For now se can just raise an error if it encounters random binary data in a property, but eventually it would be nice if it put that content in a separate file and referenced the file from the headers. The data could be changed either by changing the binary temp file for the property, or changing the filename in the headers.

Configuration

A module for finding and reading a configuration file. I've already got code from a previous effort for the ‘finding’ part.

For now the config file needs:

  • Database info (DSN, username, password).
  • Subversion repository URL.
  • Username/password for the publishing daemon (see below). These can be anything, as long as both the daemon and client code can read this file to get them.

Later it would be nice to add Subversion authentication information.

I think an XML file would be best. YAML might be nicer, but I don't see any need for YAML in other parts of the system, and the Perl modules for reading YAML seem not to be very mature. I need XML::LibXML for generating XML content (such as RSS feeds) anyway, so I might was well use that.

Authentication

Since it's just me who'll be using this for now, I don't need information about users, passwords, etc in the database. Scripts which update things in the database can, for now, just access it directly. Later they might be changed to go through the web interface so that users can authenticate.

The preview feature described below can rely on Apache doing the authentication, which makes it simple to ‘open up’ access to specific IP addresses, e.g. for validating HTML. If the CMS did the authentication it might have to provide a special feature for that to work.

Timestamps

It should be easy to find out important time information about URLs. This is needed at least for displaying a publication date on a documents web output, providing the right timestamps in RSS/Atom feeds, and doing sensible cache control on a dynamically served site.

Timestamps required:

  • Publication—by default should be the time the user indicated that the file should be published, but can be overridden.
  • Updated—time at which a new revision of a file was comitted.

These should be suitable for use directly as the atom:published and atom:updated times in Atom feeds. If others are needed they can either be manually stored in Subversion properties or calculated from the revision history.

If some scheduling system is used to schedule a file for publication in the future, then the time it is scheduled for is the publication time (not the time at which the user asked for it to be scheduled, or the time it actually goes live). It's also important that the user be able to override the publication time, for example to manually add old content or republish content from another source which has already published it.

  • Modify se to provide a default foocms:published property when editing an new uncommitted file.
  • Keep track of the ‘updated’ timestamp during working copy updates, by setting it to the time at which a revision was committed (the svn:date revision property) for every revision which adds or modifies the file.

Database working copies

Ability to checkout and update a working copy in the database. Without a web interface for editing the content in the working copies, this can be simplified by saying that the content in the database won't be changed except by updates, so merges should never be needed. So the really hard part (handling conflicts) doesn't need to be dealt with yet.

There should be a single working copy, which for now I'll assume is always working copy number 1, which represents the state of the live site(s), and tracks the trunk. For the time being there won't be any way of switching this to a branch, or of switching the live sites over to a different working copy. I can save space and time with other working copies by using the live one as a base and referencing it from other WCs when their content would be the same.

I do need to be able to preview large changes to websites before making them live, but I can do that by committing everything on a branch and checking out a working copy from the branch to preview on.

  • DB schema for storing content in working copies.
  • A program to do some validation of the DB (whatever can't simply be done with constraints in the database).
  • Perl module for creating and updating a working copy.
  • Way for plugin modules to override normal storage of Subversion properties in the database, such as storing them in a way that can be better indexed. The standard code should do special things for some properties. For example, these should be stored directly in the file table, for easier searching:
    • svn:mime-type
    • foocms:published
    • foocms:title
    • foocms:description
    • foocms:status—currently only ‘retired’ is allowed, indicating the content should be published, but not show up in navigation menus, blog feeds, etc.
  • Script to create, delete, and update a working copy.

Note that because some properties are stored in a special way, and because we need to know exactly what they are in the repository by looking at the database, it will be necessary to enfore constraints on them sometimes. Certainly if you have a file with a bad svn:mime-type value, then I don't want that bad value appearing in the database, so an update which sets a bad value has to fail. If you fix it in a new commit to the repository and then retry the update it should be fine. This might also mean that I'll need to place some rather trivial constraints on values. For example, excess whitespace in the svn:mime-type property. I don't want the extra whitespace in the database because it would make it harder to compare mime types, but it would be annoying to fail an update because someone editing content in a svn working copy accidentally put a space at the end of their mime type name. I'll need to give this one some more thought.

There's no standard about how character encoding should be indicated in a Subversion repository. For now I'll just assume content is in UTF8, which is probably good enough for my use.

Publishing content

To start with I'll assume generated content needs to be saved in some document root directory. Later I'll want to be able to have some content served dynamically, straight from the database.

I've thought about different ways of configuring where content should be published to (which local path, or which server to upload it to). It might be nice to have this information in the database, for easier configuration through the eventual web UI, or in Subversion properties, since different websites are represented by directories within the repository, but I think the best thing for now is just to put it in the XML config file. So there will be a mapping from paths in the repository where the websites live to the location of the document root, and eventually details about how to upload the files to one or more webservers.

Actual generation of output files should be done by Perl classes. To add some custom code for generating code, you'll supply a new Perl class, possibly inheriting useful stuff from a standard one. To decide which classes should handle generating of content from a file, look at the properties on that file and its ancestor directories, something like this:

  • foocms:generate, used on a file or directory which generates URLs, containing a list of classes
  • foocms:generate-descendents, used on a directory, containing some kind of mapping from descendent files (specified with globs or regexes) to the classes which should be used for them.

The content-generating classes should have methods which write each output file to some file handle, and which provide information about the URLs they generate. They should also be able to say which other files will need to be regenerated when they're changed in particular ways, so that for example a change of the title of a blog article causes the archive pages and perhaps the front page of the blog and its feeds to be regenerated.

  • Perl module to act as base class for the classes which generate output.
  • Some kind of wrapper around the templating system(s). I'll go for TT to start with. The code which runs the template generation should know how to find the right templates in the working copy (so that templates are also versioned), allowing particular websites or sections of websites to override the more generic templates.
  • DB schema for publishing jobs.
  • Working copy code should, if asked, create a new job for the files changed or added during an update (but it won't do that by default). Doing this during the update means the publishing job is created in the same transaction as the live working copy is updated, which seems like the best thing to me.
  • Perl module for doing a publishing job, or processing the queue of jobs.
  • Publishing daemon program.

The publishing process should keep track of progress so that large jobs can be restarted part way through. So perhaps for each input file store a state, one of:

  • Not yet done anything.
  • Output files generated locally, in a temporary directory.
  • Output files copied to destination server, if we're publishing to a remote server with ssh or something. To start with I can assume everything gets published on the local filesystem, so I don't need this yet.
  • Output files renamed into place for live serving. If this stage gets interrupted part way through, we're in trouble. The live site may show partly published content (some files present but others not). To make this a bit safer the renaming should be done in the order in which the files are generated, so that for example the printer-friendly version of an article can be put in place before the normal version which links to it.

Once all files have reached the final state, some post-processing may need to be done, such as sending trackbacks, pingbacks, or pings to Pingomatic. Search engine indexing may fall into this category too, but I don't need support for internal site search engines yet.

Tracking URLs

It's important to keep track of which URLs have been (or will be) created from CMS content. This should make it easy to serve content dynamically, make redirects for changed URLs simple to implement, and allow preview versions of output to have the appropriate URLs adjusted to make links stay within the preview.

It's not unreasonable to have a set of websites generate, say, a million URLs, so we don't want every single working copy to generate its own full set.

The URLs for a file or directory need to be updated (deleted from the database table and if necessary replaced) every time a file is modified by a working copy update from the repository or by being saved by a user (when a web-based editing interface is added). The later ensures that links will work properly in previews when several different files have uncommitted modifications which change the URLs they generate.

Blogging

  • Perl modules for generating Atom 1 and RSS 2 feeds. It's probably best if modules which generate pure XML content do so by generating SAX events, so that they scale well to large output files.
  • Send trackbacks or pingbacks after files have been published. To do trackbacks I'll need some code to extract a reasonable ‘extract’ from an HTML/XHTML document. By default it should probably pull out the first paragraph.

Previewing content

  • Shell of what will become the web editing interface, which just needs to handle URLs for previews.
  • For that to work well I'll need some sort of architecture for running a web app in different environments, such as different versions of mod_perl. I might be able to borrow bits of Catalyst for that.
  • Perl module to generate content for a specific URL and, if it's HTML or XHTML, adjust URLs that are managed by the CMS.
  • Optionally: some way for plugin modules to provide filters for particular mime types, so that a plugin could be supplied for adjusting links in RSS or Atom feeds for example.

Unique IDs

There should be some easy way to get a globally unique, never-changing, identifier for a file. It would assume that if a file generates multiple URLs they will all have basically the same content and so should have the same unique ID. So unique IDs should be tied to Subversion files. Committing a revision to a file, and even renaming it, should not change the ID. Deleting it and replacing it with a new file should cause the new file to have a different ID.

I think I've figured out how to do this by keeping track of the path of a file at each revision it exists in. I'll write that up separately.

One note though: to allow importing content from elsewhere it may be necessary to allow a Subversion property to override the unique ID by providing a unique URL for a file.

< A new CMS | Meaning of copyfrom_path in Subversion editors >