Loading metadata

I'm storing metadata about content (such as the titles of articles) in Subversion properties. This has the advantage that the metadata is all version controlled by Subversion, and can be adjusted using the normal svn command. The main reason for this though is so that I can pull out these individual values from the repository without having to parse them out of the actual content. So if you want to adorn your images with descriptions, you can just add a property and keep the binary data in the proper format.

When I update a working copy, I load all the properties for files into records in the wc_property table. Eventually this will be where properties are edited (in some hypothetical web interface) before the changes are committed.

For the most important metadata though I want the values to be more readily available, and some metadata needs to be processed in some way before the information becomes useful. So for example a few values, like the title and publication date of an article, should be stored alongside the content in the wc_file table, so that you don't have to do lots of extra queries or joins to get at them.

I've put together a little architecture for loading these values. Eventually I want to allow plugins to add their own metadata processing code, but the plugin loading stuff will come later. For now, I've got a hash of patterns which match the names of properties, and map them to a callback function. The function is passed a hash of all the properties, so that it can process them all at once if that's more efficient. The patterns can just be the name of a property, of something like foo:* for properties with a foo prefix, or * for all properties.

These are the properties which I'm currently parsing with a predefined custom metadata callback, most of which simply store the value in the wc_file table alongside the content:

(Update: the complete list of properties understood by Daizu is now documented properly.)

svn:mime-type
Stored in the content_type column, if it contains a single valid MIME type.
dcterms:issued
Stored in the issued_at column, if it contains a single valid date and time. Currently only the Subversion datetime format is understood, but I'd like to make it accept the full range of W3CDTF formats.
dc:title
Stored in the title column, if it contains anything other than whitespace. Leading and trailing whitespace is removed.
dc:description
Stored in the description column, with the same processing as dc:title.
daizu:status
Sets the retired column to true iff it has the value retired. A retired file should be published as normal, but it shouldn't show up in navigation menus, section indexes, or blog feeds. The idea is that you can use this for an old file which is only kept for historical purposes. I'll probably want the templates to display an appropriate message about it being of historical use only on the actual page. I might want to add other status values in the future, but I can't think of any useful ones yet.
daizu:tags
Stored separately in the wc_file_tag table. The value should be a list of tag names (terms), each on a separate line. This allows tags which contain spaces and commas to be listed. These tags are ‘folksonomy tags’, like the ones used by Technorati.

< How to store file data in working copies | Parsing content as HTML fragments >