One of the unexpected complications of using Subversion to store content was the problem of keeping track of the identity of files. By ‘identity’ I mean that even though a file might be changed in various ways, including a change of filename, it can still be considered to be the same file in some sense.
This concept is important if you're generating an Atom or RSS feed, where a blog article should have a unique identifier (in the form of a URI) which doesn't change when the article is updated. That way, even if the URL of the article changes, or if it is published on multiple websites, a clever news reader can avoid showing it to the user more than once. There may or may not be other cases when a unique identifier is needed for files in a CMS, but it's hard to think about how to manage content without some way of identifying a file without falling back on the path it happened to have at some point in time.
Nomenclature
Identifying unique files is further complicated by branching. I think
it makes sense to think about a file as being unique within a branch.
So a branch is a complete universe of content which isn't going to be
published in a normal way at the same time as any other branch. When I
talk about a branch I mean (in Subversion terminology) a branch,
a tag, or the trunk. Keeping track of which branches exist and what
their paths are is a whole different problem, so I'll tackle that elsewhere,
but for now I'm assuming that each branch has an ID number in the database's
branch table, which also records its root path. When I talk
about the path of a file I mean its path in the database, which
is relative to the path of the branch it's in. Of course in the Subversion
repository these are just parts of the same path. I'm assuming that my
relative paths always start from the same level of hierarchy, which means
that branches should always be a complete copy of all content.
One final bit of non-obvious terminology: I need a name for the
identity of a file which stays the same as it moves through the
revision history. I'm already using the term file for a
file in a particular revision, or as a more general term, so I'm
using the term GUID (as in ‘Globally Unique Identifier’)
to refer to a unique file, irrespective of which revisions it appears in.
It's not an ideal name, because I've ended up with database columns called
guid_id, but it kind of makes sense.
Database representation
So each GUID has a record in this table:
create table file_guid ( id serial primary key, is_dir boolean not null, uri text not null unique, old_uri text, custom_uri boolean not null default false, constraint file_guid_old_uri_missing check (custom_uri = (old_uri is not null)) );
The id column is used as an internal unique ID for
these things, but each one has an associated uri column
for use as a real GUID. There will be some configuration in the future
to allow these to be generated based on your domain name or email address,
but I'll deal with that later.
The URIs chosen by the CMS can be overridden by putting a Subversion
property on a file or directory (currently called daizu:guid).
This is so that content imported from some other publishing system can
keep the same GUIDs that were used previously, avoiding any breakage in
things like Atom feeds.
If the property contains a valid URI then it because the GUID URI for the file.
The standard URI which the CMS generates is stored in old_uri
so that it can still be used if the daizu:guid property is
later removed. I only look at this property on the trunk, since I don't
want to end up with different GUIDs for each branch. That might mean that
if you're previewing changes made on a branch then the GUIDs might not be
what you'd expect, but I can't imagine that being a serious problem. The
ones on the live site should always be correct, providing the live site
is based on the trunk.
is_dir makes sense because, in a Subversion filesystem,
it's impossible to change a file to a directory or vice versa through
copying. At least I think it is.
There's another table which records every path which each GUID has had in each revision, in each branch where it existed:
create table file_path ( guid_id int not null references file_guid, path text not null, branch_id int not null references branch, first_revnum int not null references revision check (first_revnum >= 1), last_revnum int references revision check (last_revnum >= 1), constraint file_path_bad_revnums check (last_revnum >= first_revnum), primary key (guid_id, branch_id, first_revnum) );
It might help to give an example here. The entries for a particular
GUID might look like this (where guid_id
would be the same for each record):
| Branch | Path | Revnums | Description |
|---|---|---|---|
| 1 | orig.html | 3–5 | Created in r3, with this path, and had the same path revisions 4 and 5. |
| 2 | other.html | 4– | Copied into branch 2 in r4 with a different path, and has had the same path since in that branch, unaffected by changes on branch 1. |
| 1 | new.html | 6–6 | Meanwhile, back on branch 1, renamed in r6, but only had this path in r6. |
| 1 | revived.html | 9– | This GUID doesn't have any record for revision 7, so it must have been deleted then, but it appears again with a different path in r9. It must have been copied from an earlier revision where it did exist, or from a different branch. It has the same path up to and including the latest revision. |
I've shown the first_revnum and last_revnum
columns combined. last_revnum is null when the path
is currently correct, in the most recent revision.
Implementation
I've got a Perl module which loads information from new revisions into these tables. Since a Subversion revision can never change once it has been committed this only needs to be done once each time there are new revisions which the database needs to know about. I do this automatically whenever a database working copy is checked out or updated, so the information is always there for revisions we care about.
(It is actually possible to change the unversioned ‘revision properties’
on a revision after it has been committed. The only use I've heard of
for that is to correct an old commit log message. For that reason I don't
currently store the log messages in the database. The only revision property
I do store is svn:date (the date and time of the commit).
I'm not sure if it's possible to change that, but it seems unlikely to
cause problems. The reason I store it is so that in the future I'll be
able to use it for date-based searches, which can be unreliable in Subversion
if the commits aren't done in date order (if for example you're adding
revisions for history derived from some other source, like CVS.)
To get the information about new revisions I'm using the
SVN::Ra get_log method. This supplies a hash of paths for
each revision and information about whether they were added, modified, or
deleted. When a file is copied Subversion counts it as an added file,
but provides a the full path and revision it was copied from. My code
splits that path into the branch path (which yields the
branch_id) and the path relative to the branch. With
those three pieces of information I can look up the GUID of the
original file, and use that for the newly added path, so that moved
or revived files keep their original GUID.
GUID conflicts
Sometimes a copy of a file might cause a GUID conflict, where the GUID
of the file you're copying from is already present in the branch
and revision you're copying it to. This will happen if you do
svn copy within a branch (providing you don't also delete the
old file at the same time, which would make it the same as
svn move), or if you're copying a file from a branch or
reviving a file which already exists in the place you're copying to.
I've solved this by collecting all the path changes for a single revision, organising them according to the type of operation, and then processing them in this order:
- Deleting
- Copying
- Adding
The deletions can't cause any conflicts, so I update the database to account for them first. This is all done in a transaction of course, so I don't have to worry about changing particular files atomically.
Copying (adding ‘with history’) are the only place conflicts can occur. I organise the copy operations by the GUID of the source. If that GUID already exists in the current branch and revision then clearly the new file can't keep its GUID, so it needs to get a new one minted. I do that simply by moving the copy operation on to the pile of additions which will be processed later. Effectively I'm discarding the historical information about this file, because for this purpose I can't do anything with it.
Usually there's only going to be one copy for each source GUID, but it is possible to copy the same source (or several different sources which happen to be the same file, in the sense that they have the same GUID) into the same branch in a single revision. If that happens then I pick one of them which gets to keep the GUID, providing it's not already used in the current branch and revision, and change all the others into plain additions. I've arbitrarily chosen just to pick the first one after sorting them by their full path in the repository.
Finally I do the additions, including any which were actually copies but couldn't preserve the GUID of the source. Each gets a completely new GUID.
As far as I can tell this algorithm behaves reasonably in all the important situations:
- If a file is moved (with
svn move) then Subversion records it as a deletion of the original file plus a copy of it to a new path from the previous revision. This works for my algorithm because after processing the deletion, the file's GUID is no longer present in the current revision, so that copy is free to keep the same GUID and give it a new path. So I'm not trying to detect when a deletion plus copy is actually a ‘move’ operation, but the result is as if I had. - If there would be a conflict of GUIDs, the algorithm guarantees that only one will keep it, but preserves GUIDs across as many copies as possible.
- If you delete a file and then undelete it (that is, copy if from an earlier revision before it was deleted) then the recovered version will have the same GUID as the original, even if it was recovered with a different path or in a different branch, providing that wouldn't cause a conflict.
OK, so I think that's everything. Having written it all down I'm a bit more confident that this approach will actually work in practice. I'll post again if I find any serious problems with it.