How to store file data in working copies

How should the actual data for files in my CMS be stored? I could store the data in files on disk, which might make it easier to deal with very large files, but it would make it really hard to ensure transaction safety when updating data in a working copy. So I'm storing the data in a bytea column in the PostgreSQL database.

I've got one optimization though. Since I want to allow many working copies to be checked out into the database (for example from different branches, to allow previewing a branch) I don't want to keep many copies of the same data. So for any working copy other than the live WC I'm allowing the data itself to be replaced with a reference to another file (the data_from_file_id column) which has the same data. Most branches will only have changes to a few files, so all the other files just need a separate copy of the metadata. For simplicity I'm only allowing references to the live working copy.

I'm storing a SHA1 hash of the content as well, and using that (as well as the length of the content) to search for matching data in the live WC which I can reference. I think that combination should make it unlikely enough that they'll be a mess-up due to the wrong data being referenced. And since the live WC always has a complete copy it won't be possible for the live site to get the wrong data published even if there's a hash and length collision.

One annoyance is that this might make the Perl code for publishing content slightly more complicated. I first thought that I wouldn't have to worry about it, since I'd be publishing from the live WC, so I could just pull the data column out without checking to see if the data was actually stored in a different record. But that's not good enough, because the publishing code should also work on other working copies, to allow previews to be generated. Still, I can probably just provide a utility function to take care of getting the data.

I may have gone slightly overboard on the PostgreSQL constraints though. This is what the four fields for storing content in the wc_file table currently look like:

data_from_file_id int references wc_file,
data bytea,
data_len int not null
    check (data_len >= 0),
data_sha1 char(27)
    check (length(data_sha1) = 27 and
           data_sha1 similar to '[A-Za-z0-9+/]+'),
constraint wc_file_wrong_data_len
    check (data_len = length(data)),
constraint wc_file_bad_dir_data
    check (not is_dir or
           (data_from_file_id is null and data is null and
            data_len = 0 and data_sha1 is null)),
constraint wc_file_bad_file_data
    check (is_dir or
           (data_sha1 is not null and
            ((data is not null and data_from_file_id is null) or
             (data is null and data_from_file_id is not null)))),
constraint wc_file_bad_empty_file_data
    check (is_dir or data_len > 0 or data is not null)

< Organising content | Loading metadata >