How should the actual data for files in my CMS be stored? I could
store the data in files on disk, which might make it easier to deal with very
large files, but it would make it really hard to ensure transaction safety
when updating data in a working copy. So I'm storing the data in a
bytea column in the PostgreSQL database.
I've got one optimization though. Since I want to allow many working
copies to be checked out into the database (for example from different
branches, to allow previewing a branch) I don't want to keep many copies
of the same data. So for any working copy other than the live WC I'm
allowing the data itself to be replaced with a reference to another file
(the data_from_file_id column) which has the same data.
Most branches will only have changes to a few files, so all the other
files just need a separate copy of the metadata. For simplicity I'm
only allowing references to the live working copy.
I'm storing a SHA1 hash of the content as well, and using that (as well as the length of the content) to search for matching data in the live WC which I can reference. I think that combination should make it unlikely enough that they'll be a mess-up due to the wrong data being referenced. And since the live WC always has a complete copy it won't be possible for the live site to get the wrong data published even if there's a hash and length collision.
One annoyance is that this might make the Perl code for publishing
content slightly more complicated. I first thought that I wouldn't
have to worry about it, since I'd be publishing from the live WC, so
I could just pull the data column out without checking
to see if the data was actually stored in a different record.
But that's not good enough, because the publishing code should also
work on other working copies, to allow previews to be generated.
Still, I can probably just provide a utility function to take care
of getting the data.
I may have gone slightly overboard on the PostgreSQL constraints
though. This is what the four fields for storing content in the
wc_file table currently look like:
data_from_file_id int references wc_file, data bytea, data_len int not null check (data_len >= 0), data_sha1 char(27) check (length(data_sha1) = 27 and data_sha1 similar to '[A-Za-z0-9+/]+'), constraint wc_file_wrong_data_len check (data_len = length(data)), constraint wc_file_bad_dir_data check (not is_dir or (data_from_file_id is null and data is null and data_len = 0 and data_sha1 is null)), constraint wc_file_bad_file_data check (is_dir or (data_sha1 is not null and ((data is not null and data_from_file_id is null) or (data is null and data_from_file_id is not null)))), constraint wc_file_bad_empty_file_data check (is_dir or data_len > 0 or data is not null)