Character encodings in Perl

Juerd Waalboer has published a new piece of documentation for Perl about how character encodings work in Perl, and more importantly how to deal with them. I found it made the whole thing much clearer, and I'd recommend every Perl programmer read it. It's called perlunitut.

I think the most important thing to take away from it is that you should keep text internally in Perl's own text strings, and be strict about doing decoding and encoding whenever text goes in or out of your program.

Encodings in Daizu

This isn't something I've tried too hard to get right yet, but obviously I'll want everything working before the release, so I've added some items to the TODO list:

  • Get rid of all the utf::decode() and such and use the Encode module, and fire off warnings or errors when something isn't in the expected encoding. Also explicitly encode output as UTF-8.
  • Decode Subversion properties which are to be treated as text. There's nowhere to declare their encodings, so I'll assume things like dc:title are always in UTF-8. I also make that assumption for XHTML content, but that will probably change in the future.
  • Instead of using binmode $fh, ':utf8' on to do encoding to/from filehandles, either use Encode explicitly, or open the filehandle with the encoding layer in the first place, so that it's documented in the right place:

    open my $fh, '>:encoding(UTF-8)', $filename;
  • Turn off the DBD::Pg UTF-8 handling (I don't think it really works, and it probably won't be very robust in the face of badly encoded data). Instead, be careful about always decoding stuff from the database at the right point.

< New XML::LibXML input callbacks API | Bug adding namespaced attributes >