Thursday, July 10, 2008

Getting Rid of Legacy Text Encoding

In the process of migrating my sites to a new server, I am also making my HTML code 7-bit clean. This should finally relieve me from ever having to manually change the character set in my terminal and the $LANG variable in my shell environment.

I tried to find a program that would do the conversion for me but, alas, I did not find anything that would help me get the job done quickly. Hence, I decided to write my own script.

Here’s my Perl script to convert non-ASCII characters to numerical character references (such as A):

#!/usr/bin/perl

binmode(STDIN, ':encoding(iso-8859-15)');
binmode(STDOUT, ':encoding(iso-8859-15)');

while(<>) {
    foreach(split (//)) {
        if (ord() < 2**7) {
            print;
        } else {
            print '&#', ord(), ';';
        }
    }
}

Call it like this: ./make-html-ascii-clean.pl < input.html > output.html.

If your files are not ISO-8859-15 encoded then change the STDIN mode appropriately. You can leave the STDOUT mode as is since the script will only print ASCII characters.

The attentive reader may notice that this script can easily be modified to convert non-ASCII characters between different character sets:

#!/usr/bin/perl

binmode(STDIN, ':encoding(iso-8859-15)');
binmode(STDOUT, ':encoding(utf-8)');

while(<>) {
    foreach(split(//)) {
        print;
    }
}

One final trick. If you have a whole bunch of files to convert, you can use find to automate the process:

find -iname '*.html' -or -iname '*.php' \
      -exec sh -c '/path/to/make-html-ascii-clean.pl \
      < "{}" > "{}.tmp"; mv "{}.tmp" "{}"' ';'

Make sure to make a backup first though. In particular, please note that not changing the path in the command above will destroy all data in all .html and .php files in the current directory and all subdirectories!

This find command may look a bit scary. It’s really useful though so I’ll explain it step by step.

First observe that I ended all lines with a backslash (\). That’s to indicate that the command continues on the next line. You could also delete these backslashes and put it all on one line. Now what we make the find command do is recursively walk all files and directories in the current directory and all subdirectories therein. For every file (or directory—technically speaking there’s a bug here) it will run the make-html-ascii-clean.pl script and replace the content of said file with a 7-bit clean version of it.

The first line of the find command is easy to understand. We tell find to look for files whose name ends in .html or .php. On the second line we tell it to run, for each match, the command after -exec and before ';'. That is, it runs

sh -c '/path/to/make-html-ascii-clean.pl < "{}" > "{}.tmp"; \
    mv "{}.tmp" "{}"'

for each such file. However, it will first substitute the match’s filename for each instance of {} (which I here consistently enclosed between double quotes in case such a filename contained whitespace). Notice that this expression really contains two commands though. To pack these two commands in the single find statement, I call sh -c followed by—and enclosed between single quotes—the three commands, separated by semi-colons (;). Thus for each .html and .php file we run

/path/to/make-html-ascii-clean.pl < "{}" > "{}.tmp"

which converts the file {} and stores the result into {}.tmp. Finally, we overwrite the original file with our 7-bit clean version of it:

mv "{}.tmp" "{}"