Wednesday, March 18, 2009

Jakarta-ORO's Perl5Util: Subexpressions

Hello,

Lately I've had the job of fixing a java-based content management system which had been created back in 2003, and had barely been maintained since then - the production server ran on Tomcat 4.1.12 which was stone-age technology just there. But you know how it is with IT - if it works, don't break it.

Precisely why I have been slated with working on it. It works, but the W3C validator whines to bloody glory over the ancient HTML that resides within it. It carries the weight of years of WYSIWYG editors that spit out non-compliant HTML code.

The system has an interesting feature that allows you to preview the saved pages before any layout and variable substitution occurs, so you can run it through a validator to make sure the markup is up to scratch. The order was given to make the entire site XHTML 1.0 Strict compliant, so they added a doctype to each page. Unfortunately, while the page formatter knew to strip out html, head and body tags, it knew not about doctype tags. A rogue doctype tag appeared and added another error to the exhaustive list (one page had over 300 errors!).

The stripping was done with the Perl5Util class of the Jakarta-ORO library, to facilitate the use of Perl Compatible Regular Expressions. I added another line to strip out the doctype tag. Then other issues arose - list items that had an opening tag but no closing tag - omitting closing tags is a big no-no in XHTML, so another bit of regular expressions to add the missing tag. Another again to get rid of any duplicate tags (as a programmer I am inherently lazy) and there goes a good 80 errors of one page.

Another issue had the W3C validator crying itself to sleep at night - HTML tag attributes with no quotes. Not a problem in traditional HTML but in XHTML extremely frowned upon. So another bit of regex to add them in. Recompile, load the page... hang on, where did the images go?
s|=([^\s">]+)|="\1"|g
Oh they were still there, all right - one pixel by one pixel. My well-meaning regular expression had in fact replaced their values with '1'. What the...

I tested the regex in perl:
$bob = "<font size=-1 family=Arial>";

$bob =~ s|=([^\s">]+)|="\1"|g;

print "$bob\n";
As it should, it proudly printed the result:
<font size="-1" family="Arial">
But why wasn't it working in Perl5Util? I added a bit of debugging code to test:

String test = "<font size=-1 family=Arial>";
System.err.println("test(s|=([^\\s\">]+)|=\"\\1\"|g): "
+util.substitute("s|=([^\\s\">]+)|=\"\\1\"|g", test));
In the error log I saw:
<font size="1" family="1">
Argh.

A frenzy of Googling ensued before I discovered someone who was also having trouble with subexpressions in Perl5Util - except his were sort of working. The difference was he was using dollar signs instead of backslashes. A little adjustment to the code:
String test = "<font size=-1 family=Arial>";
System.err.println("test(s|=([^\\s\">]+)|=\"$1\"|g): "
+util.substitute("s|=([^\\s\">]+)|=\"$1\"|g", test));
And a peek in the error log:
<font size="-1" family="Arial">
Success! Now to adjust the last bit of code and the images appear again... and 200+ errors disappear from the W3C Validator report.

So remember, if you're using Perl5Util, perl supports backslashes and dollar signs, Perl5Util only supports dollar signs.

No comments:

Post a Comment