Saturday, December 1, 2012

Hairy Escaping Problems (Keep the Pieces 2)

I was just settling in to hack out a Smarty-like template system (or at least an interpreter for it) in a non-PHP language, when my brain went all meta on me. ‘How can I never, ever have to deal with careful manual control over output encoding, ever again?’


The simplest way (to my Perl-warped brain) would be to have the interpreter aware of substitution context in much the same way a Perl subroutine can check wantarray to know whether it’s being invoked in list, scalar, or void context. In the case of the template engine, the context would be HTML element, attribute… or URI, for instance a href="/search?user={$name}&nextid=241"—but how do we know whether the user has a literal URL fragment to paste, or whether they intend $term to be a single URI component?

To understand the context, the template library would want to have the application keep everything in a data structure “like the DOM” so that the final context would be known. Instead of building up giant HTML strings to print, piece by piece, every variable carefully escaped, you would build up a native object in the programming language, then serialize that to HTML. The serializer would know through the object structure what node type a given string was going to become, and be able to escape it accordingly.

Most template systems can’t achieve this sort of thing: they care only for their string-substitution and application-defined escaping mechanism, and ignore the type of the document. Without knowing the document type, the escaping rules for that type are also unknown. Most frameworks with an auto-escape mechanism are built for the web, assume you’ll be generating HTML, and make all possible substitutions whether they’re needed or not, just to be safe. They’re still doing context-insensitive string replacement, after all.

But you notice I said “most,” because there’s one that I ran into that doesn’t: I met it in Clojure land, and it’s called enlive. It cares for the document type, because it takes a straight HTML file as a template, and gives you an interface for finding parts within and replacing them as you will. It can thus escape correctly on the output side.

(I haven’t studied it very closely, but apparently the Perl world has a similar option, the HTML::Seamstress.)

It seems like the approach taken by enlive is another instance of Keep the Pieces, where having data in its non-serialized form just generally helps in dealing with it.

No comments: