Thursday, July 12, 2012

Unintended Consequences (an API Design Rant)

tl;dr

  • Your package needs to understand encoding. It can’t just throw away structure and hope for the best.
  • If a package is overly simple, it’s likely to be too simple for real-world use; and more likely to be reimplemented with a “slightly less awful” hack because it wasn’t that big in the first place.
  • MIME won.  Email packages should understand/generate MIME by default where necessary, and only avoid MIME processing at the caller’s option.  If that.
  • Perl’s Unicode implementation adds too much complexity (and it doesn’t help that people don’t agree on terminology).  Now package authors get to write two APIs, one for Unicode and one for octets.

Exposition

Consider the simple program:
use Email::MIME;
$m = Email::MIME->create(
header_str => [
To => join(', ', qw(
  foobar2000@example.com
  spaceball99@other.example.com
  break@domain.example.net
))]);
print $m->as_string;


What happens if you run this on Perl 5.10?  The final email address is “break@domain.example .net”—note the space between “example” and “.net”!

As it turns out, such a thing may actually be legal per RFC 2822: the obs-domain syntax allows for embedded CFWS around the dots of the domain.  However, Amazon SES doesn’t support it, so some email was bouncing with the error message, “Domain contains control or whitespace.”

The rogue space explains the SES failure, but how did it get there in the first place?

The Debugging

As part of its work, the header_str method takes a Unicode string (i.e. a regular Perl string) and encodes it in the MIME =?UTF-8?Q?...?= syntax (if it happens to have code points above 127; indeed, it leaves everything from NUL \x00 to DEL \x7F intact) via Encode::encode("MIME-Q", $value).

Besides raw encoding, MIME-Q also attempts to fold encoded values to lines of 75 octets, and it considers its $especials regex to be break-able points in the string.  That regex in turn carries all of the “special” characters, including dots.

Email::MIME actually passes in the entire header value, even for headers that it shouldn’t (because an addr-spec MUST be US-ASCII), which means that the email address itself is included as part of the string being encoded.  Since it contains dots and Encode::MIME::Header is too low-level to understand email structure, the encoder sees the dot and breaks the line there: it inserts a newline and space.

RFC 2822 expects folding to happen at whitespace (hence the FWS and CFWS everywhere), so when unfolding, the leading whitespace of the unfolded line is kept.  With the folded line unfolded, the email address still contains the space that was inserted to fold the header in the first place.

But wait.  This space is showing up locally, before we send the message to SES.  Who has unfolded it?  We’re passing in a valid email address and more-or-less expecting it to be pasted as-is into an email message.  But that’s not what happens.

Email::Simple’s create implementation, as of version 2.101, immediately constructs a message string, then passes it to new(), which parses it back out.  Thus, you pass in pieces of an email, only to have it build you a string, then rip the string back into pieces.  It doesn’t just keep the pieces and validate them directly.

It so happens that there’s another problem lurking in Email::MIME: for a very narrow range of string lengths, the output can actually contain a byte sequence like “..., bug@example.com,\n\x20trigger@example.\x20com”.  The evidence suggests that Email::MIME is also trying to encode the final completed header (“To: ...”) as well as the header value that went into it.  Because of the addition of “To: ” to the line, the fold happens at a different place.

Later changes to Encode::MIME::Header have altered the behavior so that on Perl 5.14.2, the encoder no longer inserts the break in the domain; it is after the angle bracket instead, which is just as incorrect.  Only the atom of obs-local-part would allow for FWS or CFWS at that position.

Lessons / Rant

Part of this bug is clearly caused by trying to deal with too little detail inside Email::MIME, which ends up passing inappropriate strings to Encode::MIME::Header for encoding.  The encoder is at too low of a level to understand what context the string is in, so it can’t really do much except encode the entire thing.  If I hand you a block of text and ask you to encode it all, who are you to start guessing at what it might mean?

I ended up solving my problem by effectively implementing header_str correctly and robustly in my own code, then passing the results to header—which, fortunately, folds at true white space and not random special characters.  If a package is too simple, then it fails in the real world, and because it’s so simple, it’s easy to reimplement.  Then the package is failing to provide value.

Why do header and header_str exist as separate functions?  The former is the old-style “strings are bytes” world where callers promise that the string is encoded correctly.  header_str is meant to deal with Unicode strings, where callers instead promise they decoded the input properly and this is valid Unicode, pushing the task of handling the output encoding down into header_str.  This split would be pretty much unnecessary if strings could remember whether they were decoded or not, and what encoding they had if they were encoded.

It also seems like it would be a lot easier and more accurate to build up header values in components, something like my $addr = Fantasy::Email::Address ->new($email) ->phrase(trim("$first $last"));.  Instead of orienting the interface around parsing, it would be built around the decomposed data structure and a parser would be added atop it.

Consider, after all, XML parsing.  The major divisions of parsers—DOM, SAX, and pull—all serve different needs.  Building a DOM-style parser into the core representation is the wrong design.

(Updated 7/13) – It even turns out that a bug was filed on the CPAN RT, ages ago, along with a patch in a pull request on github five months ago, either of which could have avoided my problem.  This has clearly been repeatedly identified and now people's time is just getting wasted on it.

No comments: