Monday, July 30, 2012

DynamoDB Performance

Things I learned this past week:
  1. AWS Signature V4 no longer requires temporary credentials.  If you aren’t caching your tokens, this can give you a nice speedup because it cuts IAM/STS out of the connection sequence.
  2. AWS service endpoints are SSL.  If you make a lot of fresh connections, you may pay a lot of overhead per connection.
  3. Net::Amazon::DynamoDB and CGI are terrible things to mix.
Read on for details.

Wednesday, July 18, 2012

Add Multiply Exponentiate Tetrate

A random thought occurred to me today:
  1. Multiplication is iterated addition.  (= (* 5 4) (reduce + (take 4 (repeat 5))))
  2. Exponentiation is iterated multiplication.  (= (expt 5 4) (reduce * (take 4 (repeat 5)))) ; if you've imported clojure.contrib.math/expt
  3. So what do you get if you iterate exponentiation?  Is it useful?
Another way to look at the question is: logarithms strength-reduce by one level, hence the log rules like (= (+ (log a) (log b)) (log (* a b))) and the definition of log as the inverse of exponentiation, just as division and subtraction invert multiplication and addition.  What, then, is the law for (expt (log a) (log b))?  Again, this should be the log of our mystery operation on a and b.

It turns out Wikipedia has me covered.  A tiny little section on the Exponentiation page links off to tetration, Ackermann function, and Knuth's up-arrow notation.  There goes my night.

(Bonus: I finally have all the background to understand the third panel of xkcd 207. #latetotheparty)

Tuesday, July 17, 2012

find: arguments are not actually one big expression

I first learned about find somewhere around 12 years ago, so the documentation today might contradict me, but I’ve been carrying around in my head this false notion: that flags like -print0 participate in the conditions as a true value while they update a flag that sets the eventual output format.

They don’t, in fact, participate.  find accepts them with -print0 -a «expr» syntax just to mess with you, and outputs the filename as soon as it sees the -print0 option.  That means these two commands are equivalent:
find . -print0 -a «expr»
find . -print0
And in fact, if you offer multiple -print options, you'll get each filename printed out multiple times.

I was intending to do CRLF→LF translations only on text files, and the extension-matching came after the -print0.  Since find emitted no warning, I only noticed the damage when I deployed the website and looked at Firefox valiantly trying to make sense of all the broken images.

The correct way to write the command is actually:
find . «expr» -print0
This triggers the -print0 “option” only once the complete expression has matched.  And suppresses the regular/default -print, of course.

Thursday, July 12, 2012

Unintended Consequences (an API Design Rant)

tl;dr

  • Your package needs to understand encoding. It can’t just throw away structure and hope for the best.
  • If a package is overly simple, it’s likely to be too simple for real-world use; and more likely to be reimplemented with a “slightly less awful” hack because it wasn’t that big in the first place.
  • MIME won.  Email packages should understand/generate MIME by default where necessary, and only avoid MIME processing at the caller’s option.  If that.
  • Perl’s Unicode implementation adds too much complexity (and it doesn’t help that people don’t agree on terminology).  Now package authors get to write two APIs, one for Unicode and one for octets.

Exposition

Consider the simple program:
use Email::MIME;
$m = Email::MIME->create(
header_str => [
To => join(', ', qw(
  foobar2000@example.com
  spaceball99@other.example.com
  break@domain.example.net
))]);
print $m->as_string;


What happens if you run this on Perl 5.10?  The final email address is “break@domain.example .net”—note the space between “example” and “.net”!

As it turns out, such a thing may actually be legal per RFC 2822: the obs-domain syntax allows for embedded CFWS around the dots of the domain.  However, Amazon SES doesn’t support it, so some email was bouncing with the error message, “Domain contains control or whitespace.”

The rogue space explains the SES failure, but how did it get there in the first place?

Friday, July 6, 2012

Keep the Pieces

When a low-level function is going to write out a string, for instance the To header of an email, I often find myself tempted to “just” pass down strings.  Often, I find later that I would rather have the header in array form at some intermediate level, so that I can add recipients only if they’re not already present.  I’m then forced to parse the string in some manner, with that choice requiring some balance of performance and correctness.  (It’s tempting to make code deal only with the subset of the RFC you think you’ll need.)

If this happens more than once on the way down (“some errors should email us admins if we’re not already involved in this transaction”), it gets even worse: build original array, reify to string, {parse, modify, reify} × 2.  Whereas just handing the array down through looks more like build, modify × 3, reify and send.

Letting the lowest layer put the data on the wire in the format the wire requires can also be more robust: if there are only Bcc recipients and the To address is “Undisclosed-recipients: ;” then checking whether To is empty loses its simplicity: it can have a non-empty value and yet not have real recipients.  Also, nobody at the higher layers has to care whether your addresses are actually separated by comma or by semicolon.

Finally, this lets you push down basic cleaning like calling array_unique() into the lowest level, meaning each modification along the way can quickly append and trust the result will be safe on the wire.  All those layers become more concise and readable.

Thursday, July 5, 2012

Auth doesn't belong in the session

PHP locks session access by default, from the time you call session_start() until session_commit(), or until the response is fully written if you didn't commit it earlier.  If you store your authenticated state ("user bob; expires at 12:30") inside the session, then you have to open the session any time you need to know who the user is.  If that makes you set up your app to open the session automatically and leave it open the whole request, then you're hurting parallelism if you have read-only operations.

If you store the auth info in a separate, MAC'd* cookie instead, then you can read the auth state without affecting the session.  Of course, the auth cookie is the most powerful one, so all possible protections should apply: HttpOnly and Secure, served over HTTPS.

* Don't let your users impersonate each other by editing their own cookies.