Friday, November 23, 2018

A Debugging Session

With the company-wide deprecation of Perl, I’ve been rewriting memcache-dynamo in Python 3.  The decision has been made to use a proxy forever.  All of the other options require effort from programmers.  Worse, failures would be inconsequential in dev, but manifest in production.

I’d like to take a moment to walk through the process of debugging this new system, because I built it, I added tests, I felt good about it, and it simply didn’t work with either PHP or Perl clients.


The first thing I decided to do was capture a packet trace.  I wanted to see what was happening on the wire, so that I could understand how the server was really being used by these clients.  This was full of surprises.

PHP failed immediately and noisily.  There were a couple of unimplemented commands, and a validation exception from botocore where PHP used add with a 0-length value.  (This is a security feature added in 7.x, making sure the session ID it wants is not, through amazing bad luck, already in use.) Those were easy enough to build out, and a little tricky to test, but eventually, I got there.

Perl, though.  It was weird. It seemed to be reading and writing fine, except that it rejected all the reads as invalid for some reason.

I couldn’t see what else to do, so I looked carefully over everything with a fine-toothed comb.  This is the part of debugging where you just don’t know what’s happening.

(POST /login)
> GET cgi-fab31…
< VALUE cgi-fab31… 0 115 0
< END
> SET cgi-5334f… 0 623
< STORED
(GET /home)
> GET cgi-5334f…
< VALUE cgi-5334f… 0 623 0
< END
> SET cgi-492ac… 0 115
< STORED

Here, length 115 corresponds to a logged-out session, and 623 to a logged-in one.  What could be the problem?

If you, dear reader, want to do this as an exercise, look through the above trace.  Compare with the protocol document that I used to implement the server.

Below, I’ll continue, describing what I tried and explaining why it didn’t work, before revealing the real issue.

I thought it unlikely, but the first thing I checked was the stored values vs. the returned values.  They matched, bit-for-bit.  Furthermore, the embedded expiration timestamps were accurate (session created 4 seconds ago, etc.) so the session should clearly not be rejected by the data itself.

I noticed the expiration timestamp is 0.  Was that the issue?  I wrote some more tests to simulate Perl’s calls on the wire, but with expiration of 0, they still passed.  (As expected.  The server just converts this to 86400 because I need an expiration value, and that’s larger than our absolute session lifetime.)

Comparing the memcache-dynamo packet trace with another trace from real memcached, the other thing I saw was that END was being sent in a separate packet from VALUE and the value contents.  Surely, that wasn’t the issue?  I worked a bit on packing my writes and not flushing the stream in between, and finally verified the result.  It didn’t help.

Thank goodness, really.  I’d worry about my sanity if a packet boundary affected the protocol.

I began to wonder if there was a problem in a Perl library.  I checked out CGI::Session, Cache::Memcached, and CGI::Session::Driver::memcached, but none of them have had an update since 2012.

I started going through the packets byte-by-byte. The GET command had a trailing space on the line, but the server processes it with line.rstrip(), so that wasn’t an issue.

But then I noticed something else.  Real memcached was returning the line as:

< VALUE cgi-9b992… 0 623

Should I leave off the CAS value when it’s 0?  This didn’t make any sense, but I tried it, and it fixed Perl without harming PHP.

It also set the test suite on fire.  Any time aiomcache did a get, it would throw an exception, because the CAS value was missing from the response.  As it turns out, the library actually issues a gets command when the get() method is called.

The difference between the get and gets commands is not multiple-key support, but that gets returns CAS values, and get does not.  This is not indicated in the section about the commands in the protocol document; it’s mentioned in passing in the discussion of <cas unique>.  I took some notes on the protocol, and implemented from them, and missed this nuance.

The conspiracy was extended by Wireshark.  It behaved a little oddly, not showing data values for those improper responses, but otherwise not issuing any warnings.  (Unlike its “add must have a value” warning I had seen earlier.)

PHP also played into the conspiracy, either by sending gets commands itself, or ignoring the extraneous CAS value in the response.

In retrospect, it might have been clever of me to capture a packet trace of the test suite. Assuming Wireshark would show a gets command or response differently, that would have helped narrow the problem down.

But in the end, that’s what debugging is: generating ideas, testing them, then looking for more if the test proves them wrong.  I know “if you run out of ideas, then look harder” isn’t really great how-to advice?  But that’s all I’ve got.  Debugging is the most creative part of coding, and I don’t know how to teach creativity, either.

No comments: