Saturday, October 7, 2023

The Logging Tarpit

Back in August, Chris Siebenmann had some thoughts on logging:

A popular response in the comments was “error numbers solve everything,” possibly along with a list (provided by the vendor) detailing all error numbers.

The first problem is, what if the error number changes?  MySQL changed from key index to name in their duplicate-key error, and consequently changed from error code 1062 to 1586.  Code or tools that were monitoring for 1062 would never hit a match again.  Conversely, if “unknown errors” were being monitored for emergency alerts, the appearance of 1586 might get more attention than it deserves.

In other cases, the error numbers may not capture enough information to provide a useful diagnostic.  MySQL code 1586 may tell us that there was a duplicate value for a unique key, but we need the log message to tell us which value and key.  Unfortunately, that is still missing the schema and table!

Likewise, one day, my Windows 10 PC hit a blue screen, and the only information logged was an error code for machine check 0x3E.  The message “clarified” that this was a machine check exception with code 3e.  No running thread/function, no stack trace, no context.

Finally, in some cases, logging doesn’t always capture an intent fully.  If a log message is generated, is it because of a problem, or is it operationally irrelevant?  Deciding this is the real tar pit of log monitoring, and the presence of an error number doesn’t really make a difference to it.  There’s no avoiding the decisions.

In the wake of Chris’ posts, I changed one of our hacky workaround services to log a message if it decides to take action to fix the problem.  Now we have the opportunity to find out if the service is taking action, not simply being started regularly.  Would allocating an error number (for all time) help with that?

All of this ends up guiding my log monitoring philosophy: look at the logs sometimes, and find ways to clear the highest-frequency messages. I don't want dozens of lines of known uninteresting messages clogging the log during incident response.  For example, “can’t load font A, using fallback B.”  We’d either install font A properly, or mute the message for font A, specifically.  But, I want to avoid trying to categorize every single message, because that way lies madness.