Sunday, May 19, 2024

Everything Fails, FCGI::ProcManager::Dynamic Edition

I have been reading a lot of rachelbythebay, which has led me to thinking about the reliability of my own company’s architecture.

It’s top of my mind right now, because an inscrutable race condition caused a half-hour outage of our primary site.  It was a new, never-before-seen condition that slipped right past all our existing defenses.  Using Site as a placeholder for our local namespace, it looked like this:

use Try::Tiny qw(try catch);
try {
  require Site::Response;
  require Site::Utils;
  ...;
} catch {
  # EX_PRELOAD => exit status 1
  exit_manager(EX_PRELOAD, $_);
};
...;
$res = Site::Response->new();

Well, this time—this one time—on both running web servers… it started throwing an error that Method "new" wasn't found in package Site::Response (did you forget to load Site::Response?).  Huh?  Of course I loaded it; I would’ve exited if that had failed.

In response, I added a lot more try/catch, exit_manager() has been improved, and there is a separate site-monitoring service that will issue systemctl restart on the site, if it starts rapidly cycling through workers.

Does the Module Have Code Inside?

First, the preload section has a new section at the end:

# Try::Tiny still in effect
try {
  require Site::Response;
  Site::Response->can('new') or die "can't Site::Response->new";
} catch { ... };

This checks the crucial modules used by the request loop for actual usability, in case this particular problem ever reappears.

After that point, every call to anything is wrapped in try/catch.  There are many more opportunities to stop the whole service if something is badly wrong.

A Better Request Loop

UPDATE 2024-06-25: I ripped all the SIGTERM stuff back out. It did not work in practice; the workers would no longer exit in response to the signal.  More research is required.

We have something this shape now:

# Try::Tiny still in effect
my $fcgi;
try {
  $fcgi = FCGI::Request();
} catch {
  exit_manager(EX_FCGI_CREATE, $_);
};

while ($pm->pm_loop && $fcgi->Accept() >= 0) {
  # request handling goes here
}

warn "Request loop exited; shutting down\n";
CORE::exit(0);

Delayed Exit in the Worker

In testing all of this, I found that FCGI::ProcManager::Dynamic and/or Perl itself get confused about the status of the workers.  exit_manager used to simply warn-and-exit, but now it gives the manager a chance to signal it:

my $MANAGER_PID = $$; # before calling pm_manage!
...;
sub exit_manager { my ($status, $message) = @_;
  warn "exit($status): ", $message;
  $SIG{TERM} = 'DEFAULT';
  kill TERM => $MANAGER_PID;
  sleep(10); # wait for the TERM back
  exit($status);
}

A SIGTERM received during sleep() will be processed by Perl, so we don’t have to do anything fancy in there.  But if it doesn’t get back to us, we’ll still exit within a bounded period of time.

Monitoring

The external monitoring service, responsible for issuing systemctl restart on our FastCGI service when necessary, tails the journal and reads the timestamp on messages with “increase workers to minimal” in them.  This happens when a worker exits, but without signaling the whole manager to exit with it.

If the workers start crashing rapidly on startup, the manager will desperately try to keep starting them.  It doesn’t have any control over when it should give up entirely.  This will spin the CPU and flood the journal.

A worker that manages to enter the request loop, but then fails, doesn’t spin the CPU as hard (it needs a request to come in first), but the site is just as broken for the users.  Such an event puts the same message in the journal.

In both cases, we want to restart the whole manager, so the monitoring task simply waits until, say, 10 unexpected failures are seen over 90 seconds or less, before issuing restart.  Then, it ignores failure messages for the next 30 seconds, to give the service time to reboot itself.

Overall

I hope that’s enough guardrails to keep the system on the road.  But, I also hope to be done with Perl at some point in the future…

I half-wish I would have used CGI::Emulate::PSGI, but on the other hand, I have great control over the failure behavior in this system.  If Starman started cycling workers rapidly, could I have done anything about it in-process? Or would I have needed to go straight to the external monitoring process?

No comments: