Sunday, November 17, 2024

Fixing a Random ALB Alarm Failure

tl;dr: if an Auto Scaling Group’s capacity is updated on a schedule, the max instance lifetime is an exact number of days, and instances take a while to reach healthy state after launching… Auto Scaling can terminate running-but-healthy instances before new instances are ready to replace them.

I pushed our max instance lifetime 2 hours further out, so that the max-lifetime terminations happen well after scheduled launches.

At work, we run multiple front-end instances during the work day, with a reduced number during off hours.  At night, we have a maximum of one instance per configured availability zone.

We sometimes received our “no healthy instances” alert during the early-morning “launch additional instances” window.  When I finally dug into it, I saw the existing instances had logged a “power button pressed short” event, and duly shut down.

That was confusing.  I quickly understood that this power-button emulation was how AWS signals an EC2 instance to shut itself down during termination, but I was confused about why it was happening.

I finally found the culprit in the Auto Scaling activity log: the group’s configuration had a max instance lifetime set to an exact number of days. When AWS terminates instances at night, the “oldest instance” rule isn’t region-wide.  Instead, AWS picks a random availability zone, and then shuts down the oldest instance within the zone.  (Our zones have an equal number of instances by day.  YMMV.) In theory, the randomness may never pick a particular zone, allowing the instances within to get very old.

Thus, the max instance lifetime had been set to provide a backstop; if an instance’s zone weren’t selected for a week, it would still be terminated and replaced.

When these two systems interacted, it caused the problem: long-running healthy systems would be shut down, based on their initial launch times, before the new ones were up and running.  This left us below our minimum level of service for a brief period, triggering the alert.  Of course, by the time we got the alert and looked at them, everything would be “back to normal,” because those new instances would be healthy.

Thus, moving max-lifetime terminations later in the day does reduce the service level at that time, but not enough to send an “everything failed!!!11!!one” type of alert.

No comments: