Sunday, February 8, 2015

Pointless metrics: Uptime

Back in the day when I was on Slashdot, a popular pasttime was making fun of Windows. The 9x line had a time counter that rolled over after 30-45 days (I forget and can’t dredge the real number up off the Internet now), which would crash the system. Linux could stay up, like, forever dude, because Unix is just that stable.

So I spent a while thinking that ‘high uptime’ was a good thing. I was annoyed, once upon a time, at a VPS provider that regularly rebooted my system for security upgrades approximately monthly, because they were using Virtuozzo and needed to really, seriously, and truly block privilege escalations.

About monthly. As in 30-45 days…

I thought that was bad, but nowadays, the public-facing servers my employer runs live for less than a week. Maybe a whole week if they get abnormally old before Auto Scaling gets around to culling them. And I’m cool with this!

I try to rebuild an image monthly even if “nothing” has happened, and definitely whenever upstream releases a new base image, and sometimes just because I know “major” updates were made to our repos (e.g. I just did composer update everywhere) and it’ll save some time in the self-update scripts when the image relaunches.

It turns out that the ‘uptime’ that the Slashdot crowd were so proud of was basically counterproductive. I do not want to trade security, agility, or anything else just to make that number larger. There is no benefit from it! Nothing improves solely because the kernel has been running longer, and if it does, then the kernel needs fixed to provide that improvement instantly.

And if the business is structured around one physical system that Must Stay Running, then the business on that server is crucial enough to have redundancy, backups, failovers… and regular testing of the failover by switching onto fresh hardware with fresh uptimes.

No comments: