Sunday, September 21, 2025

Reflections on Breaking Something

Last week, I deployed some code, and then impossible phenomena followed on the website.  Ultimately, it was all my fault, because I have changed my design sensibilities over time.

Distant past me figured it would make for shorter commands if we left the .service suffix off of names.  It would be added automatically at the boundary, when actually invoking a systemctl command.

Present me is less tolerant of magic, hates checking at several places whether or not to add .service, and worries about whether the code works with other types of systemd units.

Hence, when I recently updated our deployment code, it also began passing the full service name to the reload script.  That script is a wrapper that sits between the deployment user and systemctl.  The deployment user has sudo rights to the script, which can only run very specific systemctl commands, and which validates the unit name against an “allowed” list.

For simplicity—because it is run through sudo—this wrapper script had zero magic. It expected the caller to give it an abbreviated name, to which it would add .service itself.  The change to the deployment code then broke that process.  Tests didn’t catch it, not only because there are none, but because the wrapper script lives outside of the deployment repository.  It’s an externally-provided service.

Consequently…

The “impossible phenomena” happened because the new files were unpacked, including templates, while the reload left the old code running.  The old code didn’t set the newly-used variables for the template to process, so the parts relying on those variables malfunctioned.  I had a lot of difficulty duplicating this effect, because out of habit, I restart the daemon with sudo systemctl ... after making code changes in dev.  I don’t use the wrapper script.  (Maybe I should.)

The first thing to do was fix the wrapper script to accept names with the .service suffix.

But after that, the biggest thing is that the deployer needs to cancel the operation and issue a rollback if the final reload fails.  This will restore consistency between the old code and the original files on disk.

I might also be able to improve robustness overall by using a relative path for the template root dir.  If we stay in a working directory below the symlink that is updated during deployment, instead of traversing that symlink on an absolute path, we’ll always get the templates that correspond to the running code. However, that’s more subtle and tricky than issuing a rollback, and hence, more likely to get broken in the future.

I like the sudo local-reload website.service approach.  The script can be tracked in version control easily, and the sudoers file remains as short and simple as possible.  Meanwhile, the deployment user isn’t given broad access to the entire set of subcommands that systemctl has to offer.

No comments: