Sunday, June 23, 2024

Sorted by What?

Shortened for illustrative purposes, I came across some ancient code of the form:

SELECT DATE_FORMAT(o.created_at, '%c/%e/%Y') AS dt,
    DATE_FORMAT(c.updated_at, '%l:%i %p') as tm,
    …
FROM orders o JOIN customers c … WHERE …
ORDER BY c.last_name, dt DESC, tm;

The ORDER BY dt caught my eye because it’s not an actual column in any table. Its value turns out to be the American-style “6/23/2024” format, which is reasonable to display, but completely wrong to sort on.  Doing that puts October prior to February, as “10” begins with “1”, which is less than “2”.

I cannot guess why it pulls the time from an unrelated column as tiebreaker, sorting it the other direction.  The rest of the issues are likely for the same reasons as the date.

I assume the chaotic arrangement of orders within a customer was never raised as a concern only because duplicating orders would be rare enough—ideally, never happening—that it didn’t matter.

Nonetheless, I queued a change to sort on the full date+time held in created_at, so that records will be fully chronological in the future.

Sunday, June 16, 2024

Availability and Automatic Responses

At work, we have built up quite a bit of custom monitoring, for example. It’s all driven by things failing in new and exciting ways.

Before that particular day, there were other crash-loop events with that code, which mainly manifested as “the site is down” or “being weird,” and which showed up in the metrics we were collecting as high load average (loadavg.) Generally, we’d get onto the machine—slowly, because it was busy looping—see the flood in the logs, and then ask Auto Scaling to terminate-and-replace the instance.  It was the fastest way out of the mess.

This eventually led to the suggestion that the instance should monitor its own loadavg, and terminate itself (letting Auto Scaling replace it) if it got too high.

We didn’t end up doing that, though.  What if we had legitimate high CPU usage?  We’d stop the instance right in the middle of doing useful work.

Instead, during that iteration, we built the exit_manager() function that would bring down the service from the inside (for systemd to replace) if that particular cause happened again.

Some other time, I accidentally poisoned php-fpm.  The site appeared to run fine with the existing pages.  However, requests involving the newly updated extension would somehow both generate a segfault, and tie up the request worker forevermore.  FPM responded by starting up more workers, until it hit the limit, and then the entire site was abruptly wedged.

It became a whole Thing because the extension was responsible for EOM reporting that big brass was trying to run… after I left that evening.  The brass tried to message the normally-responsible admin directly.  It would have worked, but the admin was strictly unavailable that night, and they didn’t reach out to anyone else.  I wouldn’t find out about the chaos in my wake until reading the company chat the next morning.

Consequently, we also have a daemon watching php-fpm for segfaults, so it can run systemctl restart from the outside if too many crashes are happening.  It actually does have the capability to terminate the instance, if enough restarts fail to alleviate the problem.

I’m not actually certain if that daemon is useful or unnecessary, because we changed our update process.  We now deploy new extension binaries by replacing the whole instance.

Having a daemon which can terminate the instance opens a new failure mode for PHP: if the new instance is also broken, we might end up rapidly cycling whole instances, rather than processes on a single instance.

Rapidly cycling through main production instances will be noticed and alerted on within 24 hours.  It has been a long-standing goal of mine to alert on any scaling group’s instances within 15 minutes.

On the other hand, we haven’t had rapidly-cycling instances in a long time, and the cause was almost always crash-looping on startup due to loading unintended code, so expanding and improving the system isn’t much of a business priority.

It doesn’t have to be well-built; it just has to be well-built enough that it never, ever stops the flow of dollars.  Apparently.

Sunday, June 9, 2024

My firewall, as of 2024

On my old Ubuntu installation, I had set up firewall rules to keep me focused on things (and to keep software in line, like blocking plain DNS to require DoT to CloudFlare.)

Before doing a fresh installation, I saved copies of /etc/gufw and /etc/ufw, but they didn’t turn out to be terribly useful.  I don’t know what happened, but some of the rules lost address information.  The ruleset ended up allowing printing to the whole internet, for instance.

I didn’t have a need for profiles (I don’t take my desktop to other networks), so I ended up reconstructing it all as a script that uses ufw, and removing gufw from the system entirely (take that!)

That script looks in part like this:

#!/bin/sh
set -eufC
# -- out --
ufw default reject outgoing
ufw allow out 443/udp comment 'HTTP 3'
ufw allow out 80,443/tcp comment 'Old HTTP'
ufw allow out proto udp \
	to 224.0.0.251 port 5353 \
    comment 'mDNS to LAN for printing'
ufw allow out proto tcp \
    to 192.168.0.251 port 631,9100 \
    comment 'CUPS to Megabrick'
ufw allow out on virbr0 proto tcp \
    to any port 22 comment 'VM SSH'
# -- in --
ufw default deny incoming
ufw allow in 9000:9010/tcp \
    comment 'XDebug listener'

This subset captures all of the syntax I’m using: basic and advanced forms, and all of the shapes of multi-port rules.  One must use the ‘advanced’ form to specify address or interface restrictions.  However, ufw is extremely unhelpful about error messages, usually only giving out “wrong number of arguments.”  The typical recourse is either to look harder at the man page syntax, or to try to roll back conditions until it gets accepted.

For deleting those test rules, the best way is ufw status numbered followed by ufw delete N where N is the desired rule number.  (You can also do ufw reset and start over.)

Note that the ufw port range syntax is “low:high” with a colon, like iptables. For example, 9000:9010 is a range of 11 ports; 9000,9010 is a list of only those two ports.

(I gave the printer a static IP because Windows; thus, the printer’s static IP appears in the ruleset.)

This script, then, only has to be run once per fresh install; after that, ufw will remember these rules and apply them at boot.

Sunday, June 2, 2024

Stateful Deployment was Orthogonal

I used to talk about “stateful, binary” deployment, thinking that both things would happen together:

  1. We would deploy from a built tarball, without any git pull or composer install steps
  2. We would record the actual version (or whole tarball path) that was deployed

This year, we finally accumulated enough failures caused by auto-deploy picking up pushed code that wasn’t ready that we decided we had to solve that issue. It turned out to be unimportant that we weren’t deploying from tarballs.

We introduced a new flag for “auto mode” for the instance-launch scripts to use. Without the flag, deployment happens in manual mode: it performs the requested operation (almost) as it always has, then writes the resulting branch, commit, and (if applicable) tarball overlay as the deployed state.

In contrast, auto mode simply reads the deployed state, and applies that exact branch, commit, and overlay as requested.

I say “simply,” but watch out for what happens to a repository which doesn’t have any state stored.  This isn’t a one-time thing: when adding new repositories later, their first deployment won’t have state yet, either.  This can disrupt both auto and manual deployments.