Sunday, July 21, 2024

The Case of the Unknown Errors

For a number of reports, we did the lazy thing: we print errors on stderr in the job, and let cron email them to us.

Unfortunately, email is unreliable, and transient.  If a remote system accepts the message from us, then drops it for anti-spam reasons, we don’t have a log of that, nor a copy to resend.  We noticed these problems with report data, and now all output files are archived to S3.  However, the cron emails are the only source of truth for errors, so if we don’t get them, they’re lost forever!

I think the solution will be changing the error_log() calls to syslog().  That will create an on-host record, then forward it to the central server for searching and archiving.  We can even still get cron emails (normally) if we include the flag to print the messages to stderr.

I’m just kind of surprised that I have left a “can’t get email errors about email errors” loop in production for over a decade.

Wednesday, July 17, 2024

Reverting Flatpaks

Today, I learned a lot about Flatpak, motivated by Thunderbird giving me an error at startup.  The error message said that I had used the profile with “a newer version of Thunderbird,” and now it was not compatible.  It gave me the choice of creating a new profile, or quitting.  This was incredibly confusing, since I simply ran flatpak update as usual this morning, and took whatever was there.

That turned out to have been 115.13.0.  By the end of the day, it was “resolved” in the sense that a new version was published on Flathub (128.0esr) and that version was capable of opening my profile.

In the meantime, I learned more commands:

flatpak history

flatpak history produces a list of install/update activity, but it does not have version information at all. We can at least get the commit by asking for it specifically:

flatpak history --columns time,change,app,commit

Or with a sufficiently large terminal window (1920+ pixels wide), try using --columns all.

flatpak remote-info --log

It turns out that the remote can—and Flathub in particular does—keep older versions around for “a while.”  We can get these older versions, with their full commit IDs, by using flatpak remote-info with the repository and the package name.

flatpak remote-info --log flathub \
    org.mozilla.Thunderbird | less

(Line wrapped for readability on mobile; remove the backslash and put it all on one line.)

This prints out some nice header information, then the commit, followed by History.  For Thunderbird in particular, as I write this, the top 3 are:

  • Commit: 2131b9d618… on 2024-07-17 16:26:34 +0000 - Commit: c2e09fc595… on 2024-07-16 18:58:52 +0000 (this is the one I installed this morning, about 2024-07-17 13:00:00 +0000.) - Commit: 2151b1e101… on 2024-07-11 18:18:41 +0000 (which I installed 2024-07-12)

I opened a second terminal window, so I could copy and paste the full commit IDs between them, while experimenting.

sudo flatpak update --commit

Now that we have our old version, how do we install it?  Let’s assume the most-recent version wasn’t published yet, and I just wanted to roll back to my version from 2024-07-12.  We’ll pass its hash to the update command, and run it with root privileges:

sudo flatpak update --commit=2151b1e101… \
    org.mozilla.Thunderbird

Of course I did it without sudo at first, but after confirming, it failed, stating I needed root privileges.  I guess it makes sense (they don’t want someone who doesn’t know my password to downgrade to a known-vulnerable app and then exploit it) but I’m also miffed that it couldn’t tell me this before confirmation.

Anyway, one quick test later, I had my email again.

Versions in Flatpak

After the rollback, I checked the Thunderbird version through Help → About: it was “128.0esr, Mozilla Flatpak version 1.0”.

I followed up with a plain flatpak update org.mozilla.Thunderbird to get the latest 128.0esr build, and verified that was able to access my email as well.  I checked the version again in Help → About: it was “128.0esr, Mozilla Flatpak version 1.0”.

That’s why flatpak update and flatpak history don’t have version numbers at all.  They don’t have any guarantees of clarity or accuracy.

What I didn’t learn

I might have been able to give the short commit ID (from flatpak history) directly to flatpak update without going through flatpak remote-info --log in between.  I didn’t actually try it.

I kept trying to find information about branches, to see if there was a Thunderbird beta branch I could try since stable was broken, but I never did find any information about that.  There’s some build-related documentation about how to specify the branch during build, but absolutely nothing about listing available branches.

I also didn’t find anything about this situation in web searches.  How did version 115.x get pushed after 128.x?  Why did it take 21 hours to get it fixed?  Where would I find out whether Mozilla even knew about the problem?  I discovered it around 15:00, and couldn’t tell if anyone else was having the issue!

There’s a “Subject” in the flatpak remote-info --log data for each commit, but it is invariably “Export org.mozilla.Thunderbird”, so that didn’t add any signal, either.

Sunday, July 14, 2024

fopen() modes vs. Unix modes

fopen modes vs. Unix modes

PHP has a function for creating temporary files, tempnam. One limitation is that it only takes a filename prefix, and most often, I want to have a file “type” as the suffix, like “report-20240701-RANDOM.csv”.

new SplTempFileObject() creates a file in memory, which isn’t usable for anything where an actual “file on disk” is required.  The related tmpfile() function does not give access to the file name itself.

Meanwhile, fopen() and new SplFileObject() don’t offer control of the Unix permissions.  We can create files in exclusive-write mode by setting the mode to argument to “x”, but we can’t pass 0o600 (rw- --- ---) at that stage.  We have to create the file, and if it works, call chmod() separately.

fopen() and anything modeled on it offer a context parameter, but there are no context options for the file: scheme, only for other stream wrappers.

Underneath fopen()—at least on Linux—is the open syscall.  That call accepts a mode_t mode argument, to indicate what Unix permissions to use when creating a file, which is exactly what we are after.  But thanks to history and standards, we can’t access that directly from PHP now.

P.S.: there’s actually another possibility: we can rename() the file from tempnam() to add a suffix in the same directory.  If an attacker can observe our original file and create the target file with something unexpected, then the rename() will fail.  If tempnam() didn’t give us the permissions we wanted, though, we’d be out of luck with that, and it’s still a two-step process.

Sunday, July 7, 2024

Revisiting Backups

Since comparing DejaDup and Pika Backup for work, I’ve also used KDE backup/kup at home, and gotten a little more experience with both systems.  How are things going?

Pika Backup

Pika has a firm internal idea of the schedule.  At the end of a vacation, where the normal weekly backup was skipped, I manually asked it to run the backup “6 days late / 1 day early”, hoping I could skip it the next day.  No such luck: it promptly asked for the drive the next morning.  No big deal; I just didn’t know the details of the system.

One small file recovery was quick and effective through the GUI.  It looked like a special folder in the normal file explorer, which allowed for drag-and-drop copying from the backup to the desired location in another file window.  As usual when coding, I had done something, changed my mind, deleted it, and then changed my mind back later.  And probably changed a couple more times.

KDE Backup

KDE Backup had been set up as a redundant system behind my handcrafted (smaller, faster, non-versioned) script, the latter of which was—er, intended—to store only my critical data.

A few months after a trial-by-fire backup test, I discovered that ~/.gnupg was not in my manual backup.  This could have been a critical fault… but KDE Backup had the data.  Clicking the “Restore” button opened a “File Digger” window, and from there it was just like using Pika Backup.

I got my key back!

I added the missing ~/.gnupg directory to manual backup, then reconfigured KDE Backup to back up into the existing repository.  That went smoothly, too.  It noticed that the directory I gave it already had data, so it verified the integrity, and then backed up into it.

Format-wise, KDE uses kup to do the backup, which is a front-end to bup, storing the data as a bare git repository.  I didn’t need a password to get data out of it, which makes sense, because I didn’t need one to set up the backup.  That’s also great news for actually recovering my data, because I have no idea what I would have chosen for a password when I started KDE Backup in the first place.

Conclusion

Both systems are working great, i.e., better than my manual one.

Sunday, June 23, 2024

Sorted by What?

Shortened for illustrative purposes, I came across some ancient code of the form:

SELECT DATE_FORMAT(o.created_at, '%c/%e/%Y') AS dt,
    DATE_FORMAT(c.updated_at, '%l:%i %p') as tm,
    …
FROM orders o JOIN customers c … WHERE …
ORDER BY c.last_name, dt DESC, tm;

The ORDER BY dt caught my eye because it’s not an actual column in any table. Its value turns out to be the American-style “6/23/2024” format, which is reasonable to display, but completely wrong to sort on.  Doing that puts October prior to February, as “10” begins with “1”, which is less than “2”.

I cannot guess why it pulls the time from an unrelated column as tiebreaker, sorting it the other direction.  The rest of the issues are likely for the same reasons as the date.

I assume the chaotic arrangement of orders within a customer was never raised as a concern only because duplicating orders would be rare enough—ideally, never happening—that it didn’t matter.

Nonetheless, I queued a change to sort on the full date+time held in created_at, so that records will be fully chronological in the future.

Sunday, June 16, 2024

Availability and Automatic Responses

At work, we have built up quite a bit of custom monitoring, for example. It’s all driven by things failing in new and exciting ways.

Before that particular day, there were other crash-loop events with that code, which mainly manifested as “the site is down” or “being weird,” and which showed up in the metrics we were collecting as high load average (loadavg.) Generally, we’d get onto the machine—slowly, because it was busy looping—see the flood in the logs, and then ask Auto Scaling to terminate-and-replace the instance.  It was the fastest way out of the mess.

This eventually led to the suggestion that the instance should monitor its own loadavg, and terminate itself (letting Auto Scaling replace it) if it got too high.

We didn’t end up doing that, though.  What if we had legitimate high CPU usage?  We’d stop the instance right in the middle of doing useful work.

Instead, during that iteration, we built the exit_manager() function that would bring down the service from the inside (for systemd to replace) if that particular cause happened again.

Some other time, I accidentally poisoned php-fpm.  The site appeared to run fine with the existing pages.  However, requests involving the newly updated extension would somehow both generate a segfault, and tie up the request worker forevermore.  FPM responded by starting up more workers, until it hit the limit, and then the entire site was abruptly wedged.

It became a whole Thing because the extension was responsible for EOM reporting that big brass was trying to run… after I left that evening.  The brass tried to message the normally-responsible admin directly.  It would have worked, but the admin was strictly unavailable that night, and they didn’t reach out to anyone else.  I wouldn’t find out about the chaos in my wake until reading the company chat the next morning.

Consequently, we also have a daemon watching php-fpm for segfaults, so it can run systemctl restart from the outside if too many crashes are happening.  It actually does have the capability to terminate the instance, if enough restarts fail to alleviate the problem.

I’m not actually certain if that daemon is useful or unnecessary, because we changed our update process.  We now deploy new extension binaries by replacing the whole instance.

Having a daemon which can terminate the instance opens a new failure mode for PHP: if the new instance is also broken, we might end up rapidly cycling whole instances, rather than processes on a single instance.

Rapidly cycling through main production instances will be noticed and alerted on within 24 hours.  It has been a long-standing goal of mine to alert on any scaling group’s instances within 15 minutes.

On the other hand, we haven’t had rapidly-cycling instances in a long time, and the cause was almost always crash-looping on startup due to loading unintended code, so expanding and improving the system isn’t much of a business priority.

It doesn’t have to be well-built; it just has to be well-built enough that it never, ever stops the flow of dollars.  Apparently.

Sunday, June 9, 2024

My firewall, as of 2024

On my old Ubuntu installation, I had set up firewall rules to keep me focused on things (and to keep software in line, like blocking plain DNS to require DoT to CloudFlare.)

Before doing a fresh installation, I saved copies of /etc/gufw and /etc/ufw, but they didn’t turn out to be terribly useful.  I don’t know what happened, but some of the rules lost address information.  The ruleset ended up allowing printing to the whole internet, for instance.

I didn’t have a need for profiles (I don’t take my desktop to other networks), so I ended up reconstructing it all as a script that uses ufw, and removing gufw from the system entirely (take that!)

That script looks in part like this:

#!/bin/sh
set -eufC
# -- out --
ufw default reject outgoing
ufw allow out 443/udp comment 'HTTP 3'
ufw allow out 80,443/tcp comment 'Old HTTP'
ufw allow out proto udp \
	to 224.0.0.251 port 5353 \
    comment 'mDNS to LAN for printing'
ufw allow out proto tcp \
    to 192.168.0.251 port 631,9100 \
    comment 'CUPS to Megabrick'
ufw allow out on virbr0 proto tcp \
    to any port 22 comment 'VM SSH'
# -- in --
ufw default deny incoming
ufw allow in 9000:9010/tcp \
    comment 'XDebug listener'

This subset captures all of the syntax I’m using: basic and advanced forms, and all of the shapes of multi-port rules.  One must use the ‘advanced’ form to specify address or interface restrictions.  However, ufw is extremely unhelpful about error messages, usually only giving out “wrong number of arguments.”  The typical recourse is either to look harder at the man page syntax, or to try to roll back conditions until it gets accepted.

For deleting those test rules, the best way is ufw status numbered followed by ufw delete N where N is the desired rule number.  (You can also do ufw reset and start over.)

Note that the ufw port range syntax is “low:high” with a colon, like iptables. For example, 9000:9010 is a range of 11 ports; 9000,9010 is a list of only those two ports.

(I gave the printer a static IP because Windows; thus, the printer’s static IP appears in the ruleset.)

This script, then, only has to be run once per fresh install; after that, ufw will remember these rules and apply them at boot.

Sunday, June 2, 2024

Stateful Deployment was Orthogonal

I used to talk about “stateful, binary” deployment, thinking that both things would happen together:

  1. We would deploy from a built tarball, without any git pull or composer install steps
  2. We would record the actual version (or whole tarball path) that was deployed

This year, we finally accumulated enough failures caused by auto-deploy picking up pushed code that wasn’t ready that we decided we had to solve that issue. It turned out to be unimportant that we weren’t deploying from tarballs.

We introduced a new flag for “auto mode” for the instance-launch scripts to use. Without the flag, deployment happens in manual mode: it performs the requested operation (almost) as it always has, then writes the resulting branch, commit, and (if applicable) tarball overlay as the deployed state.

In contrast, auto mode simply reads the deployed state, and applies that exact branch, commit, and overlay as requested.

I say “simply,” but watch out for what happens to a repository which doesn’t have any state stored.  This isn’t a one-time thing: when adding new repositories later, their first deployment won’t have state yet, either.  This can disrupt both auto and manual deployments.

Sunday, May 26, 2024

My ssh/sshd Configurations

Let’s look at my SSH configurations!

File Layout

Starting with Ubuntu 22.04 LTS and Debian 12, the OpenSSH version in the distribution is new enough that the Include directive is supported, and works properly with Match blocks in included files.  Therefore, most of the global stuff ends up in /etc/ssh/sshd_config.d/01-security.conf and further modifications are made at higher numbers.

Core Security

To minimize surface area, I turn off features I don’t use, if possible:

GSSAPIAuthentication no
HostbasedAuthentication no
PasswordAuthentication no
PermitEmptyPasswords no

AllowTcpForwarding no
X11Forwarding no
Compression no

PermitUserRC no
# Debian and derivatives
DebianBanner no

Some of these are defaults, unless the distribution changes them, which means “explicit is better than implicit” is strongly advised.

Next, I use a group to permit access, allowing me to explicitly add the members to the group without needing to edit the ssh config when things change.  Don’t forget to groupadd ssh-users (once) and gpasswd -a USER ssh-users (for each user.) Then, permit only that group:

AllowGroups ssh-users
# extra paranoia
PermitRootLogin no

Note that all of the above may be overridden in Match blocks, where required. TCP forwarding may also be more finely controlled through PermitListen and PermitOpen directives.

Note also that my systems are essentially single-user.  The group doesn't permit any sharing (and doesn't participate in quotas or anything) that would otherwise be forbidden.

Performance

Machines I use for ssh and sshd are all amd64, so for personal usage, I bump the AES algorithms to the front of the list:

Ciphers ^aes256-gcm@openssh.com,aes256-ctr

SFTP

The biggest trouble is the SFTP subsystem.  I comment that out in the main config, then set it in my own:

# /etc/ssh/sshd_config:
#Subsystem sftp ...

# /etc/ssh/sshd_config/02-sftp.conf:
Subsystem sftp internal-sftp
Match group sftp-only
    # ForceCommand, ChrootDirectory, etc.

I forget the details of what goes in that Match block.  It’s work stuff, set up a while ago now.

Ongoing Hardening

I occasionally run ssh-audit and check out what else shows up.  Note that you may need to run it with the --skip-rate-test option these days, particularly if you have set up fail2ban (guess how I know.)

There are also other hardening guides on the internet; I have definitely updated my moduli to only include 3072-bit and up options.  Incidentally, if you wonder how that works:

awk '$5 >= 3071' ...

The default action for awk is print, so that command prints lines that fulfill the condition.  The fifth field is the length of the modulus, so that’s what we compare to.  The actual bit count is 3071 instead of 3072, because the first digit must be 1 to make a 3072-bit number, so there are only 3071 bits that aren’t predetermined.

Client Config Sample

Host site-admin
  # [HostName, Port, User undisclosed]
  IdentityFile ~/.ssh/id_admin
  IdentitiesOnly yes

Host 192.168.*
  # Allow talking to Dropbear 2022.83+ on this subnet
  KexAlgorithms +curve25519-sha256,curve25519-sha256@libssh.org
  MACs +hmac-sha2-256

Host *
  Ciphers aes256-gcm@openssh.com,aes256-ctr
  KexAlgorithms sntrup761x25519-sha512@openssh.com
  MACs hmac-sha2-256-etm@openssh.com,umac-128-etm@openssh.com
  GSSAPIAuthentication no

It’s mostly post-quantum, or assigning a very specific private key to the administrative user on my Web server.

Sunday, May 19, 2024

Everything Fails, FCGI::ProcManager::Dynamic Edition

I have been reading a lot of rachelbythebay, which has led me to thinking about the reliability of my own company’s architecture.

It’s top of my mind right now, because an inscrutable race condition caused a half-hour outage of our primary site.  It was a new, never-before-seen condition that slipped right past all our existing defenses.  Using Site as a placeholder for our local namespace, it looked like this:

use Try::Tiny qw(try catch);
try {
  require Site::Response;
  require Site::Utils;
  ...;
} catch {
  # EX_PRELOAD => exit status 1
  exit_manager(EX_PRELOAD, $_);
};
...;
$res = Site::Response->new();

Well, this time—this one time—on both running web servers… it started throwing an error that Method "new" wasn't found in package Site::Response (did you forget to load Site::Response?).  Huh?  Of course I loaded it; I would’ve exited if that had failed.

In response, I added a lot more try/catch, exit_manager() has been improved, and there is a separate site-monitoring service that will issue systemctl restart on the site, if it starts rapidly cycling through workers.

Sunday, May 12, 2024

Using tarlz with GNU tar

I have an old trick that looks something like:

$ ssh HOST tar cf - DIR | lzip -9c >dir.tar.lz

The goal here is to pull a tar from the server, compressing it locally, to trade bandwidth and client CPU for reduced server CPU usage.  I keep this handy for when I don’t want to disturb a small AWS instance too much.

Since then, I learned about tarlz, which can compress an existing tar archive with lzip.  That seemed like what I wanted, but naïve usage would result in errors:

$ ssh HOST tar cf - DIR | tarlz -z -o dir.tar.lz
tarlz: (stdin): Corrupt or invalid tar header.

It turned out that tarlz only works on archives in POSIX format, and (modern?) GNU tar produces them in GNU format by default.  Pass it the --posix option to make it all work together:

$ ssh HOST tar cf - --posix DIR | \
    tarlz -z -o dir.tar.lz

(Line broken on my blog for readability.)

Bonus tip: it turns out that GNU tar will auto-detect the compression format on read operations these days.  Running tar xf foo.tar.lz will transparently decompress the archive with lzip.

Tuesday, April 30, 2024

Things I learned Reinstalling My Ubuntu

I did not want to wait for Ubuntu Studio 24.04 to be offered as an update to 23.10, so I got the installer and tried it.  Also, I thought I would try repartitioning the disk as UEFI.

Brief notes:

  • I did not feel in control of manual partitioning
  • I found out one of my USB sticks is bad, thanks to F3…
  • …and no thanks to the Startup Disk Creator!
  • If the X11 window manager crashes/doesn’t start, goofy things happen
  • Wayland+KWin still don’t support sticky keys, smh
  • snap remove pops up the audio device overlay… sometimes repeatedly
  • I depend on a surprising amount of configuration actually

Tuesday, April 23, 2024

Getting fail2ban Working [with my weird choices] on Ubuntu 22.04 (jammy)

To put the tl;dr up front:

  1. The systemd service name may not be correct
  2. The service needs to be logging enough information for fail2ban to process
  3. Unrelatedly, Apple Mail on iPhone is really bad at logging into Dovecot
  4. Extended Research

[2024-04-26: Putting the backend in the DEFAULT section may not actually work on all distributions.  One may need to copy it into each individual jail (sshd, postfix, etc.) for it to take effect.]

A minimalist /etc/fail2ban/jail.local for a few services, based on mine:

[DEFAULT]
backend = systemd
[sshd]
enabled = true
journalmatch = _SYSTEMD_UNIT=ssh.service + _COMM=sshd
[postfix]
enabled = true
journalmatch = _SYSTEMD_UNIT=postfix@-.service
[pure-ftpd]
enabled = true
journalmatch = _SYSTEMD_UNIT=pure-ftpd.service

(The journalmatch for pure-ftpd removes the command/_COMM field entirely.)

Sunday, March 3, 2024

vimrc tips

On Debian-family systems, vim.tiny may be providing the vim command, through the alternatives system. If I bring in my dotfiles and haven’t installed a full vim package yet, such as vim-gtk3, then dozens of errors might show up.  vim.tiny really does not support many features.

Other times, I run gvim -ZR for quickly checking some code, to get read-only restricted mode.  In that case, anything that wants to run a shell command will fail.  Restricted mode is also a signal that I don’t trust the files I’m viewing, so I don’t want to process their modelines at all.

To deal with these scenarios, my vimrc is shaped like this (line count heavily reduced for illustration):

set nocompatible ruler laststatus=2 nomodeline modelines=2
if has('eval')
    call plug#begin('~/.vim/plugged')
    try
        call system('true')
        Plug 'dense-analysis/ale'
        Plug 'mhinz/vim-signify' | set updatetime=150
        Plug 'pskpatil/vim-securemodelines'
    catch /E145/
    endtry
    Plug 'editorconfig/editorconfig-vim'
    Plug 'luochen1990/rainbow'
    Plug 'tpope/vim-sensible'
    Plug 'sapphirecat/garden-vim'
    Plug 'ekalinin/Dockerfile.vim', { 'for': 'Dockerfile' }
    Plug 'rhysd/vim-gfm-syntax', { 'for': 'md' }
    Plug 'wgwoods/vim-systemd-syntax', { 'for': 'service' }
    call plug#end()
    if !has('gui_running') && exists('&termguicolors')
        set termguicolors
    endif
    let g:rainbow_active=1
    colorscheme garden
endif

We start off with the universally-supported settings.  Although I use the abbreviated forms in the editor, my vimrc has the full spelling, for self-documentation.

Next is the feature detection of if has('eval') … endif.  This ensures that vim.tiny doesn’t process the block.  Sadly, inverting the test and using the finish command inside didn’t work.

If we have full vim, we start loading plugins, with a try-catch for restricted mode.  If we can’t run the true shell command, due to E145, we cancel the error and proceed without that subset of non-restricted plugins.  Otherwise, ALE and signify would load in restricted mode, but throw errors as soon as we opened files.

After that, it’s pretty straightforward; we’re running in a full vim, loading things that can run in restricted mode.  When the plugins are over, we finish by configuring and activating the ones that need it.

Friday, February 2, 2024

My Issues with Libvirt / Why I Kept VirtualBox

At work, we use VirtualBox to distribute and run development machines.  The primary reasons for this are:

  1. It is free (gratis), at least the portions we require
  2. It has import/export

However, it isn’t developed in the open, and it has a worrying tendency to print sanitizer warnings on the console when I shut down my laptop.

Can I replace it with kvm/libvirt/virt-manager?  Let’s try!