Decoded Node

Finding Meaning

2025-07-20T16:30:00.001-04:00

Two things came across my radar recently; cks talking about Job Vs Career, which is an old post, and apenwarr talking about Billionaire Math, which is not. They’re very different, but they both give me the same “meaning of life” vibes, so let’s talk about that.

“Where am I going?” and “What do I do now that I have all this money?” are sort of the same question, just from different angles. It’s a spiritual question, because the process of answering it has the shape of a spiritual journey.

What do you really, truly want?

It’s really hard to untangle from what everyone thinks we should want! It’s a question that comes down to values and worth, and it takes time to uncover those values. It takes thought, grit, commitment. It takes looking inward to our own expectations, and deciding whether we need to hold onto those.

An example

Consider my experience buying my second car. (The first one had been purchased used, in my youth, mainly on affordability and whim.) I surveyed the market, chose a car, shopped around, and ended up not being able to buy right then. I had been laid off, becoming a statistic in the Great Recession.

But that was a blessing. It gave me space to think. It slowly dawned on me that the Honda Fit I had chosen was exactly the kind of car my parents would buy, where price was the main consideration and “how much car can I get for the money?” was next in line. True to form, they purchased a Nissan Versa the next year.

Five years later, the car I finally purchased was a Subaru Forester. With my old car, I was tired of the low ground clearance, ramming the football-sized chunks of ice dropped off of larger automobiles in the winter. I was tired of having such a small car when the need to move larger items arose. I was tired of needing to choose gears, and we were reaching a point where the computer might get better MPG anyway. I wanted a more reliable, more common car, that didn’t need to go fifty miles to a dealer to get treated by someone who really understood it.

Also, I needed a car that would fit inside my normal-size, urban garage; that eliminated the Outback.

To find all those parameters and make a new choice, I had to sort out my feelings, deciding which were internal, and which were inherited from the outside world.

But to get back to money…

Most people want more money

It’s a common bias, to think that more money will solve all the problems, without creating new ones. It’s easy to look far up the wealth ladder, especially with TV and internet.

The thing that is seared into my brain on this point is an article from the 1990s, where a minister (the author) asks his friend, “What would it take to make you truly happy?” “Another $33 million,” he responds without hesitation.

You can’t just want money. You have to want money for a reason. You have to know when the need is met, so you can do the real thing. It’s not an end in itself; it should always be a means to an end. The hunger is dangerous. Trying to win at mere accumulation did not work out for Sam Bankman-Fried.

I want to think that, whether I had twice or ten times the money, I still would have chosen the Forester. Because it fit my real needs and values.

Going your own way

It’s important not to run out of money, for sure. Another problem is going very fast, but in the wrong direction (one that isn’t satisfying at all.)

It’s worth starting slowly. It’s okay to hang on to an unfulfilling job while exploring. You can figure out goals piece by piece. Start something that seems good. If it doesn’t meet expectations, let it go again. Turn the experience over. Sort out the good parts from the bad ones, and look for alternatives that might have more of the good and less of the bad.

If desires change over time, or the dream can’t actually be realized, that’s fine, too. I used to think that I would be the best programmer ever, able to jump into any codebase in any language, and do stuff. I’d be a jack-of-all-trades generalist extraordinaire. The problem is, true expertise didn’t work that way. Deepening and extending my abilities in Perl and PHP actively rotted my capabilities in C++, and turned Rust downright impenetrable. That’s not failure; that’s feedback. It’s probably something I could change, if I thought it were worthwhile for me to learn Rust and put all my energy back into becoming an übercoder.

Again we reach spiritual territory: once something speaks to your soul, you’ll have all the motivation you need. But you can’t wait to know before you start anything. You have to get out, try stuff, and change course where needed. It’s an experimental process. If no ideas can be tested, nothing happens. My brain alone was certainly not enough.

P.S.

A quote:

It’s stupid to risk what you have and need for what you don’t have and don’t need.
—Morgan Housel, The Psychology of Money

Take care of yourself.

There’s No HealthScore™

2025-07-06T17:00:00.001-04:00

I don’t know who needs to hear this today, but there’s no single number that defines “healthy.”

Weight and BMI don’t work. Total cholesterol, LDL, nor triglycerides cover it. Blood glucose or A1c, as useful as they are for diabetes, do not have specific “Health” levels. Exercise isn’t magic, either; there’s no step count or bike computer statistic that indicates perfection.

These are all data points in a larger picture, and should be regarded as a holistic output. There is a complex, interlinked system regulating it all, and trying to directly change one of the outputs is not likely to be helpful or sustainable. At least, not if I don’t have a disease that is specifically related to those markers.

It took all my willpower, but I finally quit dieting.

I don’t have anything else. There’s no general advice I can give on diet, exercise, or health care that I can be confident I will be able to stand behind for even five years. The science isn’t there to give anyone (me included) individualized advice. Regardless, the limited hypothesis of “there is no silver bullet to health” should withstand the test of time. I can hope.

Early Thoughts on Coding with an LLM

2025-06-22T16:30:00.005-04:00

I temporarily displaced my misgivings about the Plagiarism Machine That Also Destroys Earth. Since I haven’t yet spent a full week working with it, I have some first impressions only:

All generators have the highest accuracy on the smallest tasks, where they are the least useful.
The tools all seem to prefer the same model, so there’s less differentiation than one might hope for. Nothing is worse… but nothing is better.
Writing a good prompt takes a lot of planning.
Every change must be tediously reviewed. This includes every tab-completable inline suggestion in the editor.
Junie is not actually integrated very deeply into PHPStorm.
Agent mode is full-on Sorcerer’s Apprentice. One must always be standing by on the “Emergency Stop – Never Use” button.

Due to homogenous model choices, UI is an important differentiator right now. Using Control+Backslash to generate code in PHPStorm makes for results that are difficult for me to understand, because the diff is in character mode. Cursor is much better at this, with line diffs.

As for Junie’s lack of integration, it failed to recognize «run the "foo api unit tests"» as an instruction to run the pre-existing test configuration named “foo api unit tests”. I let it try running foo api unit tests in the terminal to see what it would do. After getting a command-not-found error, it attempted to find any test that it could run, and try running that instead. Fortunately, it needed permission to run further off the rails, so I denied it.

Summary

As for the overall experience…

The LLM removes the fun part of programming, the writing of the code, leaving the planning and debugging parts I am less fond of. The incessant demands for attention from inline suggestions also fundamentally block entering a flow state. Meanwhile, hallucinations are always trying to stab me in the back; there is absolutely no meta-analysis of whether the prompt itself is misguided.

I don’t even see these tools as useful for exploring unfamiliar code. My IDE already has a set of tools for that; like Find in Files, Go to Definition, and Find References/Usages. Since these aren’t constrained to a sidebar, the results are also far more usable.

Even if it had no downsides, at first blush, the (paid) systems still rate a solid “meh.”

Nostalgia for the Amiga

2025-06-15T16:00:00.002-04:00

When I was young, my father owned a series of Commodore machines, the last one being an Amiga 500. BASIC was for chumps, and our only other alternative was an assembler and a single misprint-riddled book for it. We didn’t have the internet, or any local user group/mentoring. We made the asm stuff work, by which I mean, mostly I stole the results of my older brother’s reverse engineering.

Years later, just before the Java phenomenon reached our little community college, I started learning C++ (officially) and C (from the internet), and some mysteries started falling into place.

The large blocks of declare statements in assembly were a C struct.
The weird parts of that struct like next_window: dc.l 0 were spaces for the OS to put a pointer to the next window, creating a singly-linked list of windows on screen.
The difference between label and #label snapped into sharp relief when I reached Pointer Enlightenment. They were *p and p, respectively. (Somewhat; C doesn’t distinguish between absolute and PC-relative addressing.)
Loading the label at the start of a block of declare statements into a register before making a library call was passing a pointer to the OS.
The 68000’s data/address register split wasn’t arbitrary; only address registers could be used with an offset, so the pointer-to-struct went in an address register, for reading fields within the struct.
That weird WaitForEvent function in the OS library was the core of cooperative multitasking. The Amiga also had preemptive multitasking, but the time slice was large enough to make the system visibly laggy if a process was uncooperative. Like all of my asm programs.

That experience of simultaneous, two-way enlightenment (asm to C and back) with pointers was incredible, but has also been a once-in-a-lifetime high. It also gave me a lot of “if-only” feelings. Had I understood all this stuff back then, I could have done so much more with the Amiga.

(Incidentally, that next-window pointer being in the window structure makes it invasive. I guess this means my first encounter with invasive lists could have been before Linux was ever released.)

I also felt a sort of awe or magic about the hardware design, that was missing from the PC space. The Amiga was half computer, half game console internally, with a bunch of custom chips to do really neat graphics. Although it had a limited number of colors in the palette, the co-processor could switch out the palette between scan lines, allowing quite a few colors per screen. HAM mode [hold-and-modify] could also produce “thousands of colors” out of a 16-color palette, if one could accept some color-fringing artifacts.

We didn’t have a VCR or camcorder (the money was spent on the computer) but apparently, the Amiga was great at video, too.

But it was soon proven to be a dead end. When RAM gets fast enough, a boring linear RGB framebuffer is the best option, and bitplanes (or HAM mode) turn from “cool hacks” into mere workarounds for the era’s limitations. The 68000 family wasn’t able to keep up with Intel’s immense fab budget to fling clockspeed with the best of them, and Commodore died before a PowerPC transition could have happened. And who even knows if they could have been as successful as Apple with it?

The Amiga’s ability to “just plug it into a TV” became a limitation over time, as the business world put all its money into PCs with dedicated monitors. They were more expensive, but also better. Our first PC would run in 800⨉600, although it could also produce a flickery 1024⨉768. By comparison, Workbench on the Amiga ran in “medium” resolution, 640⨉200. High resolution only doubled the line count, and flickered intensively on our monitor. Ultimately, monitors came down in price and offered much better than TV quality; I went off to university (post-community-college) with a monitor capable of 1280⨉1024.

This entry was somewhat prompted by “Classical "Single user computers" were a flawed or at least limited ideas”.

Thoughts from Trying Generators in PHP

2025-05-11T17:00:00.004-04:00

I am late to the party, but I have been playing with Generators in PHP more, and running into the limitations of module boundaries.

Some module might produce a Generator so that iteration can be performed in chunks, reducing peak RAM. For example, producing results one store at a time, instead of loading up all stores into a giant array. Code that processes an entire database table, but wants to lower lock contention and memory use can also benefit; it can use a Generator to isolate the fetch-in-pages logic from processing the individual records. The consumer sees one stream of results, while the Generator fetches more as needed.

In short, there are plenty of use cases.

The problem comes when a caller wants to pass “the data” produced by the Generator to another function or method that specifically takes an array. Once that happens, either the destination needs to be reworked to accept the broader iterable type, or the efforts toward efficiency are erased by an iterator_to_array() call.

(Of course, back when generators were introduced to PHP, I didn’t use type declarations, so I could have gotten away with throwing a generator at something that assumed it would receive an array or PDOStatement. Dealing with larger teams and beginning to use an IDE were both great reasons to add the type information, and the array type forbids passing a Generator in its place.)

A separate issue is that anything consuming a Generator (thus, anything type-hinted iterable) needs to be aware of its once-only nature. This only sometimes becomes a problem—for instance, if a template wants to output the data set and also some aggregate statistics over it for display before the main output.

Generators can also produce “return” values, which can be fetched by code that knows it is dealing with a Generator after the regular values are produced. (I might change my mind later, with more experience, but it doesn’t pass the vibe check. It feels a lot like requiring methods of a class to be called in a specific order, which is usually best to avoid.) It implies that the entire system should lean into handling Generators in particular, and not allow them to mix with other iterable types.

These are (mostly) things I was vaguely aware of from reading about Python generators, but they weren’t on my mind while writing PHP.

The fiserv Outage

2025-05-04T16:00:00.014-04:00

Editor’s Note: this post was penned offline early Friday evening, before the author had knowledge of the issue being resolved, and fiserv processing the backlog as of 16:45. We have chosen to simply add some links, now that we are online to retrieve them. The post follows.

As I write, on Friday, 2025-05-02, fiserv has been offline all day, or substantially all day. This company acts as a third party to a number of banks, providing wire transfers, ACH, and/or direct deposit services, and possibly even online/mobile banking. A number of large banks, including Ally Bank, Bank of America, Capital One, and Synchrony have been affected in some way by this outage, as was my regional bank.

I don’t know anything about the root causes yet. It would be irresponsible to speculate about those causes, so of course I am going to.

Ways to Deadlock a System

This incident occurred on the first Friday of a month, so it might have collapsed under load. Sometimes, the failure of part of a service cascades into taking the whole thing down, and puts it in a state where it also cannot be restored piecemeal.

Core infrastructure with (unintended) circular dependencies could have gone down. If your higher-level services depend on your file servers, but the file servers accidentally started depending on one of those higher-level services, great pain awaits once both of those go down simultaneously. If most pieces of the cycle usually work, caching information from each other to cover a minor failure/maintenance window, this problem can remain latent for a surprisingly long time.

There have also been notable outages where a company lost a necessary component for both the external service and internal administration, impeding their ability to repair it, as they no longer had access themselves. They couldn’t run their monitoring tools, or issue commands, unless they fixed the external site, but those are the actions they’d take to fix it.

It’s Probably None of Those

The time to repair speaks to a less simplistic, more complicated failure. Unless something happens like “the physical hardware is gone, and the data with it,” or conditions are like “we are a small team of 3 people,” even large-scale failures are usually resolvable within a few hours.

Regardless of what went wrong technically, management is also culpable. Whether the corners were cut to “accept” more risk, or to provide efficiency, risk management failed. If staffing was reduced to cut operational expenses, without considering institutional knowledge, then that was a spectacular failure of risk management. Likewise, if management isn’t willing to pay for expert oversight of their systems. (Loosely related, but search/scroll to the “Schedule prediction” heading.)

It’s easy to have 20/20 hindsight, and it's easy to criticize from an armchair. And it’s hard to remember sometimes that it’s “efficient” to run reactively, rather than proactively. After all, if you create and test a plan for some contingency, and never suffer that contingency, then logically, all that effort had zero return on the investment.

But one would hope with possibly billions of dollars for millions of customers at thousands of clients on the line, a company could afford to think about some of this.

Unless I just take my own job too seriously?

Simplicity can be Imaginary

2025-04-20T18:00:00.001-04:00

There’s a comic about simplicity: how an Apple product has one place to touch, a Google product has one search field, and “your company’s app” has dozens of fields with interrelated requirements, obscure codes, strange highlighting, and “…” buttons.

The thing is, for internal or even b2b apps, the user probably knows what kind of thing they have, that they would like to search on. If they are trying to look up a customer ID, then matching to a PO number is irrelevant; it will just take time and produce extraneous results. If they can tell the computer directly, “Find customer #33448” then jump straight to the customer record, it saves them an extra round-trip through a search result page they didn’t need.

“Your company’s app” from the comic comes across as more of a data-entry page than the main point of interaction. One might still organize the form along required/optional dimensions, and put auto-loaded fields in proximity with what will automatically update them. However, to make the business happen, there’s a minimum amount of data that is genuinely required, that shouldn’t be crammed down to one textarea and parsed back out.

The Enterprise’s Goals

2025-04-13T19:30:00.001-04:00

When I was a n00b on the internets, I heard whisperings about awful, over-complex “rule based systems” out there, somewhere. Programmers scoffed at them for essentially being programs that were being written by non-programmers; nebulous “managers” allegedly dreamed of replacing expensive programmers with cheap office staff. I did not understand at the time where these systems came from, if everyone seemed to think they were so bad.

Part of that answer is simple. “Programmers aren’t ‘everyone.’” Oops.

The other part of that answer turns out to be review and auditing. Anything that exists in code is opaque to the business staff; they largely have to trust the programmers on it, or demonstrate defective outcomes. (And at that point, they need to wait for the necessary programming and deployment for it to be fixed. If it is a big enough problem that customers or clients are exploiting in the meantime, that delay can become costly.)

Functionality that is exposed as ‘configuration data’ to the office staff becomes reviewable by other office staff, such as managers, and errors can be corrected more quickly. External auditors can use the same review capability for their own work. The next problem is that this data might not be flexible enough, which pushes toward the development of conditions and actions, and the rule-based system is born.

It was never about the programmers; it was about the business being able to view its own source code.

Every Change Might Be Breaking

2025-04-06T19:00:00.001-04:00

We originally had the “automatic minor version upgrade” option active at Amazon RDS. This option simply does not work very well. Sometimes, for no clear reason (and without notification), it would stop applying upgrades, and require manual updates to get moving again. We mostly lived with it, and then we hit the worst case scenario: it did perform the upgrade, and then one of our scripts stopped working.

Not only that, it managed to break while I was on vacation.

(Obligatory xkcd about spacebar heating.)

Since then, we don’t use that option. When I’m good and ready, I peruse the changelogs, then schedule the update to happen when I will be in the office to handle unexpected issues.

For their part, AWS recommends testing the app against the new version of the database before performing any upgrades. This is implicitly a recommendation against using automatic minor upgrades, because there is no automated process to test the upgrade first.

One knows an analysis tool is looking at AWS with a security-first paradigm when it recommends switching the automatic upgrade option back on for the database. It is technically correct that new releases MAY contain security fixes, but upgrading to them MAY cause an automated denial of service. It is not a simple, inconsequential task.

Some Notes from Fixing my Server’s IPv6 / SLAAC

2025-03-23T17:30:00.000-04:00

I had a hard time getting IPv6 to work properly on my VPS. It has a static address, which I published to DNS (ages ago), but it wasn’t fully operational. It wasn’t obvious, because it was able to accept and respond to incoming IPv6, but it was not able to generate outgoing IPv6 connections. Thanks to Happy Eyeballs, the system cheerfully fell back to IPv4 and left me none the wiser. Probably for years. (Since inbound traffic could be responded to, the IPv6 network-transfer graph looked plausible, too.)

SLAAC

The full and basically meaningless name for SLAAC is StateLess Address Auto Configuration, by the way.

My provider uses SLAAC to, in theory, handle the IPv6 address assignment, but the implementation leaves a bit to be desired.

The documentation is clear that router advertisements, RA for short, are required for SLAAC. The documentation was not so clear that ICMPv6 Echo Requests (aka IPv6 Ping packets) are also required.

The latter was easily handled, through my existing sysctl configuration at /etc/sysctl.d/99-zz-really-final.conf:

net.ipv6.icmp.echo_ignore_all = 0

Previously, I had set it to 1, because a quiet server is a good server.

Logs? We Don’t Know Her

I also tried to set the accept_ra sysctl in the same file, but it kept being turned off somehow!

That took a while to track down. The key thing to understand was: systemd-networkd controls everything. It will silently alter sysctls to its liking, because the admin’s settings are unimportant in case of conflict.

Some simple log messages could have saved me so much time and trouble here!

What is happening in detail is that systemd-networkd can’t get the information it wants from the kernel if the kernel accepts the RA packet, so systemd-networkd is silently turning off the sysctl and secretly handling the RA in user space. Trying to make kernel-side accept_ra work was a shaggy-dog quest.

Another thing that is happening in the same confusing manner is that systemd-networkd actually chooses IPv6 Privacy Extensions on a per-connection basis; the sysctl file (placed by procps, which seems like the wrong place to do it) that activates them by default is, once again, operationally irrelevant.

I needed them to be off for my server, which has a single /128 assigned to it, but the correct way is to go through systemd-networkd.

The systemd-networkd Magic

To make everything work, the final answer was to (mostly) ignore sysctl tweaking, and write the relevant directives into the network unit file. Besides static addresses for DNS and Address, along with an IPv4 [Route] section, my particular network file also includes:

[Match]
# Your name will vary!
Name=enp0s6
[Network]
DHCP=no
IPv6PrivacyExtensions=false
IPv6AcceptRA=true
[Route]
Gateway=fe80::1
GatewayOnLink=true

Incidentally, that enp0s6 means a single-function card on bus 0, device 6. In lspci terms, it's 00:06.0. I believe a multi-function card could end up with an address like enp0s6f1, but that’s not something I’ve seen firsthand, working in the PC/small VPS space as I do.

I am a bit confused on why RA needs to be part of this at all, if the server is assigned a static address, and the Gateway remains operational. But hey, I followed the provider’s documentation to the best of my ability, deactivated their auto-config tool, and I finally found some working configuration. I spent two days’ free time on it, and I don't want to sacrifice more to try anything else.

Dusting

Since the VPS has been continually upgraded for over a decade, I also cleaned up some obsolete packages. If I’m using systemd-networkd, for instance, then ifupdown can go. Nothing actually still claimed /etc/network/interfaces, so to avoid confusion, I (made a backup and) deleted that, too. I cleared out the remaining resolvconf related packages, which were marked as “local.” As far as I can tell, only the only mention of it in the package system now is that systemd-resolved provides resolvconf as a virtual package.

Finally, I uninstalled the DHCP client. If DHCP is officially not supported by the provider, and the correct static address is already configured elsewhere, then I don’t want to make unnecessary requests or generally complicate matters. A quiet server is a good server, after all.

In Conclusion

The interesting meta-point, though, is that IPv4 configured “incorrectly” will work; the provider has gone to the effort of making the attractive nuisance on the undocumented path provide valid information. That is, there’s a DHCP (v4) server that will respond with the VPS’ intended public IP address.

On the flip side, IPv6 can barely be configured “correctly” at all, and all (or most) misconfigurations fail, either by blocking egress connections, or by not routing any traffic. Worse, the provider’s networking tool will not produce a correct result. Then again, maybe DHCPv6 works? I still don’t want to put in the effort to experiment.

Spammers Get Confused About Temporary Errors

2025-03-16T17:00:00.001-04:00

When I wrote Everything Needs Rate Limits, I mentioned in passing that the disk-full state prevented receiving email. The MTA was returning a temporary “try again later” code, but the clients weren’t responding. I got several session transcripts emailed to me that were of the form:

> EHLO some-host-name
< 200 OK + capabilities listing
> MAIL FROM sender-address
< 200 OK
> RCPT TO recipient-address
< 415 Resource unavailable, try again later
> RCPT TO recipient-address
< 415 Resource unavailable, try again later
> RCPT TO recipient-address
< 415 Resource unavailable, try again later
> DATA …
< 500 No recipient given
X Connection lost

The client is recognizing that they’re getting some sort of error, but their idea of “later” was milliseconds later, and the disk space problem was not being cleared at CPU speeds. After enough rejections of their recipient-address, they YOLO’d it and sent the email body, to no avail.

(I also think it’s pretty interesting that the MTA is happy to tell everyone else ENOSPC, yet deliver these error emails through to me. I suppose it purposely stops accepting email with some reserve disk space, so that it can continue to deliver critical errors for a while.)

Finding When You Pushed and Pulled a Git Repository

2025-03-09T16:00:00.001-04:00

I found myself needing to know if I had pushed something before or after some operational event. The commit itself was dated late on Friday before the event; could it have been deployed then, or did I wait until Monday to push?

It turns out that git reflog has the answers; as it's a log of what the branch tip pointed to, it logs commits, amends, and push/pull activity. It doesn't normally include a date in the output, but we can convince it to give us one by using the --pretty formatting option.

git reflog --pretty='%cd %h %gd %gs'

This format means “commit date”, Thu Nov 28 17:56:08 -0500 style; the hash; the “shortened reflog selector,” like HEAD{1}; and finally, the reflog summary, which has information like commit (amend): Awesome subject line.

Within the reflog, for push/pull events, the “commit date” is actually the date of the push/pull completing its task.

The major limitation here is that the reflog is local; we can only look at our own activity with this method.

However, in my situation, that was enough to exonerate my commit, allowing us to turn our attention elsewhere for an explanation.

Everything Needs Rate Limits

2025-03-02T20:00:00.001-05:00

For reasons of “anything else would cost more,” my web server and email MTA are running on the same VPS. Which apparently means, if a known issue in some web-side software fills the disk, email quits flowing in.

Fortunately, I was at liberty to investigate immediately. The root cause was a bot making things up and filling a cache with negative responses. The actual bot was promptly banned. Not only via robots.txt (which used to return text "# 200 OK Have Fun" so that I wouldn’t get 404 errors for it) but via User-Agent in the Web server, since it had obviously already seen the permissive robots.txt contents.

(Also, the only email that failed to be received were spam messages. “Lucky” me!)

It worried me, though. What if another bot did this? Am I going to play Whac-A-Mole® with it? (Don’t get me wrong—that’s not a bad game, but this version doesn’t give me any tickets redeemable for prizes at the front counter.)

To give me more runway to respond to future problems, I added per-source rate limits. This should prevent the cache from filling in units of MB/s. Concurrently, there is also a custom disk-usage alarm that will activate if free space falls below a “normal” amount, giving me a chance to catch problems before the MTA starts refusing service again.

A global rate limit for new connections is still being considered. The current bot menace has been beaten for today, and future bots from a single IP will also find themselves running into limits, but someone running a crawler network could still cause plenty of trouble. The problem is that a global rate limit is probably something that would be HTTP-oblivious, responding with an RST packet instead of an application-layer 429 message.

I know the “everything needs rate limits” is common wisdom, at least in some circles, but I ran public sites for at least 14 years without them. Sadly for the nostalgia, it seems those days are gone.

AWS Auto Scaling, Load Balancers, and Availability Zones

2025-02-26T19:00:00.001-05:00

We ran into an odd situation: one of our EC2 instances was attached to a target group, but “unused” on the load balancer, which ended up tripping the “not enough healthy hosts” alarm.

This ended up being my fault: I had removed a subnet (thus, availability zone) from the load balancer, but did not realize it was still associated with the Auto Scaling group.

When increasing capacity, AWS picked one of the four subnets available on the group, as instructed. The instance was cheerfully launched into that subnet, but it happened to be the one that wasn’t associated with the load balancer. Conversely, the fifth zone the load balancer used to have was pointless, because Auto Scaling would never launch an instance there.

(This all came about as part of an effort to reduce our availability zone footprint and improve colocation of resources. There are probably diminishing returns, and three zones to a region should be enough.)

tl;dr: Auto Scaling has a subnet configuration that is independent of any load balancer’s configuration. For successful operation, the load balancer must have every subnet that is assigned to any Auto Scaling group using that load balancer.

Podman Desktop Isn’t Great

2025-02-23T18:30:00.002-05:00

Since buying a new computer, my primary desktop is no longer Linux, and containers are no longer native. I decided that the path of least resistance would be to try Podman Desktop, but it leaves a lot to be desired.

First and foremost, it’s slow to boot. I wouldn’t care, except that while it is still booting, it lies. The Containers tab will incorrectly say “no container engine,” then wrongly claim “no containers,” before finally realizing that everything is all there. The first time I saw it, it made me really worried about my data.

The second most-annoying feature is the ad on the dashboard that cannot be dismissed. There is no way to express disinterest, nor non-consent. (This ad is also on their website, which is why their website is not linked.) Update 2025-02-24: Version 1.16 appears to have made this dismissable… too late to avoid criticism, though; this post was drafted prior to the change.

Those are the everyday annoyances. I had a few containers going, and then I needed to build a custom one. My GitHub User Pages are managed by Pelican, which runs in Python; and I also needed git installed, to be able to push the build results. Creating the container to make all this possible was unnecessarily difficult.

The build sequence is essentially an edit-build-(run-)debug loop, but the app UI has no affordance for this. Building an image is modal. Clicking most things closes the build page. When reopening it, nearly all inputs have been forgotten. The only one that is kept is the “last used directory” for the app as a whole, which means a cycle which included “run” makes the user flip between build directory and volume attachment.

When asking to create a container from the Containers tab, the app asks whether it’s from a Containerfile or an existing image. Once the question is answered, the user is unceremoniously dumped into the appropriate Images subscreen, without explanation. To create a container from an image, the user is expected to click on one of the Play buttons, and completely ignore the button on the Containers tab. Likewise, the Build button on the Images tab is for building an image from a Containerfile; the Containers tab is not particularly special.

When actually “building from a Containerfile,” the user is building an image, and Podman Desktop completely forgets that the user originally asked to create a container. When the image is done, nothing special happens; the user clicks Done and “returns” to the image list. There’s no tip to use the Play button on the newly created image. The user just has to know that already.

Building images spits the build log onto the screen, then hides it, without clearly indicating success or failure. The user has to scroll down to get the log back, to confirm it ends with “successfully tagged image,” or else the Images screen just… mysteriously lacks the image that should have been built. Leaving the build screen to see this, of course, throws away all the input that was on it, along with the log!

The images that are built are unexpectedly tagged under the namespace of a popular container registry, despite being a local image that is never pushed (I don’t have any credentials for the registry.) This is surprising, confusing, and wrong; and it is another case where non-consent should have been expressible.

It’s possible to forget to set up a volume when creating a container, in which case, the app creates space for it internally. The stated path does not appear in the host’s filesystem (possibly Mac-specific.) This defeats the point of using a volume for file-sharing between host and container. Interestingly, volumes created this way appear on the Volumes tab, unlike every other volume.

Users must always remember to set every option every time through the cycle, since the app insists on being so forgetful, and nothing can be modified later. If the user forgets to name the container and doesn’t like the auto generated name, there’s no recourse but to delete the container and try again. If the user forgets to link a volume to share files as desired, there’s no recouse but to delete the container and try again. From scratch, because once again, the previous inputs are gone.

That’s it for building, but there are a couple of issues remaining.

The image screen—the whole app—appears to lack a way to update an image. If the user pulled Alpine five months ago, the user will have that image forever, unless the user asks to pull it by its URL again. Once the image is pulled, the old one hangs around with the uninformative name of none.

Despite having all options and plugins deactivated for a popular container engine, the app still offers a “Compatibility” button in its status bar, suggesting that compatibility is active. In reality, it is the entry-point to a dialog that asks if the user would like to use admin rights to turn it on. Yet again, for the third time, there’s no “hide this” option. The effective choices are “Yes” and “Maybe later!”

Finally, the Podman Desktop Mac App doesn’t have a built-in option to uninstall itself cleanly. Naïvely removing the app will leave downloaded data all over, including container engines. Rating: ★☆☆☆☆, average corporate disdain for users mixed with average open source lack of polish. I would not use this app if I had a better alternative in mind.

My Experience with Switching from Psalm to PHPStan

2025-02-16T18:00:00.001-05:00

Due to Psalm’s lack of support for being installed with PHP 8.4 or PHPUnit 11 at the time (January 15, 2025, prior to the Psalm 6.0 release), I finally gave PHPStan a try.

The big difference that has caused the most trouble is that PHPStan wants iterable/container types to explicitly document their contents. Any time a method returns array, PHPStan wants to know, array of what? Psalm was happy to observe what the method put in the array for return, and use that de facto type as the developer’s intention.

Outside of the smallest library repositories, that rule got ignored. It is responsible for maybe 75% of issue reports. If I can take it from 1200 down to 275 with a single ignore, that is the difference between “there are too many things to deal with” and “I can make a dent in this today.”

The next obvious difference has been that PHPStan is much more interested in handling the potential false returned from preg_replace('/\\W+/', '', $str); calls. Psalm expected that giving three string arguments to preg_replace() will always result in a string of some sort.

There’s also a class of issues reported by PHPStan due to a disagreement in the other direction. Psalm seemed to think that number_format() returned string|false, requiring an is_numeric() check on the variable. PHPStan thinks that is redundant, i.e. that number_format() has already returned a numeric-string.

I don’t have a sense yet for how effective PHPStan is at finding problems overall. In code that was previously checked with Psalm, many defects visible to static analysis have already been removed, leaving little fruit behind for PHPStan to pick.

As of early February, PHPStan can be considered a successful migration. I haven’t touched PHPStan Pro, but I may try it if I ever want to fix those hundreds of issues with array types.

The Ruthless Elimination of Differences

2025-02-09T19:30:00.003-05:00

I am excited for image-based Linux. Yes, I usually complain about people upending things just when they get stable, but I think there’s a real benefit here: the elimination of differences.

Why, exactly, does installing Ubuntu have to unpack a bunch of .deb files inside a system? Thousands or millions of machines will go consume CPU to run maintainer scripts, to hopefully produce identical output, when most of the desired result should have been possible to save as an image in the first place. Upstream should know what’s in ubuntu-minimal! Looking through a different lens, Gentoo distributes a stage2 image.

In theory, an installation CD could carry the minimal image, the installer overlay, and the flavor’s overlay. The installer’s boot loader would bring up the kernel, use the minimal+installer pair as root file system, and the installer would unpack the minimal+flavor images into the new disk partition.

“Image-based Linux” more or less takes this one more step, running the entire system directly from the images (or a singular combined image.) Everyone gets to use the same pre-made images, and bugs become less dependent on the history of package operations.

If any of this sounds like Puppy Linux, that’s not entirely accidental.

This is also the space where things like ABRoot are being introduced. Image-based Linux lends itself well to having an integrated rollback/recovery pathway. Even on my non-image systems, having “a recovery partition” has been more valuable than I ever anticipated. It let me test backups without having to work very hard about simulating a disaster. I also created my own recovery partition when I was still using a RealTek USB WiFi device, to avoid being stranded without internet. (Word to the wise: use Mediatek instead, or an Intel PCIe card is a good non-USB option.)

Image-based Linux and the tools around it are poised to make real improvements to the repeatability and reliability of the systems. I don’t know when I, personally, might benefit (as my daily driver is macOS now), but I am very excited about the progress being made here.

We Finally Put Up a WAF

2025-02-02T18:30:00.001-05:00

Someone sent an awful lot of requests at a system for long enough that management noticed the issue. Working with the responsible admin, I ended up proposing AWS WAF “to see what would happen.”

What happened: WAF blocked 10,000 requests per minute, and someone got the message. This released the pressure on the DynamoDB table behind the system, allowing it to jump straight from max to min capacity (1/16th) after fifteen minutes.

It seems some automated vulnerability scanner had gotten into an infinite loop. There were a lot of repeated URLs in the access logs, like it wasn’t clearing pages from its queue if they got an OK response but unexpected data. The reason “everything” returns OK is because an unknown URL (outside of a specific static-content prefix) returns a page with the React app root, and lets JavaScript worry about rendering whatever should be there.

I went ahead and put the same WAF on my systems, promptly breaking them. Meanwhile, our automated testing provider started reporting failures, with every request from the original system returning Forbidden.

The testing platform… is a bot. I had to write them an exception.

Turning my attention back to my systems, I put together a second WAF so I could have different policies. My system includes an API or two, so I needed to allow HTTP Libraries and non-browsers. I linked in the exception for the testing platform as well. Things went much more smoothly after that.

I know that the WAF is fundamentally “enumerating badness,” but it is clearly better than zero filtering. It is also much less effort and risk, which is why this sort of thing persists.

Separate Components Allow “Least Privilege”

2025-01-26T19:30:00.001-05:00

At work, our AWS image-building process exists as a set of scripts that are run manually, in-order, if everything looks okay to the human at the terminal.

The problem is, the image auto-installs security updates once it launches; if enough accumulate, instances launched from the image start failing to become ready for traffic before the timeout is up. The system is guaranteed to decay if I don’t periodically build a fresh image, manually.

One reason it isn’t automated (say, rebuilding itself weekly using AWS infrastructure) is that the process takes more permissions than normal business operations. It must be able to run and terminate instances, tag things, and reconfigure what image is launched by Auto Scaling. Those would be scary permissions to give to any instances, which generally only have write access to specific data storage locations.

However, if the various steps (build, test, update configuration) each existed as separate objects visible to AWS, we could give each component unique permissions. The configuration-update step would be the only one to have access to the image ID in SSM Parameter Store. Likewise, that step would be forbidden from creating or terminating EC2 instances.

Due to implementation details, for us and our system, we could split it into the following components:

Configuration and test script deployment (run by developers off-AWS; stores to S3)
Ubuntu base image lookup (read-only)
Image build sequence (create/wait/destroy EC2 instances, tag instance and image, read configuration scripts from S3, invoke the AWS-maintained SSM Automation to create the image from the instance)
Image test sequence (create/destroy EC2 instances, read the test script from S3)
Configuration update (describe the image, update a specific SSM Parameter Store item)
Garbage collection (read running configuration to determine “unused” status, deregister images, delete their snapshots)

We only need to update the configuration scripts if we make changes to them; otherwise, doing a rebuild simply pulls in the available security updates. Nothing inside AWS gets special permission to write these files.

After that, we can run step 2 to find an appropriate base AMI on our cron host. That requires no write access, so all the cron host really needs is permission to start and monitor the latter tasks. Tasks 3-5, in particular, are simple enough (once inputs are determined in step 2) to be run via AWS SSM Automation. I imagine there will be a “coordinating” automation that runs those three tasks in order, and the granular tasks exist mainly for ease of debugging each one. Finally, garbage collection is somewhat compute-heavy but requires no waiting, so AWS Lambda might be the best option for it.

The end result will be a major improvement: each step will actually have minimal privileges, and the union of all privileges is much less than the admin privileges that I am technically granting to the process. It “works” in the sense that we trust my entire laptop, but what if we didn’t need to?

Reminiscing about fvwm

2025-01-19T18:00:00.002-05:00

Inspired by Chris Siebenmann talking about his setup and reminiscing about MGR, let’s jump in the Epoch and set the dial for 2002.

I was in college, absolutely blown away by the customizability of X11. You could have a “full desktop” like CDE on the Suns, or KDE/Gnome on my home computer (variously Linux and FreeBSD); or else you could choose to run with only a window manager. Those ranged from quasi-desktops like WindowMaker and Fluxbox, down to minimalist options like Ratpoison (so named because it was intended to be used without the mouse), with what I can only describe as “normal” options in between.

In this milieu, I found fvwm2 and really dug into that. The configuration makes it more of a “window manager construction kit” than a fully-defined window manager. (In contrast, one could barely do anything but theme Metacity.) I put my window title bars down the left side, because it let me have ten more pixels of vertical space! And, let’s be honest, just because I could.

I had keyboard shortcuts for everything. Nobody in Unix space used Macs in those days; we were busy calling the iMac a “lampshade.” Therefore, Linux applications all used ctrl/alt/shift for shortcuts, and left the Windows key (Super) free for my window management shortcuts. Oh, and I think I drew my own icons, so I could have Amiga-style raise/lower buttons in the title bars.

I also found out I had limits. I configured a 3x3 grid of virtual desktops each with 3x3 pages on them, but even “9 places to look for windows” was absurdly oversized. In practice, I only ever used 3 pages on 1 desktop. By the time I wrote Layer Juggling, I had forgotten about fvwm’s “pages” layer entirely.

I had a great run with fvwm and WindowMaker for 2–4 years there, but it soon became clear I was trading features (like having a volume control, or instantly applying themes) for an environment that nobody else could use. Meanwhile, I was losing familiarity with Windows, which would slow me down if I were using anyone else’s computer. I switched to KDE by 2004, and I would eventually, relucantly, capitulate and try Gnome again in 2008. More or less just in time for Gnome and Canonical to blow everything up again!

Ironically, over time, there has been less need to use other peoples’ computers. Besides which, I have kept using Dvorak for 20+ years now, despite that being a much bigger issue when switching systems.

Systemd Allows Unknown Units in Before/After

2025-01-12T18:30:00.001-05:00

Most of the time, my development virtual machine guest would boot and run perfectly fine. Sometimes, though, the FastCGI service backing one of the websites would not be up and running. It had a ConditionPathExists, and if the code to run the service wasn’t mounted, it wouldn’t start.

The intention was to allow colleagues to import a copy of this guest, then set up the mount to share the project from the host as they saw fit. On their first boot, with no sharing, ConditionPathExists would prevent the FastCGI service from attempting to start, and therefore, systemd would not report that the system was degraded. Another point about this system is that the sharing mechanism is unspecified: colleagues are free to use NFS (as I do), Plan9 file sharing, or the hypervisor’s shared-files mechanism. The host paths are also unspecified, so there is no way I can set up the guest to expect specific sharing in advance.

In practice, sometimes NFS wasn’t ready in my guest before systemd was checking conditions for the FastCGI service. The obvious answer was to add After=remote-fs.target to the FastCGI service. I quickly added a drop-in to add this directive to my own post-configuration scripts.

However, that’s a local solution to a global problem. My colleagues can’t benefit from that, and I should minimize the burden on them periodically setting up new guest images. The fewer things they must remember, the better.

It turns out the answer was even simpler: I could skip the drop-in and add the After= line to the main service file. I added both remote-fs.target and the hypervisor’s guest services to the line, which means:

In production, there are no remote filesystems to mount, nor guest services; there is no latency introduced.
When using NFS or similar, systemd waits for the remote filesystem before starting the FastCGI service.
With the hypervisor’s file sharing, the guest services mount the shared files before starting the FastCGI service.

My guest doesn’t actually have the guest services installed, but the FastCGI service starts up as intended. Looking at systemctl list-units --all output, the guest services are (now) listed as not-found and inactive, which is pretty much what I would expect from a dangling reference. systemd knows about it because I listed it in After, but since it’s not required by anything, the missing definition for it doesn’t cause any problems.

Residual Config Without Config Files

2025-01-05T19:00:00.001-05:00

apt makes a distinction between “removed” and “purged.” In both, the packages are uninstalled; in the former state, config files remain, and in the latter, those are also removed. Actually, that’s not quite the whole story.

A package can have no configuration files, yet still be in ”residual config” state when removed. This happens if a package defines a postrm maintainer script. These can have basically any shell commands in them, so their actions aren’t visible in any list-of-files.

The specific package I was looking into was a library, with a postrm script that ran ldconfig… during removal. The package was being shown in residual-config state because it had a script. Although that script would do nothing during purge, apt (and dpkg) can’t know that.

How to list residual-config packages: apt list 2>/dev/null | grep residual-config or dpkg -l | grep ^rc.

Listing configuration files: try one of these answers as this gets real complex, real fast.

Reading a postrm script: look at /var/lib/dpkg/info/{PACKAGE}[:{ARCH}].postrm (the ARCH component may not be present.)

Scattered Notes on Dovecot’s userdb, passdb, and passwd-file

2024-12-29T18:00:00.001-05:00

Dovecot can authenticate users using a passwd-like file. This happens in two phases. First, users are looked up in the passdb. If the user is found and authenticated, then the user is looked up again in the userdb to get things like their UID/GID and home directory.

Now, this doesn’t allow for aliasing users in Dovecot. If the login is user@example.com, then the defaults will lead to trying to find “user@example.com” in the passdb, then the userdb. Failure to have these configured correctly can result in different errors:

User not found in the passdb: authentication fails. (Beware of fail2ban here.)
User not found in the userdb: user can authenticate, but appears to have no mail!

For my own system, the virtual address needs to be resolved to a particular system user (aka Unix account.) I also want to share the password files with Postfix for outbound email authentication. This made Dovecot complicated: I want to log in as user@domain, then have that processed as user for both lookups in a file that is specific to the domain. I put the shortened user in the passwd-file, and now I have to configure passdb carefully:

# /etc/dovecot/local.conf snippet
passdb {
    args username_format=%n /local/auth/%d/passwd
    override_fields user=%n
    driver = passwd-file
}
userdb {
    args /local/auth/%d/passwd
    driver = passwd-file
}

This makes passdb do the first lookup using the short username, %n, with the args setting. Then, that short username is returned by override_fields for use in later lookups. After that, userdb can continue with no special settings; it will use the overridden user to look up the short name, and nothing special needs to happen.

I believe that the passwd-file can’t return a different username, because there’s only one username field (the first field), and it is also the lookup key. This is what requires us to use override_fields for this scenario.

Don’t Let HTTP/2 Nest

2024-12-22T18:30:00.005-05:00

For some time, I had problems accessing a dev server with HTTP/2. Asking cURL to use HTTP/1.1 worked fine, so that’s what I did for a long time.

Today, I found the root cause. I had nginx set up as reverse-proxy/TLS termination (to emulate ALB), proxying requests to apache2. Both of them had HTTP/2 enabled! I needed to deactivate support in Apache, and since the system is Debian/Ubuntu based, that meant:

sudo a2dismod http2
sudo systemctl reload apache2

After that, everything worked.

The problem was that the client would connect to nginx with HTTP/2, and then the request would be sent to Apache. Apache's HTTP/2 module would include an Upgrade: h2, h2c header in the response. Then nginx would dutifully copy this back to the client. When cURL or PHP streams received this header, they would detect it as invalid: we can’t upgrade to HTTP/2 from inside HTTP/2.

That error-handling resulted in discarding the response body… but not the HTTP 200 status code, which was extremely puzzling. How could this successful request have failed? It failed during header processing, after processing the status and before accepting the body. (I think browsers must ignore it? Or maybe they don’t use HTTP/2 through a proxy, even with CONNECT requests? I would have had to figure out the problem much sooner, if they had seen this Upgrade header and treated it as an error.)

The other weird thing about this is that Apache doesn't have TLS configured, but it still provided h2 as an option in its Upgrade header. I don’t think that’s a reasonable configuration. It’s especially not a reasonable default, but I’m not sure whether that’s Apache’s problem, Debian’s, or Ubuntu’s.

What I Learned Trying to Install Kubuntu (alongside Pop!_OS)

2024-12-17T19:35:00.000-05:00

First and foremost, once again, this is clearly not a supported configuration that I tried to make. I'm sure that if I wiped the drive and started afresh, things would have gone much better. I just… wanted to push the envelope a bit.

Pop!_OS installs (with encryption) with the physical partition as a LUKS container, holding an LVM volume group, and the root filesystem is on a logical volume within. The plan was hatched:

Create a logical volume for /home and move those files over to it
Create a logical volume for Kubuntu’s root filesystem
Install Kubuntu into the new volume, and share /home for easy switching (either direction)

Things immediately got weird. The Kubuntu installer (calamares) knows how to install into a logical volume, but it doesn’t know how to open the LUKS container. I quit the installer, unlocked the thing, and restarted the installer. This let the installation proceed, up to the point where it failed to install grub.

Although that problem can be fixed, the whole installation ended up being irretrievably broken, all because booting Linux is clearly not important enough to get standardized. Oh well!

Anyway, I figured the grub failure shouldn’t matter too much. The boot loader should be one of the last things the installer does… right? All I have to do is get that partition into the real boot menu, and it should work, right? …Right??

I rebooted, and Pop!_OS came back up. I did the mount-and-chroot sequence to get into Kubuntu, and proceeded to replace grub stuff with systemd-boot, efibootmgr, and even kernelstub. With everything apparently set up correctly, and brazenly ignoring the “24 MB of free space on /boot/efi” warning, I rebooted. Finally, I had a Kubuntu menu entry available!

The thing is, when Kubuntu booted, it gave me a choice between trying Kubuntu or installing it. What? The install media had been unplugged during the initial reboot after failing to install grub. The only thing I can really figure is that the installer installs the live environment, then sets the boot loader, and only then configures the installed system to “not be the installer.” Somehow. Maybe because splitting live- and non-live data would make the install media too big. I guess?

Regardless, I obviously didn’t have a usable Kubuntu install, and little idea about how to make it one from the state we had arrived at. I was pretty sure it would refuse to reinstall into its own partition, or make any progress. I gave up on Kubuntu, and shut down the laptop.

Monday, my desktop environment was all messed up. Most obviously, there were Breeze icons, and some of them were showing a missing-image placeholder. My work user (non-admin), for complicated reasons, shares the same UID as the default user on Kubuntu, and I was sharing my home directory. Therefore, my Pop!_OS settings were altered when Kubuntu reconfigured .gtkrc files and dconf entries to make GTK apps “integrate better” with the KDE theme, Breeze. And by default, Pop!_OS doesn’t have a way to change themes, so I couldn’t put it back without Gnome Tweak Tool. Or dconf-editor, which I already had installed. 🤷🏻

Deleting newly-added files wasn’t enough; those configure GTK when running outside of a desktop environment. Most environments have a system to overrule that, which also provides for instant updates of running applications when changing settings. I think KDE runs just such a daemon? Couldn’t it override the style for GTK apps only while KDE is running? Anyway, the point is, the yaks got shaved. If the install had worked out, I may have never noticed that part.

On the other hand, it is rather annoying that we have arrived at a point where we can’t really “choose a desktop from multiple installed options” anymore, because each one thinks it has to rule over all of them.