Sunday, September 15, 2024

The Wrong Terminal

Somewhere in my Pop!_OS 22.04 settings, I set Tilix as the preferred terminal emulator. When I use the Super+T* keyboard shortcut, I get a Tilix window.  However, when I use a launcher that Distrobox has created for a container from the Super+A (for Applications) UI, the command-line doesn’t come up in Tilix… it comes up in gnome-terminal instead. Why is that, and can I fix it?

AIUI, all the Application Launcher UI does is ask the system to open the .desktop file that Distrobox added.  That file has the “run in terminal” option, but lacks the ability to request some specific terminal. That gets chosen, eventually, by the GIO library.

GIO isn’t desktop-specific, so it doesn’t read the desktop settings.  It actually builds in a hard-coded list of terminals that it knows how to work with, like gnome-terminal, konsole, and eventually (I am assuming) ye olde xterm.  It walks down the list and runs the first one that exists on the system, which happens to be gnome-terminal.  AFAIK, there is no configuration for this, at any level.

It is also possible that one of the distributions in the chain (Debian, Ubuntu, or Pop!_OS) patched GIO to try x-terminal-emulator first.  If so, it would go through the alternatives system, which would send it directly to gnome-terminal, since between that and Tilix, gnome-terminal has priority.  We are deep into speculative territory, but if all of that were the case, I could “make it work” by making a system-level change resulting in all users now preferring Tilix… but only for cases where x-terminal-emulator is involved, specifically.

I want the admin user to have all the defaults, like gnome-terminal, because the less deviation made in that account, the less likely I am to configure a weird problem for myself.** (Especially for Gnome, which has a configuration system, but they don’t want anyone to use it.  For simplicity.) Changing the alternatives globally is in direct contradiction to that goal.

It seems that the “simplest” solution is to change the .desktop file to avoid launching in a terminal, and then update the command to include the desired terminal.  It would work in the short term, but fall “out of sync” if I ever changed away from Tilix as default in the desktop settings, or uninstalled Tilix.  It’s not robust.

It seems like there’s some sort of desktop-environment standard missing here.  If we don’t want to invoke threads or communication inside GIO, then there would need to be a way for the Gnome libraries to pass an “XDG Configuration” or something, to allow settings like “current terminal app” to be passed in.

If we relax the constraints, then a D-Bus call would be reasonable… but in that case, maybe GIO could be removed from the sequence entirely.  The Applications UI would make the D-Bus call to “launch the thing,” and a desktop-specific implementation would pick it up and do it properly.

It seems like there should be solutions, but from searching the web, it looks like the UX has “just done that” for years.

* Super is the |□| key, because it's a System76 laptop, a Darter Pro 8 in particular.

** As a side effect, this makes the admin account feel like “someone else’s computer,” which makes me take more care with it.  I may not want to break my own things, exactly, but I would feel even worse about breaking other people’s stuff.

Monday, September 9, 2024

Some Solutions to a Problem

We have an EC2 instance that has a quickly-produced shell script that runs on boot.  It sets a DNS name to the instance’s public IPv4 address.  Since time was of the essence, it hard-codes everything about this, particularly, the DNS name to use.

This means, if we want to bring up a copy of this instance, based on a snapshot of its root volume, the copied instance will overwrite the DNS record for the production service. We need to stop this.

As a side project, it would be nice to remove the hard-coding of the DNS name.  It would be trivial to “stop DNS name conflicts” if we did not have a DNS name stored on the instance’s disk image to begin with.

What are the options?

Sunday, September 1, 2024

A Problem of Semantic Versioning

For a while, we’ve been unable to upgrade to PHPUnit 11 due to a conflict in transitive dependencies.  The crux of the problem is:

  1. Psalm (5.25.0) directly requires nikic/php-parser: ^4.16, prohibiting 5.x.
  2. PHPUnit (11.3.1) transitively requires nikic/php-parser: ^5.1, prohibiting 4.x.

It is possible in the short term to retain PHPUnit 10.x, but it brings to light a certain limitation of Semantic Versioning: it tells you how to create version numbers for your own code base, but it does not carry information about the dependencies of that code.

When the required PHP runtime version goes up, what kind of change is that?  SemVer prescribes incrementing the major number for “incompatible API changes,” or the patch for “backward compatible bug fixes.”

So, is it a bug fix?  Is it incompatible? Or is the question ill-formed?

It feels wrong to increment the patch version with such a change.  Such a release states, “We are now preventing the installation on {whatever Enterprise Linux versions} and below, and in exchange, you get absolutely nothing. There are no new features.  Or perhaps we fixed bugs, but now you can’t access those fixes.”  That sounds… rude.

Meanwhile, it seems gratuitous to bump the major version on a strict time schedule, merely because an old PHP version is no longer supported upstream every year.  It appears to cause a lot of churn in the API, simply because making a major version change is an opportunity to “break” that API.  PHPUnit is particularly annoying about this, constantly moving the deck chairs around.

In between is the feature release.  I have the same misgivings as with the patch version, although weaker.  Hypothetically, a project could release X.3.0 while continuing to maintain X.2.Y, but I’m not sure how many of them do.  When people have a new shiny thing to chase, they don’t enjoy spending any time on the old, tarnishing one.

What if we take the path of never upgrading the minimum versions of our dependencies?  I have also seen a project try this.  They were starving themselves of contributors, because few volunteers want to make their patch work on PHP 5.2–8.1.  (At the PHP 8.1 release in 2021, PHP 5.2 had reached its “end of life” about 11 years prior, four years after its own release in 2006.) Aside from that issue, they were also either unable to pick up new features in other packages they may use, or they were forever adding run-time feature detection.

As in most things engineering, it comes down to trade-offs… but versions end up being a social question, and projects do not determine their answers in isolation.  The ecosystem as a whole has to work together.  When they don’t, users have to deal with the results, like the nikic/php-parser situation.  And maybe, that means users will migrate away from Psalm, if it’s not moving fast enough to permit use with other popular packages.

Sunday, August 18, 2024

The Missing Call

I decided to combine (and minify) some CSS files for our backend administration site, so I wrote the code to load, minify, and output the final stylesheet.  I was very careful to write to a temporary file, check even the fclose() return code, rename it into place, and so on.  I even renamed the original to a backup so that I could attempt to rename it back if the first rename succeeded, but the second failed.

For style points, I updated it to set the last-modified time of the generated file to the timestamp of the latest input, so that If-Modified-Since headers will work correctly.

I tested it, multiple times, with various states of having the main and backup filenames. It looked great.  I pushed it out to production… and that wasn’t so great.

We just had no styles at all. Yikes!  I had some logic in there for “if production and minified CSS exists, use it; else, fall back to the source stylesheets.”  I hastily changed that to if (false) and pushed another deployment, so I could figure out what happened.

It didn’t take long.  The web server log helpfully noted that the site.min.css file wasn’t accessible to the OS user.

I had used tempnam(), which created an output file with mode 600, rw- --- ---.  Per long-standing philosophy, the deployment runs as a separate user from the web server, so a file that’s only readable to the deployer can’t be served by the web server.  Oops.

I had considered the direct failure modes of all of the filesystem calls I was making, but I hadn’t considered the indirect consequences of the actions being performed.  I added a chmod(0o644) call and its error check, and deployed again.  After that, the site worked.

Sunday, August 11, 2024

Our Long-Term AWS CloudSearch Experience

AWS has announced the deprecation of CloudSearch, among other services, just as I wanted to share why we chose it, and how it worked out.

Competitors

The field we considered when choosing CloudSearch included Sphinx, ElasticSearch (the real one and AWS’ ripoff), MySQL FULLTEXT indexes, rolling our own in-SQL trigram search, and of course, CloudSearch.

We had operational experience with Sphinx. It performed well enough, but it is oriented toward documents, not the smaller constellation of attributes I was interested in here.  It took quite a chunk of memory to index our tickets (description/comments), required a pet machine, and didn’t vibe correctly with the team.  I didn’t want to commit to putting 100 times more entries in it, then defending it politically for all eternity.

ElasticSearch appeared to be hyper-focused on log searching specifically, more like what we’re already doing with Papertrail.  It was not clear that it could be used for other purposes, let alone how to go about such things.

We actually had an in-SQL trigram search already, but only for customer names.  I built it because MySQL’s full-text index features were not in great health at the time. (I thought full-text indexes were deprecated ever since, but in checking today, this appears not to be the case.  Even the MySQL 9.0 docs don’t mention it.) I started populating an additional trigram index for all the data I was interested in searching, and it blew up our storage size so fast I had to stop it and definitely find something else. That’s also how I found out that RDS can’t reclaim storage; once it expands, it has expanded for good.

The problem with using MySQL’s full-text indexing was the related integer fields that needed to be indexed.  We wanted to have a general search field, where the user could put in “Sunesh” or “240031” and get the related customer or transaction number, without a complex multi-part form.  Doing that with nothing but MySQL features seemed difficult and/or slow.

“Do nothing” wasn’t really an alternative, either; to search all the relevant fields, MySQL wanted to do two full table scans.  Searches would be running against the largest tables in the database, which makes even a single full scan prohibitively expensive.

CloudSearch

CloudSearch got a great review in my collection of blurbs about AWS services, but further experience has been somewhat less rosy.

For background, CloudSearch is arranged into one-dimensional domains, with a limited second dimension in the form of array attributes.  To contain costs, I chose to index our customers, attaching their VINs as array attributes, rather than have separate customer and vehicle domains or redundantly index the customer attributes on every vehicle.  This results in a domain with 2.5M records.  (Doing some serious guesswork, that means around 12M contracts in total.  Give or take a couple million.)

Things worked fine with a ‘small’ search instance for a while, but it didn’t handle bursty traffic.  Last month, I resized the instance to ‘medium’, and rebuilt the index… which took an unknown number of hours between 2 and 18, inclusive.

Why don’t I know exactly how long it took?  Well, that’s the next problem: metrics. CloudSearch only keeps metrics for three hours, and doesn’t have an event log.  (They appear to go into CloudWatch, but with a custom 3-hour expiration time.) When did the rebuild finish?  Dunno!  Did the system get overwhelmed overnight?  Too bad; that’s gone! With the basic metrics being so anemic, there’s definitely nothing as useful as RDS’ Performance Insights, which is what I would really want here.

Our instance has managed to survive adequately at medium for a while, but I don’t know when I’ll have to scale it up as we roll out this search to more parts of the system.  We just don’t have the metrics here to plan capacity.

Considering that, and the deprecation of it by AWS, I would love to have an alternative… except it would just be CloudSearch, improved.

Wednesday, August 7, 2024

AWS CodeDeploy’s Blue/Green Deployment Gotcha

Once, well after I no longer remembered how the whole thing was bootstrapped, I accidentally deleted the target group associated with a CodeDeploy application that was configured for blue/green deployment.  That’s how I found out (rediscovered?) that CodeDeploy doesn’t create a target group for blue/green deployments, it copies an existing one.  Since I had just deleted that existing one, I couldn’t do a (re)deployment and bring the system back online!

(Also, it cemented my opinion that prompts should be like, “Type ‘delete foo-production-dAdw4a1Ta’ to delete the target group” rather than “Type ‘delete’ to delete.” Guess which way the AWS Console is set up.)

I started up an instance to add to a new target group, and it promptly fell over.  The AMI had health monitoring baked in, and one of the health checks was “CodeDeploy has installed the application on the instance.”  Since it was not CodeDeploy starting the instance for the purpose of installing the application, the health check failed, and Auto Scaling dutifully pulled it down to replace it.

Meanwhile, the lack of healthy instances was helpfully sending alerts and bringing my boss’ attention to the problem.

[Now I wonder if it could have worked to issue a redploy at this point.  The group was there to copy, even if the instances weren’t functional.  I guess we’ll never know; I’m not about to delete the target group again, just to find out!]

I ended up flipping the configuration to using EC2 health checks instead of HTTP, and then everything was stable enough to issue a proper redeployment through CodeDeploy.  With service restored, I finally put the health checks back to HTTP.

And then, with production in service again, I finally got to work on moving staging from in-place to blue/green.  Ironically, I would have learned the lesson either way; but by breaking production, it really stuck with me.

Sunday, August 4, 2024

qbXML is Rest – Distilled

The design of Quickbooks XML is fundamentally REST.  Allow me to rephrase an old post with way too many words about this.

The Quickbooks Web Connector (QBWC) would run on the client with the Quickbooks GUI, and periodically make calls out to a SOAP server to set up a session, get “qbXML” documents, and execute them.

Each of those documents contained a series of elements that essentially mapped to commands within Quickbooks.  To make an edit request, one included the type of object being edited, its ID, its sequence number (for conflict detection), and the desired changes.  Crucially, everything Quickbooks needed to carry out that request was embedded within the XML.  The XML could only reference objects that existed inside of Quickbooks.  There was no concept of “session data,” “temporary IDs,” locks, or anything, and no way to create nor access them.

If memory serves, one could “name” objects being created, then reference them later by that name within the same qbXML document.  Thus, “create a new thing and update something else to reference it” was expressible.

In other words, qbXML transferred a complete representation of the necessary state to complete the request: therefore, by my understanding, it is REST.

The overall system wasn’t pure REST.  Everything happened within the context of “a session” which had “a specific company file” open in the GUI.  Outside of that, the fact that SOAP/WSDL (normally an full-blown RPC mechanism) was the transport was practically irrelevant.

I’m also aware there is no HTTP, thus no HTTP integration, no URLs, and no HATEOAS.  However, I don’t think these things are required to call something REST; those are simply things that REST was co-popularized with.