Friday, March 31, 2023

Passing data from AWS EventBridge Scheduler to Lambda

The documentation was lacking images, or even descriptions of some screens ("Choose Next. Choose Next.")  So, I ran a little experiment to test things out.

When creating a new scheduled event in AWS EventBridge Scheduler, then choosing AWS Lambda: Invoke, a field called "Input" will be available.  It's pre-filled with the value of {}, that is, an empty JSON object. This is the value that is passed to the event argument of the Lambda handler:

export async function handler(event, context) {
  // handle the event
}

With an event JSON of {"example":{"one":1,"two":2}}, the handler could read event.example.two to get its value, 2.

It appears that EventBridge Scheduler allows one complete control over this data, and the context argument is only filled with Lambda-related information.  Therefore, AWS provides the ability to include the <aws.scheduler.*> values in this JSON data, to be passed to Lambda (or ignored) as one sees fit, rather than imposing any constraints of its own on the data format.  (Sorry, no examples; I was only testing the basic features.)

Note that the handler example above is written with ES Modules.  This requires the Node 18.x runtime in Lambda, along with a filename of "index.mjs".

Monday, March 27, 2023

DejaDup and PikaBackup, early impressions (Update 2)

I tried a couple of backup programs:

I installed both of them as Flatpaks, although deja-dup also has a version in the Pop!_OS 22.04 repository.  I have been using DejaDup for four months, and PikaBackup for one month.  This has been long enough for DejaDup to make a second full backup, but not so long for Pika to do anything special.

Speed:

For a weekly incremental backup of my data set…

  • DejaDup: about 5 minutes, lots of fan speed changes
  • PikaBackup: about 1 minute, fans up the whole time

Part of Pika’s speed is probably the better exclusion rules; I can use patterns of **/node_modules and **/vendor, to exclude those folders, wherever they are in the tree.  With DejaDup, I would apparently have to add each one individually, and I did not want to bother, nor keep the list up-to-date over time.

Part of DejaDup’s slowness might be that it executes thousands of gpg calls as it works.  Watching with top, DejaDup is frequently running, and sometimes there’s a gpg process running with it.  Often, DejaDup is credited with much less than 100% of a single CPU core.

Features:

PikaBackup offers multiple backup configurations.  I keep my main backup as a weekly backup, on an external drive that’s only plugged in for the occasion. I was able to configure an additional hourly backup of my most-often-changed files in Pika.  (This goes into ~/.borg-fast, which I excluded from the weekly backups.) The hourly backups, covering about 2 GB of files, aren’t noticeable at all when using the system.

Noted under “speed,” PikaBackup offers better control of exclusions.  It tracks how long operations took, so I know that it has been exactly 53–57 seconds to make the incremental weekly backups.

On the other hand, Pika appears to always save the backup password.  DejaDup gives the user the option of whether it should be remembered.

There is a DejaDup plugin for Caja (the MATE file manager) in the OS repo, which may be interesting to MATE users.

Space Usage:

PikaBackup did the weekly backup on 2023-04-24 in 46 seconds; it reports a total backup size of 28 GB and 982 MB (0.959 GB = 3.4%) written out.

With scheduled backups, Pika offers control of the number of copies kept.  One can choose from a couple of presets, or provide custom settings.  Of note, these are count-based rather than time-based; if a laptop is only running for 8-9 hours a day, then 24 hourly backups will be able to provide up to 3 days back in time.

For unscheduled backups, it’s not clear that Pika offers any ‘cleanup’ options, because the cleanup is tied to the schedule in the UI.

I do not remember being given many options to control space usage in DejaDup.

Disaster Simulation:

To ensure backups were really encrypted, I rebooted into the OS Recovery environment and tried to access them.  Both programs’ CLI tools (duplicity and borgbackup) from the OS repository were able to verify the data sets. I don’t know what the stability guarantees are, but it’s nice that this worked in practice.

  • duplicity verified the DejaDup backup in about 9m40s
  • borgbackup verified the PikaBackup backup in 3m23s

This isn’t a benchmark at all; after a while, I got bored of duplicity being credited with 30% of 1 core CPU usage, and started the borgbackup task in parallel.

Both programs required the password to unlock the backup, because my login keychain isn’t available in this environment.

Curiously, borgbackup changed the permissions on a couple of files on the backup during the verification: the config and index files became owned by root.  This made it impossible to access the backups as my normal user, including to take a new one.  I needed to return to my admin user and set the ownership back to my limited account.  The error message made it clear an unexpected exception occurred, but wasn’t very useful beyond that.

Major limitations of this post:

My data set is a few GB, consisting mainly of git repos and related office documents.  The performance of other data sets is likely to vary.

I started running Pika about the same time that DejaDup wanted to make a second backup, so the full-backup date and number of incremental snapshots since should be fairly close to each other.  I expect this to make the verification times comparable.

I haven’t actually done any restores yet.

Final words:

Pika has become my primary backup method.  Together, its speed and its support for multiple configurations made hourly backups reasonable, without compromising the offline weekly backup.

Update History:

This post was updated on 2023-03-31, to add information about multiple backups to “Features,” and about BorgBackup’s file permission change during the verification test.  Links were added to the list above, and a new “Final Words” concluding section was written.

It was updated again on 2023-04-26, to add the “Space Usage” section, and to reduce “I will probably…” statements to reflect the final decisions made.

Thursday, March 16, 2023

Using sshuttle with ufw outbound filtering on Linux (Pop!_OS 22.04)

I am using sshuttle and UFW on my Linux system, and I recently set up outbound traffic filtering (instead of default-allow) in ufw.  Immediately, I noticed I couldn’t make connections via sshuttle anymore.

The solution was to add another rule to ufw:

allow out from anywhere to IP 127.0.0.1, TCP port 12300

Note that this is “all interfaces,” not tied to the loopback interface, lo.

Now… why does this work?  Why doesn’t this traffic already match one of the “accept all on loopback” rules?

To receive that sshuttle is responsible for, sshuttle listens at 127.0.0.1:12300 (by default) and creates some NAT rules to redirect traffic for its subnet to that IP and port.  That is, running sshuttle -r example.com 192.168.99.0/24 creates a NAT rule to catch traffic to any host within 192.168.99.0/24.  This is done in netfilter’s nat tables.

UFW has its rules in the filter tables, and the nat tables run first. Therefore, UFW sees a packet that has already been redirected, and this redirection changes the packet’s destination while its interface and source remain the same!

That’s the key to answering the second question: the “allow traffic on loopback” rules are written to allow traffic on interface lo, and these redirected packets have a different interface (Ethernet or Wi-Fi.) The public interfaces are not expected to have traffic for local addresses on them… but if they do, they don’t get to take a shortcut through the firewall.

With this understanding, we can also see what’s going wrong in the filtering rules.  Without a specific rule to allow port 12300 outbound, the packet reaches the default policy, and if that’s “reject” or “deny,” then the traffic is blocked.  sshuttle never receives it.

Now we can construct the proper match rule: we need to allow traffic to IP 127.0.0.1 on TCP port 12300, and use either “all interfaces” or our public (Ethernet/Wi-Fi) interface.  I left mine at “all interfaces,” in case I should ever plug in the Ethernet.

(I admit to a couple of dead-ends along the way.  One, allowing port 3306 out didn’t help.  Due to the NAT redirection, the firewall never sees a packet with port 3306 itself.  This also means that traffic being forwarded by sshuttle can’t be usefully firewalled on the client side.  The other problem was that I accidentally created the rule to allow UDP instead of TCP the first time.  Haha, oops.)

Thursday, February 2, 2023

On Handling Web Forms [2012]

Editor’s Note: I found this in my drafts from 2012.  The login form works as described, but few of the other forms do.  However, the login form has not had glitches, even with 2FA being added to it recently.  The site as a whole retains its multi-page architecture.  Without further ado, the original post follows…

I’ve been working on some fresher development at work, which is generally an opportunity for a lot of design and reflection on that design.

Back in 2006 or so, I did some testing and tediously developed the standard “302 redirect on POST responses” technique for preventing pages that handled inbound form data from showing up in the browser history.  Thus, sites would be Back- and Reload-friendly, as they’d never show that “About to re-submit a form, do you really want to?” box.  (I would bet it was a well-known technique at the time, but NIH.)

That’s pretty much how I’ve written my sites since, but a necessary consequence of the design is that on submission failure, data for the form must be stored “somewhere,” so it can be retrieved for pre-filling the form after the redirection completes.

My recent app built in this style spends a lot of effort on all that, and then throws in extra complexity: when detecting you’re logged out, it decides to minimize the session storage size, so it cleans up all the keys.  Except for the flash, the saved form data, and the redirection URL.

That latter gets stored because I don’t trust Referer as a rule, and if login fails and redirects to the login page for a retry, it won’t be accurate by the time a later attempt succeeds.  So every page that checks login also stores the form URL to return to after a login.

There’s even an extra layer of keys in the form area, so that each form’s data is stored independently in the session, although I don’t think people actually browse multiple tabs concurrently.  All that is serving to do is bloat up the session when a form gets abandoned somehow.

Even then, it still doesn't work if the form was inside a jQueryUI dialog, because I punted on that one.  The backend and page don’t know the dialog was open, and end up losing the user’s data.

Simplify, Young One

That's a whole lot of complexity just to handle a form submission.  Since the advent of GMail, YouTube, and many other sites which are only useful with JavaScript enabled, and since this latest project is a strictly internal app, I've decided to throw all that away and try again.

Now, a form submits as an AJAX POST and the response comes back.  If there was an error, the client-side code can inform the user, and no state needs saved/restored because the page never reloaded. All the state it had is still there.

But while liberating, that much is old news.  “Everyone” builds single-page apps that way, right?

But here’s the thing: if I build out separate pages for each form, then I’m effectively building private sections of code and state from the point of view of “the app” or “the whole site.”  No page is visible from any other, so each one only has to worry about a couple of things going on in the global navbar.

This means changes roll out more quickly, as well, since users do reload when moving from area to area of the app.

Tuesday, January 31, 2023

Argon2id Parameters

There are no specific recommendations in this post. You will need to choose parameters for your specific situation.  However, it is my hope that this post will add deeper understanding to the recommendations that others make.  With that said…

The primary parameter for Argon2 is memory. Increasing memory also increases processing time.

The time cost parameter is intended to make the running time longer when memory usage can’t be increased further.

The threads (or parallelism, or “lanes” when reading the RFC) parameter sub-divides the memory usage.  When the memory is specified as 64 MiB, that is the total amount used, whether threads are 1 or 32.  However, the synchronization overhead causes a sub-linear speedup, and this is more pronounced with smaller memory sizes.  SMT cores offer even less speed improvement than the same number of non-SMT cores, as expected.

I did some tests on my laptop, which has 4 P-cores and 8 E-cores (16 threads / 12 physical cores.) The 256 MiB tests could only push actual CPU usage to about 600% (compared to the 1260% we might expect); it took 1 GiB or more to reach 1000% CPU.  More threads than cores didn’t achieve anything.

Overall then, higher threads allow for using more memory, if enough CPU is available to support the thread count.  If memory and threads are both in limited supply, then time cost is the last resort for extending the operation time until it takes long enough.

Bonus discovery: in PHP, the argon2id memory is separate from the memory limit.  memory_get_peak_usage() reported the same number at the beginning and end of my test script, even for the 1+ GiB tests.

Saturday, January 28, 2023

Experiences with AWS

Our core infrastructure is still EC2, RDS, and S3, but we interact with a much larger number of AWS services than we used to.  Following are quick reviews and ratings of them.

CodeDeploy has been mainly a source of irritation.  It works wonderfully to do all the steps involved in a blue/green deployment, but it is never ready for the next Ubuntu LTS after it launches.  As I write, AWS said they planned to get the necessary update out in May, June, September, and September 2022; it is now January 2023 and Ubuntu 22.04 support has not officially been released. Ahem. 0/10 am thinking about writing a Go daemon to manage these deployments instead.  I am more bitter than a Switch game card.

CodeBuild has ‘environments’ thrown over the wall periodically.  We added our scripts to install PHP from Ondřej Surý’s PPA instead of having the environment do it, allowing us to test PHP 8.1 separately from the Ubuntu 22.04 update.  (Both went fine, but it would be easier to find the root cause with the updates separated, if anything had failed.) “Build our own container to route around the damage” is on the list of stuff to do eventually.  Once, the CodeBuild environment had included a buggy version of git that segfaulted unless a config option was set, but AWS did fix that after a while.  9/10 solid service that runs well, complaints are minor.

CodeCommit definitely had some growing pains.  It’s not as bad now, but it remains obviously slower than GitHub.  After a long pause with 0 objects counted, all objects finish counting at once, and then things proceed pretty well.  The other thing of note is that it only accepts RSA keys for SSH access.  6/10 not bad but has clearly needed improvement for a long time.  We are still using it for all of our code, so it’s not terrible.

CodePipeline is great for what it does, but it has limited built-in integrations.  It can use AWS Code services… or Lambda or SNS.  8/10 conceptually sound and easy to use as intended, although I would rather implement my own webhook on an EC2 instance for custom steps.

Lambda has been quarantined to “only used for stuff that has no alternative,” like running code in response to CodeCommit pushes.  It appears that we are charged for the wall time to execute, which is okay, but means that we are literally paying for the latency of every AWS or webhook request that Lambda code needs to make.  3/10 all “serverless” stuff like Lambda and Fargate are significantly more expensive than being server’d.  Would rather implement my own webhook on an EC2 instance.

SNS [Simple Notification Service] once had a habit of dropping subscriptions, so our ALB health checks (formerly ELB health checks) embed a subscription-monitor component that automatically resubscribes if the instance is dropped.  One time, I had a topic deliver to email before the actual app was finished, and the partner ran a load test without warning.  I ended up with 10,000 emails the next day, 0 drops and 0 duplicates.  9/10 has not caused any trouble in a long time, with plenty of usage.

SQS [Simple Queue Service] has been 100% perfectly reliable and intuitive. 10/10 exactly how an AWS service should run.

Secrets Manager has a lot of caching in front of it these days, because it seems to be subject to global limits.  We have observed throttling at rates that are 1% or possibly even less of our account’s stated quota.  The caching also helps with latency, because they are either overloaded (see previous) or doing Serious Crypto that takes time to run (in the vein of bcrypt or argon2i).  8/10 we have made it work, but we might actually want AWS KMS instead.

API Gateway has ended up as a fancy proxy service.  Our older APIs still have an ‘API Definition’ loaded in, complete with stub paths to return 404 instead of the default 403 (which had confused partners quite a bit.) Newer ones are all simple proxies.  We don’t gzip-encode responses to API Gateway because it failed badly in the past. 7/10 not entirely sure what value this provides to us at this point.  We didn’t end up integrating IAM Authentication or anything.

ACM [AWS Certificate Manager] provides all of our certificates in production.  The whole point of the service is to hide private keys, so the development systems (not behind the load balancer) use Let’s Encrypt certificates instead.  10/10 works perfectly and adds security (vs. having a certificate on-instance.)

Route53 Domains is somewhat expensive, as far as registering domains goes, but the API accessibility and integration with plain Route53 are nice.  It is one of the top-3 services on our AWS bill because we have a “vanity domain per client” architecture.  9/10 wish there was a bulk discount.

DynamoDB is perfect for workloads that suit non-queryable data, which is to say, we use it for sessions, and not much else.  It has become usable in far more scenarios with the additions of TTL (expiration times) and secondary indexes, but still remains niche in our architecture.  9/10 fills a clear need, just doesn’t match very closely to our needs.

CloudSearch has been quietly powering “search by name” for months now, without complaints from users.  10/10 this is just what the doctor ordered, plain search with no extra complexity like “you will use this to parse logs, so here are extra tools to manage!”

That’s it for today.  Tune in next time!

Thursday, January 26, 2023

FastCGI in Perl, but PHP Style [2012]

Editor's note: I found this in my drafts from 2012. By now, everything that can be reasonably converted to FastCGI has been, and a Perl-PHP bridge has been built to allow new code to be written for the site in PHP instead. However, the conclusion still seems relevant to designers working on frameworks, so without further ado, the original post follows...

The first conversions of CGI scripts to FastCGI have been launched into production. I have both the main login flow and six of the most popular pages converted, and nothing has run away with the CPU or memory in the first 50 hours. It’s been completely worry-free on the memory front, and I owe it to the PHP philosophy.

In PHP, users generally don’t have the option of persistence. Unless something has been carefully allocated in persistent storage in the PHP kernel (the C level code), everything gets cleaned up at the end of the request. Database connections are the famous example.

Perl is obviously different, since data can be trivially kept by using package level variables to stash data, but my handler-modules (e.g. Site::Entry::login) don’t use them. Such handler-modules define one well-known function, which returns an object instance that carries all the necessary state for the dispatch and optional post-dispatch phases. When this object is destroyed in the FastCGI request loop, so too are all its dependencies.

Furthermore, dispatching returns its response, WSGI style, so that if dispatch dies, the FastCGI loop can return a generic error for the browser. Dispatch isn’t allowed to write anything to the output stream directly, including headers, which guarantees a blank slate for the main loop’s error page. (I once wrote a one-pass renderer, then had to grapple with questions like “How do I know whether HTML has been sent?”, “How do I close half-sent HTML?”, and “What if it’s not HTML?” in the error handler.)

Sunday, January 22, 2023

PHP’s PDO, Single-Process Testing, and 'Too Many Connections'

Quite some time ago now, I ran into a problem with running a test suite: at some point, it would fail to connect to the database, due to too many connections in use.

Architecturally, each connection sent a PSR-7 Request through the HTTP layer, which caused the back-end code under test to connect to the database in order to fulfill the request.  All of these resources (statement handles and the database handle itself) should have been out of scope be the end of the request.

But every PDOStatement has a reference to its parent PDO object, and apparently each PDO keeps a reference to all of its PDOStatements.  There was no memory pressure (virtually all other allocations were being cleaned up between tests), so PHP wasn’t trying to collect cycles, and the PDO objects were keeping connections open the whole duration of the test suite.

Lowering the connection limit in the database engine (a local, anonymized copy of production data) caused the failure to occur much sooner in testing, proving that it was an environmental factor and not simply “unlucky test ordering” that caused the failure.

Using phpunit’s --process-isolation cured the problem entirely, but at the cost of a lot of time overhead.  This was also expected: with the PHP engine shut down entirely between tests, all of its resources (including open database connections) were cleaned up by the OS.

Fortunately, I already had a database connection helper for other reasons: loading credentials securely, setting up exceptions as the error mode, choosing the character set, and retrying on failure if AWS was in the middle of a failover event (“Host not found”, “connection refused”, etc.) It was a relatively small matter to detect “Too many connections” and, if it was the first such error, issue gc_collect_cycles() before trying again.

(Despite the “phpunit” name, there are functional and integration test suites for the project which are also built on phpunit.  Then, the actual tests to run are chosen using phpunit --testsuite functional, or left at the default for the unit test suite.)