Saturday, October 13, 2018

Idea: Type Propagation for Gradual Typing

Regarding this paper recently featured on Reddit, I got to thinking.

Perhaps it’s best to add type information starting in high-level modules; intuitively, having a low-level leaf function (especially one that is frequently called) checking and re-checking its type arguments on every call would certainly be slower than a higher-level function that gets called only a few times throughout the course of a run.

For instance, for a program that does data processing, added type checks in “once per file” functions would have less effect on the execution time than type checks in “once per line” functions.

But maybe we’re missing something, here.  The paper adds complete type information to one module at a time, but does nothing about inter-module calls at each step.  That is, a module may define that it accepts a string argument, but callers in other modules won’t be declaring that they are passing strings until that module has types added.

Tuesday, September 25, 2018

asyncio not handling SIGCHLD? callback never called?

I wrote a process manager into the new memcache-dynamo.  Maybe I shouldn’t have, but it happened, and I’ve had to fix my bugs.

The problem is, the parent would never notice when the child exited.  Other signals were being handled fine, but the SIGCHLD handler was never called.

This is because, although it says “add” signal handler, the API is really more of a “set” signal handler, replacing any that are already there.  Also, the Unix event loop needs to know about exiting children in order to clean up the subprocess resources, so it sets its own handler.

As it turns out, the correct way to go about this is to use a “child watcher” to allow outside code to react to SIGCHLD.  One should call get_child_watcher and then, on the returned object, add_child_handler. This takes a PID argument, so it can only be done once the child has been created.  At minimum:

proc = await asyncio.create_subprocess_exec(…)
watcher = asyncio.get_child_watcher()
watcher.add_child_handler(proc.pid, onChildExit)

This “onChildExit” is the name of a callback function, which be called with the PID and returncode as arguments.  If more positional arguments were given to add_child_handler, then those will also be passed to the callback when it is called.

The other signals can be handled in the usual manner, but SIGCHLD is special this way.

(This applies to Unix/macOS only, as Windows doesn’t have POSIX signals. Maybe the shiny new subsystem does, but in general, it doesn’t.)

Thursday, September 6, 2018

I still don't understand Python packaging

Since we last talked about this subject, I've tried to use pipenv with PIPENV_VENV_IN_PROJECT=1 for the project in question. Everything was going pretty well, and then… updates!

I'm using a Homebrew-installed version of Python to test, because it's easier and faster on localhost, and the available Python version was upgraded from 3.6 to 3.7. As usual, I ran brew cleanup -s so the Python 3.6 installation is gone.

It turns out that my python_version = "3.6" line doesn't do what I want—pipenv will be unable to do anything because that binary no longer exists—and I haven't been able to figure out a way to ask Pipenv to use "3.6 or above" to both:
  1. Express the "minimum version: 3.6" requirement
  2. Allow 3.7 to satisfy the requirement
pipenv seems pretty happy to use the system Python when given a version requirement of ">=3.6" but it's also acting like that's a warning only. pipenv check doesn't like this solution, and it's not clear that a system Python 3.5 would cause it to fail as desired.

In PHP, this is just not that hard. We put "php": "^7.1.3" in our composer.json file, and it will install on PHP >=7.1.3,<8.*. It will fail on <7.1.3 or on 8.x or on an 8.0 development version. It's all understood and supported by the tool.

So anyway: right now, we have a deployment process which is more or less "read the internet; build in place for production; swap symlink to make updated code available to the web server."

The end goal is to move the production deployment process to "extract a tarball; swap symlink." To do this, we need to create the tarball with "read the internet; build in place; roll into tarball" prior. And AFAICT, building a virtualenv into a tarball will package everything successfully, similar to Composer, but it will also bake in all the absolute paths to the build process's Python installation.

Pipfile and Pipfile.lock look like what I want (deterministic dependency selection in the build stage, and with the environment variable, in-project vendoring of those dependencies) but it seems like it's fundamentally built on virtualenv, which seems to be a thing that I don't want. I obviously want dependencies like aiobotocore vendored, but I don't necessarily want "the python binary" and everything at that layer. I especially don't want any symlinks pointing outside the build root to be put into the tarball.

Overall, I think pipenv is trying to solve my problem? But it has dragged in virtualenv to do it, which "vendors too much stuff," and it has never been clear to me what benefit I'm supposed to get from having a bloated virtualenv. And also, virtualenv doesn't fully support relocatable environments, which is another problem to overcome. In the past, it has been fairly benign, but now it has turned adversarial.

(We have the technical capability to make the build server and the production server match, file path for file path, exactly. But my devops-senses tell me that tightly coupling these things is a poor management decision, which seems to imply poor design on the part of virtualenv at least. And that contaminates everything built on top of it.)

Monday, August 27, 2018

Configuration as a Point of Failure

Recently, I happened to restore service for our /giphy Slack command.  I originally had things set up poorly, so we needed a ProxyPass configured in every virtual host that wanted to run PHP.  That changed at some point.

I updated all of the configurations in git, of course.

But our giphy endpoint is a separate project, meant to be shared with the world, which therefore doesn’t follow our standard layout.  Its config is not in its git dir; I missed it in the update pass.

I had also just dealt with all the config file merging and changes in the Ubuntu 18.04.1 LTS update.

It really got me to thinking: anything configurable is a point of failure. Whether it’s a process/knowledge failure like with our giphy endpoint, or merge conflicts in *.lock or webpack.config.js files, or a system package changing its configuration semantics or defaults between versions:

The presence of configuration allows broken configurations.

We should try to avoid unnecessary configurability in a system.  This is also very difficult to achieve—a system that only has defaults absolutely needs to have correct, working defaults.  The problem only gets harder in open source, where a project often serves many diverse users.

Also, one of my greatest weaknesses is throwing in easy-to-build options “just in case.”

Finally, it’s bad practice to bake credentials into a git repository, but how shall they be provided without configuration?  EC2 has IAM Roles, but they don’t work in VirtualBox… it really does seem like some configuration is a necessary evil.

Thursday, June 21, 2018

Version Locking

We have a few lovely workarounds, arrived at through a small measure of blood, sweat, tears, and research, embedded in our cpanfile these days.  Here are my two favorites:

requires 'Email::MIME', '<= 1.928';
    # 1.929+ screws up Content-Type sometimes
requires 'DateTime::Format::Strptime', '<= 1.57';
    # DynamoDB v0.1.16 (current) buggy with rewritten Strptime

Nobody wants to come back and check these, to see if later releases of the packages work again.  (It’s possible I’m the only one who remembers these hacks are here.)

Email::MIME was especially nasty because it only failed sometimes, and now, how do I prove that the fixes (which may have occurred in 1.931 or 1.933) actually solve the problem?  I can’t prove the negative in test, and I can’t trust the package in production.

As for the other fun bug, “DynamoDB v0.1.16” refers to the package whose full name is Net::Amazon::DynamoDB and released on November 6th, 2012.  I think this one was detectable at install/test time, but it was still No Fun.  A lot of work was expended in finding out its dependency changed, and I’m not excited to redo it all to find out if 1.67 (or 1.63) fixed the issues.

Especially since use of Perl was deprecated company-wide, and we want to get it all ported to a new language.

Editor’s Notes: this was another old draft.

Since it was written, we accidentally introduced an updated version of Email::MIME into production, and it still failed.  We fixed the bug that allowed the update to occur; clearly, upstream’s process is broken.  I don’t think we could be the only people hit, across multiple years and multiple versions of the entire software stack.

I’m not entirely sure what happened with Net::Amazon::DynamoDB—but we may have been able to use a newer version with it.

Thursday, June 14, 2018

Python, virtualenv, pipenv

I heard (via LWN) about some discussion about Python and virtualenvs.  I'm bad at compressing thoughts enough to both fit Twitter and make sense at the same time, so I want to cover a bit about my recent experiences here.

I'm writing some in-house software (a new version of memcache-dynamo, targeting Python 3.6+ instead of Perl) and I would like to deploy this as, essentially, a tarball.  I want to build in advance and publish an artifact with a minimum amount of surrounding scripts at deploy time.

The thing is, the Python community seems to have drifted away from being able to run software without a complex installation system that involves running arbitrary Python code.  I can see the value in tools like tox and pipenv—for people who want to distribute code to others.  But that's not what I want to do; I want to distribute pre-built code to myself, and as such, "execute from source" has always been my approach.

[Update 2018-09-06: I published another post with further thoughts on this problem.]


Thursday, June 7, 2018

Stream Programming, Without Streams

Editor’s note: Following is a brief draft from two years ago.  I’m cleaning out the backlog and faking HUGE POST COUNTS!! for 2018, I guess.

I wrote before that I don't “use” stream programming, but I've come to realize that it's still how I think of algorithms and for loops.  Input enters at the top, gets transformed in the middle, and yields output at the bottom.

It's like I look at a foreach loop as an in-line lambda function.  The concept may not be explicitly named, and the composition of steps built into lower-level control flow… but inside my head, it's still a “sequence of operations” on a “stream of data.”

There doesn't seem to be much benefit to building up “real” streams in a language that doesn't have them either built-in or in the standard library. It creates another layer where the things that have been built can only interoperate with themselves, and a series of transformations can no longer share state.  And, a PHP array can be passed to any array_* function, where (last I even checked) our handmade streams or Iterators cannot.