Monday, April 30, 2012

Uncontrolled Mutation: An Example of Inefficiency

I pruned /etc/apt/sources.list a bit tonight, because it annoys me when I have to wait for a lot of data just to answer the question, "So do I need to upgrade anything?"  Really, when a handful of updates are published, I should not have to get a fresh 4.6 MB copy of the entire multiverse package tree.

Since the scope of potential mutation is the entire index file, all clients only get file-level granularity for controlling the amount of data they download.

One obvious suggestion is zsync, which adds a metadata file describing the contents of the data.  This is subject to the same problems of mutability, but the metadata is much smaller.  (Essentially, zsync is rsync with the checksums pre-computed and stored explicitly, so that you don't need anything more than HTTP to serve the actual bits.)  zsync would allow the client to discover the true scope of mutation and download the changed fragments.

Perhaps even better would be a log-structured file, divided into blocks that each contain a small chunk of metadata stating their time of update, size, number of packages, and GPG signature.  The first block contains a complete index, of course.  This could be combined with HTTP partial downloads to repeatedly resume the file over time, allowing the client collect the latest pack of updates (and nothing more) each request.  If disk space actually became a concern, the file could be replaced with a new version periodically, and some simple mitigations (416 Range Not Satisfiable, checking for "new update block" magic) in clients could alert them that they needed to restart from the beginning.

It's 2012.  Can we require HTTP/1.1 capable mirrors yet?  It seems like we're spending pounds on bandwidth to save pennies on disk space.

No comments: