Tuesday, May 15, 2018

MySQL data copying optimizations

Recently at work, I optimized a script.  It copies data from a live system to a demo system, anonymizing it along the way.  We used GUIDs as contract and customer identifiers, but in the pursuit of something shorter and more URL-friendly, we created a “V2 format.”

The V2 format secretly hides a client ID inside; the demo importer used to copy the IDs and not worry too much, but now it has to change them all. Making that change took it from running in about four minutes to 15 minutes against a local MySQL instance hosted in the same virtual machine.

Unfortunately, it took 75 minutes in production, and the ID rewriting it was doing was causing our infrastructure to raise alerts about data stability, response times, and IOPS rate limiting while a rebuild was running.

Update 2018-05-18: the production run takes around 9 minutes and 3 seconds after optimization, with far less disruption. I got the local instance rebuild down to 2 minutes and 15 seconds.  I want to share a bit about how I did it.



First, there’s some audit data that’s saved, but not actually exposed in the UI, that was still participating in ID rewriting.  I ended up taking that log data out of the copy.

I also restructured the ID rewriting drastically.  Before the need to handle V2 format, we had simple INSERT ... SELECT statements to copy data between tables.  The naïve approach to this with V2 IDs was to do the same, and then later, go read each ID and update it in all dependent tables.

By the end, I added a new $idmap array that stores all incoming IDs and the new IDs generated for them.  We now select source data, edit it on the client, and insert the final (rewritten) version. We also have a special ID generator for this script to create IDs in sequential order, so that the records are inserted in primary key order.  Now, InnoDB doesn’t have to move anything later.

Overall, instead of copying data and then rewriting the whole thing to new physical pages by updating key values one by one, we now “only” copy data as far as InnoDB is concerned.  All it sees is a stream of INSERT requests, that go at the end of the table.

I was worried $idmap would take a lot of memory, but peak consumption was around 104 MB before the change, and 106 MB after it.

As for timing, I think at this point, we had reached 8:35 for the total run, implying about 6.5 or 7 minutes had been saved between copying less data, and avoiding “copy-then-rewrite” approaches.

The next thing to drastically accelerate insertion rates was to wrap the series of SQL INSERT statements in a transaction. That shaved actual minutes off of the execution time, bringing the copy phase from 3:54 to 1:06, and was easily the biggest payoff-to-work change in the entire process.

(This code actually dates from a time when everything was stored in MyISAM, which is why it didn’t use transactions to start with.)

I poked around with a few fruitless optimizations, and then I hit on the fast name randomizer.  For reasons, we have a couple of not-necessarily-users associated with contracts in the system, and we randomize those names to anonymize them during the demo database creation, as well.  The previous approach, which has been there since day 0, was to SELECT DISTINCT names, and then issue UPDATE to set the random name where the name equaled the one we were working with from the SELECT DISTINCT query.

In the new approach, we build random names into a temporary table, then issue an UPDATE with a subquery on that temp table. The update then writes all the rows with their new names in one pass.  (We still end up issuing two updates, one per column, because we can’t reopen the temp table in a single query.) The columns aren’t indexed, so this step moves from 642 passes over all rows of the table taking 3:17, down to 2 passes taking 0:19 in total… including the temp table management.

So that’s the story.  Save the IOPS: work in bulk!

No comments: