Racing with MySQL online schema changes

The gist of it

Even as of MySQL 5.7 there are ALTER TABLE operations that cannot be done on-line, that is without blocking writes to the table that has its schema changed. Altering a large table can take hours or days and blocking writes for longer than a few seconds is often unacceptable. Fortunately there exist great external tools that fix this gap, the Percona Toolkit for MySQL being one of them.

The principle of their operation is simple and roughly consists of the following steps:

– create a new table with the schema matching the old one
– quickly alter the new, empty table
– create a set of on delete/on update/on insert triggers on the original table that track changes on the original table and apply them on the new one
– copy data from the old table to the new one
– atomically switch the tables around (using rename)
– drop triggers and the old table

This works very well in practice, but in some cases there exists a very short time window when things can go funky if the altered table has an auto_increment column that’s meaningful to the application. Needless to say, auto-incremented integer primary keys that serve as unique object/entity identifiers are quite common.

The problematic sequence of events is as follows:

1) The online migration tool creates the new table
2) The application inserts and shortly thereafter deletes the newly inserted records from the database
3) The online migration tool creates triggers and copies data to the new table, switches the tables around
4) The application inserts new rows into the database

At time (1), the migration tool creates the new table with the auto_increment value matching the original table.

Before the change-capture triggers are created at time (3), new rows appear and are removed at (2). The auto_increment on the original table changes, reflecting that.

Once the migration is done at (3), inserts into the new table at (4) are going to assign previously seen IDs to new, different objects/rows.

It can be a serious problem if information about the short-lived rows from time (2) has ever reached any external systems – from there on there would be two different entities being referenced by the same ID! If our system performs data change capture at the MySQL binary log level and we hit this condition, we’re guaranteed to see both of the two different rows being inserted into “the same” table with the same id values.

Fortunately this race condition is unlikely in practice, even from the security point of view.


Fortunately it’s quite easy to reproduce this scenario in a sandbox environment, so let’s do just that! We’ll create a test table and modify the pt-online-schema-change utility to guarantee our ugly race. I used the 2.2.16-1 version for code references.

First, prepare the test database:

mysql> create database pt_test;
mysql> use pt_test;
mysql> create table pt_test (id int auto_increment primary key, data varchar(255));
mysql> insert into pt_test(data) values ("wat");

Next, modify the create_new_table subroutine in the tool’s source code. We’ll add the insert-delete sequence just after creating the new table:

$cxn->dbh()->do("insert into $orig_tbl->{name}(data) values (\"oops\");");
$cxn->dbh()->do("delete from $orig_tbl->{name} order by id desc limit 1;");

just before the following lines:

print $sql, "\n" if $o->get('print'); # the sql that work
print "Created new table $orig_tbl->{db}.$new_table_name OK.\n";

Run it!

$ pt-online-schema-change u=USERNAME,p=PASSWORD,D=pt_test,t=pt_test --alter='modify column data varchar(1024)' --execute

Let’s what happens:

mysql> select * from pt_test;
| id | data |
| 1  | wat  |

Let’s add one more row.

mysql> insert into pt_test(data) values ("one more");
mysql> select * from pt_test;
| id | data     |
| 1  | wat      |
| 2  | one more |

The newly added row will have the id = 2, but we know that a row with the id of 2 has already been inserted and deleted before! If we have binary logging enabled, the mysqlbinlog utility will confirm the short existence of the “oops” entry with id of 2. Oops…

What now?

I cannot think of any ways of fixing this problem other than locking the original table for the time it takes to create the new table *and* the change capture triggers. Transactions are obviously of no use, since they don’t apply to DDL operations.

An alternative solution is not to rely on identifiers obtained with auto increments on the database side and use GUIDs instead.

Database queries that took small multiples of 100ms

I had a curious troubleshooting task last year. A developer came up asking if I could explain why logs of some Ruby Sidekiq workers often showed simple MySQL queries taking almost exact multiples of 100ms to complete. The numbers would be like: 105ms, 201ms, 402ms, 205ms, and so on…

Half an hour of interrogation, strace, top and browsing through MRI source code led to the following answers:

  • each Sidekiq process has many threads
  • the purpose of having so many threads was to fully utilize the CPU and memory running tasks that spent most of their time waiting for remote systems to respond
  • most of the jobs were quick and simple, but some could take many seconds of CPU time to process

We also observed an obvious correlation between the Sidekiq worker CPU spiking to 100% for extended periods of time and the round SQL query times being logged.

But look, here’s our 100ms!

Minutes later we stumbled upon this clear explanation. I’m quoting the relevant part:

Every 100ms the timer thread sets an interrupt flag on the thread currently holding the GIL.

Deep in a file called vm_eval.c is the code for handling Ruby method calls. It’s responsible for setting up the context for a method call and calling the right function. At the end of a function called vm_call0_body, right before it returns the return value of the current method, these interrupts are checked.

If this thread’s interrupt flag has been set, then it stops execution on the spot, before returning its value. Before executing any more Ruby code, the current thread will release the GIL and callsched_yield. sched_yield is a system function prompting the thread scheduler to schedule another thread.

Aha! So it works as follows:

  1. Thread A takes a note of current time T1, sends a SQL query and waits for results. This releases the Global Interpreter Lock and lets the VM schedule a different thread.
  2. The VM schedules Thread B to run. Unfortunately Thread B is going to take seconds of CPU time.
  3. Milliseconds later the TCP stack has the answer from MySQL ready, but the requesting thread A is still waiting for its turn on the CPU.
  4. After 100ms Thread B is interrupted. Another CPU-heavy thread C gets scheduled.
  5. After another 100ms, Thread C is interrupted and Thread A gets its slice of the CPU time.
  6. Thread A takes the current time T2 and logs it took T2 – T1 = 205ms to get the results of its simple SQL query.

Having found the answer, we left the system mostly as it was. It was optimized for throughput, not response times, so long job pauses had no real impact.

The obvious lessons are:

  • Measure and log the right thing. In this case it wasn’t the SQL query that took hundreds of milliseconds.
  • Know your runtime environment. 🙂
  • Distribution of the times it takes to handle requests is important.
  • Global Interpreter Locks are not pretty.

The True Hackers

A quote of the day for me comes from a digest of the Gemini project software history (

“I’m beginning to learn things about the Gemini memory I wish I had known in the 1960s. For example, after many divide instructions were executed in a row (I don’t know how many), the memory cores heated up and you would get the wrong answer in the accumulator. This problem was controlled by inserting a no-op instruction between the divides, thus keeping the cores cool enough to always give the right answer. Wow!”

Wow! indeed..!

Getting Atom Zombie Smasher to run under Ubuntu 12.04

There’s a number of forum posts asking for help with Atom Zombie Smasher crashing after the loading screen with recent versions of Linux distros on the Internet, but I’ve been unable to find one with a solution – so here it is.

In short, there seems to be a race condition somewhere in the initialization code that makes the process terminate in xcb code. Not having the sources available, my solution was to set the game’s CPU affinity to just one core, thus greatly reducing the chances of the crash. It’s quite reliable (9/10 I’d say), but don’t be surprised if it does crash on some attempts!

To apply the “fix”, change the last line in the data/atomzombiesmasher script to:

MONO_LOG_LEVEL="debug" taskset 0x00000001 ./mono ./release.exe "$@"

Good luck!

“private” methods in Python

Python supports semi-private methods through a name mangling trick. The interpreter transforms method names starting with two underscores by prepending it with the class name and some additional underscores, so it becomes impossible to call the method by its original name. Here’s an example:

>>> class T(object):
... def __m(self): pass
>>> t = T()
>>> dir(t)
['_T__m', '__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__']
>>> t.__m()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
AttributeError: 'T' object has no attribute '__m'

It is still possible to call the method using its mangled name.

Fortunately, one can still use the original name from within other methods of the same class:

>>> class T(object):
... def __m(self): print "m"
... def public(self): self.__m()
>>> t = T()
>>> t.public()

It works because bytecode generated for this method uses the mangled name:

>>> import dis
>>> dis.dis(t.public)
 3 0 LOAD_FAST 0 (self)
 3 LOAD_ATTR 0 (_T__m)
 10 LOAD_CONST 0 (None)

Post prompted by a colleague’s question.