userfaultfd() is here

I noticed a new (as of Linux kernel 4.3+) userfaultfd() system call while poking around make menuconfig today and, despite thinking that it may be complexity creep at first, found lots of good information about really interesting use-cases it enables.

The manpage is here, including a sample program, and gives a good idea about how userfaultfd() works.

To get more context and an overview of application ideas I recommend start with this presentation by Andrea Arcangeli.

To sum it up:

  • userfaultfd() is key to enabling live VM migrations and research-level stuff like memory externalization
  • memory constrained environments, like Android devices, need a way to reclaim memory from processes without crashing them.
    userfaultfd() makes it possible for the process to notice that its volatile pages are gone. It seems similar to how weak references work in many programming languages, but now with memory pages?
  • apparently more efficient snapshoting (think Redis), but I haven’t thought it through yet to understand how!

The presentation also mentions JIT improvements, but it’s a broader (and exciting) subject. This essay provides a deep look from the perspective of a LISP compiler/VM and touches upon the GC implications.

These awesome posts show how the HotSpot JVM currently uses memory protection/SIGSEGV:

More JVM Signal tricks – Thread control via mprotect
SIGSEGV as control flow – How the JVM optimizes your null checks

I wonder if userfaultfd() is coming to HotSpot?

Racing with MySQL online schema changes

The gist of it

Even as of MySQL 5.7 there are ALTER TABLE operations that cannot be done on-line, that is without blocking writes to the table that has its schema changed. Altering a large table can take hours or days and blocking writes for longer than a few seconds is often unacceptable. Fortunately there exist great external tools that fix this gap, the Percona Toolkit for MySQL being one of them.

The principle of their operation is simple and roughly consists of the following steps:

– create a new table with the schema matching the old one
– quickly alter the new, empty table
– create a set of on delete/on update/on insert triggers on the original table that track changes on the original table and apply them on the new one
– copy data from the old table to the new one
– atomically switch the tables around (using rename)
– drop triggers and the old table

This works very well in practice, but in some cases there exists a very short time window when things can go funky if the altered table has an auto_increment column that’s meaningful to the application. Needless to say, auto-incremented integer primary keys that serve as unique object/entity identifiers are quite common.

The problematic sequence of events is as follows:

1) The online migration tool creates the new table
2) The application inserts and shortly thereafter deletes the newly inserted records from the database
3) The online migration tool creates triggers and copies data to the new table, switches the tables around
4) The application inserts new rows into the database

At time (1), the migration tool creates the new table with the auto_increment value matching the original table.

Before the change-capture triggers are created at time (3), new rows appear and are removed at (2). The auto_increment on the original table changes, reflecting that.

Once the migration is done at (3), inserts into the new table at (4) are going to assign previously seen IDs to new, different objects/rows.

It can be a problem if information about the short-lived rows from time (2) reaches any external systems – from there on there would exist two different entities being referenced by the same ID! If our system performs data change capture at the MySQL binary log level and we hit this condition, we’re guaranteed to see two different rows being inserted into the same table with the same IDs.

Fortunately this race condition is unlikely in practice, even from a security point of view.


Fortunately it’s quite easy to reproduce this scenario in a sandbox environment, so let’s do that! We’ll create a test table and modify the pt-online-schema-change utility to guarantee our ugly race. I used the 2.2.16-1 version for code references.

First, prepare the test database:

mysql> create database pt_test;
mysql> use pt_test;
mysql> create table pt_test (id int auto_increment primary key, data varchar(255));
mysql> insert into pt_test(data) values ("wat");

Next, modify the create_new_table subroutine in the tool’s source code. We’ll add the insert-delete sequence just after creating the new table:

$cxn->dbh()->do("insert into $orig_tbl->{name}(data) values (\"oops\");");
$cxn->dbh()->do("delete from $orig_tbl->{name} order by id desc limit 1;");

just before the following lines:

print $sql, "\n" if $o->get('print'); # the sql that work
print "Created new table $orig_tbl->{db}.$new_table_name OK.\n";

Run it!

$ pt-online-schema-change u=USERNAME,p=PASSWORD,D=pt_test,t=pt_test --alter='modify column data varchar(1024)' --execute

Let’s what happens:

mysql> select * from pt_test;
| id | data |
| 1  | wat  |

Let’s add one more row.

mysql> insert into pt_test(data) values ("one more");
mysql> select * from pt_test;
| id | data     |
| 1  | wat      |
| 2  | one more |

The newly added row will have the id = 2, but we know that a row with the id of 2 has already been inserted and deleted before! If we have binary logging enabled, the mysqlbinlog utility will confirm the short existence of the “oops” entry with id of 2. Oops…

What now?

I cannot think of any ways of fixing this problem other than locking the original table for the time it takes to create the new table *and* the change capture triggers. Transactions are obviously of no use, since they don’t apply to DDL operations.

An alternative solution is not to rely on identifiers obtained with auto increments on the database side and use GUIDs instead.

Database queries that took small multiples of 100ms

I had a curious troubleshooting task last year. A developer came up asking if I could explain why logs of some Ruby Sidekiq workers often showed simple MySQL queries taking almost exact multiples of 100ms to complete. The numbers would be like: 105ms, 201ms, 402ms, 205ms, and so on…

Half an hour of interrogation, strace, top and browsing through MRI source code led to the following answers:

  • each Sidekiq process has many threads
  • the purpose of having so many threads was to fully utilize the CPU and memory running tasks that spent most of their time waiting for remote systems to respond
  • most of the jobs were quick and simple, but some could take many seconds of CPU time to process

We also observed an obvious correlation between the Sidekiq worker CPU spiking to 100% for extended periods of time and the round SQL query times being logged.

But look, here’s our 100ms!

Minutes later we stumbled upon this clear explanation. I’m quoting the relevant part:

Every 100ms the timer thread sets an interrupt flag on the thread currently holding the GIL.

Deep in a file called vm_eval.c is the code for handling Ruby method calls. It’s responsible for setting up the context for a method call and calling the right function. At the end of a function called vm_call0_body, right before it returns the return value of the current method, these interrupts are checked.

If this thread’s interrupt flag has been set, then it stops execution on the spot, before returning its value. Before executing any more Ruby code, the current thread will release the GIL and callsched_yield. sched_yield is a system function prompting the thread scheduler to schedule another thread.

Aha! So it works as follows:

  1. Thread A takes a note of current time T1, sends a SQL query and waits for results. This releases the Global Interpreter Lock and lets the VM schedule a different thread.
  2. The VM schedules Thread B to run. Unfortunately Thread B is going to take seconds of CPU time.
  3. Milliseconds later the TCP stack has the answer from MySQL ready, but the requesting thread A is still waiting for its turn on the CPU.
  4. After 100ms Thread B is interrupted. Another CPU-heavy thread C gets scheduled.
  5. After another 100ms, Thread C is interrupted and Thread A gets its slice of the CPU time.
  6. Thread A takes the current time T2 and logs it took T2 – T1 = 205ms to get the results of its simple SQL query.

Having found the answer, we left the system mostly as it was. It was optimized for throughput, not response times, so long job pauses had no real impact.

The obvious lessons are:

  • Measure and log the right thing. In this case it wasn’t the SQL query that took hundreds of milliseconds.
  • Know your runtime environment. 🙂
  • Distribution of the times it takes to handle requests is important.
  • Global Interpreter Locks are not pretty.

The True Hackers

A quote of the day for me comes from a digest of the Gemini project software history (

“I’m beginning to learn things about the Gemini memory I wish I had known in the 1960s. For example, after many divide instructions were executed in a row (I don’t know how many), the memory cores heated up and you would get the wrong answer in the accumulator. This problem was controlled by inserting a no-op instruction between the divides, thus keeping the cores cool enough to always give the right answer. Wow!”

Wow! indeed..!

Testing equipment at Hackerspace Kraków

Hackerspace Kraków got two nice analog oscilloscopes with 20MHz bandwidth from our sponsor. Initial tests using the built-in 1kHz square wave generators with provided probes showed a gloomy picture of bumpy squares that could have been caused by Who Knows What, and the picture below shows the best we could get by tuning the probes.


Yesterday I brought my cheap Voltcraft/Hantek/Tekway DSO to see what’s going on. Turns out the main problem were our probes – using the ones from my oscilloscope we got nice squares using all three signal generators available!



However, testing revealed that test signals provided by generators in HS oscillosopes are not 1kHz squares anymore:



And this one is quite disturbing as well – given the same signal source – I hope it’s just the matter of using two completely different probes!


R.I.P. PayPass (mBank)

We recently got issued a PayPass-enabled debit card. Not that we wanted one, no. But there seems to be a crazy push for wireless payment going on in Poland and it’s getting hard to get a card without it (or PayWave). Given the security concerns of these solutions (remote cloning), I decided to give it a go and try to disable PayPass while keeping other functions working. Turns out there’s a cheap and fairly reliable way to do it and it involves… x-rays. And drilling. 🙂

Here’s what the card I got looks like internally:

You can clearly see where the chip is, how the antenna is connected to it and where it goes on the card. Since it’s basically an RFID chip, it requires an external power source to function. In this case electrical current is inducted in the antenna. In theory it should be enough to break the loop to disable wireless payments. Why not drill through it, then? 😀

The card was tested to work OK in ATMs, POS terminals and wirelessly before any changes were made to ensure that it’s the changes that disabled it, not chance. I decided to drill two 3mm holes through it, just to make sure, and here’s what it looked like after the operation:



As you can see I’ve messed up a little bit and drilled right through the magnetic stripe, but it still works! ATMs, POS terminals do, PayPass… doesn’t. Mission successful!