Database queries that took small multiples of 100ms

I had a curious troubleshooting task last year. A developer came up asking if I could explain why logs of some Ruby Sidekiq workers often showed simple MySQL queries taking almost exact multiples of 100ms to complete. The numbers would be like: 105ms, 201ms, 402ms, 205ms, and so on…

Half an hour of interrogation, strace, top and browsing through MRI source code led to the following answers:

  • each Sidekiq process has many threads
  • the purpose of having so many threads was to fully utilize the CPU and memory running tasks that spent most of their time waiting for remote systems to respond
  • most of the jobs were quick and simple, but some could take many seconds of CPU time to process

We also observed an obvious correlation between the Sidekiq worker CPU spiking to 100% for extended periods of time and the round SQL query times being logged.

But look, here’s our 100ms!

Minutes later we stumbled upon this clear explanation. I’m quoting the relevant part:

Every 100ms the timer thread sets an interrupt flag on the thread currently holding the GIL.

Deep in a file called vm_eval.c is the code for handling Ruby method calls. It’s responsible for setting up the context for a method call and calling the right function. At the end of a function called vm_call0_body, right before it returns the return value of the current method, these interrupts are checked.

If this thread’s interrupt flag has been set, then it stops execution on the spot, before returning its value. Before executing any more Ruby code, the current thread will release the GIL and callsched_yield. sched_yield is a system function prompting the thread scheduler to schedule another thread.

Aha! So it works as follows:

  1. Thread A takes a note of current time T1, sends a SQL query and waits for results. This releases the Global Interpreter Lock and lets the VM schedule a different thread.
  2. The VM schedules Thread B to run. Unfortunately Thread B is going to take seconds of CPU time.
  3. Milliseconds later the TCP stack has the answer from MySQL ready, but the requesting thread A is still waiting for its turn on the CPU.
  4. After 100ms Thread B is interrupted. Another CPU-heavy thread C gets scheduled.
  5. After another 100ms, Thread C is interrupted and Thread A gets its slice of the CPU time.
  6. Thread A takes the current time T2 and logs it took T2 – T1 = 205ms to get the results of its simple SQL query.

Having found the answer, we left the system mostly as it was. It was optimized for throughput, not response times, so long job pauses had no real impact.

The obvious lessons are:

  • Measure and log the right thing. In this case it wasn’t the SQL query that took hundreds of milliseconds.
  • Know your runtime environment. :-)
  • Distribution of the times it takes to handle requests is important.
  • Global Interpreter Locks are not pretty.

The True Hackers

A quote of the day for me comes from a digest of the Gemini project software history (http://www.ibiblio.org/apollo/Gemini.html):

“I’m beginning to learn things about the Gemini memory I wish I had known in the 1960s. For example, after many divide instructions were executed in a row (I don’t know how many), the memory cores heated up and you would get the wrong answer in the accumulator. This problem was controlled by inserting a no-op instruction between the divides, thus keeping the cores cool enough to always give the right answer. Wow!”

Wow! indeed..!

Testing equipment at Hackerspace Kraków

Hackerspace Kraków got two nice analog oscilloscopes with 20MHz bandwidth from our sponsor. Initial tests using the built-in 1kHz square wave generators with provided probes showed a gloomy picture of bumpy squares that could have been caused by Who Knows What, and the picture below shows the best we could get by tuning the probes.

IMG_2575

Yesterday I brought my cheap Voltcraft/Hantek/Tekway DSO to see what’s going on. Turns out the main problem were our probes – using the ones from my oscilloscope we got nice squares using all three signal generators available!

IMG_2570

IMG_2574

However, testing revealed that test signals provided by generators in HS oscillosopes are not 1kHz squares anymore:

IMG_2584

IMG_2573

And this one is quite disturbing as well – given the same signal source – I hope it’s just the matter of using two completely different probes!

IMG_2577