I had a curious troubleshooting task last year. A developer came up asking if I could explain why logs of some Ruby Sidekiq workers often showed simple MySQL queries taking almost exact multiples of 100ms to complete. The numbers would be like: 105ms, 201ms, 402ms, 205ms, and so on…
Half an hour of interrogation, strace, top and browsing through MRI source code led to the following answers:
- each Sidekiq process has many threads
- the purpose of having so many threads was to fully utilize the CPU and memory running tasks that spent most of their time waiting for remote systems to respond
- most of the jobs were quick and simple, but some could take many seconds of CPU time to process
We also observed an obvious correlation between the Sidekiq worker CPU spiking to 100% for extended periods of time and the round SQL query times being logged.
But look, here’s our 100ms!
Minutes later we stumbled upon this clear explanation. I’m quoting the relevant part:
Every 100ms the timer thread sets an interrupt flag on the thread currently holding the GIL.
Deep in a file called
vm_eval.cis the code for handling Ruby method calls. It’s responsible for setting up the context for a method call and calling the right function. At the end of a function called
vm_call0_body, right before it returns the return value of the current method, these interrupts are checked.
If this thread’s interrupt flag has been set, then it stops execution on the spot, before returning its value. Before executing any more Ruby code, the current thread will release the GIL and call
sched_yieldis a system function prompting the thread scheduler to schedule another thread.
Aha! So it works as follows:
- Thread A takes a note of current time T1, sends a SQL query and waits for results. This releases the Global Interpreter Lock and lets the VM schedule a different thread.
- The VM schedules Thread B to run. Unfortunately Thread B is going to take seconds of CPU time.
- Milliseconds later the TCP stack has the answer from MySQL ready, but the requesting thread A is still waiting for its turn on the CPU.
- After 100ms Thread B is interrupted. Another CPU-heavy thread C gets scheduled.
- After another 100ms, Thread C is interrupted and Thread A gets its slice of the CPU time.
- Thread A takes the current time T2 and logs it took T2 – T1 = 205ms to get the results of its simple SQL query.
Having found the answer, we left the system mostly as it was. It was optimized for throughput, not response times, so long job pauses had no real impact.
The obvious lessons are:
- Measure and log the right thing. In this case it wasn’t the SQL query that took hundreds of milliseconds.
- Know your runtime environment. :-)
- Distribution of the times it takes to handle requests is important.
- Global Interpreter Locks are not pretty.