Blocking gevent's Hub Part 2: Finding Blocking

Last time we looked at the causes and consequences of greenlets that block and avoid switching to the hub. Many times such greenlets (really, the code they run) are obvious, but some times they are not.

This post will look at some new features that gevent 1.3 added to make finding greenlets that block easier. These can be used both during development and in production.

Contents

What It does

gevent 1.3 includes a couple features to help find code that's blocking. One of those we can turn on using just environment variables, no code changes needed (although you can configure it in code as well). This makes it especially handy for debugging on the fly.

Let's take a look at one of our blocking examples from last time:

import time
import gevent

def f():
    time.sleep(3)

now = time.time()
greenlet = gevent.spawn(f)
while not greenlet.dead:
    print('.', end='')
    gevent.sleep(1)
after = time.time()
print("\nElapsed time: %1.1f" % (after - now))

Now we'll tell gevent to let us know if it detects any greenlets blocking the hub and run the example (the duration of time that a greenlet is allowed to block before it gets reported is configurable; we'll go with the defaults here):

$ GEVENT_MONITOR_THREAD_ENABLE=true python /tmp/test.py
================================================================================

2018-06-22T19:20:10Z : Greenlet <Greenlet "Greenlet-0" at 0x1.: f> appears to be blocked
    Reported by <gevent.__tracer.GreenletTracer object at 0x...>
Blocked Stack (for thread id 0x...):
  File "/tmp/test.py", line 5, in f
    time.sleep(3)

Info:
********************************************************************************
* Threads
********************************************************************************
Thread 0x2 (MainThread)

  File "/tmp/test.py", line 5, in f
    time.sleep(3)

********************************************************************************
* Greenlets
********************************************************************************
<greenlet.greenlet object at 0x3>
 :    Parent: None
 :    Monitoring Thread:<PeriodicMonitoringThread >
<greenlet.greenlet object at 0x4>
 :    Parent: None
 +--- <Greenlet "Greenlet-0" at 0x1: f>
 :          Parent: <Hub '' at 0x5 select default pending=0 ref=1 thread_ident=0x2>
 :          Spawned at:
 :            File "/tmp/test.py", line 8, in <module>
 :              greenlet = gevent.spawn(f)
 +--- <Hub '' at 0x5 select default pending=0 ref=1 thread_ident=0x2>
            Parent: <greenlet.greenlet object at 0x4>

 .
 Elapsed time: 3.0

This time, gevent detected the hold up and printed a report about what the problem was. [1]

Testing And Asserting

Either in unittests, from particularly latency sensitive parts of your application, or when trying to further narrow down the cause of blocking, gevent provides the assert_switches() context manager. Wrap this manager around some code to be sure that it switches to another greenlet or back to the hub, optionally within a set amount of time. If nothing switches, then an exception will be raised with the same sort of information as above.

>>> from gevent.util import assert_switches
>>> # This will always raise an exception: nothing switches
>>> with assert_switches():
...     pass
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "//gevent/util.py", line 592, in __exit__
    raise _FailedToSwitch('\n'.join(report_lines))
gevent.util._FailedToSwitch: ...
>>> # This will never raise an exception; nothing switched,
... # but it happened very fast
... with assert_switches(max_blocking_time=1.0):
...     pass
...
>>>

How It Works

Credit

This technique did not originate with gevent. I've seen it in this blog post from 2012 by Ryan Kelly and I've seen it in this code from Mozilla. gevent 1.3 just generalized it, made it safe for multiple threads and hubs, and incorporated it into the core of gevent.

The main question we want to be able to answer is: did the hub get to run during a certain time period? Running the hub means switching into its greenlet, so what we really want to know is: was the greenlet hub switched into?

The greenlet library provides a hook well suited for this purpose: greenlet.settrace(). This functions similarly to sys.settrace() that's part of the core interpreter. The later calls a function for every "interesting" interpreter event such as a function call, and the former calls a function for every interesting event involving greenlets.

Since greenlets can really only do two interesting things, 'switch' and 'throw' are the only two defined events—for our purposes they are equivalent as they both involve a transfer of control from one greenlet to another, and both events tell us which greenlet is going to be taking control (the "target").

So to answer our question, we only need to keep track of whether or not the greenlet hub was the target of an event and thus assumes control. For the decorator assert_switches, which has a well-defined scope and only needs to answer that question at the exact time it is exited, that's enough [2], but for the general monitoring function that lets you know if (when) it ever found the hub blocked for longer than a certain period, we need to do a little more. To be able to answer that question meaningfully, we also need to know the timestamp of the last event entering the hub so we can compare it to the current timestamp: if it's been "too long," then we can raise an alert.

Performance

gevent is a performance critical library, so monitoring features that are too perfomance sucking to enable are a non-starter. For that reason, the implementation of these features was carefully optimized and a few different tradeoffs are made. Here are some examples.

Getting the current time (time.time()) is relatively expensive, so we try to avoid doing that too often. For example, instead of checking if we've exceeded the blocking threshold on every greenlet switch, we only do so periodically in a background thread.
We use some heuristics to detect if we switched into the hub instead of absolute timestamps.
The greenlet trace functions are compiled with Cython and strongly typed. This makes the relatively complex work that the blocking monitor trace function needs to do faster than the most trivial trace function implemented in Python.

As always, if you find bugs or can think of enhancements or additions to gevent, please feel free to open an issue or create a pull request!

Footnotes

[1]

I admit I cheated here in the output a little bit. I trimmed it: it detected, and reported, the same blockage several times. It should probably recognize that and just print something about "same blockage". I should fix that. Also, I fixed up the spawning line. Things spawned at the module level do not always get the right traceback.

[2]	As long as you don't use the `max_blocking_time` parameter.