How does the 'in' keyword work in Python?
A few days go I played a bit with a naive implementation of Bloom
filters in Python. I wanted to time
them against just checking whether a field is in a set/collection. I found
something slightly puzzling: it looked like the in
worked too fast for
smaller lists. And I wondered: maybe small lists are special internally, and
allow for really fast lookups? Maybe they have some internal index? This raised
the question: how does in
find stuff in sequences?
Instead of reading the documentation I delved into the cpython source. Skipping the documentation was a bad idea, since it is not only pretty comprehensive but explains everything… if you know what you are looking for. But a dive into this big pile of C was more fun. I was also stupid enough to not Google the answer until I was editing this post… When writing this, I found out this excellent Stack Overflow answer covers more or less what I explain here.
The in
keyword (defined
here)
is actually a shorthand for running __contains__
from the target object, you
probably know this already. Your class will be able to provide in
if you add
this magic method, you can read about this in the first chapter of Fluent
Python. But tracing this inside cpython
is a bit
cumbersome and got me diving into interesting pieces of the code. By the way,
how the Python compiler works is documented
here.
First, after we have parsed the code and generated a parse tree from our text,
we go to the abstract syntax tree. Converting the string in
as part of some
node of Python source in this tree into an In
here:
And now, what does actually in
do? Well, we need to move forward in the
compilation chain, and check
compile.c:
This is inside a function called cmpop
, which is called when we find the
COMPARE_OP
opcode. This is the opcode we’d see by disassembling anything
running in
or ==
or any other comparison operator (all comparison operators
are in the same basket, see
here).
We can follow the route through
ceval.c
now:
So, we are calling this PySequence_Contains
thingy. A bit more grep and we can
find it defined in
abstract.c:
And now we can see what it does:
- Get a pointer to the sequence’s base sequence methods in the C struct slot
tp_as_sequence
- If there is a
sq_contains
method pointer there, invoke it and return - Otherwise, use iterative search
Hey, where is my __contains__
? The magic happens on new object/class/type
definition:
typeobject.c
The base object all classes extend from looks like the following struct (from
object.h):
This means that, technically, all objects have a field in their defining struct
for sequences: tp_as_sequence
. This is populated when we define a new class
(which internally is known as type) in
typeobject.c.
Slots are populated from what is essentially a dictionary of
methods
by invoking
fixup_slot_dispatchers.
This maps the python name __contains__
to the corresponding slot in the
struct, sq_contains
and defines which function sets it up,
slot_sq_contains:
Built-in objects (and likely libraries with C extensions) implement these directly in C, and point its slot to the C method:
Finally, this method looks for a method defined in the class and called __contains__
. If it is None
(that is, it is defined and is None
), object is not a container, that’s it. If it is not defined (hence the null, and this one is actually puzzling… reduces to it not being provided when defining the class… I think), Python falls back to iterating for search using __iter__
(which is what eventually gets called under PySequence_IterSearch
). If this is also not valid or available, an error returns, following a chain of -1 in the method lookups.
If you have been paying attention you’ll see that we are actually deferring to iterative search in two places: when defining the slot in sq_contains
but also when invoking PySequence_Contains
. I’m not 100% sure about why this is the case, and experimenting with the REPL does not get you very far, since you can never be sure if you are hitting PySequence_Contains -> PY_ITERSEARCH_CONTAINS
or PySequence_Contains -> sq_contains -> PY_ITERSEARCH_CONTAINS
without changing the messages (and I don’t feel like recompiling Cpython
). Weirdly, the second case should be faster since it is going straight for the method via the slot without needing an extra method lookup.
As expected, dictionary lookup is fancier. It is a common and known performance improvement in Python to change lists or other sequence-like datatypes for dictionaries, since they show the best performance for most operations. Since internally in Python everything from methods/functions to classes is implemented in one way or another as dictionaries (or reusing the machinery that is built for dictionaries), anything that speeds dictionaries up, speeds the whole of Python code. Of course, dictionary lookup is usually fast no matter the language: hash table lookup in general (leaving aside how collision resolution might be implemented in lookup) is O(1) fast. Note that below we have the macro: #define PyUnicode_CheckExact(op) (Py_TYPE(op) == &PyUnicode_Type)
There even is a specialised method for cases when the hash key is known (not sure of the use case, since magic methods are hardcoded in the object structs, maybe it’s used to optimise tighter loops?).
And if you wonder how key in dict
works, it is of course by introducing the method in the struct for sequences. The snippet below is from dictobject.c
, as the methods above.
And here finishes my exploration of cpython
to figure out how in
contains. Not sure about you, but I had a lot of fun.