CPython vs PyPy Memory Usage

If you have lots of "small" objects in a Python program (objects which have few instance attributes), you may find that the object overhead starts to become considerable. The common wisdom says that to reduce this in CPython you need to re-define the classes to use __slots__, eliminating the attribute dictionary. But this comes with the downsides of limiting flexibility and eliminating the use of class defaults. Would it surprise you to learn that PyPy can significantly, and without any effort by the programmer, reduce that overhead automatically?

Let's take a look.

Contents

Contrary to advice, instead of starting at the very beginning, we'll jump right to the end. The following graph shows the peak memory usage of the example program we'll be talking about in this post across seven different Python implementations: PyPy2 v6.0, PyPy3 v6.0, CPython 2.7.15, 3.4.9, 3.5.6, 3.6.6, and 3.7.0 [1].

For regular objects ("Point3D"), PyPy needs less than 700MB to create 10,000,000, where CPython 2.7 needs almost 3.5 GB, and CPython 3.x needs between 1.5 and 2.1 GB [6]. Moving to __slots__ ("Point3DSlot") brings the CPython overhead closer to—but still higher than—that of PyPy. In particular, note that the PyPy memory usage is essentially the same whether or not slots are used.

The third group of data is the same as the second group, except instead of using small integers that should be in the CPython internal integer object cache [7], I used larger numbers that shouldn't be cached. This is just an interesting data point showing the allocation of three times as many objects, and won't be discussed further.

What Is Being Measured

In the script I used to produce these numbers [2], I'm using the excellent psutil library's Process.memory_info to record the "unique set size" ("the memory which is unique to a process and which would be freed if the process was terminated right now") before and then after allocating a large number of objects.

def check(klass, x, y, z):
    before = get_memory().uss
    inst = klass(0, x, y, z)
    print("Size of", type(inst).__name__, sizeof_object(inst))
    del inst

    print("Count      AbsoluteUsage     Delta")
    print("=======================================")

    for count in 100, 1000, 10000, 100000, 1000000, 10000000:
        l = [None] * count

        for i in range(count):
            l[i] = klass(i, x, y, z)

        after = get_memory().uss
        print("%9d" % count, format_memory(after - global_starting_memory.uss),
              format_memory(after - before))
        l = None
    print()

This gives us a fairly accurate idea of how much memory the processes needed to allocate from the operating system to be able to create all the objects we asked for. (get_memory is a helper function that runs the garbage collector to be sure we have the most stable numbers.)

What Is Not Being Measured

In this example output from a run of PyPy, the AbsoluteUsage is the total growth from when the program started, while the Delta is the growth just within this function.

Memory for Point3D(1, 2, 3)
Count      AbsoluteUsage     Delta
=======================================
      100           0.02           0.02
     1000           0.03           0.03
    10000           0.51           0.51
   100000           7.20           7.20
  1000000          69.11          69.11
 10000000         691.90         691.90

This was the first of the test runs within this particular process. The second test run within this process reports higher absolute deltas since the beginning of the program, although the overall deltas are smaller. This indicates how much memory the program has allocated from the operating system but not returned to it, even though it may technically free from the standpoint of the Python runtime; this accounts for things like internal caches, or in PyPy's case, jitted code.

Memory for Point3DSlot(1, 2, 3)
Size of Point3DSlot -1
Count      AbsoluteUsage     Delta
=======================================
      100          86.09           0.00
     1000          86.12           0.03
    10000          86.56           0.46
   100000          87.33           1.23
  1000000         138.70          52.60
 10000000         692.05         605.95

Although I captured the data, this post is not about the startup or initial memory allocation of the various interpreters, nor about how much can easily be shared between forked processes, nor about how much memory is returned to the operating system while the process is still running. We're only talking about the memory size needed to allocate a given number of objects, e.g., the Delta column.

Object Internals

To understand what's happening, let's look at the two types of objects we're comparing:

class Point3D(object):

    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z


class Point3DSlot(object):
    __slots__ = ('x', 'y', 'z')

    def __init__(self, x, y, z):
        self.x = x
        self.y = y
        self.z = z

These are both small classes with three instance attributes. One is a standard, default, object, and one specifies its instance attributes in __slots__.

Objects with Dictionaries

Standard objects, like Point3D, have a special attribute __dict__, that is a normal Python dictionary object that is used to hold all the instance attributes for the object. We previously looked at how __getattribute__ can be used to customize all attribute reads for an object; likewise, __setattr__ can customize all attribute writes. The default __getattribute__ and __setattr__ that a class inherits from object function something like they were written to access the __dict__:

class Object:

    def __getattribute__(self, name):
        if name in self.__dict__:
             return self.__dict__[name]
        return getattr(type(self), name)

    def __setattr__(self, name, value):
        self.__dict__[name] = value

One advantage of having a __dict__ underlying an object is the flexibility it provides: you don't have to pre-declare your attributes for every object, and any object can have any attribute, so it facilitates subclasses adding new attributes, or even other libraries adding new, specialized, attributes to implement caching of expensive computed properties.

One disadvantage is that a __dict__ is a generic Python dictionary, not specialized at all [3], and as such it has overhead.

On CPython, we can ask the interpreter how much memory any given object uses with sys.getsizeof. On my machine under a 64-bit CPython 2.7.15, a bare object takes 16 bytes, while a trivial subclass takes a full 64 bytes (due to the overhead of being tracked by the garbage collector):

>>> import sys
>>> sys.getsizeof(object())
16
>>> class TrivialSubclass(object): pass
>>> sys.getsizeof(TriviasSubclass())
64

An empty dict occupies 280 bytes:

>>> sys.getsizeof({})
280

And so when you combine the size of the trivial subclass, with the size of its __dict__ you arrive at a minimum object size of 344 bytes:

>>> sys.getsizeof(TrivialSubclass().__dict__)
280
>>> sys.getsizeof(TrivialSubclass()) + sys.getsizeof(TrivialSubclass().__dict__)
344

A fully occupied Point3D object is also 344 bytes:

>>> pd = Point3D(1, 2, 3)
>>> sys.getsizeof(pd) + sys.getsizeof(pd.__dict__)
344

Because of the way dictionaries are implemented [8], there's always a little spare room for extra attributes. We don't find a jump in size until we've added three more attributes:

>>> pd.a = 1
>>> sys.getsizeof(pd) + sys.getsizeof(pd.__dict__)
344
>>> pd.b = 1
>>> sys.getsizeof(pd) + sys.getsizeof(pd.__dict__)
344
>>> pd.c = 1
>>> sys.getsizeof(pd) + sys.getsizeof(pd.__dict__)
1112

Note

These values can change quite a bit across Python versions, typically improving over time. In CPython 3.4 and 3.5, getsizeof({}) returns 288, while it returns 240 in both 3.6 and 3.7. In addition, getsizeof(pd.__dict__) returns 96 and 112 [4]. The answer to getsizeof(pd) is 56 in all four versions.

Objects With Slots

Objects with a __slots__ declaration, like Point3DSlot do not have a __dict__ by default. The documentation notes that this can be a space savings. Indeed, on CPython 2.7, a Point3DSlot has a size of only 72 bytes, only one full pointer larger than a trivial subclass (when we do not factor in the __dict__):

>>> pds = Point3DSlot(1, 2, 3)
>>> sys.getsizeof(pds)
72

If they don't have an instance dictionary, where do they store their attributes? And why, if Point3DSlot has three defined attributes, is it only one pointer larger than Point3D?

Slots, like @property, @classmethod and @staticmethod, are implemented using descriptors. For our purpose, descriptors are a way to extend the workings of __getattribute__ and friends. A descriptor is an object whose type implements a __get__ method, and when that object is found in a type's dictionary, it is called instead of checking the __dict__. Something like this [5]:

class Object:

    def __getattribute__(self, name):
        if (name in dir(type(self))
           and hasattr(getattr(type(self), name), '__get__')):
           return getattr(type(self), name).__get__(self, type(self))

        if name in self.__dict__:
             return self.__dict__[name]
        return getattr(type(self), name)

    def __setattr__(self, name, value):
        if (name in dir(type(self))
           and hasattr(getattr(type(self), name), '__set__')):
           getattr(type(self), name).__set__(self, type(self))
           return
        self.__dict__[name] = value

When the class statement (indeed, when the type metaclass) finds __slots__ in the class body (the class dictionary), it takes special steps. Most importantly, it creates a descriptor for each mentioned slot and places it in the class's __dict__. So our Point3DSlot class gets three such descriptors:

>>> dict(Point3DSlot.__dict__)
{'__doc__': None,
'__init__': <function __main__.__init__>,
'__module__': '__main__',
'__slots__': ('x', 'y', 'z'),
'x': <member 'x' of 'Point3DSlot' objects>,
'y': <member 'y' of 'Point3DSlot' objects>,
'z': <member 'z' of 'Point3DSlot' objects>}
>>> pds.x
1
>>> Point3DSlot.x.__get__
>>> <method-wrapper '__get__' of member_descriptor object at 0x10b6fc2d8>
>>> Point3DSlot.x.__get__(pds, Point3DSlot)
1

Variable Storage

We've established how we can access these magic, hidden slotted attributes (through the descriptor protocol). (We've also established why we can't have defaults for slotted attributes in the class.) But we still haven't found out where they are stored. If they're not in a dictionary, where are they?

The answer is that they're stored directly in the object itself. Every type has a member called tp_basicsize, exposed to Python as __basicsize__. When the interpreter allocates an object, it allocates __basicsize__ bytes for it (every object has a minimum basic size, the size of object). The type metaclass arranges for __basicsize__ to be big enough to hold (a pointer to) each of the slotted attributes, which are kept in memory immediately after the data for the basic object . The descriptor for each attribute, then, just does some pointer arithmetic off of self to read and write the value. In a way, it's very similar to how collections.namedtuple works, except using pointers instead of indices.

That may be hard to follow, so here's an example.

The basic size of object exactly matches the reported size of its instances:

>>> object.__basicsize__
16
>>> sys.getsizeof(object())
16

We get the same when we create an object that cannot have any instance variables, and hence does not need to be tracked by the garbage collector:

>>> class NoSlots(object):
...     __slots__ = ()
...
>>> NoSlots.__basicsize__
16
>>> sys.getsizeof(NoSlots())
16

When we add one slot to an object, its basic size increases by one pointer (8 bytes), and because that object could be tracked by the garbage collector, this object needs to be tracked by the collector, so getsizeof reports some extra overhead:

>>> class OneSlot(object):
...     __slots__ = ('a',)
...
>>> OneSlot.__basicsize__
24
>>> sys.getsizeof(OneSlot())
56

The basic size for an object with 3 slots is 16 (the size of object) + 3 pointers, or 40. What's the basic size for an object that has a __dict__?

>>> Point3DSlot.__basicsize__
40
>>> Point.__basicsize__
32

Hmm, it's 16 + 2 pointers. What could those two pointers be? Documentation to the rescue:

__slots__ allow us to explicitly declare data members (like properties) and deny the creation of __dict__ and __weakref__ (unless explicitly declared in __slots__...)

So those two pointers are for __dict__ and __weakref__, things that standard objects get automatically, but which we have to opt-in to if we want them with __slots__. Thus, an object with three slots is one pointer size bigger than a standard object.

How PyPy Does Better

By now we should understand why the memory usage dropped significantly when we added __slots__ to our objects on CPython (although that comes with a cost). That leaves the question: how does PyPy get such good memory performance with a __dict__ that __slots__ doesn't even matter?

Earlier I wrote that the __dict__ of an instance is just a standard dictionary, not specialized at all. That's basically true on CPython, but it's not at all true on PyPy. PyPy basically fakes __dict__ by using __slots__ for all objects.

A given set of attributes (such as our "x", "y", "z" attributes for Point3DSlot) is called a "map". Each instance refers to its map, which tells PyPy how to efficiently access a given attribute. When an attribute is added or deleted, a new map is created (or re-used from an existing object; objects of completely unrelated types, but having common attributes can share the same maps) and assigned to the object, re-arranging things as needed. It's as if __slots__ was assigned to each instance, with descriptors added and removed for the instance on the fly.

If the program ever directly accesses an instance's __dict__, PyPy creates a thin wrapper object that operates on the object's map.

So for a program that has many simalar looking objects, even if unrelated, PyPy's approach can save a lot of memory. On the other hand, if the program creates objects that have a very diverse set of attributes, and that program frequently directly accessess __dict__, it's theoretically possible that PyPy could use more memory than CPython.

You can read more about this approach in this PyPy blog post.

Footnotes

[1]	All 64-bit builds, all tested on macOS. The results on Linux were very similar.

[2]	Available at this gist.

[3]	In CPython. But I'm getting ahead of myself.

[4]	The CPython dict implementation was completely overhauled in CPython 3.6. And based on the sizes of `{}` versus `pd.__dict__` we can see some sort of specialization for instance dictionaries, at least in terms of their fill factor.

[5]	This is very rough, and actually inaccurate in some small but important details. Refer to the documentation for the full protocol.

[6]	No, I'm not totally sure why Python 3.7 is such an outlier and uses more memory than the other Python 3.x versions.

[7]	See PyLong_FromLong

[8]	With a particular desired load factor.