Python多久分配一次内存?
How often does Python allocate?

原始链接: https://zackoverflow.dev/writing/how-often-does-python-allocate/

## Python 整数分配:深入研究 最近的研究探讨了 Python 整数分配内存的频率。最初的惊讶在于发现 CPython 将整数表示为堆分配的 `PyLongObject*` 结构。这引发了对性能的担忧,因为频繁的堆分配可能会显著降低即使是基本算术的速度。 测试表明,打印整数会触发大量的分配(在 10 万次循环中约为 10 万次),这主要是由于 `print()` 分配了一个临时对象。然而,仅仅在循环中添加数字会导致更少的分配(约为 900 次),表明已经存在优化。 进一步的分析表明,Python 通过一个空闲列表重用许多 `PyLongObject` 实例,从而最大限度地减少新的分配。小整数(-5 到 1025)使用预分配的列表处理。较大的整数使用池分配器,将内存划分为固定大小的块,以实现更快的分配/释放和减少碎片。该分配器由竞技场(1MB 或 256KB)支持,并利用 `mmap()` 进行按需物理内存分配。 尽管有这些优化,作者指出仍有改进的空间,特别是缺少其他动态语言中常见的标记指针优化,这可以消除较小整数的堆分配。最终,虽然 Python “非常频繁”地分配内存,但它采用了多种策略来减轻性能影响。

相关文章

原文

The answer is "very often"


I recently saw this tweet which I thought was funny:

Python allocating tweet

How true is it?

I was surprised to learn that CPython represents each integer in a heap allocated PyLongObject*.

Does that mean Python heap allocates every integer you create? Integers are likely the most used data type of any program, that means a lot of heap allocations.

Beyond heap allocations, this could just make basic arithmetic operations like adding two numbers 200-500x slower.

Or maybe it uses the well-known, industry standard pointer tagging technique to avoid allocating small integers?

To figure out I modified CPython to print when it allocates a PyLongObject*:

static PyLongObject *
long_alloc(Py_ssize_t size)
{
    assert(size >= 0);
    PyLongObject *result = NULL;

    /* ... */

    Py_ssize_t ndigits = size ? size : 1;

    if (ndigits == 1) {
        result = (PyLongObject *)_Py_FREELIST_POP(PyLongObject, ints);
    }
    
    if (result == NULL) {
        printf("ALLOCATING a new number object\n");
        
        result = PyObject_Malloc(offsetof(PyLongObject, long_value.ob_digit) +
                                ndigits*sizeof(digit));
        if (!result) {
            PyErr_NoMemory();
            return NULL;
        }
        _PyObject_Init((PyObject*)result, &PyLong_Type);
    }

    /* ...omitted code... */
    return result;
}

And wrote a script to add some numbers 100k times:

# lol.py
for i in range(0, 100_000):
  print(i + 1)

And then counted the number of times it allocated:

echo "Allocating number object $(
    ./python.exe lol.py \
    | rg -o ALLOCATING \
    | wc -l
) times"
Allocating number object 100904 times

Huh, that seems like a lot!

It looks like it could be allocating once per each iteration in the loop.

Let’s take out the print statement and see if it’s just the addition:

# lol.py
for i in range(0, 100_000):
  a = i + 1
echo "Allocating number object $(
    ./python.exe lol.py \
    | rg -o ALLOCATING \
    | wc -l
) times"

Allocating number object 905 times

That’s almost 100k less, so it does look like it’s the printing and CPython has some optimization in place to avoid allocations for addition.

Let’s look at the internal function to add two integers, long_add(a, b).

Just by looking at its signature, the return type is PyLongObject*, which could mean that it’s allocating a new object everytime you add two integers:

static PyLongObject *
long_add(PyLongObject *a, PyLongObject *b)
{
    // An integer is "compact" if it is <= 2^30
    if (_PyLong_BothAreCompact(a, b)) {
        // Unwrap the values of `a` and `b` and add them
        stwodigits z = medium_value(a) + medium_value(b);
        return _PyLong_FromSTwoDigits(z);
    }
    
    /* ... more code ... */
}

The _PyLong_BothAreCompact(...) function checks a bitfield on PyLongObject to check if it’s “compact” (the integer value can fit in 63 bits) which will always be the case for our script.

The code appears to be unwrapping the integer values for a and b, which are represented inside of PyLongObject structs, and adding them together.

The reason the result of adding them together has the type stwodigits is because PyLongObject stores the number in base 230, essentially each digit is a 30bit integer.

Let’s look at _PyLong_FromSTwoDigits:

/* Create a new int object from a C word-sized int */
static inline PyLongObject *
_PyLong_FromSTwoDigits(stwodigits x)
{
    if (IS_SMALL_INT(x)) {
        return (PyLongObject*)get_small_int((sdigit)x);
    }
    assert(x != 0);
    /* check that it is <= 2^(32-1) */
    if (is_medium_int(x)) {
        return (PyLongObject*)_PyLong_FromMedium((sdigit)x);
    }
    return (PyLongObject*)_PyLong_FromLarge(x);
}

There’s three separate code paths:

  1. when x is a “small int”: call get_small_int(x).

  2. when x is a “medium int”: call _PyLong_FromMedium(x).

  3. when x is a “large int”: call _PyLong_FromLarge(x).

The call to get_small_int() is catching my eye. Is it like the tagged pointer trick I mentioned earlier?

static PyObject *
get_small_int(sdigit ival)
{
    assert(IS_SMALL_INT(ival));
    return (PyObject *)&_PyLong_SMALL_INTS[_PY_NSMALLNEGINTS + ival];
}

Nope, it seems there is a pre-allocated list of objects for integers in the range of -5 -> 1025. This would account for 1025 iterations of our loop but not for the rest.

Let’s look at the medium case (meaning x is less than 232-1) and the _PyLong_FromMedium() function:

static PyObject *
_PyLong_FromMedium(sdigit x)
{
    assert(!IS_SMALL_INT(x));
    assert(is_medium_int(x));

    PyLongObject *v = (PyLongObject *)_Py_FREELIST_POP(PyLongObject, ints);
    if (v == NULL) {
        v = PyObject_Malloc(sizeof(PyLongObject));
        if (v == NULL) {
            PyErr_NoMemory();
            return NULL;
        }
        _PyObject_Init((PyObject*)v, &PyLong_Type);
    }
    digit abs_x = x < 0 ? -x : x;
    _PyLong_SetSignAndDigitCount(v, x<0?-1:1, 1);
    v->long_value.ob_digit[0] = abs_x;
    return (PyObject*)v;
}

Okay, it seems to not be using the long_alloc() function from earlier and instead trying to get a PyLongObject* from a freelist (using _Py_FREELIST_POP()), or otherwise allocating the memory for it if that fails.

A “freelist” is a common technique used in memory allocation strategies to keep track of freed allocations so they can be reused later, avoiding allocating more memory if possible.

If we look at the long_dealloc() function, there’s a codepath specifically for pushing the allocation to the freelist and not calling free on it so it can be reused:

static void
long_dealloc(PyObject *self)
{
    /* ... */
       
    if (PyLong_CheckExact(self) && _PyLong_IsCompact((PyLongObject *)self)) {
        _Py_FREELIST_FREE(ints, self, PyObject_Free);
        return;
    }
    Py_TYPE(self)->tp_free(self);
}

I deleted the printf() I added to long_alloc() earlier, and added two print statements to _PyLong_FromMedium() to print if it allocated or reused memory:

> ./python.exe lol.py | rg 'ALLOCATING|REUSING' | sort | uniq -c
  102 ALLOCATING
99193 REUSING

Our script appears to actually be reusing most of the PyLongObject objects!

So where did the allocation come from when our script was print()’ing the integers? It turns out the implemention of print() allocates a scratch PyLongObject* for conversion purposes, even though this can be avoided in most cases.

It’s nice to see that simply adding two numbers in a loop does not allocate memory everytime.

We just saw an optimization to reduce the frequency of allocations (by reusing them), but I also want to poke around and see if Python is doing anything to amortize the actual cost of an allocation.

Here’s the internal allocation function used for objects:

static inline void*
pymalloc_alloc(OMState *state, void *Py_UNUSED(ctx), size_t nbytes)
{
    if (UNLIKELY(nbytes == 0)) {
        return NULL;
    }
    if (UNLIKELY(nbytes > SMALL_REQUEST_THRESHOLD)) {
        return NULL;
    }

    uint size = (uint)(nbytes - 1) >> ALIGNMENT_SHIFT;
    poolp pool = usedpools[size + size];
    pymem_block *bp;

    if (LIKELY(pool != pool->nextpool)) {
        /*
         * There is a used pool for this size class.
         * Pick up the head block of its free list.
         */
        ++pool->ref.count;
        bp = pool->freeblock;
        assert(bp != NULL);

        if (UNLIKELY((pool->freeblock = *(pymem_block **)bp) == NULL)) {
            // Reached the end of the free list, try to extend it.
            pymalloc_pool_extend(pool, size);
        }
    }
    else {
        /* There isn't a pool of the right size class immediately
         * available:  use a free pool.
         */
        bp = allocate_from_new_pool(state, size);
    }

    return (void *)bp;
}

It looks like the memory allocator for objects is a pool allocator with multiple different pools based on the size of the object.

The pool allocator splits its backing buffer into fixed size chunks and typically is implemented with a freelist. This comes with some pretty nice advantages.

First, because everything is fixed in size, allocating and freeing becomes very simple (and thus faster!) especially since it doesn’t need to do any of the complicated bookkeeping malloc(...) needs to do to work for any allocation size.

There’s also less fragmentation (because that arises from variable allocation sizes, which means you can optimally reuse freed memory as the pool is not getting fragmented.

Similarly to an arena, you pay most of the cost upfront by pre-allocating a backing buffer at initialization. In fact in CPython’s case, the pool’s backing buffer itself is allocated by an arena whose capacity is 1mb or 256kb:

#ifdef USE_LARGE_ARENAS
    #define ARENA_BITS              20                    /* 1 MiB */
#else
    #define ARENA_BITS              18                    /* 256 KiB */
#endif

#define ARENA_SIZE              (1 << ARENA_BITS)
#define ARENA_SIZE_MASK         (ARENA_SIZE - 1)

Also, CPython will use mmap() with MAP_ANONYMOUS so it reserves 1mb of virtual memory upfront and the physical memory is allocated on demand on page faults.

With all these tricks, it’s likely that our script (the one that prints and does seem to allocate memory in the loop) is probably reusing memory from the pool allocator more than it’s actually allocating new memory (at least for PyLongObject integers)

So to finally answer the title question “How often does Python allocate?”, the answer is what everyone expected: very often. There are some good optimizations, but it could definitely do better :)

Despite some of the optimizations, there’s obviously still a big of overhead of executing all this allocation code, when theoretically all it should take the CPU to add two integers is a single ADD instruction.

In addition, the representation of PyLongObject is designed to handle both small and really big integers alike, meaning the code to handle the slow and unlikely path (using a really big integer in your Python script) pessimizes the fast path (using regularly sized integers).

To be honest, I’m a little surprised Python is missing some kind of tagged pointer optimization here, which would allow it to avoid heap allocations and directly the encode the value of some integers in the pointer. I consider this to be a very well-known optimization that many interpreters of dynamically typed languages use (e.g. V8, JSC, LuaJIT) and has been around since Smalltalk in the 80s!

But we do see a good example of how using specialized memory allocators lets us greatly reduce some of the costs of memory allocation.

This is why Zig is one my favorite languages. The answer to the question of “I wonder if this is allocating?” is “does this function take an allocator?“. And the std.mem.Allocator interface lets you easily swap out allocators, and all of the standard library is designed to accept a std.mem.Allocator when it wants to allocate memory.

联系我们 contact @ memedata.com