Prospero挑战，现在增加了更多垃圾回收。

Prospero挑战，现在增加了更多垃圾回收。
Prospero challenge, now with more garbage collection

原始链接: https://bernsteinbear.com/blog/prospero/

Matt Keeter 的 Prospero 挑战赛旨在优化一个运行缓慢的 Python/NumPy 程序。最初的代码运行需要 40 秒，并且消耗了过多的内存。第一个改进是实现了活跃性分析和垃圾回收。通过确定中间变量何时不再需要并将其删除，程序的运行时间减少到 10 秒，内存占用也显著减少。第二个改进利用了 GPU 加速。通过用 CuPy（一个可在 GPU 上运行的 NumPy 替代品）替换 NumPy，程序的运行时间骤降至 1.5 秒。GPU 加速版本可以处理更大的图像尺寸，但在 4096x4096 分辨率下最终耗尽了内存。进一步的优化，可能通过元 JIT 技术，可以 potentially 带来更快的性能。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Prospero 挑战，现在增加了更多垃圾收集 (bernsteinbear.com) todsacerdoti 33 分钟前 11 分 | 隐藏 | 过去 | 收藏 | 讨论加入我们，参加 6 月 16-17 日在旧金山举办的 AI 初创公司学校！指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

原文

Matt Keeter put up The Prospero Challenge, which is like catnip for me. It’s a well-scoped project: we have a slow program. Make it faster within these constraints. In this post, I will describe two very small changes that can speed up his sample program with minimal effort.

His sample tiny implementation, which I will reproduce here, uses Python and NumPy to compute pixels in parallel while parsing the input:

import numpy as np

with open('prospero.vm') as f:
    text = f.read().strip()

image_size = 1024
space = np.linspace(-1, 1, image_size)
(x, y) = np.meshgrid(space, -space)
v = {}

for line in text.split('\n'):
    if line.startswith('#'):
        continue
    [out, op, *args] = line.split()
    match op:
        case "var-x": v[out] = x
        case "var-y": v[out] = y
        case "const": v[out] = float(args[0])
        case "add": v[out] = v[args[0]] + v[args[1]]
        case "sub": v[out] = v[args[0]] - v[args[1]]
        case "mul": v[out] = v[args[0]] * v[args[1]]
        case "max": v[out] = np.maximum(v[args[0]], v[args[1]])
        case "min": v[out] = np.minimum(v[args[0]], v[args[1]])
        case "neg": v[out] = -v[args[0]]
        case "square": v[out] = v[args[0]] * v[args[0]]
        case "sqrt": v[out] = np.sqrt(v[args[0]])
        case _: raise Exception(f"unknown opcode '{op}'")
out = v[out]

with open('out.ppm', 'wb') as f: # write the image out
    f.write(f'P5\n{image_size} {image_size}\n255\n'.encode())
    f.write(((out < 0) * 255).astype(np.uint8).tobytes())

I saw that, and while I didn’t know what linspace or meshgrid did, I got the general idea. This program runs in 40 seconds on my machine.

Matt made the observation that oops, it’s storing every single frame and that takes up… uh…

>>> (8 * 1024 * 1024 * 7000) / 1000 / 1000 / 1000
58.720256
>>>

Oops, nearly 60 gigabytes of matrices. Fortunately for us, each intermediate variable is not needed for the entirety of the render: we can find out when each variable is used for the last time and insert a call to delete that frame from v afterward. This is called liveness analysis.

Garbage collection

To do that, we’ll need to load the program into memory first so we can analyze it:

prog = []

with open('prospero.vm') as f:
    for line in f:
        if line.startswith('#'):
            continue
        prog.append(line.split())

# ...

for (out, op, *args) in prog:
    match op:
    case "var-x": v[out] = x
    # ...

Now, in order to find out when a frame is used for the last time, we can seek backwards on the program to find out when it is used for the “first” time:

with_gc = []
seen = set()

for (out, op, *args) in reversed(prog):
    # Don't try to GC constants
    # Also don't add GC to the beginning (end) of the program
    if op != "const" and with_gc:
        for arg in args:
            # Delete variable at first (last) use
            if arg not in seen:
                with_gc.append(("_", "delete", arg))
        seen.update(args)
    with_gc.append((out, op, *args))

prog = with_gc[::-1]

We just have to be careful not to insert calls to delete constant values like 3.14 – they are not names that will be in v. We also don’t want to insert any instructions to delete data at the end of the program (in backwards mode, before we have added any instructions to with_gc).

Now we edit the interpreter to handle the new delete instruction:

for (out, op, *args) in prog:
    match op:
        case "delete": del v[args[0]]
        # ...

After that modification, the program takes only 10 seconds and no more than one gigabyte of RAM. Nice.

Matt also mentions the GPU. I have a 1080ti in my desktop, so let’s see if we can make use of it.

Enter the GPU

A quick internet search reveals that CuPy should be a drop-in replacement for NumPy that runs on the GPU. Cool. I installed it with uv pip install cupy-cuda11x and replaced the first line with:

Incredibly, the whole program ran in one and a half seconds. It can also do a 2048x2048 in three seconds. I tried 4096x4096 and ran out of GPU RAM, I think.

Traceback (most recent call last):
  File "/home/max/Documents/code/minifidget/test.py", line 38, in <module>
    case "add": v[out] = v[args[0]] + v[args[1]]
  File "cupy/_core/core.pyx", line 1318, in cupy._core.core._ndarray_base.__add__
  File "cupy/_core/_kernel.pyx", line 1349, in cupy._core._kernel.ufunc.__call__
  File "cupy/_core/_kernel.pyx", line 645, in cupy._core._kernel._get_out_args_from_optionals
  File "cupy/_core/core.pyx", line 2884, in cupy._core.core._ndarray_init
  File "cupy/_core/core.pyx", line 257, in cupy._core.core._ndarray_base._init_fast
  File "cupy/cuda/memory.pyx", line 738, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1424, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1445, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1116, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1137, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 1382, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
  File "cupy/cuda/memory.pyx", line 1385, in cupy.cuda.memory.SingleDeviceMemoryPool._try_malloc
cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 134,217,728 bytes (allocated so far: 5,771,395,072 bytes).

Can it be made faster still? Almost certainly yes. I know some folks working on some meta-JIT hackery that might make it go even faster…