在Rust中写入未初始化的缓冲区
Writing into Uninitialized Buffers in Rust

原始链接: https://blog.sunfishcode.online/writingintouninitializedbuffersinrust/

这篇博文介绍了`Buffer` trait,这是一种处理 Rust 中未初始化缓冲区的新方法,目前已在 rustix 1.0 和独立的`buffer-trait`库中可用。`Buffer` trait旨在提供一种安全高效的方式将数据读取到可能未初始化的内存中,以取代旧的方法,例如`read_uninit`。 该 trait 定义了访问底层缓冲区 (`parts_mut`) 的函数,以及在写入后将缓冲区标记为已初始化 (`assume_init`) 的函数。它已为`&mut [T]`、`&mut [MaybeUninit]` 和 `Vec` 的 `SpareCapacity` 包装器实现了该 trait,分别允许读取到已初始化的切片、未初始化的切片和向量的剩余容量中。`Buffer` trait 还处理将数据读取到非字节缓冲区中。 虽然比 Rust 的实验性`BorrowedBuf` 更简单(避免了“双游标”),但`Buffer` 在`assume_init`函数内部依赖于`unsafe`代码。作者探讨了通过加入类似于`BorrowedCursor` 的`Cursor` API 来提高`Buffer` trait 安全性的可能性。如果`Buffer` trait在实践中成功,希望它能够被考虑纳入 Rust 的标准库。

一位Hacker News评论员90s_dev 指出了在Rust中处理未初始化内存的挑战,他提到了一个为Windows开发的新的开源“编辑”程序的创建者也表达了类似的挫败感。核心问题在于Rust严格处理未初始化缓冲区,需要使用`MaybeUninit`或`mem::uninit()`。评论员认为这种复杂性将编译器工程的细枝末节暴露给了程序员。他们希望有一种更简单的方法,类似于C语言,在其中声明一个未初始化的数组,将数据读入其中,然后安全地访问它感觉更自然。虽然承认在类似情况下C语言中存在未定义行为,但评论员发现与他在C语言中的经验相比,Rust的更严格方法在心理上更令人负担。他链接到之前的一条Hacker News评论以了解更多上下文。

原文

Posted on

Uninitialized buffers in Rust are a long-standing question, for example:

Recently, John Nunley and Alex Saveau came up with an idea for a new approach, using a Buffer trait, which is now in rustix 1.0, which I'll describe in this post.

Update: This idea is now available in a standalone published library: buffer-trait.

Introducing the Buffer trait

The POSIX read function reads bytes from a file descriptor into a buffer, and it can read fewer bytes than requested. Using Buffer, read in rustix looks like this:

pub fn read<Fd: AsFd, Buf: Buffer<u8>>(fd: Fd, buf: Buf) -> Result<Buf::Output>

This uses the Buffer trait to describe the buffer argument. The Buffer trait looks like this:

pub trait Buffer<T> {
    /// The type of the value returned by functions with `Buffer` arguments.
    type Output;

    /// Return a raw pointer and length to the underlying buffer.
    fn parts_mut(&mut self) -> (*mut T, usize);

    /// Assert that `len` elements were written to, and provide a return value.
    unsafe fn assume_init(self, len: usize) -> Self::Output;
}

(And thanks to Yoshua Wuyts for feedback on this trait and encouragement for the overall idea!)

(Rustix's own Buffer trait is sealed and its functions are private, but that's just rustix choosing for now to reserve the ability to evolve the trait without breaking compatibility, at the expense of not allowing users to use Buffer for defining their own I/O functions, for now.)

Buffer is implemented for &mut [T], so users can pass read a &mut [u8] buffer to write into, and it'll return a Result<usize>, where the usize indicates how many bytes were actually read, on success. This matches how read in rustix used to work. Using this looks like:

let mut buf = [0_u8; 16];
let num_read = read(fd, &mut buf)?;
use(&buf[..num_read]);

Buffer is also implemented for &mut [MaybeUninit<T>], so users can pass read a &mut [MaybeUninit<u8>], and in that case, they'll get back a Result<(&mut [u8], &mut [MaybeUninit<u8>])>. On success, that provides a pair of slices which are subslices of the original buffer, containing the range of bytes that data was read into, and the remaining bytes that remain uninitialized. Rustix previously had a function called read_uninit that worked this way, and in rustix 1.0 it's replaced by this new Buffer-enabled read function. Using this looks like:

let mut buf = [MaybeUninit::<u8>::uninit(); 16];
let (init, uninit) = read(fd, &mut buf)?;
use(init);

This allows reading into uninitialized buffers with a safe API.

And, Buffer also supports a way to read into the spare capacity of a Vec. The spare_capacity function takes a &mut Vec<T> and returns a SpareCapacity newtype which implements Buffer, and it automatically sets the length of the vector to include the number of initialized elements after the read, encapsulating the unsafety of Vec::set_len. Using this looks like:

let mut buf = Vec::<u8>::with_capacity(1024);
let num_read = read(fd, spare_capacity(&mut buf))?;
use(&buf);

In rustix, all functions that previously took &mut [u8] buffers to write into now take impl Buffer<u8> buffers, so they support writing into uninitialized buffers.

Under the covers

read is implemented like this:

let len = unsafe { backend::io::syscalls::read(fd.as_fd(), buf.parts_mut())? };
unsafe { Ok(buf.assume_init(len)) }

First we call the underlying system call, and it returns the number of bytes it read. We then pass that to assume_init, which computes the Buffer::Output to return. The output may be just that number, or may be a pair of slices reflecting that number.

What if T is not u8?

Buffer uses a type parameter T rather than hard-coding u8, so that it can be used by functions like epoll::wait, kevent, and port::get to return event records instead of bytes. Using this can look like this:

let mut event_list = Vec::<epoll::Event>::with_capacity(16);
loop {
    let _num = epoll::wait(&epoll, spare_capacity(&mut event_list), None)?;
    for event in event_list.drain(..) {
        handle(event);
    }
}

This drains the Vec with drain so that it's empty before each wait, because spare_capacity appends to the Vec rather than overwriting any elements.

There are no dynamic allocations inside the loop; SpareCapacity only uses the existing spare capacity and only calls set_len, and not resize.

Alternatively, because Buffer also works on slices, this code can be written without using Vec at all:

let mut event_list = [MaybeUninit::<epoll::Event>; 16];
loop {
    let (init, _uninit) = epoll::wait(&epoll, &mut event_list, None)?;
    for event in init {
        handle(event);
    }
}

Error messages

One downside of the Buffer trait approach is that it sometimes evokes error messages from rustc which aren't obvious. This happened enough that we now have a section in rustix's documentation about them, and an example showing examples where they come up.

Using Buffer safely

Rust's std currently contains an experimental API based on BorrowedBuf, which has the nice property of allowing users to use it without using unsafe, and without doing anything hugely inefficient, such as initializing the full buffer. To achieve this, BorrowedBuf uses a "double cursor" design to avoid re-initializing memory that has already been initialized.

The Buffer trait described here is simpler, avoiding the need for a "double cursor", however it does have an unsafe required method. Is there a way we could modify it to support safe use?

A Cursor API like BorrowedCursor could do it. That supports safely and incrementally writing into an uninitialized buffer. And a key feature of BorrowedCursor is that it never requires the full buffer to be eagerly initialized.

With that, the Buffer trait might look like:

pub trait Buffer<T> {
    // ... existing contents

    /// An alternative to `parts_mut` for use with `init`.
    ///
    /// Return a `Cursor`.
    fn cursor(&mut self) -> Cursor<T> {
        Cursor::new(self)
    }
}

impl<T, B: Buffer<T>> Cursor<T, B> {
    /// ... cursor API

    fn finish(self) -> B::Output {
        // SAFETY: `Cursor` ensures that exactly `pos` bytes have been written.
        unsafe { self.b.assume_init(selff.pos) }
    }
}

This way, a user could write their own functions that take Buffer arguments and implement them using cursor, without using unsafe.

Why parts_mut and a raw pointer?

The parts_mut function in the Buffer trait looks like this:

fn parts_mut(&mut self) -> (*mut T, usize);

Why return a raw pointer and length, instead of a &mut [MaybeUninit<T>]? Because a &mut [MaybeUninit<T>] would be unsound in a subtle way. We implement Buffer for &mut [T], which cannot contain any uninitialized elements, and exposing it as a &mut [MaybeUninit<T>] would allow uninitialized elements to be written into it.

With a raw pointer, we put the burden on the assume_init call to guarantee that the buffer has been written to properly.

Looking forward

A limited version of this Buffer trait is now in rustix 1.0, so we'll see how it goes in practice.

This idea is now also available in a standalone published library: buffer-trait.

If it works out well, I think this Buffer design is worth considering for Rust's std, as a replacement for BorrowedBuf (which is currently unstable). It's simpler, as it avoids the "double cursor" pattern, and it has the fun feature of supporting the Vec spare capacity use case and encapsulating the unsafe Vec::set_len call.

联系我们 contact @ memedata.com