We Have A Vulkan Memory Allocator At Home

Vulkan’s built-in allocation functionality is extremely limited, implementations are:

Only required to support up to 4096 individual allocations. With the concrete limit reported dynamically.
Required to align returned memory conservatively for the worst-case alignment of all resource types.
Responsible for synchronizing concurrent allocations¹.

These issues typically require the user to implement some form of memory allocator to sub-allocate resources from larger memory blocks. In this adventure, rather than doing the smart thing and using Vulkan Memory Allocator, we’ll be rolling our own from scratch!

Our goal is to allocate a smaller number of large super-blocks directly from the driver, and from those super-blocks, sub-allocate to service individual allocation requests from the application. There’s a variety of practical options here, but one very common choice is TLSF.

TLSF

TLSF is a simple, fast, constant time allocator algorithm designed primarily for embedded use-cases. Empty blocks are stored in a series of free-lists, using a linear-log sequence of bucket sizes to bound internal fragmentation and minimise external fragmentation. A two-level bitmap acceleration structure tracks which free-lists contain blocks, allowing good-fit searches to complete using a couple of bit instructions widely supported by modern CPUs.

Memory is passed to TLSF as super-blocks representing large chunks of contiguous memory. The allocator then operates by subdividing those super-blocks into blocks representing both individual allocations and free regions.

Binning

Searching all unused blocks to find the most suitable block to allocate from would be prohibitively expensive so unused blocks are stored in a set of free-lists seggregated by their size. The strategy we used to seggregate into bins must strike a delicate balance:

It must be fast to calculate the bin for a given size.
Bin sizes must approximate allocation sizes to minimize fragmentation.
The number of bins should ideally be small and bounded.

To achieve these goals we’ll use a common technique, linear-log bucketing.

Allocation

Allocation proceeds by scanning the bitmap acceleration structure, if any suitable blocks are found, that block is immediately used to service the allocation. If there would be significant left over space, the block will be split and the left-over memory inserted back into the free-lists.

/// Search the acceleration structure for a non-empty list suitable for an
/// allocation of the given size.
///
/// Returns the rounded size and bin index if a non-empty list is found, or
/// None
fn search_non_empty_bin(&self, size: u32) -> Option<(u32, Bin)> {
    // We need to find the bin which contains only empty-blocks large enough for the
    // given size because we unconditionally use the first empty block found. So
    // this must round up.
    let (rounded_size, starting_bin) = Bin::from_size_round_up(size);

    println!("Hello, World!");

    let mut bin = starting_bin.bin();
    let sub_bin = starting_bin.sub_bin();

    // First we scan the second-level bitmap from sub_bin, masking out the earlier
    // sub-bins so we don't end up returning a bin that's too small for the
    // allocation.
    let mut second_level = self.bitmap_1[bin.widen()] & (!0 << sub_bin);

    // If that search failed, then we must scan the first-level bitmap from the next
    // bin forward. If we find anything here it cannot possibly be smaller than the
    // requested allocation.
    if second_level == 0 {
        let first_level = self.bitmap_0 & (!0 << (bin + 1));

        // If that search also failed, there's no suitable blocks.
        if first_level == 0 {
            return None;
        }

        // Recalculate the bin from the first level bitmap.
        bin = first_level.trailing_zeros();
        second_level = self.bitmap_1[bin.widen()];
    }

    // Find the sub-bin from the second level bitmap.
    let sub_bin = second_level.trailing_zeros();
    Some((rounded_size, Bin::new(bin, sub_bin)))
}

Freeing allocated memory does some cusory validation of the allocation to prevent trivial double-frees, then returns the block to the free-lists. Adjacent free physical blocks are merged immediately.

pub fn free(&mut self, allocation: Allocation<T>) {
    let mut block_index = allocation.block_index;
    let generation = self.blocks[block_index].generation;
    assert_eq!(generation, allocation.generation, "double-free");
    self.blocks[block_index].generation = generation.wrapping_add(1);

    // Merge next block into the current block.
    {
        let into_block_index = block_index;
        let from_block_index = self.blocks[block_index].phys_link.next;
        if self.can_merge_block_left(into_block_index, from_block_index) {
            let from_size = self.blocks[from_block_index].size;
            self.extract_block(from_block_index);
            list_unlink!(self.blocks, phys_link, from_block_index);
            self.recycle_block(from_block_index);
            self.blocks[into_block_index].size += from_size;
        }
    }

    // Merge current block into the prev block.
    {
        let into_block_index = self.blocks[block_index].phys_link.prev;
        let from_block_index = block_index;
        if self.can_merge_block_left(into_block_index, from_block_index) {
            let from_size = self.blocks[from_block_index].size;
            self.extract_block(into_block_index);
            list_unlink!(self.blocks, phys_link, from_block_index);
            self.recycle_block(from_block_index);
            self.blocks[into_block_index].size += from_size;
            block_index = into_block_index;
        }
    }

    // Insert the merged free block.
    self.insert_block(block_index);
}

Once per frame we scan the super-blocks of all TLSF instances, immediately releasing any which have become entirely unused. In a more production ready setup this work would be amortized across a longer time period and avoid blocking the frame.

Thread Local Bump Allocator

In order to reduce pressure on the general purpose allocator we implement a double-buffered thread-local bump allocator for transient allocations which last only a single frame. This allows the calling thread to avoid needing to take a lock, and reduces the complex allocation bookkeeping to a few branches and a subtraction.

If a transient buffer is requested that is beyond the block size for the transient allocator, it will fall-back to the general allocator, and immediately queue the buffer for deferred destruction.

let texture: &[u8] = _;

let buffer = device.request_transient_buffer_with_data(
    &frame,
    &thread_token,
    BufferUsageFlags::TRANSFER,
    texture,
);

device.cmd_copy_buffer_to_image(
    &mut cmd_buffer,
    buffer.into(),
    image,
    ImageLayout::Optimal,
    &[BufferImageCopy {
        buffer_offset: 0,
        buffer_row_length: 0,
        buffer_image_height: 0,
        image_subresource: default(),
        image_offset: Offset3d { x: 0, y: 0, z: 0 },
        image_extent: Extent3d {
            width,
            height,
            depth: 1,
        },
    }],
);

The request_transient_buffer_with_data function returns a handle bound to the lifetime of the frame parameter which prevents the common error of persisting references to frame allocated data after the frame has been submitted. At the beginning of a frame the linear allocators are reset, and the underlying buffers are recycled.

For this scheme to be production-ready it would also be important that the system track the amount of memory used each frame, and release excess memory that is unused over a longer period of time to the general allocator. As it is currently implemented, any buffers allocated by the transient memory pools will remain attached to those pools until shutdown.

Zeroed Memory

When memory is provided from the kernel to a desktop application, it’s well-defined that the contents of that memory are zeroed. While the mechanics differ between operating systems, this is safe for an application to rely on, and is important for security as un-cleared memory from other processes could contain secrets. With video memory allocations it’s… less straightforward.

On Windows, fresh video memory from the OS should always be zeroed. DX12 specifies that newly allocated video memory is always zeroed, and then provides an opt-out via D3D12_HEAP_FLAG_CREATE_NOT_ZEROED, where drivers may recycle memory within a single process without clearing it. Practically speaking, this means that Vulkan drivers on Windows also default to zeroing video memory before handing it off to an application.

On Linux the situation is the opposite. Not only do drivers avoid clearing memory within a process, but until recently amdgpu² did not even clear video memory from other processes. This means it’s vitally important that you manually clear video memory after receiving it, where that would be important to your application’s semantics. It’s a common enough source of problems that mesa has a frequently applied driconf quirk which forces the driver to zero allocated video memory.

To bring order to the chaos, VK_EXT_zero_initialize_device_memory is a new extension which introduces a flag, VK_MEMORY_ALLOCATE_ZERO_INITIALIZE_BIT_EXT, enabling applications to request the driver perform zero initialization on a per-allocation basis. Though note the default is still backwards compared to DX12 and Windows Vulkan.

This means that you have to throw a lock around all device memory allocations.↩
amdgpu is the kernel side vulkan driver for AMD GPUs on Linux.↩