We've Got A Vulkan Memory Allocator At Home

Vulkan's built-in allocation functionality is extremely limited, with the specification only requiring an implementation support up to 4096 individual allocations. There is no control over alignment, so all allocated memory will be conservatively aligned for the worst-case of all resource types. These issues typically require the user to implement some form of memory allocator to sub-allocate resources from larger memory blocks.

In this adventure, rather than doing the smart thing and using VMA¹, we'll be rolling our own from scratch!

Contents

(Top)
1 The Actual Allocator
  1.1 TLSF
  1.2 Per-frame Bump Allocator
2 Vulkan Messing
  2.1 Memory Requirements
  2.2 Dedicated Allocations
  2.3 Allocation Size Limit
  2.4 Memory Allocation Limit
  2.5 Alignment 2: Buffer Image Granularity
  2.6 Memory Mapping
3 Advanced Vulkan Messing
  3.1 Aliasing
  3.2 Defragmentation
  3.3 Resident Set Management
4 Conclusion
5 Links
6 Changelog

The Actual Allocator

In the simplest sense, an allocator needs to keep track of which regions of memory are available for use, and which regions are unavailable. They need to do so without significant per-allocation overhead, and while avoiding expensive searches during allocation and deallocation. There's lots of approaches here, but we'll be using TLSF²³.

Most allocators for system memory store metadata in-situ with allocated memory, but this doesn't fit well when sub-allocating memory that's potentially sitting on a dedicated GPU where CPU access is impossible or very slow. Moving metadata out-of-line is usually a straightforward process.

TLSF

TLSF (Two-Level Seggregated Fit), is a simple, fast constant time allocator algorithm designed primarily for embedded use-cases. Empty blocks are stored in a series of free-lists bucketed by their size, using a linear-log⁴⁵ sequence to minimise loss due to fragmentation. A two-level bitmap acceleration structure tracks which free-lists contain blocks, allowing searches to complete using a couple of bit instructions widely supported by modern CPUs, rather than requiring a linear scan over the free-list heads.

Figure 1: Visualization of the TLSF acceleration bitmaps. The first column shows the first-level bitmap, with 1-bits as orange squares. The larger grid similarly shows the second-level bitmap. Bins, represented by a row, have power-of-two sizes, excluding the first “linear” bin. Each bin is further sub-divided into sub-bins.

Allocation first scans the bitmap structure, then if any suitable blocks were found, immediately uses that block for the allocation. When there would be significant left over space, allocation will split the block, and insert the left-over memory back into the TLSF structure.

/// Search the acceleration structure for a non-empty list suitable for an
/// allocation of the given size.
///
/// Returns the rounded size and bin index if a non-empty list is found, or
/// None.
fn search_non_empty_bin(&self, size: u32) -> Option<(u32, Bin)> {
    // We need to find the bin which contains only empty-blocks large enough for the
    // given size because we unconditionally use the first empty block found. So
    // this must round up.
    let (rounded_size, starting_bin) = Bin::from_size_round_up(size);

    let mut bin = starting_bin.bin();
    let sub_bin = starting_bin.sub_bin();

    // First we scan the second-level bitmap from sub_bin, masking out the earlier
    // sub-bins so we don't end up returning a bin that's too small for the
    // allocation.
    let mut second_level = self.bitmap_1[bin.widen()] & (!0 << sub_bin);

    // If that search failed, then we must scan the first-level bitmap from the next
    // bin forward. If we find anything here it cannot possibly be smaller than the
    // requested allocation.
    if second_level == 0 {
        let first_level = self.bitmap_0 & (!0 << (bin + 1));

        // If that search also failed, there's no suitable blocks.
        if first_level == 0 {
            return None;
        }

        // Recalculate the bin from the first level bitmap.
        bin = first_level.trailing_zeros();
        second_level = self.bitmap_1[bin.widen()];
    }

    // Find the sub-bin from the second level bitmap.
    let sub_bin = second_level.trailing_zeros();
    Some((rounded_size, Bin::new(bin, sub_bin)))
}

Freeing allocated memory does some cusory validation of the allocation to prevent obvious double-free issues, and then returns the block to the TLSF structure. Adjacent free physical blocks are merged immediately.

pub fn free(&mut self, allocation: Allocation<T>) {
    let mut block_index = allocation.block_index;
    let generation = self.blocks[block_index].generation;
    assert_eq!(generation, allocation.generation, "double-free");
    self.blocks[block_index].generation = generation.wrapping_add(1);

    // Merge next block into the current block.
    {
        let into_block_index = block_index;
        let from_block_index = self.blocks[block_index].phys_link.next;
        if self.can_merge_block_left(into_block_index, from_block_index) {
            let from_size = self.blocks[from_block_index].size;
            self.extract_block(from_block_index);
            list_unlink!(self.blocks, phys_link, from_block_index);
            self.recycle_block(from_block_index);
            self.blocks[into_block_index].size += from_size;
        }
    }

    // Merge current block into the prev block.
    {
        let into_block_index = self.blocks[block_index].phys_link.prev;
        let from_block_index = block_index;
        if self.can_merge_block_left(into_block_index, from_block_index) {
            let from_size = self.blocks[from_block_index].size;
            self.extract_block(into_block_index);
            list_unlink!(self.blocks, phys_link, from_block_index);
            self.recycle_block(from_block_index);
            self.blocks[into_block_index].size += from_size;
            block_index = into_block_index;
        }
    }

    // Insert the merged free block.
    self.insert_block(block_index);
}

Once per frame we scan the super-blocks of all TLSF instances, freeing any empty super-blocks immediately. In a more production ready setup this work would be amortized across a longer time period and avoid blocking the frame.

Per-frame Bump Allocator

It doesn't strictly fall under the scope of the memory allocator, however we're also going to support automatically managed per-frame thread-local bump allocators for uniform and other transient data. This should help remove a lot of pressure from the general allocator and avoids us needing to take a lock to allocate.

If an allocation request is too large for the bump allocator's block size, then we instead allocate via the regular TLSF allocation path, and immediately queue the object for destruction with the frame.

let texture: &[u8] = _;

let buffer = device.request_transient_buffer_with_data(
    &frame,
    &thread_token,
    BufferUsageFlags::TRANSFER,
    texture,
);

device.cmd_copy_buffer_to_image(
    &mut cmd_buffer,
    buffer.into(),
    image,
    ImageLayout::Optimal,
    &[BufferImageCopy {
        buffer_offset: 0,
        buffer_row_length: 0,
        buffer_image_height: 0,
        image_subresource: default(),
        image_offset: Offset3d { x: 0, y: 0, z: 0 },
        image_extent: Extent3d {
            width,
            height,
            depth: 1,
        },
    }],
);

Rust lifetimes prevent misuse by stopping transient buffer handles from living beyond a single frame. Each frame the linear allocators are reset and the buffers recycled. (after the gpu fence for that frame has been signalled).

In the future the library will track how much memory from the linear allocators is used each frame over some large period, and release any excess buffers. Otherwise it's possible that a single frame would use a large amount of memory (say, if we were in a load screen), and then that memory would be stuck in the transient allocator pools forever, unable to be used for other purposes.

Vulkan Messing

Most of the complexity of a Vulkan memory allocator is, perhaps unsurprisingly, the vulkan part. Which, in fairness, is largely due to the inherently complex range of system topologies and access types that vulkan must expose coherently.

Vulkan uses a two-level abstraction for memory. First memory heaps represent areas of memory that are individually budgeted, and secondly memory types define specific sets of properties that can be used to allocate from a heap.

At a basic level there's a few different memory properties which you need to care about.

DEVICE_LOCAL: Is this memory “fast” to access from the GPU.
HOST_VISIBLE: Is this memory accessible from the CPU.
HOST_CACHED: Are CPU accesses to this memory cached on the host side. Uncached memory has very poor read performance, and is typically write-combined. Cached memory interacts with host caches normally.

If you care about mobile devices, you might also have to deal with memory types that are HOST_VISIBLE but not HOST_COHERENT, where it is necessary to manually call vkFlushMappedMemoryRanges and vkInvalidateMappedMemoryRanges. As far as I know all relevant desktop devices expose host coherent host visible memory types.

From an application perspective, this gives you a few concrete options for where to place resources.

DEVICE_LOCAL: Fast GPU memory, but you need to copy data into place as it cannot be mapped.
HOST_VISIBLE: The GPU reads this memory directly from the host over PCIe. This can be a good choice for direct placement of data, relying on GPU side caching for performance. Or as a place from which you copy resources into device-local memory. Write-combined.
HOST_VISIBLE | HOST_CACHED: With the addition of host-cached, mapped memory utilizes the CPU cache hierarchy and is not write-combined. This means CPU random access reads are practical, but at the cost of write performance.
DEVICE_LOCAL | HOST_VISIBLE: For integrated graphics, or dedicated devices which support PCIe resizable BAR, this represents a CPU mappable view of the entire GPU memory. A small 256 MiB region may be available even in the absence of ReBAR. Write-combined.
DEVICE_LOCAL | HOST_VISIBLE | HOST_CACHED: Same as above, but instead of uncached reads and write combining, we go through the CPU cache hierarchy for CPU side reads and writes, and the GPU snoops CPU caches for reads.

For AMD APUs, all HOST_VISIBLE memory types will work similarly to Intel integrated, with fast GPU access, but they might not actually put the DEVICE_LOCAL flag on the memory.

With ReBAR or Integrated GPUs, staging copies can often be skipped entirely, instead placing resources directly into their final location. Usage of the smaller 256MiB BAR memory region must be more carefully managed, as the driver will use it for internal allocations and over-subscription can be problematic.

Care must be taken when dealing when directly mapping dedicated device memory using the DEVICE_LOCAL | HOST_VISIBLE memory types. Not only is it vitally important to avoid reading from uncached memory from the CPU, it's also necessary to avoid random write patterns, partial writes of cache lines, large strides, and anything that would close write combining buffers during a write. Otherwise it's likely you will end up with very low bandwidth. It may be better in some cases to use regular host-visible memory, or to introduce extra staging copies to keep accesses linear. It's also possible that different CPU microarchitectures will perform significantly differently. Testing your specific workloads is going to be important for maximum performance. For AMD guidelines see this article about GPU_UPLOAD which is the way DX12 exposes the same feature.

For our purposes we're mostly going to try and avoid being clever, and pick the first suitable memory type. However it might be worth investigating using ReBAR memory for transient allocations, for example.

Memory Requirements

In order to allocate a vulkan object, it's necessary to query the vulkan driver to find the object's memory requirements. The functions vkGetBufferMemoryRequirements and vkGetImageMemoryRequirements yield a VkMemoryRequirements structure which contains the resource's size and alignment needs, as well as a memory type mask indicating which memory types are usable.

We'll create a TLSF instance for each memory type, then scan for a suitable memory type using the memory type mask to filter out invalid options.

We do this in three passes:

Pass 1 - Tries all preferred memory types in order.
Pass 2 - Clears the preferred memory types mask and retries.
Pass 3 - Performs an emergency free of all empty super-blocks and retries.

If everything fails in all passes, allocation will fail. A production-ready implementation would return failure to the API user, as it's possible the app itself could release some memory before retrying.

Dedicated Allocations

Some vulkan implementations would prefer to control the placement of certain objects internally, avoiding sub-allocation entirely. Typically this is used for things like frame buffers and may improve performance. When querying memory requirements, it's possible to also retreive a VkMemoryDedicatedRequirements indicating whether this resource should be allocated separately or not.

When allocating using the dedicated allocation path, we obviously don't need to use the TLSF sub-allocator, but we do still need to track allocations so we can clean-up and do error checking. Dedicated allocations should be a relatively rare code-path so for this we'll maintain a simple hash set of VkDeviceMemory allocations per memory type.

Allocation Size Limit

Vulkan defines a limit maxMemoryAllocationSize, this one is pretty easy to deal with. We just need to ensure that the block size we use when allocating backing memory for the TLSF allocators is less than this value.

A conforming vulkan implementation must support allocations of at least 1GiB.

Memory Allocation Limit

Vulkan defines a limit, maxMemoryAllocationCount, we count the total number of memory allocations and fail to allocate if the limit would be exceeded.

Alignment 2: Buffer Image Granularity

Vulkan defines a limit, bufferImageGranularity. This one is very confusing. Firstly, it does not apply to buffers and images, but to linear resources and non-linear resources. So between images with optimal tiling, and everything else. This is because the function of this limit is to account for certain GPUs associating compression metadata directly with memory addresses at some larger page granularity. Some uncompressed resource being partially allocated in memory that has compression metadata associated could cause lots of issues. Other GPUs don't maintain compression information in this way, and so, don't have any issues and report 1 for bufferImageGranularity.

Our approach is simple. If bufferImageGranularity exceeds the guaranteed alignment of our TLSF instances, we move non-linear allocations into their own special TLSF instance. Plus a bit of bookkeeping to make sure we can free allocations correctly.

// If the allocation is smaller than the TLSF super-block size for this
// allocation type, we should attempt sub-allocation.
if size <= memory_heap.tlsf_super_block_size {
    let (allocated_in_non_linear, mut tlsf) = if (VULKAN_CONSTANTS
        .tlsf_force_segregated_non_linear_allocator
        || self.allocator.use_segregated_non_linear_allocator)
        && non_linear
    {
        (true, memory_type.tlsf_non_linear.lock())
    } else {
        (false, memory_type.tlsf.lock())
    };

    // ...
}

Memory Mapping

Vulkan defines a limit, minMemoryMapAlignment. We avoid dealing with this by persistently mapping all host-visible memory, rather than exposing memory mapping functions directly.

Advanced Vulkan Messing

Aliasing

It's possible to create resources with the flag VK_IMAGE_CREATE_ALIAS_BIT and then have a region of memory simultaniously bound to multiple resources. So long as you pinkie promise not to access those resources concurrently in a way that would make vulkan sad. Please check your friendly vulkan specification for the exact requirements here.

Aliased memory can help you reduce your memory footprint by sharing memory for resources that you want to exist concurrently (avoiding overhead of more dynamic allocation), but won't ever access concurrently. e.g. If you want to pre-allocate all attachments used by a complex rendering pipeline, but don't want to associate unique memory blocks for attachments that are never used at the same time.

Left as an exercise for the reader.

Defragmentation

It's possible to end up in a situation where you have a lot of super-blocks assigned to the TLSF sub-allocators, but which are under-utilised because all but a small number of the allocations that were serviced by that super-block have been freed.

Defragmentation looks at statistics for super-blocks, and tries to move the objects associated with under-utilised blocks into a smaller number of densely packed blocks to allow the allocator to return more unused memory to the system.

It requires the defragmenter be able to map from allocated regions to resources, and to then re-create those resources in a new location. Usually runs incrementally to avoid large stalls.

See the vulkan memory allocator documentation on the subject Vulkan Memory Allocator Defragmentation.

Left as an exercise for the reader.

Resident Set Management

DirectX 12 has a set of APIs, SetResidencyPriority, MakeResident and Evict, plus some DXGI APIs QueryVideoMemoryInfo, RegisterVideoMemoryBudgetChangeNotificationEvent and a sample library that together allow some level of coordination between applications and vidmm, Windows' Video Memory Manager. Without feedback from the application, it's hard for the OS to know which regions of memory are most important for an application at any given time (like, you really don't want to spill your render targets to system memory on a PC, for example).

Vulkan's support here is far more limited. Out of the box you get a per-heap size, and need to guess a suitable budget from that. There's an extension VK_EXT_memory_priority which provides static residency priorities per allocation, and a more sophisticated, and less supported extension VK_EXT_pageable_device_local_memory which allows an application to query a dynamic budget for each heap, and to dynamically adjust allocation residency priorities.

Good luck.

Conclusion

This was an attempt at going over the set of information required to get a functional Vulkan Memory Allocator At Home. I'll hopefully expand this over time, if I shave enough yaks to have a meaningful example of aliasing and defragmentation. But failing that, scoping out the real VMA source is a good way to go.

The source code for the allocator discussed here is available on github⁶.

Links

¹ GPUOpen - Vulkan Memory Allocator https://gpuopen.com/vulkan-memory-allocator/

² Masmano et al - Implementation of a constant-time dynamic storage allocator http://www.gii.upv.es/tlsf/files/spe_2008.pdf

³ Narcissus source code - TLSF https://github.com/jsimmons/narcissus/blob/e4745ebd5e5824ec9dbfca8a66f9c4bdb0dd0cec/engine/narcissus-gpu/src/tlsf.rs

⁴ Paul Khuong - Linear-log bucketing https://pvk.ca/Blog/2015/06/27/linear-log-bucketing-fast-versatile-simple/

⁵ Narcissus source code - Linear-log binning https://github.com/jsimmons/narcissus/blob/e4745ebd5e5824ec9dbfca8a66f9c4bdb0dd0cec/engine/narcissus-core/src/linear_log_binning.rs

⁶ Narcissus source code - Vulkan Allocator https://github.com/jsimmons/narcissus/blob/e4745ebd5e5824ec9dbfca8a66f9c4bdb0dd0cec/engine/narcissus-gpu/src/backend/vulkan/allocator.rs

Changelog

Small fixes: 25-07-2023
Initial Post: 20-07-2023

formatted by Markdeep 1.18

✒