Vulkan's built-in allocation functionality is extremely limited, with the specification only requiring an implementation support up to 4096 individual allocations. There is no control over alignment, so all allocated memory will be conservatively aligned for the worst-case of all resource types. These issues typically require the user to implement some form of memory allocator to sub-allocate resources from larger memory blocks.
In this adventure, rather than doing the smart thing and using VMA1, we'll be rolling our own from scratch!
(Top)
1 The Actual Allocator
1.1 TLSF
1.2 Per-frame Bump Allocator
2 Vulkan Messing
2.1 Memory Requirements
2.2 Dedicated Allocations
2.3 Allocation Size Limit
2.4 Memory Allocation Limit
2.5 Alignment 2: Buffer Image Granularity
2.6 Memory Mapping
3 Advanced Vulkan Messing
3.1 Aliasing
3.2 Defragmentation
3.3 Resident Set Management
4 Conclusion
5 Links
6 Changelog
In the simplest sense, an allocator needs to keep track of which regions of memory are available for use, and which regions are unavailable. They need to do so without significant per-allocation overhead, and while avoiding expensive searches during allocation and deallocation. There's lots of approaches here, but we'll be using TLSF23.
TLSF (Two-Level Seggregated Fit), is a simple, fast constant time allocator algorithm designed primarily for embedded use-cases. Empty blocks are stored in a series of free-lists bucketed by their size, using a linear-log45 sequence to minimise loss due to fragmentation. A two-level bitmap acceleration structure tracks which free-lists contain blocks, allowing searches to complete using a couple of bit instructions widely supported by modern CPUs, rather than requiring a linear scan over the free-list heads.
Allocation first scans the bitmap structure, then if any suitable blocks were found, immediately uses that block for the allocation. When there would be significant left over space, allocation will split the block, and insert the left-over memory back into the TLSF structure.
/// Search the acceleration structure for a non-empty list suitable for an
/// allocation of the given size.
///
/// Returns the rounded size and bin index if a non-empty list is found, or
/// None.
fn search_non_empty_bin(&self, size: u32) -> Option<(u32, Bin)> {
// We need to find the bin which contains only empty-blocks large enough for the
// given size because we unconditionally use the first empty block found. So
// this must round up.
let (rounded_size, starting_bin) = Bin::from_size_round_up(size);
let mut bin = starting_bin.bin();
let sub_bin = starting_bin.sub_bin();
// First we scan the second-level bitmap from sub_bin, masking out the earlier
// sub-bins so we don't end up returning a bin that's too small for the
// allocation.
let mut second_level = self.bitmap_1[bin.widen()] & (!0 << sub_bin);
// If that search failed, then we must scan the first-level bitmap from the next
// bin forward. If we find anything here it cannot possibly be smaller than the
// requested allocation.
if second_level == 0 {
let first_level = self.bitmap_0 & (!0 << (bin + 1));
// If that search also failed, there's no suitable blocks.
if first_level == 0 {
return None;
}
// Recalculate the bin from the first level bitmap.
bin = first_level.trailing_zeros();
second_level = self.bitmap_1[bin.widen()];
}
// Find the sub-bin from the second level bitmap.
let sub_bin = second_level.trailing_zeros();
Some((rounded_size, Bin::new(bin, sub_bin)))
}
Freeing allocated memory does some cusory validation of the allocation to prevent obvious double-free issues, and then returns the block to the TLSF structure. Adjacent free physical blocks are merged immediately.
pub fn free(&mut self, allocation: Allocation<T>) {
let mut block_index = allocation.block_index;
let generation = self.blocks[block_index].generation;
assert_eq!(generation, allocation.generation, "double-free");
self.blocks[block_index].generation = generation.wrapping_add(1);
// Merge next block into the current block.
{
let into_block_index = block_index;
let from_block_index = self.blocks[block_index].phys_link.next;
if self.can_merge_block_left(into_block_index, from_block_index) {
let from_size = self.blocks[from_block_index].size;
self.extract_block(from_block_index);
list_unlink!(self.blocks, phys_link, from_block_index);
self.recycle_block(from_block_index);
self.blocks[into_block_index].size += from_size;
}
}
// Merge current block into the prev block.
{
let into_block_index = self.blocks[block_index].phys_link.prev;
let from_block_index = block_index;
if self.can_merge_block_left(into_block_index, from_block_index) {
let from_size = self.blocks[from_block_index].size;
self.extract_block(into_block_index);
list_unlink!(self.blocks, phys_link, from_block_index);
self.recycle_block(from_block_index);
self.blocks[into_block_index].size += from_size;
block_index = into_block_index;
}
}
// Insert the merged free block.
self.insert_block(block_index);
}
Once per frame we scan the super-blocks of all TLSF instances, freeing any empty super-blocks immediately. In a more production ready setup this work would be amortized across a longer time period and avoid blocking the frame.
It doesn't strictly fall under the scope of the memory allocator, however we're also going to support automatically managed per-frame thread-local bump allocators for uniform and other transient data. This should help remove a lot of pressure from the general allocator and avoids us needing to take a lock to allocate.
If an allocation request is too large for the bump allocator's block size, then we instead allocate via the regular TLSF allocation path, and immediately queue the object for destruction with the frame.
let texture: &[u8] = _;
let buffer = device.request_transient_buffer_with_data(
&frame,
&thread_token,
BufferUsageFlags::TRANSFER,
texture,
);
device.cmd_copy_buffer_to_image(
&mut cmd_buffer,
buffer.into(),
image,
ImageLayout::Optimal,
&[BufferImageCopy {
buffer_offset: 0,
buffer_row_length: 0,
buffer_image_height: 0,
image_subresource: default(),
image_offset: Offset3d { x: 0, y: 0, z: 0 },
image_extent: Extent3d {
width,
height,
depth: 1,
},
}],
);
Rust lifetimes prevent misuse by stopping transient buffer handles from living beyond a single frame. Each frame the linear allocators are reset and the buffers recycled. (after the gpu fence for that frame has been signalled).
In the future the library will track how much memory from the linear allocators is used each frame over some large period, and release any excess buffers. Otherwise it's possible that a single frame would use a large amount of memory (say, if we were in a load screen), and then that memory would be stuck in the transient allocator pools forever, unable to be used for other purposes.
Most of the complexity of a Vulkan memory allocator is, perhaps unsurprisingly, the vulkan part. Which, in fairness, is largely due to the inherently complex range of system topologies and access types that vulkan must expose coherently.
Vulkan uses a two-level abstraction for memory. First memory heaps represent areas of memory that are individually budgeted, and secondly memory types define specific sets of properties that can be used to allocate from a heap.
At a basic level there's a few different memory properties which you need to care about.
DEVICE_LOCAL
: Is this memory “fast” to access from the GPU.
HOST_VISIBLE
: Is this memory accessible from the CPU.
HOST_CACHED
: Are CPU accesses to this memory cached on the host side. Uncached memory has
very poor read performance, and is typically write-combined. Cached memory interacts with host
caches normally.
HOST_VISIBLE
but not HOST_COHERENT
, where it is necessary to manually call
vkFlushMappedMemoryRanges
and vkInvalidateMappedMemoryRanges
. As far as I know all relevant
desktop devices expose host coherent host visible memory types.From an application perspective, this gives you a few concrete options for where to place resources.
DEVICE_LOCAL
: Fast GPU memory, but you need to copy data into place as it cannot be mapped.
HOST_VISIBLE
: The GPU reads this memory directly from the host over PCIe. This can be a good
choice for direct placement of data, relying on GPU side caching for performance. Or as a place
from which you copy resources into device-local memory. Write-combined.
HOST_VISIBLE | HOST_CACHED
: With the addition of host-cached, mapped memory utilizes the CPU
cache hierarchy and is not write-combined. This means CPU random access reads are practical, but
at the cost of write performance.
DEVICE_LOCAL | HOST_VISIBLE
: For integrated graphics, or dedicated devices which support PCIe
resizable BAR, this represents a CPU mappable view of the entire GPU memory. A small 256 MiB
region may be available even in the absence of ReBAR. Write-combined.
DEVICE_LOCAL | HOST_VISIBLE | HOST_CACHED
: Same as above, but instead of uncached reads and
write combining, we go through the CPU cache hierarchy for CPU side reads and writes, and the GPU
snoops CPU caches for reads.
HOST_VISIBLE
memory types will work similarly to Intel integrated, with
fast GPU access, but they might not actually put the DEVICE_LOCAL
flag on the memory.With ReBAR or Integrated GPUs, staging copies can often be skipped entirely, instead placing resources directly into their final location. Usage of the smaller 256MiB BAR memory region must be more carefully managed, as the driver will use it for internal allocations and over-subscription can be problematic.
Care must be taken when dealing when directly mapping dedicated device memory using the
DEVICE_LOCAL | HOST_VISIBLE
memory types. Not only is it vitally important to avoid reading from
uncached memory from the CPU, it's also necessary to avoid random write patterns, partial writes of
cache lines, large strides, and anything that would close write combining buffers during a write.
Otherwise it's likely you will end up with very low bandwidth. It may be better in some cases to use
regular host-visible memory, or to introduce extra staging copies to keep accesses linear. It's also
possible that different CPU microarchitectures will perform significantly differently. Testing your
specific workloads is going to be important for maximum performance. For AMD guidelines see
this article about GPU_UPLOAD which is the way DX12 exposes the same feature.
For our purposes we're mostly going to try and avoid being clever, and pick the first suitable memory type. However it might be worth investigating using ReBAR memory for transient allocations, for example.
In order to allocate a vulkan object, it's necessary to query the vulkan driver to find the object's
memory requirements. The functions vkGetBufferMemoryRequirements
and
vkGetImageMemoryRequirements
yield a VkMemoryRequirements
structure which contains the
resource's size and alignment needs, as well as a memory type mask indicating which memory types are
usable.
We'll create a TLSF instance for each memory type, then scan for a suitable memory type using the memory type mask to filter out invalid options.
We do this in three passes:
If everything fails in all passes, allocation will fail. A production-ready implementation would return failure to the API user, as it's possible the app itself could release some memory before retrying.
Some vulkan implementations would prefer to control the placement of certain objects internally,
avoiding sub-allocation entirely. Typically this is used for things like frame buffers and may
improve performance. When querying memory requirements, it's possible to also retreive a
VkMemoryDedicatedRequirements
indicating whether this resource should be allocated separately or
not.
When allocating using the dedicated allocation path, we obviously don't need to use the TLSF
sub-allocator, but we do still need to track allocations so we can clean-up and do error checking.
Dedicated allocations should be a relatively rare code-path so for this we'll maintain a simple hash
set of VkDeviceMemory
allocations per memory type.
Vulkan defines a limit maxMemoryAllocationSize
, this one is pretty easy to deal with. We just need
to ensure that the block size we use when allocating backing memory for the TLSF allocators is less
than this value.
A conforming vulkan implementation must support allocations of at least 1GiB.
Vulkan defines a limit, maxMemoryAllocationCount
, we count the total number of memory allocations
and fail to allocate if the limit would be exceeded.
Vulkan defines a limit, bufferImageGranularity
. This one is very confusing. Firstly, it does not
apply to buffers and images, but to linear resources and non-linear resources. So between images
with optimal tiling, and everything else. This is because the function of this limit is to account
for certain GPUs associating compression metadata directly with memory addresses at some larger page
granularity. Some uncompressed resource being partially allocated in memory that has compression
metadata associated could cause lots of issues. Other GPUs don't maintain compression information in
this way, and so, don't have any issues and report 1 for bufferImageGranularity
.
Our approach is simple. If bufferImageGranularity
exceeds the guaranteed alignment of our TLSF
instances, we move non-linear allocations into their own special TLSF instance. Plus a bit of
bookkeeping to make sure we can free allocations correctly.
// If the allocation is smaller than the TLSF super-block size for this
// allocation type, we should attempt sub-allocation.
if size <= memory_heap.tlsf_super_block_size {
let (allocated_in_non_linear, mut tlsf) = if (VULKAN_CONSTANTS
.tlsf_force_segregated_non_linear_allocator
|| self.allocator.use_segregated_non_linear_allocator)
&& non_linear
{
(true, memory_type.tlsf_non_linear.lock())
} else {
(false, memory_type.tlsf.lock())
};
// ...
}
Vulkan defines a limit, minMemoryMapAlignment
. We avoid dealing with this by persistently mapping
all host-visible memory, rather than exposing memory mapping functions directly.
It's possible to create resources with the flag VK_IMAGE_CREATE_ALIAS_BIT
and then have a region
of memory simultaniously bound to multiple resources. So long as you pinkie promise not to access
those resources concurrently in a way that would make vulkan sad. Please check your friendly vulkan
specification for the exact requirements here.
Aliased memory can help you reduce your memory footprint by sharing memory for resources that you want to exist concurrently (avoiding overhead of more dynamic allocation), but won't ever access concurrently. e.g. If you want to pre-allocate all attachments used by a complex rendering pipeline, but don't want to associate unique memory blocks for attachments that are never used at the same time.
Left as an exercise for the reader.
It's possible to end up in a situation where you have a lot of super-blocks assigned to the TLSF sub-allocators, but which are under-utilised because all but a small number of the allocations that were serviced by that super-block have been freed.
Defragmentation looks at statistics for super-blocks, and tries to move the objects associated with under-utilised blocks into a smaller number of densely packed blocks to allow the allocator to return more unused memory to the system.
It requires the defragmenter be able to map from allocated regions to resources, and to then re-create those resources in a new location. Usually runs incrementally to avoid large stalls.
See the vulkan memory allocator documentation on the subject Vulkan Memory Allocator Defragmentation.
Left as an exercise for the reader.
DirectX 12 has a set of APIs, SetResidencyPriority, MakeResident and Evict, plus some DXGI APIs QueryVideoMemoryInfo, RegisterVideoMemoryBudgetChangeNotificationEvent and a sample library that together allow some level of coordination between applications and vidmm, Windows' Video Memory Manager. Without feedback from the application, it's hard for the OS to know which regions of memory are most important for an application at any given time (like, you really don't want to spill your render targets to system memory on a PC, for example).
Vulkan's support here is far more limited. Out of the box you get a per-heap size, and need to guess
a suitable budget from that. There's an extension VK_EXT_memory_priority
which provides static
residency priorities per allocation, and a more sophisticated, and less supported extension
VK_EXT_pageable_device_local_memory
which allows an application to query a dynamic budget for each
heap, and to dynamically adjust allocation residency priorities.
Good luck.
This was an attempt at going over the set of information required to get a functional Vulkan Memory Allocator At Home. I'll hopefully expand this over time, if I shave enough yaks to have a meaningful example of aliasing and defragmentation. But failing that, scoping out the real VMA source is a good way to go.
The source code for the allocator discussed here is available on github6.