**We've Got A Vulkan Memory Allocator At Home** Vulkan's built-in allocation functionality is extremely limited, with the specification only requiring an implementation support up to 4096 individual allocations. There is no control over alignment, so all allocated memory will be conservatively aligned for the worst-case of all resource types. These issues typically require the user to implement some form of memory allocator to sub-allocate resources from larger memory blocks. In this adventure, rather than doing the smart thing and using VMA [^vulkan-memory-allocator], we'll be rolling our own from scratch! The Actual Allocator === In the simplest sense, an allocator needs to keep track of which regions of memory are available for use, and which regions are unavailable. They need to do so without significant per-allocation overhead, and while avoiding expensive searches during allocation and deallocation. There's lots of approaches here, but we'll be using TLSF [^tlsf-paper][^narcissus-tlsf]. !!! Most allocators for system memory store metadata in-situ with allocated memory, but this doesn't fit well when sub-allocating memory that's potentially sitting on a dedicated GPU where CPU access is impossible or very slow. Moving metadata out-of-line is usually a straightforward process. TLSF --- TLSF (Two-Level Seggregated Fit), is a simple, fast constant time allocator algorithm designed primarily for embedded use-cases. Empty blocks are stored in a series of free-lists bucketed by their size, using a linear-log [^linear-log-bucketing][^narcissus-linear-log-binning] sequence to minimise loss due to fragmentation. A two-level bitmap acceleration structure tracks which free-lists contain blocks, allowing searches to complete using a couple of bit instructions widely supported by modern CPUs, rather than requiring a linear scan over the free-list heads. ![Figure [tlsf-bitmap]: Visualization of the TLSF acceleration bitmaps. The first column shows the first-level bitmap, with 1-bits as orange squares. The larger grid similarly shows the second-level bitmap. Bins, represented by a row, have power-of-two sizes, excluding the first "linear" bin. Each bin is further sub-divided into sub-bins.](tlsf-bitmap.svg) Allocation first scans the bitmap structure, then if any suitable blocks were found, immediately uses that block for the allocation. When there would be significant left over space, allocation will split the block, and insert the left-over memory back into the TLSF structure. [Listing [search_non_empty_bin]: Function for scanning the TLSF acceleration bitmaps for a non-empty bin suitable for allocation.] Freeing allocated memory does some cusory validation of the allocation to prevent obvious double-free issues, and then returns the block to the TLSF structure. Adjacent free physical blocks are merged immediately. [Listing [tlsf_free]: Freeing an allocation merges physically adjacent free blocks immediately.] Once per frame we scan the super-blocks of all TLSF instances, freeing any empty super-blocks immediately. In a more production ready setup this work would be amortized across a longer time period and avoid blocking the frame. Per-frame Bump Allocator --- It doesn't strictly fall under the scope of the memory allocator, however we're also going to support automatically managed per-frame thread-local bump allocators for uniform and other transient data. This should help remove a lot of pressure from the general allocator and avoids us needing to take a lock to allocate. If an allocation request is too large for the bump allocator's block size, then we instead allocate via the regular TLSF allocation path, and immediately queue the object for destruction with the frame. [Listing [allocate_transient_data]: Users can request transient buffers from a thread-local bump allocator. These are valid for the duration of the frame, and automatically manage memory behind the scenes.] Rust lifetimes prevent misuse by stopping transient buffer handles from living beyond a single frame. Each frame the linear allocators are reset and the buffers recycled. (after the gpu fence for that frame has been signalled). In the future the library will track how much memory from the linear allocators is used each frame over some large period, and release any excess buffers. Otherwise it's possible that a single frame would use a large amount of memory (say, if we were in a load screen), and then that memory would be stuck in the transient allocator pools forever, unable to be used for other purposes. Vulkan Messing === Most of the complexity of a Vulkan memory allocator is, perhaps unsurprisingly, the vulkan part. Which, in fairness, is largely due to the inherently complex range of system topologies and access types that vulkan must expose coherently. Vulkan uses a two-level abstraction for memory. First memory heaps represent areas of memory that are individually budgeted, and secondly memory types define specific sets of properties that can be used to allocate from a heap. At a basic level there's a few different memory properties which you need to care about. * **`DEVICE_LOCAL`**: Is this memory "fast" to access from the GPU. * **`HOST_VISIBLE`**: Is this memory accessible from the CPU. * **`HOST_CACHED`**: Are CPU accesses to this memory cached on the host side. Uncached memory has very poor read performance, and is typically write-combined. Cached memory interacts with host caches normally. !!! If you care about mobile devices, you might also have to deal with memory types that are **`HOST_VISIBLE`** but not **`HOST_COHERENT`**, where it is necessary to manually call `vkFlushMappedMemoryRanges` and `vkInvalidateMappedMemoryRanges`. As far as I know all relevant desktop devices expose host coherent host visible memory types. From an application perspective, this gives you a few concrete options for where to place resources. * **`DEVICE_LOCAL`**: Fast GPU memory, but you need to copy data into place as it cannot be mapped. * **`HOST_VISIBLE`**: The GPU reads this memory directly from the host over PCIe. This can be a good choice for direct placement of data, relying on GPU side caching for performance. Or as a place from which you copy resources into device-local memory. Write-combined. * **`HOST_VISIBLE | HOST_CACHED`**: With the addition of host-cached, mapped memory utilizes the CPU cache hierarchy and is not write-combined. This means CPU random access reads are practical, but at the cost of write performance. * **`DEVICE_LOCAL | HOST_VISIBLE`**: For integrated graphics, or dedicated devices which support PCIe resizable BAR, this represents a CPU mappable view of the entire GPU memory. A small 256 MiB region may be available even in the absence of ReBAR. Write-combined. * **`DEVICE_LOCAL | HOST_VISIBLE | HOST_CACHED`**: Same as above, but instead of uncached reads and write combining, we go through the CPU cache hierarchy for CPU side reads and writes, and the GPU snoops CPU caches for reads. !!! For AMD APUs, all **`HOST_VISIBLE`** memory types will work similarly to Intel integrated, with fast GPU access, but they might not actually put the **`DEVICE_LOCAL`** flag on the memory. With ReBAR or Integrated GPUs, staging copies can often be skipped entirely, instead placing resources directly into their final location. Usage of the smaller 256MiB BAR memory region must be more carefully managed, as the driver will use it for internal allocations and over-subscription can be problematic. Care must be taken when dealing when directly mapping dedicated device memory using the **`DEVICE_LOCAL | HOST_VISIBLE`** memory types. Not only is it vitally important to avoid reading from uncached memory from the CPU, it's also necessary to avoid random write patterns, partial writes of cache lines, large strides, and anything that would close write combining buffers during a write. Otherwise it's likely you will end up with very low bandwidth. It may be better in some cases to use regular host-visible memory, or to introduce extra staging copies to keep accesses linear. It's also possible that different CPU microarchitectures will perform significantly differently. Testing your specific workloads is going to be important for maximum performance. For AMD guidelines see [this][d3d12-heap-type-gpu-upload] article about GPU_UPLOAD which is the way DX12 exposes the same feature. [d3d12-heap-type-gpu-upload]: https://gpuopen.com/learn/using-d3d12-heap-type-gpu-upload/ For our purposes we're mostly going to try and avoid being clever, and pick the first suitable memory type. However it might be worth investigating using ReBAR memory for transient allocations, for example. Memory Requirements --- In order to allocate a vulkan object, it's necessary to query the vulkan driver to find the object's memory requirements. The functions `vkGetBufferMemoryRequirements` and `vkGetImageMemoryRequirements` yield a `VkMemoryRequirements` structure which contains the resource's size and alignment needs, as well as a memory type mask indicating which memory types are usable. We'll create a TLSF instance for each memory type, then scan for a suitable memory type using the memory type mask to filter out invalid options. We do this in three passes: * Pass 1 - Tries all preferred memory types in order. * Pass 2 - Clears the preferred memory types mask and retries. * Pass 3 - Performs an emergency free of all empty super-blocks and retries. If everything fails in all passes, allocation will fail. A production-ready implementation would return failure to the API user, as it's possible the app itself could release some memory before retrying. Dedicated Allocations --- Some vulkan implementations would prefer to control the placement of certain objects internally, avoiding sub-allocation entirely. Typically this is used for things like frame buffers and may improve performance. When querying memory requirements, it's possible to also retreive a `VkMemoryDedicatedRequirements` indicating whether this resource should be allocated separately or not. When allocating using the dedicated allocation path, we obviously don't need to use the TLSF sub-allocator, but we do still need to track allocations so we can clean-up and do error checking. Dedicated allocations should be a relatively rare code-path so for this we'll maintain a simple hash set of `VkDeviceMemory` allocations per memory type. Allocation Size Limit --- Vulkan defines a limit `maxMemoryAllocationSize`, this one is pretty easy to deal with. We just need to ensure that the block size we use when allocating backing memory for the TLSF allocators is less than this value. A conforming vulkan implementation must support allocations of at least 1GiB. Memory Allocation Limit --- Vulkan defines a limit, `maxMemoryAllocationCount`, we count the total number of memory allocations and fail to allocate if the limit would be exceeded. Alignment 2: Buffer Image Granularity --- Vulkan defines a limit, `bufferImageGranularity`. This one is very confusing. Firstly, it does *not* apply to buffers and images, but to *linear* resources and *non-linear* resources. So between images with optimal tiling, and everything else. This is because the function of this limit is to account for certain GPUs associating compression metadata directly with memory addresses at some larger page granularity. Some uncompressed resource being partially allocated in memory that has compression metadata associated could cause lots of issues. Other GPUs don't maintain compression information in this way, and so, don't have any issues and report 1 for `bufferImageGranularity`. Our approach is simple. If `bufferImageGranularity` exceeds the guaranteed alignment of our TLSF instances, we move non-linear allocations into their own special TLSF instance. Plus a bit of bookkeeping to make sure we can free allocations correctly. [Listing [segregate_non_linear_allocations]: A excerpt from the allocate implementation where we decide per-allocation whether to use the non_linear or general TLSF instance.] Memory Mapping --- Vulkan defines a limit, `minMemoryMapAlignment`. We avoid dealing with this by persistently mapping all host-visible memory, rather than exposing memory mapping functions directly. Advanced Vulkan Messing === Aliasing --- It's possible to create resources with the flag `VK_IMAGE_CREATE_ALIAS_BIT` and then have a region of memory simultaniously bound to multiple resources. So long as you pinkie promise not to access those resources concurrently in a way that would make vulkan sad. Please check your friendly vulkan specification for the exact requirements here. Aliased memory can help you reduce your memory footprint by sharing memory for resources that you want to exist concurrently (avoiding overhead of more dynamic allocation), but won't ever access concurrently. e.g. If you want to pre-allocate all attachments used by a complex rendering pipeline, but don't want to associate unique memory blocks for attachments that are never used at the same time. Left as an exercise for the reader. Defragmentation --- It's possible to end up in a situation where you have a lot of super-blocks assigned to the TLSF sub-allocators, but which are under-utilised because all but a small number of the allocations that were serviced by that super-block have been freed. Defragmentation looks at statistics for super-blocks, and tries to move the objects associated with under-utilised blocks into a smaller number of densely packed blocks to allow the allocator to return more unused memory to the system. It requires the defragmenter be able to map from allocated regions to resources, and to then re-create those resources in a new location. Usually runs incrementally to avoid large stalls. See the vulkan memory allocator documentation on the subject [Vulkan Memory Allocator Defragmentation][vma-defragmentation]. [vma-defragmentation]: https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/defragmentation.html Left as an exercise for the reader. Resident Set Management --- DirectX 12 has a set of APIs, [SetResidencyPriority][d3d12-set-residency-priority], [MakeResident][d3d12-make-resident] and [Evict][d3d12-evict], plus some DXGI APIs [QueryVideoMemoryInfo][dxgi-query-video-memory-info], [RegisterVideoMemoryBudgetChangeNotificationEvent][dxgi-register-budget-change-notification] and a [sample library][d3d12-residency-sample] that together allow some level of coordination between applications and vidmm, Windows' Video Memory Manager. Without feedback from the application, it's hard for the OS to know which regions of memory are most important for an application at any given time (like, you really don't want to spill your render targets to system memory on a PC, for example). Vulkan's support here is far more limited. Out of the box you get a per-heap size, and need to guess a suitable budget from that. There's an extension `VK_EXT_memory_priority` which provides static residency priorities per allocation, and a more sophisticated, and less supported extension `VK_EXT_pageable_device_local_memory` which allows an application to query a dynamic budget for each heap, and to dynamically adjust allocation residency priorities. Good luck. [d3d12-set-residency-priority]: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device1-setresidencypriority [d3d12-make-resident]: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device-makeresident [d3d12-evict]: https://learn.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device-evict [dxgi-query-video-memory-info]: https://learn.microsoft.com/en-us/windows/win32/api/dxgi1_4/nf-dxgi1_4-idxgiadapter3-queryvideomemoryinfo [dxgi-register-budget-change-notification]: https://learn.microsoft.com/en-us/windows/win32/api/dxgi1_4/nf-dxgi1_4-idxgiadapter3-registervideomemorybudgetchangenotificationevent [d3d12-residency-sample]: https://github.com/microsoft/DirectX-Graphics-Samples/blob/master/Samples/UWP/D3D12Residency/readme.md Conclusion === This was an attempt at going over the set of information required to get a functional Vulkan Memory Allocator At Home. I'll hopefully expand this over time, if I shave enough yaks to have a meaningful example of aliasing and defragmentation. But failing that, scoping out the real VMA source is a good way to go. The source code for the allocator discussed here is available on github [^narcissus-gpu-allocator]. Links === [^vulkan-memory-allocator]: GPUOpen - Vulkan Memory Allocator https://gpuopen.com/vulkan-memory-allocator/ [^tlsf-paper]: Masmano et al - Implementation of a constant-time dynamic storage allocator http://www.gii.upv.es/tlsf/files/spe_2008.pdf [^narcissus-tlsf]: Narcissus source code - TLSF https://github.com/jsimmons/narcissus/blob/e4745ebd5e5824ec9dbfca8a66f9c4bdb0dd0cec/engine/narcissus-gpu/src/tlsf.rs [^linear-log-bucketing]: Paul Khuong - Linear-log bucketing https://pvk.ca/Blog/2015/06/27/linear-log-bucketing-fast-versatile-simple/ [^narcissus-linear-log-binning]: Narcissus source code - Linear-log binning https://github.com/jsimmons/narcissus/blob/e4745ebd5e5824ec9dbfca8a66f9c4bdb0dd0cec/engine/narcissus-core/src/linear_log_binning.rs [^narcissus-gpu-allocator]: Narcissus source code - Vulkan Allocator https://github.com/jsimmons/narcissus/blob/e4745ebd5e5824ec9dbfca8a66f9c4bdb0dd0cec/engine/narcissus-gpu/src/backend/vulkan/allocator.rs Changelog === * Small fixes: 25-07-2023 * Initial Post: 20-07-2023