Vulkan Memory Types
Picking up the discussion on memory and caching: how does this all interact with an external processor such as a GPU? GPUs are an intersting addition to this discussion because their operation is very memory-bound and very very multi-core. They also run alongside CPUs. This makes things ...complicated.
A quick overview:
- GPUs usually have lots and lots of their own main RAM, which we call VRAM (for video-RAM).
- GPUs can usually read data from main RAM as well. This is generally a little slower than VRAM since the accesses have to be coordinated such that they don't overlap with the CPU's.
- CPUs may or may not be able to directly access VRAM (again, because this shared access to the resource poses a coordination problem). Often a CPU will be limited to just a small piece of VRAM which is controlled by the driver and used to transfer instructions to the GPU which then accesses other memory on the CPU's behalf.
- But this isn't true of all GPUs - some of them, like mobile and integrated GPUs, have no VRAM and just share main RAM with a CPU.
- In addition to their internal caches, GPUs can have special regions of extremely fast (and, like CPU cache: power-hungry, hot, and thus limited) memory.
- The CPU and GPU might share not only main RAM, they might share a CPU package. They might sit right next to one another in one chip and have super secret best-friends-only special handshakes that they can do with one another to make coordinating over memory access fast.
- GPUs may or may not virtualize their memory. (Modern ones generally do, for robustness and security reasons.)
- There might be rules about certain types of resources having to go in certain parts of memory, or about having to havethat memory mapped a certain way.
And this list could go on...
How do we deal with this?
The old answer to this question was "we don't, the graphics driver is made of tiny demons who figure it all out for us". But, as it turned out, tiny demons require special care and feeding (and sometimes they have to be paid!), so let's look at the current state of things.
Modern APIs like Vulkan expose this stuff to applications as a set of different memory types.
A memory type contains three bits of information:
- The memory region, or heap, where it can be allocated.
- Information about how the application is allowed to interact with this memory.
- Secret configuration flags that only the graphics driver knows about. These might effect the resource types that the memory type is compatible with, and that's all that concerns the application about this stuff.
The Vulkan memory API
In Vulanese, that's all wrapped up in this:
typedef struct VkMemoryType {
VkMemoryPropertyFlags propertyFlags;
uint32_t heapIndex;
} VkMemoryType;
heapIndex
tells us which heap this is associated with. propertyFlags
tell us the memory's overall performance characteristics and how the CPU may and may not interact with it.
Memory type flags
These tell us (well, these strongly suggest) the location of the memory:
DEVICE_LOCAL
means that the GPU can efficiently access this memory.If this bit is missing, then the GPU's access to this memory will be slow(er) than otherwise.
HOST_VISIBLE
means that the CPU can map a pointer to this memory and access it.If it's paired with
DEVICE_LOCAL
, that usually indicates that the CPU and GPU can efficiently deconflict access to some shared bit of RAM or VRAM and that the CPU is therefore allowed to map and access it directly. That makes this an excellent candidate for uniform buffers, dynamic mesh vertex buffers, things like that.This combination is typically seen on integrated GPUs and on mobile devices where main RAM and VRAM are the same thing, but it's also present on newer discrete desktop GPUs, where the trend is to make more and more of VRAM visible to the CPU over the PCIe bus. (Look up the terms BAR and ReBAR for more info.)
If it's present, but
DEVICE_LOCAL
is absent, then this is probably part of system RAM which the GPU can nevertheless access (if more slowly). That makes it well-suited (in the absence of a better option) for read-once data, such as staging buffers, small uniform buffers which won't fall out of cache, unindexed vertex buffers (or those used with very cache-friendly indexing).If it's absent, then that's probably a part of VRAM which the CPU is not allowed to touch because the cost of deconflicting external CPU access from internal GPU access to (that part of) VRAM is just too expensive.
LAZILY_ALLOCATED
is interesting. It represents those special small regions of ultrafast VRAM mentioned above. Those generally have to be carefully managed by the graphics driver, so this flag is incompatible withHOST_VISIBLE
, the CPU is not allowed to directly touch it. Assume the driver is deploying tiny demons to do magic on your behalf. (Okay, fine, I'll be serious: this has to do with how framebuffers are allocated on a GPU which uses a tiled rendering approach, and these approaches require the aforementioned special small fast memories.)
Caching flags
The rest of the flags have to do with how this memory will interact with the CPU's memory caches:
HOST_CACHED
means that this memory works like normal RAM as far as the CPU's memory cache is concerned.If this flag is missing, then the CPU must access this memory contiguously and sequentially, because any other access pattern is going to hurt bad, and reading from this memory will hurt extra bad. No cache means each CPU instruction which touches memory is exposed to the full cost of dealing with the slowness of (V)RAM.
HOST_COHERENT
means that the CPU and GPU have a secret special handshake they can do to make sure they don't trip up on differences between the version of some data stored in the CPU cache and the underlying RAM.If this flag is missing then the application must explicitly ask the CPU and GPU to shake hands using
vkFlushMappedMemoryRanges
after sending data to the GPU andvkInvalidateMappedMemoryRanges
before reciving data. Obviously don't call these functions one byte (well, one cache line) at a time: that's slow and bad. And ideal use case is writing an entire buffer full of uniform data and then flushing the whole thing at once for the whoe frame before submitting commands which will read that data.If this flag is present, then those functions don't need to be called, but they don't hurt much if called regardless.
Sometimes there will be only one option in this regard. Sometimes drivers will offer multiple options, and the software chooses whichever it prefers. In such a case:
- When writing data contiguously and sequentially and doing no reading whatsoever: avoid
HOST_CACHED
(but no big deal if you can't). In all other cases, insist onHOST_CACHED
(and be prepared for things to run sloooow if it isn't available and you proceed without it). - Whenever it's not burdensome to appropriately call
vkFlushMappedMemoryRanges
andvkInvalidateMappedMemoryRanges
, call those functions and avoidHOST_COHERENT
(but no big deal if you get stuck with it anyway). If you can't call those functions thenHOST_COHERENT
is absolutely required.
Secret driver sauce
As mentioned, a memory type may also contain hidden data. It manifests to applications as a cluster of memory types which share the same heapIndex
and have exactly the same propertyFlags
but which report different compatibility with Vulkan's resource types, as reported by functions like vkGetBufferMemoryRequirements
or vkGetImageMemoryRequirements
(specifically the memoryTypeBits
field in their output structures).
Other flags
There are a few more property flags which aren't often used:
PROTECTED
is for DRM stuff.DEVICE_COHERENT
andDEVICE_UNCACHED
are provided by an AMD extension and are similar in meaning toHOST_COHERENT
and (the absence of)HOST_CACHED
, except they apply to the GPU. There can be a big performance hit for using these. They are not intended for regular use, they're intended for debugging tools.DEVICE_UNCACHED
means all reads and writes to this memory go straight to VRAM, bypassing the GPU's internal caches.DEVICE_COHERENT
means that while the GPU's caches might still be involved in memory access, but the GPU does special secret handshakes among its various memory and caching subsystems to make that invisible to the application. That means that barriers are not required between different stages using this memory.
RDMA_CAPABLE
is an NV extension that indicates that the memory is directly accessible from other system devices (besides the CPU).