← All Posts
LinuxCUDANVIDIASystems

7GB of VRAM as Swap, No Kernel Module Required

Sean Lobjoit··4 min read

Built something this week that scratched a real itch.

I have an RTX 3070 laptop with soldered RAM and no upgrade path. It's hybrid graphics - the display runs off the integrated AMD/ATI GPU. The NVIDIA card sits completely idle the majority of the time, 8GB of VRAM doing nothing while the system thrashes to SSD swap.

The Dead End First

The obvious approach is NVIDIA's P2P API, which lets you map GPU memory directly into system address space. I tried both API variants, every relevant flag, and multiple driver versions. Every call returns EINVAL with no explanation.

After weeks of elimination I confirmed it's not a bug. It looks to be a SKU restriction. Consumer GeForce GPUs have P2P gated at the driver level, silently, regardless of what the documentation implies. Direct BAR1 mapping has the same problem. You get approximately 16 MiB of mapped page table entries, which is not useful as swap.

The Way Around It

A small daemon allocates VRAM using the CUDA driver API and exposes that allocation as a block device over a Unix socket using the Network Block Device protocol. The Linux kernel's built-in nbd driver connects to it and creates /dev/nbd0. From there it's just like any normal swap device - and for the workload that actually matters, it's significantly faster.

The data path is:

kernel swap subsystem >> /dev/nbd0 >> nbd kernel driver >> Unix socket >> daemon >> cuMemcpyHtoD/DtoH >> GPU VRAM

No kernel module, no patched driver. It survives every kernel and driver update because it sits entirely in userspace.

Results From Testing

7GB of VRAM as swap at priority 1500. Overflow order: RAM fills, then VRAM absorbs the spill, then zram compresses the rest (CPU-side), then SSD only if everything else is exhausted. Total addressable memory comes out at ~46GB from a 16GB machine.

I ran three benchmarks against NVMe (dm-crypt cryptswap, PCIe 4.0) to get an honest picture.

Sequential throughput - dd, 2 GiB, O_DIRECT

DeviceWriteRead
NVMe2.7 GB/s2.9 GB/s
VRAM (nbd)1.1 GB/s2.3 GB/s

NVMe wins here. The NBD + CUDA userspace round-trip adds overhead that NVMe's direct kernel block path doesn't pay. Every block crosses a Unix socket and a cuMemcpy call. Sequential throughput is not what swap actually does though.

4K random IOPS - fio, libaio, iodepth=32

DeviceRead IOPSWrite IOPSAvg latency
NVMe45.4k45.3k343 us
VRAM (nbd)28.7k28.7k550 us

NVMe wins again. At iodepth=32, NVMe has 32 requests genuinely in flight; the NBD daemon serialises them.

Per-operation latency - ioping, 4K reads, 1 request/sec

DeviceMinAvgMax
NVMe120 us9.05 ms10.1 ms
VRAM (nbd)134 us335 us490 us

This is the one that matters. VRAM is 27x faster on average. The NVMe drive is physically capable of ~112 us - you can see it on the warmup request - but APST (Autonomous Power State Transitions) puts it to sleep between requests. At one request per second, which matches how page faults actually arrive under normal memory pressure, it wakes cold almost every time and pays a ~9 ms penalty. VRAM has no power states.

Memory pressure on a laptop is not a sustained GB/s flood. It is individual 4K page faults arriving seconds apart. Every one of those faults stalls the process waiting for the swap device. At 9 ms per fault, NVMe swap is something you feel. At 335 us, VRAM swap is not.

There is also the wear angle. GDDR6 is DRAM - no write endurance limit. NVMe NAND has a finite TBW rating and swap is one of the worst workloads for it.

Getting Started

Requirements are a CUDA-capable NVIDIA GPU, libcuda.so.1 from any modern driver, and the nbd-client package. The kernel module ships with Linux 3.0+.

git clone https://github.com/c0dejedi/nbd-vram
cd nbd-vram
sudo ./install.sh
sudo systemctl start vram-swap-nbd

Two parameters in the systemd unit file control the allocation size (VRAM_SETUP_SIZE_MB) and swap priority (VRAM_SWAP_PRIORITY). If you're on a laptop with soldered memory and an NVIDIA GPU, this works today.

Feel free to check out nbd-vram, star the project and share it with anyone who could use the extra headroom.