VRAM Swap, Two Weeks In: Multithreading and Killing the Deadlock
The nbd-vram daemon got a thread pool, ~4x concurrent 4K IOPS, and the swap-pressure freeze is finally gone. Fresh benchmarks against NVMe.
26 posts on cloud, AI, and everything in between.
The nbd-vram daemon got a thread pool, ~4x concurrent 4K IOPS, and the swap-pressure freeze is finally gone. Fresh benchmarks against NVMe.
Most RAG accuracy problems are retrieval misses, not model failures. Here is how to measure and fix the part that actually breaks.
CUDA Unified Memory can extend effective VRAM for LLM inference. The catch is PCIe bandwidth, which makes it slower than just splitting layers to CPU for any meaningful overflow.
NVIDIA's P2P API is silently blocked on consumer GPUs. I've found a path around it in the form of an NBD server over a Unix socket that turns VRAM into a real swap device.
Zero-downtime deployments don't protect you from schema migrations. Here's how table locks bring down production, and how to fix it.