Learn about “kvcached”: a machine learning library that enables virtualized, elastic KV caching for LLM services on shared GPUs

by admin · October 26, 2025

Serving large language models often wastes GPU memory because the engine pre-reserves large static KV cache areas for each model, even when requests are bursty or idle. Meet ‘kvcached’‘, a library that provides virtualized, elastic KV cache for LLM on shared GPUs. kvcached was developed as a result of research at the Berkeley Sky Computing Laboratory (UC Berkeley) in close collaboration with Rice University and UCLA, with valuable input from collaborators and colleagues at NVIDIA, Intel Corporation, and Stanford University. It introduces an operating system-style virtual memory abstraction for the KV cache, allowing the service engine to retain contiguous memory virtual Use spaces first, then return only the active part physical GPU on demand page. This decoupling improves memory utilization, reduces cold starts, and enables multiple models to share devices in time and space without requiring extensive engine rewrites.

What happened to kvcached?

Using kvcached, the engine creates a contiguous KV cache pool in the virtual address space. When the token arrives, the library uses the CUDA virtual memory API to map the physical GPU page with fine-grained latency. When the request completes or the model becomes idle, the page is unmapped and returned to the shared pool, which can be immediately reused by other collocated models. This preserves simple pointer arithmetic in the kernel and eliminates the need for per-engine user-level paging. Project goals Sigrand and Master of Laws Integrated and released under the Apache 2.0 license. Installation and single-command quickstarts are documented in the Git repository.

How does it have a large-scale impact?

Production workloads host many models with long-tail traffic and bursts of spikes. When models have to be activated or swapped, static reservations can tie up memory and slow down the time to the first token. this prism Research paper shows that multiple LL.M. services are needed Cross-model memory coordination At runtime, more than just scheduling is computed. Prism implements on-demand mapping of physical pages to virtual pages and a secondary scheduler, and reports More than 2 times cost savings and 3.3 times Achieved higher TTFT SLO on real trajectories compared to the previous system. kvcached focuses on memory coordination primitives and provides a reusable component to bring this functionality into mainstream engines.

performance signal

kvcached team report 1.2 times to 28 times hurry up time of first token In multi-model serving, due to immediate reuse of freed pages and removal of large static allocations. These numbers are from a multi-LLM scenario where activation latency and memory headroom dominate tail latency. The research team noted kvcached’s compatibility with SGLang and vLLM and described elastic KV allocation as the core mechanism.

Recent work has shifted from fixed partitioning to virtual memory-based KV management methods. prism Extend VMM-based allocation to multi-LLM settings through cross-model coordination and scheduling. Previous efforts, e.g. vattention Explore CUDA VMM’s single model serving to avoid fragmentation without PagedAttention. The arc is clear, use virtual memory to maintain KV continuity in virtual space, and then elastically map physical pages as the workload changes. kvcached implements this idea as a library, simplifying adoption within existing engines.

Practical applications for developers

Cross-model hosting: The engine can co-locate multiple small or medium-sized models on one device. When one model is idle, its KV pages are quickly freed, and the other model can expand its working set without restarting. This reduces head-of-line blocking during bursts and improves TTFT SLO achievement.

activation behavior: Prism reports activation time of approx. 0.7 seconds for a 8B Model and about 1.5 seconds for a 70B Model with flow activation. kvcached benefits from a similar principle, as virtual reservations allow the engine to prepare address ranges ahead of time and then map the pages when the token arrives.

Serverless LLM auto-scaling: Fine-grained page mapping makes it possible to expand replicas more frequently and run cold models in a hot state with minimal memory footprint. This allows for tighter autoscaling loops and reduces the radius of influence of hot spots.

Offloading and future work. Virtual memory opens the door for KV to be offloaded to host memory or NVMe when access patterns allow. NVIDIA’s recently released managed memory guidance on KV offloading on GH200 class systems shows how a unified address space can scale capacity with acceptable overhead. The kvcached maintainers also discuss offloading and compaction directions in the common thread. Verify throughput and latency in your own pipelines, as access locality and PCIe topology have a strong impact.

Main points

kvcached uses GPU virtual memory to virtualize the KV cache. The engine reserves continuous virtual space and maps physical pages on demand to achieve elastic allocation and recycling under dynamic loads.
It is integrated with mainstream inference engines (especially SGLang and vLLM) and released under Apache 2.0, making adoption and modification of production service stacks simple.
Public benchmark reports show 1.2x to 28x faster time to first token in multi-model serving due to immediate reuse of freed KV pages and removal of large static reservations.
Prism shows that cross-model memory coordination through on-demand mapping and two-level scheduling can save more than 2 times the cost and achieve a 3.3 times improvement in TTFT SLO on actual traces. kvcached provides memory primitives that can be reused by mainstream engines.
For clusters hosting many models with bursty, long-tail traffic, virtualized KV cache enables secure hosting, faster activation, and tighter autoscaling, with activation times reported in Prism evaluations of approximately 0.7 seconds for the 8B model and approximately 1.5 seconds for the 70B model.

kvcached is an efficient method of GPU memory virtualization for LLM services, not a full operating system, and clarity is important. The library reserves virtual address space for the KV cache and then maps physical pages on demand, enabling elastic sharing across models with minimal engine changes. This is consistent with cross-model memory coordination being critical for multi-model workloads and improving SLO implementation and cost under real-world tracing. Overall, kvcached facilitates GPU memory coordination of LLM services, and the production value depends on the verification of each cluster.

Check GitHub repository, Paper 1, Paper 2 and technical details. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an AI media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

Learn about “kvcached”: a machine learning library that enables virtualized, elastic KV caching for LLM services on shared GPUs

What happened to kvcached?

How does it have a large-scale impact?

performance signal

Practical applications for developers

Main points

You may also like...

live chat

Recent Posts

Learn about “kvcached”: a machine learning library that enables virtualized, elastic KV caching for LLM services on shared GPUs

What happened to kvcached?

How does it have a large-scale impact?

performance signal

How does it relate to recent research?

Practical applications for developers

Main points

You may also like...

Improve police interactions in schizophrenia patients

The ultimate desktop fan for tech lovers – cute tech gadgets

Exploring ARC-AGI: Test of the test of real AI adaptability

live chat

Recent Posts