Comparing the top 6 inference runtimes for LLM services in 2025
Large language models are now limited less by training and more by the speed and cost at which we can deliver tokens under real traffic. This comes down to three implementation details: how the runtime batches requests, how prefilling and decoding overlap, and how the KV cache is stored and reused. Different engines make different tradeoffs on these axes, which manifests directly into differences in tokens per second, P50/P99 latency, and GPU memory usage.
This article compares six runtimes that appear repeatedly in production stacks:
- Master of Laws
- TensorRT LL.M.
- Embracing facial text generation inference (TGI v3)
- LM deployment
- Sigrand
- DeepSpeed Inference / ZeRO Inference

1. Master of Laws
design
vLLM surrounds Pagination attention. Instead of storing the KV cache for each sequence in a large contiguous buffer, it divides the KV into fixed-size chunks and uses a layer of indirection, so each sequence points to a list of chunks.
This gives:
- Very low KV fragments (reportedly Waste vs 60-80% in simple allocator)
- High GPU utilization with continuous batch processing
- Native support for block-level prefix sharing and KV reuse
Latest version added KV quantification (FP8) And integrates FlashAttention style kernel.
Performance
vLLM assessment:
- vLLM reached 14–24× Higher throughput than Hugging Face Transformer 2.2–3.5× Higher than early TGI of LLaMA model on NVIDIA GPU.
KV and memory behavior
- PagedAttention provides a KV layout that is both GPU-friendly and fragmentation-resistant.
- When computation is not the bottleneck, FP8 KV quantization reduces KV size and improves decoding throughput.
suitable place
- The default high-performance engine when you need a general-purpose LLM service backend with good throughput, good TTFT, and hardware flexibility.
2. Master of Laws in TensorRT
design
TensorRT LLM is a compilation engine based on NVIDIA TensorRT. It generates fusion kernels for each model and shape and exposes an executor API used by frameworks such as Triton.
Its KV subsystem is clear and feature-rich:
- Paged KV cache
- Quantized KV cache (INT8, FP8, some combinations are still developing)
- Circular buffer KV cache
- KV cache multiplexingincluding offloading the KV to the CPU and reusing it across prompts to reduce TTFT
NVIDIA reports that CPU-based KV reuse can reduce time to first token by as much as 14× On the H100, and even more so on the GH200 in certain scenarios.
Performance
TensorRT LLM is highly tunable, so results vary. Common patterns for public comparisons and supplier benchmarking:
- very low single request latency On NVIDIA GPUs when the engine is compiled for the exact model and configuration.
- At moderate concurrency, it can be tuned for low TTFT or high throughput; at very high concurrency, the throughput-optimized profile drives P99 higher due to aggressive batching.
KV and memory behavior
- Paging KV plus quantization KV provide powerful control over memory usage and bandwidth.
- The Executor and Memory APIs allow you to design cache-aware routing strategies at the application layer.
suitable place
- Latency-critical workloads and NVIDIA-only environments where teams can invest in engine builds and per-model tuning.
3. Face-hugging TGI v3
design
Text Generative Inference (TGI) is a server-centric stack with:
- Rust-based HTTP and gRPC server
- Continuous batching, streaming transmission, safety hooks
- PyTorch and TensorRT backends and tight Hugging Face Hub integration
TGI v3 adds new long context pipeline:
- Chunked pre-population For long input
- Prefix KV cache Such long conversation history is not recalculated on every request
Performance
For general tips, recent third-party work shows:
- Due to PagedAttention, vLLM often outperforms TGI on raw tokens per second under high concurrency, but in many settings the difference is not huge.
- TGI v3 processes approximately 3x more tokens than vLLM under long prompts and is 13x fasterin settings with a long history and prefix caching enabled.
Latency overview:
- P50 for short and medium length cues was similar to vLLM when both were adjusted by sequential batching.
- For longer chat histories, prepopulation dominates in simple pipelines; TGI v3’s reuse of early tokens wins big in TTFT and P50.
KV and memory behavior
- TGI uses KV cache and paging attention style kernels and reduces memory footprint by pre-filling chunks and other runtime changes.
- It integrates quantization via bits and bytes as well as GPTQ, and runs across multiple hardware backends.
suitable place
- Production stacks are already on Hugging Face, especially for historic chat-style workloads where prefix caching can bring huge real-world benefits.
4.LM deployment
design
LM deployment Is the compression and deployment toolkit for the InternLM ecosystem. It exposes two engines:
- Turbo thinking: High-performance CUDA kernels for NVIDIA GPUs
- PyTorch engine:Flexible fallback
Main runtime features:
- Durable, continuous batch processing
- Blocked KV cache Work with managers to assign and reuse
- Dynamic segmentation and fusion of attention blocks
- tensor parallelism
- Weight and KV quantization only (includes AWQ and online INT8/INT4 KV quantization)
Provided by LMDeploy Request throughput is 1.8 times higher than vLLMattribute this to persistent batching, blocking KV, and optimized kernels.
Performance
Reviews show:
- For a 4-bit Llama style model on the A100, LMDeploy can achieve higher tokens per second than vLLM under comparable latency constraints, especially in high concurrency situations.
- It also reported that 4-bit inference is about 2.4 times faster than FP16 For supported models.
Delay:
- When configured without extreme batch limits, single-request TTFT is on the same level as other optimized GPU engines.
- Under high concurrency, persistent batching plus blocking KV allow LMDeploy to maintain high throughput without causing TTFT to crash.
KV and memory behavior
- The blocking KV cache replaces contiguous per-sequence buffers with a grid of KV blocks managed by the runtime, similar in spirit to vLLM’s PagedAttention, but with a different internal layout.
- support Weight and KV quantification Targeting large models on constrained GPUs.
suitable place
- NVIDIA-centric deployments that require maximum throughput and can easily use TurboMind and LMDeploy specific tools.
5.SGLang
design
SGLang is both:
- one DSL For building structured LLM projects such as agents, RAG workflows and tool pipelines
- Implemented runtime Note on cardinalitya KV reuse mechanism that uses a radix tree structure instead of a simple block hash to share prefixes.
Root note:
- Store KVs for many requests in a prefix tree keyed by tokens
- Higher KV hit rates are achieved when multiple calls share a prefix, such as less-shot prompts, multi-round chats, or toolchains
Performance
Key insights:
- SGLang implementation Increased throughput by up to 6.4x and Latency reduced by up to 3.7x Benchmark systems that outperform structured workloads such as vLLM and LMQL.
- The improvements are greatest when there is a lot of prefix reuse, such as multiple rounds of chats or evaluation workloads with repeated contexts.
The reported range of KV cache hit rates is approximately 50% to 99%and the cache-aware scheduler achieves close to optimal hit rates on measured benchmarks.
KV and memory behavior
- RadixAttention sits on top of the paging attention style kernel and focuses on Reuse And not just distribution.
- SGLang integrates well with hierarchical context cache systems that can move KV between GPU and CPU when sequences are long, although these systems are usually implemented as separate projects.
suitable place
- Agent systems, tool pipelines, and heavy RAG applications where many calls share large hint prefixes and KV reuse is important at the application level.
6. DeepSpeed inference/ZeRO inference
design
DeepSpeed provides two parts related to inference:
- DeepSpeed inference: Optimized converter cores plus tensor and pipeline parallelism
- ZeRO Inference/ZeRO Uninstall: Techniques for offloading model weights, and in some settings KV cache, to the CPU or NVMe in order to run very large models on limited GPU memory
ZeRO reasoning focuses on:
- Little or no model weights are preserved in the GPU
- Stream tensors from CPU or NVMe as needed
- Target throughput and model size rather than low latency
Performance
In the example of ZeRO Inference OPT 30B on a single V100 32GB:
- Complete CPU offload reaches approx. 43 tokens per second
- Full NVMe offload reaches approx. 30 tokens per second
- both 1.3–2.4 times faster Better than partial offload configurations as full offload enables larger batch sizes
These numbers are small compared to the GPU-resident LLM runtime on the A100 or H100, but they apply to a model that doesn’t natively fit into 32GB.
Recent I/O characterization of DeepSpeed and FlexGen confirms that the offload-based system is dominated by small reads of 128 KiB and that I/O behavior becomes the main bottleneck.
KV and memory behavior
- Model weights (and sometimes KV blocks) are offloaded to the CPU or SSD to accommodate models that exceed GPU capacity.
- TTFT and P99 are higher compared to a pure GPU engine, but the trade-off is the ability to run very large models that would otherwise not fit.
suitable place
- Offline or batch inference, or low QPS services where model size is more important than latency and GPU count.
comparison table
This table qualitatively summarizes the main trade-offs:
| runtime | Main design concepts | relative strength | KV strategy | Typical use cases |
|---|---|---|---|---|
| Master of Laws | PagedAttention, continuous batch processing | High number of tokens per second given TTFT | Paging KV blocks, FP8 KV support | General GPU service, multiple hardware |
| TensorRT LL.M. | NVIDIA + KV reuse compiled kernel | Extremely low latency and high throughput on NVIDIA | Paging, quantized KV, reuse and offload | NVIDIA only, latency sensitive |
| TGI v3 | High-frequency service layer for long prompt paths | Powerful long prompt performance, integrated stack | Paging KV, block pre-filling, prefix caching | HF-centric API, long chat history |
| LM deployment | TurboMind kernel, blocking KV, quantification | vLLM throughput up to 1.8x in vendor testing | Blocked KV cache, weighting and KV quantization | NVIDIA deployment focuses on raw throughput |
| Sigrand | RadixAttention and structured procedures | Up to 6.4x structured workload throughput and 3.7x lower latency | Radix tree KV reuses prefixes | Proxy, RAG, high prefix reuse |
| deep speed | GPU CPU NVMe offload for large models | Enable large models on small GPUs; throughput oriented | Reduce weight and sometimes KV | Very large models, offline or low QPS |
Runtime selection in practice
For production systems, the choices tend to break down into a few simple patterns:
- You want a strong default engine with minimal customization effort: You can start from Master of Laws. It gives you good throughput, reasonable TTFT, and reliable KV processing on general-purpose hardware.
- You are committed to NVIDIA and want fine-grained control over latency and KV: you can use TensorRT LL.M.probably behind the Triton or TGI. Plan model-specific engine builds and tuning.
- Your stack is on Hugging Face and you care about long chats: you can use TGI v3. Its long hint pipeline and prefix caching are very effective for conversational traffic.
- You want to achieve maximum throughput per GPU with quantized models: you can use LM deployment With TurboMind and blocking KV, especially for the 4-bit Llama series models.
- You are building an agent, toolchain or heavy RAG system: you can use Sigrand And design tips to make the KV reuse rate through RadixAttention very high.
- You have to run very large models on a limited GPU: you can use DeepSpeed Inference / ZeRO Inferenceaccept higher latency and treat the GPU as a throughput engine with an SSD in the loop.
Overall, all these engines converge on the same idea: KV cache is the real bottleneck resource. The winner is a runtime that treats KV as a first-class data structure that can be paged, quantized, reused, and offloaded, rather than just a large tensor that fits into GPU memory.

Michal Sutter is a data science professional with a master’s degree in data science from the University of Padua. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex data sets into actionable insights.
🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.