Comparing the top 6 inference runtimes for LLM services in 2025

by admin · November 7, 2025

Large language models are now limited less by training and more by the speed and cost at which we can deliver tokens under real traffic. This comes down to three implementation details: how the runtime batches requests, how prefilling and decoding overlap, and how the KV cache is stored and reused. Different engines make different tradeoffs on these axes, which manifests directly into differences in tokens per second, P50/P99 latency, and GPU memory usage.

This article compares six runtimes that appear repeatedly in production stacks:

Master of Laws
TensorRT LL.M.
Embracing facial text generation inference (TGI v3)
LM deployment
Sigrand
DeepSpeed Inference / ZeRO Inference

1. Master of Laws

design

vLLM surrounds Pagination attention. Instead of storing the KV cache for each sequence in a large contiguous buffer, it divides the KV into fixed-size chunks and uses a layer of indirection, so each sequence points to a list of chunks.

This gives:

Very low KV fragments (reportedly Waste vs 60-80% in simple allocator)
High GPU utilization with continuous batch processing
Native support for block-level prefix sharing and KV reuse

Latest version added KV quantification (FP8) And integrates FlashAttention style kernel.

Performance

vLLM assessment:

vLLM reached 14–24× Higher throughput than Hugging Face Transformer 2.2–3.5× Higher than early TGI of LLaMA model on NVIDIA GPU.

KV and memory behavior

PagedAttention provides a KV layout that is both GPU-friendly and fragmentation-resistant.
When computation is not the bottleneck, FP8 KV quantization reduces KV size and improves decoding throughput.

suitable place

The default high-performance engine when you need a general-purpose LLM service backend with good throughput, good TTFT, and hardware flexibility.

2. Master of Laws in TensorRT

design

TensorRT LLM is a compilation engine based on NVIDIA TensorRT. It generates fusion kernels for each model and shape and exposes an executor API used by frameworks such as Triton.

Its KV subsystem is clear and feature-rich:

Paged KV cache
Quantized KV cache (INT8, FP8, some combinations are still developing)
Circular buffer KV cache
KV cache multiplexingincluding offloading the KV to the CPU and reusing it across prompts to reduce TTFT

NVIDIA reports that CPU-based KV reuse can reduce time to first token by as much as 14× On the H100, and even more so on the GH200 in certain scenarios.

Performance

TensorRT LLM is highly tunable, so results vary. Common patterns for public comparisons and supplier benchmarking:

very low single request latency On NVIDIA GPUs when the engine is compiled for the exact model and configuration.
At moderate concurrency, it can be tuned for low TTFT or high throughput; at very high concurrency, the throughput-optimized profile drives P99 higher due to aggressive batching.

KV and memory behavior

Paging KV plus quantization KV provide powerful control over memory usage and bandwidth.
The Executor and Memory APIs allow you to design cache-aware routing strategies at the application layer.

suitable place

Latency-critical workloads and NVIDIA-only environments where teams can invest in engine builds and per-model tuning.

3. Face-hugging TGI v3

design

Text Generative Inference (TGI) is a server-centric stack with:

Rust-based HTTP and gRPC server
Continuous batching, streaming transmission, safety hooks
PyTorch and TensorRT backends and tight Hugging Face Hub integration

TGI v3 adds new long context pipeline:

Chunked pre-population For long input
Prefix KV cache Such long conversation history is not recalculated on every request

Performance

For general tips, recent third-party work shows:

Due to PagedAttention, vLLM often outperforms TGI on raw tokens per second under high concurrency, but in many settings the difference is not huge.
TGI v3 processes approximately 3x more tokens than vLLM under long prompts and is 13x fasterin settings with a long history and prefix caching enabled.

Latency overview:

P50 for short and medium length cues was similar to vLLM when both were adjusted by sequential batching.
For longer chat histories, prepopulation dominates in simple pipelines; TGI v3’s reuse of early tokens wins big in TTFT and P50.

KV and memory behavior

TGI uses KV cache and paging attention style kernels and reduces memory footprint by pre-filling chunks and other runtime changes.
It integrates quantization via bits and bytes as well as GPTQ, and runs across multiple hardware backends.

suitable place

Production stacks are already on Hugging Face, especially for historic chat-style workloads where prefix caching can bring huge real-world benefits.

4.LM deployment

design

LM deployment Is the compression and deployment toolkit for the InternLM ecosystem. It exposes two engines:

Turbo thinking: High-performance CUDA kernels for NVIDIA GPUs
PyTorch engine:Flexible fallback

Main runtime features:

Durable, continuous batch processing
Blocked KV cache Work with managers to assign and reuse
Dynamic segmentation and fusion of attention blocks
tensor parallelism
Weight and KV quantization only (includes AWQ and online INT8/INT4 KV quantization)

Provided by LMDeploy Request throughput is 1.8 times higher than vLLMattribute this to persistent batching, blocking KV, and optimized kernels.

Performance

Reviews show:

For a 4-bit Llama style model on the A100, LMDeploy can achieve higher tokens per second than vLLM under comparable latency constraints, especially in high concurrency situations.
It also reported that 4-bit inference is about 2.4 times faster than FP16 For supported models.

Delay:

When configured without extreme batch limits, single-request TTFT is on the same level as other optimized GPU engines.
Under high concurrency, persistent batching plus blocking KV allow LMDeploy to maintain high throughput without causing TTFT to crash.

KV and memory behavior

The blocking KV cache replaces contiguous per-sequence buffers with a grid of KV blocks managed by the runtime, similar in spirit to vLLM’s PagedAttention, but with a different internal layout.
support Weight and KV quantification Targeting large models on constrained GPUs.

suitable place

NVIDIA-centric deployments that require maximum throughput and can easily use TurboMind and LMDeploy specific tools.

5.SGLang

design

SGLang is both:

one DSL For building structured LLM projects such as agents, RAG workflows and tool pipelines
Implemented runtime Note on cardinalitya KV reuse mechanism that uses a radix tree structure instead of a simple block hash to share prefixes.

Root note:

Store KVs for many requests in a prefix tree keyed by tokens
Higher KV hit rates are achieved when multiple calls share a prefix, such as less-shot prompts, multi-round chats, or toolchains

Performance

Key insights:

SGLang implementation Increased throughput by up to 6.4x and Latency reduced by up to 3.7x Benchmark systems that outperform structured workloads such as vLLM and LMQL.
The improvements are greatest when there is a lot of prefix reuse, such as multiple rounds of chats or evaluation workloads with repeated contexts.

The reported range of KV cache hit rates is approximately 50% to 99%and the cache-aware scheduler achieves close to optimal hit rates on measured benchmarks.

KV and memory behavior

RadixAttention sits on top of the paging attention style kernel and focuses on Reuse And not just distribution.
SGLang integrates well with hierarchical context cache systems that can move KV between GPU and CPU when sequences are long, although these systems are usually implemented as separate projects.

suitable place

Agent systems, tool pipelines, and heavy RAG applications where many calls share large hint prefixes and KV reuse is important at the application level.

6. DeepSpeed inference/ZeRO inference

design

DeepSpeed provides two parts related to inference:

DeepSpeed inference: Optimized converter cores plus tensor and pipeline parallelism
ZeRO Inference/ZeRO Uninstall: Techniques for offloading model weights, and in some settings KV cache, to the CPU or NVMe in order to run very large models on limited GPU memory

ZeRO reasoning focuses on:

Little or no model weights are preserved in the GPU
Stream tensors from CPU or NVMe as needed
Target throughput and model size rather than low latency

Performance

In the example of ZeRO Inference OPT 30B on a single V100 32GB:

Complete CPU offload reaches approx. 43 tokens per second
Full NVMe offload reaches approx. 30 tokens per second
both 1.3–2.4 times faster Better than partial offload configurations as full offload enables larger batch sizes

These numbers are small compared to the GPU-resident LLM runtime on the A100 or H100, but they apply to a model that doesn’t natively fit into 32GB.

Recent I/O characterization of DeepSpeed and FlexGen confirms that the offload-based system is dominated by small reads of 128 KiB and that I/O behavior becomes the main bottleneck.

KV and memory behavior

Model weights (and sometimes KV blocks) are offloaded to the CPU or SSD to accommodate models that exceed GPU capacity.
TTFT and P99 are higher compared to a pure GPU engine, but the trade-off is the ability to run very large models that would otherwise not fit.

suitable place

Offline or batch inference, or low QPS services where model size is more important than latency and GPU count.

comparison table

This table qualitatively summarizes the main trade-offs:

runtime	Main design concepts	relative strength	KV strategy	Typical use cases
Master of Laws	PagedAttention, continuous batch processing	High number of tokens per second given TTFT	Paging KV blocks, FP8 KV support	General GPU service, multiple hardware
TensorRT LL.M.	NVIDIA + KV reuse compiled kernel	Extremely low latency and high throughput on NVIDIA	Paging, quantized KV, reuse and offload	NVIDIA only, latency sensitive
TGI v3	High-frequency service layer for long prompt paths	Powerful long prompt performance, integrated stack	Paging KV, block pre-filling, prefix caching	HF-centric API, long chat history
LM deployment	TurboMind kernel, blocking KV, quantification	vLLM throughput up to 1.8x in vendor testing	Blocked KV cache, weighting and KV quantization	NVIDIA deployment focuses on raw throughput
Sigrand	RadixAttention and structured procedures	Up to 6.4x structured workload throughput and 3.7x lower latency	Radix tree KV reuses prefixes	Proxy, RAG, high prefix reuse
deep speed	GPU CPU NVMe offload for large models	Enable large models on small GPUs; throughput oriented	Reduce weight and sometimes KV	Very large models, offline or low QPS

Runtime selection in practice

For production systems, the choices tend to break down into a few simple patterns:

You want a strong default engine with minimal customization effort: You can start from Master of Laws. It gives you good throughput, reasonable TTFT, and reliable KV processing on general-purpose hardware.
You are committed to NVIDIA and want fine-grained control over latency and KV: you can use TensorRT LL.M.probably behind the Triton or TGI. Plan model-specific engine builds and tuning.
Your stack is on Hugging Face and you care about long chats: you can use TGI v3. Its long hint pipeline and prefix caching are very effective for conversational traffic.
You want to achieve maximum throughput per GPU with quantized models: you can use LM deployment With TurboMind and blocking KV, especially for the 4-bit Llama series models.
You are building an agent, toolchain or heavy RAG system: you can use Sigrand And design tips to make the KV reuse rate through RadixAttention very high.
You have to run very large models on a limited GPU: you can use DeepSpeed Inference / ZeRO Inferenceaccept higher latency and treat the GPU as a throughput engine with an SSD in the loop.

Overall, all these engines converge on the same idea: KV cache is the real bottleneck resource. The winner is a runtime that treats KV as a first-class data structure that can be paged, quantized, reused, and offloaded, rather than just a large tensor that fits into GPU memory.

Michal Sutter is a data science professional with a master’s degree in data science from the University of Padua. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex data sets into actionable insights.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

Comparing the top 6 inference runtimes for LLM services in 2025

1. Master of Laws

design

Performance

KV and memory behavior

2. Master of Laws in TensorRT

design

Performance

KV and memory behavior

3. Face-hugging TGI v3

design

Performance

KV and memory behavior

4.LM deployment

design

Performance

KV and memory behavior

5.SGLang

design

Performance

KV and memory behavior

6. DeepSpeed inference/ZeRO inference

design

Performance

KV and memory behavior

comparison table

Runtime selection in practice

You may also like...

live chat

Recent Posts

Comparing the top 6 inference runtimes for LLM services in 2025

1. Master of Laws

design

Performance

KV and memory behavior

2. Master of Laws in TensorRT

design

Performance

KV and memory behavior

3. Face-hugging TGI v3

design

Performance

KV and memory behavior

4.LM deployment

design

Performance

KV and memory behavior

5.SGLang

design

Performance

KV and memory behavior

6. DeepSpeed ​​inference/ZeRO inference

design

Performance

KV and memory behavior

comparison table

Runtime selection in practice

You may also like...

The oldest rock on Earth is confirmed in the Canadian Arctic

Stuttering related to 57 genetic regions

New tools identify abused children without forcing them to reborn trauma

live chat

Recent Posts

6. DeepSpeed inference/ZeRO inference