vLLM, TensorRT-LLM, HF TGI, LMDeploy, in-depth technical comparison of production LLM inference

by admin · November 20, 2025

Producing LLM services is now a system issue rather than a generate() ring shape. For real workloads, the choice of inference stack affects your Tokens per second, tail latency,final Cost per million tokens on a given GPU queue.

This comparison focuses on 4 widely used stacks:

Master of Laws
NVIDIA TensorRT-LL.M.
Embracing facial text generation inference (TGI v3)
LM deployment

1. vLLM and PagedAttention as open baselines

core concept

vLLM surrounds Pagination attentionan attention implementation that treats the KV cache as paged virtual memory rather than a single contiguous buffer per sequence.

Instead of allocating a large KV area for each request, Master of Laws:

Divide the KV cache into fixed-size chunks
Maintain a block table that maps logical tokens to physical blocks
Sharing blocks between sequences with overlapping prefixes

This reduces external fragmentation and lets the scheduler pack more concurrent sequences into the same VRAM.

Throughput and latency

vLLM improves throughput by 2–4× Compared to systems such as FasterTransformer and Orca, latency is similar and gains are greater for longer sequences.

Key properties of operators:

Continuous batching (also known as flying batching) merges incoming requests into existing GPU batches instead of waiting for a fixed batching window.
In a typical chat workload, throughput scales nearly linearly with concurrency until the KV memory or compute is saturated.
For moderate concurrency, P50 latency is still low, but once the queue is long or the KV memory is tight, especially for pre-populated large queries, P99 latency may decrease.

vLLM made public OpenAI compatible with HTTP API And integrates well with Ray Serve and other orchestrators, which is why it is widely used as an open baseline.

KV and multi-tenancy

PagedAttention gives KV waste is almost zero and flexible prefix sharing within and between requests.
Each vLLM process serves a modelmulti-tenant and multi-model setups are typically built using external routers or API gateways that fan out to multiple vLLM instances.

2. TensorRT-LLM, hardware maximum on NVIDIA GPUs

core concept

TensorRT-LLM is NVIDIA’s GPU-optimized inference library. The library provides custom attention kernels, on-the-fly batching, paged KV cache, quantization to FP4 and INT4, and speculative decoding.

It is tightly coupled to NVIDIA hardware, including FP8 tensor cores on Hopper and Blackwell.

Tested performance

NVIDIA’s H100 vs A100 review is the most specific public reference:

On H100 with FP8, TensorRT-LLM reaches Over 10,000 output tokens/second The peak throughput is 64 concurrent requestsand ~100 milliseconds The time of the first token.
H100 FP8 reaches Maximum throughput increased by 4.6 times and 4.4x faster first token latency Compared with the same model A100.

For latency sensitive mode:

TensorRT-LLM on H100 can drive TTFT Less than 10 milliseconds In the batch 1 configuration, the cost is lower overall throughput.

These figures are model and shape specific, but they give realistic proportions.

Pre-filling and decoding

TensorRT-LLM optimizes two stages:

Pre-population benefits from high-throughput FP8 attention kernels and tensor parallelism
Decoding benefits from CUDA graphs, speculative decoding, quantized weights and KV, and kernel fusion

The result is very high tokens/second over a wide range of input and output lengths, especially when the engine is tuned for that model and batch profile.

KV and multi-tenancy

TensorRT-LLM provides:

Paged KV cache Has configurable layout
Supports long sequence, KV reuse and offloading
Flying batching and priority-aware scheduling primitives

NVIDIA pairs this with an orchestration mode based on Ray or Triton multi-tenant clusters. Multi-model support is done at the coordinator level, not within a single TensorRT-LLM engine instance.

3. Hugging Face TGI v3, long prompt expert, multi-backend gateway

core concept

Text Generative Inference (TGI) is a Rust and Python-based services stack that adds:

HTTP and gRPC API
continuous batch scheduler
Observability and autoscaling hooks
Pluggable backends including vLLM style engines, TensorRT-LLM and other runtimes

Version 3 focuses on long prompt processing Chunking and prefix caching.

Long prompt benchmark vs. vLLM

The TGI v3 documentation gives clear benchmarks:

in long prompt exceeds 200,000 tokensthe conversation reply requires vLLM 27.5 seconds Can be delivered approx. 2 seconds in TGI v3.
According to reports, this is 13x speedup In terms of that workload.
TGI v3 is capable of processing approx. 3x more tokens in same GPU memory By reducing memory footprint and leveraging chunking and caching.

The mechanism is:

TGI retains the original conversation context in prefix cacheso only incremental tokens are paid in subsequent rounds
The cache lookup cost is approximately microsecondswhich is negligible relative to the prefill calculation

This is a targeted optimization for workloads with very long prompts that are reused in turns, such as RAG pipelines and analytic summaries.

Architecture and latency behavior

Key components:

Chunkingvery long prompts are divided into manageable segments for KV and scheduling
prefix cachea data structure that shares long context across rounds
Continuous batchingthe incoming request joins the batch of the already run sequence
PagedAttention and Fusion Kernel On the GPU backend

For short chat type workloads, throughput and latency are about the same as vLLM. For long, cacheable contexts, both P50 and P99 latencies improve by an order of magnitude because the engine avoids repeated pre-population.

Multiple backends, multiple models

TGI is designed to Router + model server architecture. It can:

Route requests across multiple models and replicas
Target different backends such as TensorRT-LLM on H100 plus CPU or smaller GPU for low priority traffic

This makes it suitable as a central service layer in a multi-tenant environment.

4. LMDeploy and TurboMind have blocking KV and aggressive quantization

core concept

LMDploy in the InternLM ecosystem is a toolkit for compressing and serving LLM to Turbo thinking engine. Its focus is:

High-throughput request servicing
Blocked KV cache
Continuous batch processing (continuous batch processing)
Weight quantization and KV caching

Relative throughput versus vLLM

The project states:

‘LMDeploy’s request throughput is 1.8 times higher than vLLM‘, supported by persistent batching, chunked KV, dynamic segmentation and fusion, tensor parallelism and optimized CUDA kernels.

KV, quantization and delay

LMDeploy includes:

Blocked KV cachesimilar to paged KV, helps pack many sequences into VRAM
support KV cache quantizationusually int8 or int4, to reduce KV memory and bandwidth
Weighted quantization path only, e.g. 4-bit AWQ
Benchmarking tool reporting token throughput, request throughput, and first token latency

This makes LMDeploy attractive when you want to run larger open models (such as InternLM or Qwen) on moderate GPUs with aggressive compression while still maintaining good tokens.

Multi-model deployment

LMDeploy provides proxy server Able to handle:

Multi-model deployment
Multi-machine, multi-GPU setup
Routing logic for model selection based on request metadata

So architecturally, it’s closer to TGI than a single engine.

When to use what?

If you want maximum throughput and extremely low TTFT on NVIDIA GPUs
- TensorRT-LL.M. is the first choice
- It uses FP8 and lower precision, custom kernels, and speculative decoding to push tokens/second and keep TTFT under 100ms at high concurrency and under 10ms at low concurrency
If you are occupied with long prompts for reuse, such as RAGs in large contexts
- TGI v3 is a strong default
- its prefix cache and chunked abandon 3×Token Capacity and Latency reduced by 13 times Better than vLLM in published long prompt benchmarks, no additional configuration required
If you want an open, simple engine with strong benchmark performance and OpenAI style API
- Master of Laws Still maintain standard baseline
- PagedAttention and continuous batching make it possible 2–4 times faster It integrates perfectly with Ray and K8s compared to older stacks with similar latency
If your goal is open models (e.g. InternLM or Qwen) and value radical quantification of multi-model services
- LM deployment Very suitable
- Blocking KV cache, persistent batching and int8 or int4 KV quantization are given Request throughput is 1.8 times higher than vLLM On supported models, includes router layer

In practice, many development teams use a mix of these systems, such as using TensorRT-LLM for high-volume proprietary chat, TGI v3 for long-context analysis, and vLLM or LMDeploy for experimental and open model workloads. The key is to align the throughput, latency tail, and KV behavior with the actual token distribution in the traffic, and then calculate the cost per million tokens based on tokens/second measured on your own hardware.

refer to

vLLM/PagedAttention
TensorRT-LLM performance and overview
HF Text Generating Inference (TGI v3) long prompt behavior
LMDeploy/TurboMind

Michal Sutter is a data science professional with a master’s degree in data science from the University of Padua. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex data sets into actionable insights.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

vLLM, TensorRT-LLM, HF TGI, LMDeploy, in-depth technical comparison of production LLM inference

1. vLLM and PagedAttention as open baselines

2. TensorRT-LLM, hardware maximum on NVIDIA GPUs

3. Hugging Face TGI v3, long prompt expert, multi-backend gateway

4. LMDeploy and TurboMind have blocking KV and aggressive quantization

When to use what?

refer to

You may also like...

live chat

Recent Posts

vLLM, TensorRT-LLM, HF TGI, LMDeploy, in-depth technical comparison of production LLM inference

1. vLLM and PagedAttention as open baselines

2. TensorRT-LLM, hardware maximum on NVIDIA GPUs

3. Hugging Face TGI v3, long prompt expert, multi-backend gateway

4. LMDeploy and TurboMind have blocking KV and aggressive quantization

When to use what?

refer to

You may also like...

COVID-19 threats to people living with HIV increase

Google AI proposes an inference library: a policy-level I proxy storage framework that enables LLM proxy to evolve itself at test time

Columbia Climate School launches its master’s degree in the first climate finance program in the United States – Earth State

live chat

Recent Posts