Tencent Hunyuan releases HunyuanOCR: 1B parameter end-to-end OCR expert VLM

by admin · November 26, 2025

Tencent Hunyuan released HunyuanOCR, a 1B parameter visual language model specially used for OCR and document understanding. The model is built on Hunyuan’s native multi-modal architecture and runs positioning, parsing, information extraction, visual question answering and text-image translation through a single end-to-end pipeline.

HunyuanOCR is a lightweight alternative to general-purpose VLMs such as Gemini 2.5 and Qwen3 VL, still matching or surpassing them on OCR-centric tasks. It targets production use cases such as document parsing, card and receipt extraction, video subtitle extraction, and multilingual document translation.

Architecture, native resolution ViT plus lightweight LLM

Mixed source OCR usage 3 main modulesa native-resolution visual encoder called Hunyuan ViT, an adaptive MLP connector, and a lightweight language model. The encoder is based on SigLIP-v2-400M and extended to support arbitrary input resolutions with adaptive patching that preserves the original aspect ratio. Images are segmented into blocks based on their original proportions and processed with global attention, which improves recognition of long lines of text, long documents, and low-quality scans.

The adaptive MLP connector performs learnable pooling in the spatial dimension. It compresses dense visual markers into shorter sequences while preserving information from text-dense regions. This reduces the sequence length passed to the language model and reduces the computational effort while preserving OCR-related details.

The language model is based on the dense architecture Hunyuan 0.5B model and uses XD RoPE. XD RoPE divides the rotation position embedding into 4 text, height, width and time subspaces. This provides the model with a native way to align 1D token order with 2D layout and 3D spatiotemporal structure. Therefore, the same stack can handle multi-column pages, cross-page streams, and video frame sequences.

Training and inference follow a completely end-to-end paradigm. There are no external layout analysis or post-processing models in the loop. All tasks are expressed in the form of natural language prompts and processed in a single forward pass. This design eliminates error propagation across pipeline stages and simplifies deployment.

Data and pre-training recipes

The data pipeline constructed over 200 million image-text pairs covering nine real-world scenarios, including street scenes, documents, advertisements, handwritten text, screenshots, cards, certificates and invoices, game interfaces, video frames, and artistic typography. The corpus covers more than 130 languages.

Synthetic data comes from a multilingual generator that supports right-to-left scripting and paragraph-level rendering. The pipeline controls font, language, rotation, and RGB values, and applies distortion, blur, and local lighting changes to simulate motion capture and other hard conditions.

Pre-training is divided into 4 stages. Stage-1 uses 50B tags and 8k context to perform visual language alignment using plain text, synthetic parsing and recognition data, and general subtitle data. Stage 2 runs multi-modal pre-training on 300B tokens, mixing plain text with synthetic localization, parsing, translation and VQA samples. Phase 3 extends the context length to 32k, with 80B tokens focused on long documents and long text. Stage-4 is application-oriented supervised fine-tuning of 24B tags of human annotations and hard negative data, preserving 32k context and unified instruction templates.

Reinforcement learning with verifiable rewards

After supervised training, HunyuanOCR is further optimized through reinforcement learning. The research team used group relative policy optimization GRPO and reinforcement learning with verifiable reward settings to handle structured tasks. For text recognition, rewards are based on the intersection of box union matching and the normalized edit distance on the text. For document parsing, the reward uses the normalized edit distance between the generated structure and the reference.

For VQA and translation, the system uses LLM as the judge. VQA uses binary rewards to check semantic matches. Translation is scored in COMET style for the LL.M. [0, 5]standardized to [0, 1]. The training framework enforces length limits and strict formatting, and assigns zero rewards when the output overflows or breaks the pattern, which stabilizes optimization and encourages valid JSON or structured output.

Benchmark results, 1B model competes with larger VLM

On an in-house text recognition benchmark of 900 images across 9 categories, Hunyuan OCR achieved an overall score of 70.92. Despite using far fewer parameters, it outperforms traditional pipeline methods such as PaddleOCR and Baidu OCR, as well as general-purpose VLMs such as Gemini 2.5 Pro, Qwen3 VL 2B, Qwen3 VL 235B, and Seed 1.6 Vision.

On OmniDocBench, HunyuanOCR scored 94.10 overall, 94.73 for formulas, and 91.81 for tables. On the Wild OmniDocBench variant, which prints and recaptures documents with folds and light changes, it scored an overall score of 85.21. On DocML, a multilingual parsing benchmark covering 14 non-Chinese and non-English languages, it achieved 91.03, and the paper reports state-of-the-art results for all 14 languages.

In terms of information extraction and VQA, the accuracy of mixed-source OCR reaches 92.29 for cards, 92.53 for receipts, and 92.87 for video subtitles. On OCRBench, it scored 860, higher than the similarly sized DeepSeek OCR and close to larger general-purpose VLMs like the Qwen3 VL 2B Instruct and Gemini 2.5 Pro.

In text-image translation, HunyuanOCR uses the DoTA benchmark and DocML-based internal set. The model achieved high COMET scores on DoTA English-Chinese document translation and won first place in the ICDAR 2025 DIMT competition Track 2.2 OCR free small model.

Main points

Compact end-to-end OCR VLM: HunyuanOCR is a 1B parameter OCR-centric visual language model that connects 0.4B native resolution ViT to a 0.5B Hunyuan language model via an MLP adapter and runs localization, parsing, information extraction, VQA and translation in an end-to-end command-driven pipeline without the need for external layout or detection modules.
Unified support for multiple OCR scenarios: The model is trained on more than 200 million image-text pairs in 9 scenarios including documents, street views, advertisements, handwritten content, screenshots, card invoices, game interfaces, and video frames. The training covers more than 130 languages, and deployment supports more than 100 languages.
Data pipelines enhance learning: Trained using a 4-stage recipe, visual language alignment, multi-modal pre-training, long-context pre-training, and application-oriented supervised fine-tuning, followed by reinforcement learning, group-dependent policy optimization, and verifiable rewards for discovery, parsing, VQA, and translation.
Strong benchmark results for models below 3B
HunyuanOCR achieves document understanding scores of 94.1 on OmniDocBench and 860 on OCRBench, which is reportedly the most advanced among visual language models with less than 3B parameters, while also outperforming multiple commercial OCR APIs and larger open models such as Qwen3 VL 4B on core OCR benchmarks.

Editor’s Note

HunyuanOCR is a strong signal that OCR-specific VLMs are maturing into practical infrastructure and not just benchmarks. Tencent combines a 1B parameter end-to-end architecture with Native Vision Transformer, Adaptive MLP Connector and RL with verifiable rewards to deliver a single model covering recognition, parsing, IE, VQA and translation for more than 100 languages, while achieving leading scores on sub-3B models on OCRBench and achieving a score of 94.1 on OmniDocBench. Overall, HunyuanOCR marks an important shift toward compact, command-driven OCR engines that are realistic for production deployment.

Check Paper, model weight and buy-back agreement. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

Tencent Hunyuan releases HunyuanOCR: 1B parameter end-to-end OCR expert VLM