Embrace Face Open Source: A New Multimodal Dataset with 24 million samples for training visual models (VLMS)

by admin · September 6, 2025

Hug face just released finean open multi-modal dataset designed to set new standards for Visual Models (VLMS). and 17.3 million images,,,,, 24.3 million samples,,,,, 88.9 million Q&A turns,almost 10 billion answer tokensFineVision will position itself as one of the largest, structured publicly available VLM training datasets.

FineVision aggregation More than 200 sources In a unified format, strictly filtered for repetition and benchmark pollution. This dataset is systematically rated on multiple aspects of quality, allowing researchers and developers to build powerful training mixtures while minimizing data leakage.

Why is FineVision important for VLM training?

Most state-of-the-art VLMs rely on proprietary datasets, thus limiting the repeatability and accessibility of the wider research community. FineVision passes:

Proportion and coverage: 5 TB of curated data across 9 categories, including general VQA, OCR quality quality quality maps, charts and table reasoning, science, subtitles, grounding and counting, and GUI navigation.
Baseline Gain: Passing through 11 benchmarks for use (e.g., AI2D, ChartQa, DocVQA, ScienceQA, OCRBENCH), models trained on FineVision are better than alternatives, better than alternatives – Llava ratio 46.3%,,,,, Cauldron ratio 40.7%and 12.1% from Cambrian.
New skill areas: FineVision introduces data from emerging tasks such as GUI navigation, pointing, and counting, thus extending the functionality of VLM beyond regular subtitles and VQA.

How was FineVision built?

The curator pipeline follows a three-step process:

Collect and enhance
More than 200 publicly available image text datasets were collected. Missing ways (e.g., text data only) are reformatted to a quality check pair. Complement underrepresented domains, such as GUI data, through targeted collection.
clean
- Deleted the oversized QA pair (>8192 token).
- Resize large images to up to 2048 PX while retaining aspect ratios.
- Discarded damaged samples.
Quality Rating
use QWEN3-32B and QWEN2.5-VL-32B-INSTRUCT As judges, each pair of QA pair is rated as four axes:
- Text format quality
- Relevance of Q&A
- Visual dependency
- Image problem correspondence
These ratings can selectively train the mixture, although ablation indicates Preserving all samples yields optimal performanceeven if the lower samples are included.

Comparative Analysis: FineVision vs. Existing Open Datasets

Dataset	image	sample	change	Token	leakage	perf. Deduplication after deduplication
kettle	2.0m	18m	27.8m	0.3b	3.05%	-2.39%
llava-vision	2.5m	39m	9.1m	1.0B	2.15%	-2.72%
Cambrian-7m	5.4m	7.0m	12.2m	0.8B	2.29%	-2.78%
fine	173m	243m	889m	9.5b	1.02%	-1.45%

FineVision is not only one of the biggest, but Minimum hallucinations Dataset, only 1% overlap With benchmark set. This ensures minimal data leakage and reliable performance evaluation.

Performance insights

Model Settings:use Nanovlm (460m parameter), combination Smollm2-360m teaching As a language backbone and Siglip2-Base-512 As a visual encoder.
Training efficiency: On 32 NVIDIA H100 GPUs, a full period (12k steps) takes about 20 hours.
Performance Trends:
- After exposure to the 12k step, the FineVision model improves steadily by exposing various data beyond the baseline.
- Compared with Big Cauldron, LLAVA and CAMBRIAN, Dewliplication experiments confirm the low leakage of FineVision.
- Even if the backbone is a monolingual multilingual seed set, it shows a slight performance increase, indicating that diversity exceeds strict alignment.
- Attempting to do multi-stage training (two or 2.5 phases) does not yield consistent benefits, thus enhancing this training Proportion + Diversity More important than training heuristics.

Why FineVision brings new standards?

+Average performance improvements by 20%: All existing open datasets that outperform over 10 benchmarks.
An unprecedented scale: 17m+ image, 24m+ sample, 10B token.
Skill extension: GUI navigation, counting, pointing and document reasoning.
Minimum data leak: 1% pollution, while 2-3% for other data sets.
Fully open source: Can be accessed immediately on the hug hub datasets library.

in conclusion

FineVision marks a significant advancement in open multimodal datasets. Its large-scale, systematic curation and transparent quality assessment creates a reproducible and scalable foundation for training state-of-the-art visual models. By reducing reliance on proprietary resources, it enables researchers and developers to build competing systems and accelerate advancements in areas such as document analysis, visual reasoning, and proxy multimodal tasks.

Check Dataset and Technical details. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

Embrace Face Open Source: A New Multimodal Dataset with 24 million samples for training visual models (VLMS)