0

VLM2VEC-V2: A unified computer vision framework for multi-modal embedded learning across images, videos and visual documents

Embedding models serve as a bridge between different data modes by encoding various multimodal information between shared dense representation spaces. In recent years, advances in embedded models have been made in recent years driven by advances in large basic models. However, existing multimodal embedding models are trained on datasets such as MMEB and M-Beir, with the most focus on natural images and photographs of MSCOCO, FLICKR and IMAGENET datasets. These datasets cannot cover larger forms of visual information, including documents, PDFs, websites, videos, and slideshows. This can cause existing embedding models to perform poorly, such as article search, website search, and YouTube video search.

Multimodal embedding benchmarks, such as MSCOCO, FLICKR30K and conceptual subtitles, were initially focused on static image text pairs for tasks such as image subtitles and retrieval. Recent benchmarks such as M-Beir and MMEB have introduced multitasking evaluation, but are still limited to static images and short contexts. Video representation learning evolved through models such as video and video tape, combining contrast learning with subtitle goals. The learning of visual document representation is learned through models such as COLPALI and VISRAG, which use VLMS for document retrieval. Unified schema retrieval methods such as GME and Uni-Retrieval achieve strong performance on a general benchmark. However, no one can unify image, video and visual document retrieval in one framework.

Researchers at Salesforce Research, UC Santa Barbara, University of Waterloo and Tsinghua University propose unified image, video and visual document retrieval within a single framework. First, the researchers developed MMEB-V2, a benchmark that extends MMEB to include five new task types including visual document retrieval, video retrieval, time grounding, video classification and video question answering. Second, VLM2VEC-V2 is a universal embedding model that supports multiple input modes while demonstrating strong performance on newly introduced tasks and original image benchmarks. This lays the foundation for more scalable and flexible representative learning in both research and practical applications.

VLM2VEC-V2 uses QWEN2-VL as its backbone and is selected for its specialized functions in multimodal processing. QWEN2-VL provides three key features that support unified embedding learning: naive dynamic resolution, multi-mode rotational position embedding (M-ROPE), and a unified framework combining 2D and 3D convolutions. To enable effective multi-task training across diverse data sources, VLM2Vec-V2 introduces a flexible data sampling pipeline with two key components: (a) on-the-fly batch mixing based on predefined sampling weight tables that control the relative probability of each dataset, and (b) an interleaved sub-batching strategy that splits full batches into independently sampled sub-batches, improving the stability of contrast learning.

VLM2VEC-V2 achieved the highest overall average score of 58.0 in 78 datasets covering image, video and visual document tasks, performing better than the strong baseline built by GME, LAMRA and VLM2VEC on the same QWEN2-VL backbone. On image tasks, VLM2VEC-V2 performs better than most baselines, but despite only 2b size parameters, it has comparable performance to VLM2VEC-7B. For video tasks, the model achieved competitive performance despite training on a relatively small amount of video data. In Visual Document Retrieval, VLM2VEC-V2 is superior to all VLM2VEC variants, but still lags behind Colpali, which is specifically optimized for visual document tasks.

In summary, the researchers introduced VLM2VEC-V2, a powerful baseline model trained through contrasting learning across different tasks and modal combinations. VLM2VEC-V2 is built on MMEB-V2 and uses QWEN2-VL as its backbone model. MMEB-V2 is a benchmark designed by researchers to evaluate multimodal embedding models of various modes, including text, images, videos, and visual documents. Experimental evaluation shows that VLM2VEC-V2 has achieved the effectiveness of balancing performance in various ways, while highlighting the diagnostic value of MMEB-V2 for future research.


Check Paper, github page and Model embracing face. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.