Google AI introduces Gemini embedding: a novel embedding model

Latest advances in embedding models focus on transforming common text representations such as semantic similarity, clustering, and classification. Traditional embedding models, such as the Universal Sentence Encoder and Sentence-T5, are designed to provide universal text representations, but recent research highlights their limitations in generalization. Therefore, integrated LLM has revolutionized embedded model development through two main approaches: improving training datasets through synthetic data generation and hard mining, and initializing with pre-trained LLM parameters. These methods significantly enhance embedding quality and downstream task performance, but increase computational costs.
Recent research has also explored adapting pre-trained LLM to embed tasks. Sonion-Bert, DPR and Contriever demonstrate the benefits of contrast learning and language-sensitivity training to embed quality. Recently, models such as E5-combination and LABSE, which were initially launched from LLM main chains such as GPT-3 and Mistral, performed better than traditional BERT and T5-based embeddings. Despite their success, these models often require large intraditional data sets, resulting in overfitting. Efforts like MTEB are designed to benchmark models across a variety of tasks and areas, thereby promoting stronger generalization capabilities in future research.
Google’s Gemini Embedding team introduced Gemini Embedding, a state-of-the-art model that generates highly universal text representations. It builds on Google’s powerful Gemini big language model, which leverages multilingual and code comprehension capabilities to enhance the quality of embedding across different tasks such as retrieval and semantic similarity. The model is trained using a high-quality heterogeneous dataset that curates the filtering of Gemini, selects positive/negative paragraphs, and generates synthetic data. Gemini Embed Gemini implements the latest performance of a large number of multilingual text embedding benchmarks (MMTEBs) through contrast learning and fine-tuning, surpassing previous models of multilingual, English and code benchmarks.
The Gemini embedding model is built on a wide range of Gemini knowledge to generate representations of tasks such as retrieval, classification, and ranking. It perfects Gemini’s initialization parameters and adopts a merge strategy to create compact embeddings. The model is trained using isolated negative negative negative quality adversarial estimation (NCE) losses, while the multi-loss approach is suitable for embeddings between sub-dimensionalities. The training process consists of a two-stage pipeline: preprocessing in large data sets and fine-tuning of different tasks. Furthermore, the model combination enhances the generalization. Gemini also helps in synthesis of data generation, filtering, and hard mining to improve the performance of models in multilingual and retrieval tasks.
Gemini embedding models were evaluated across multiple benchmarks, including multilingual, English, and code-based tasks, covering over 250 languages. It exhibits excellent classification, clustering and retrieval performance, always surpassing other leading models. The model achieved the highest ranking based on Borda scores and performed well in the translingual retrieval task. Furthermore, even if certain tasks are excluded, it outperforms its competitors in terms of code-related evaluations. These results highlight Gemini embedding as an efficient multilingual embedding model that provides state-of-the-art performance across a variety of language and technical challenges.
In summary, the Gemini Embedding Model is a robust multilingual embedding solution that excels in a variety of tasks including classification, retrieval, clustering, and ranking. Even if it is trained on English-only data, it shows strong generalization, and other models on multilingual benchmarks outperform others. To improve quality, the model benefits from synthetic data generation, dataset filtering, and hard mining. Future work aims to extend its capabilities to multimodal embedding, integrating text, images, video and audio. The evaluation of large-scale multilingual benchmarks confirms its strengths, making it a powerful tool for researchers and developers to seek efficient, high-performance embeddings.
Check Paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 80k+ ml subcolumn count.
Meet Parlant: LLM-first conversational AI framework designed to provide developers with the control and accuracy they need for AI customer service agents, leveraging behavioral guidelines and runtime supervision.
It is operated using an easy-to-use CLI
and local client SDK in Python and Typescript
.
Google AI post introduces Gemini embedding: a new embedding model initialized from a powerful Gemini large language model first appeared on Marktechpost.