Google launches Speech to Retrieval (S2R) approach that maps spoken queries directly to embeds and retrieves information without first converting speech to text

Google Artificial Intelligence Research The team introduces Speech to Retrieval (S2R). S2R maps speech queries directly to embeddings and retrieves information without first converting speech to text. The Google team positioned S2R as an architectural and philosophical change that targets error propagation in the classic cascade modeling approach and focuses the system on retrieval intent rather than transcription fidelity. Google research team says voice search Now powered on By S2R.

From cascade modeling to intent-consistent retrieval

in traditional Cascade modeling approachAutomatic Speech Recognition (ASR) first generates a single text string and then passes it to retrieval. Small transcription errors may change the meaning of a query and produce incorrect results. S2R Reframe the question around “What information is being sought?” and bypassing fragile intermediate transcripts.

Assessing the potential of S2R

Google’s research team analyzes the disconnect between the two Word Error Rate (WER) (ASR quality) and Mean Reciprocal Rank (MRR) (Search quality). Model using human-validated transcripts Cascading ground truth Under the “perfect ASR” condition, the team compared (i) Cascade automatic speech recognition (real world baseline) and (ii) Cascading ground truth (upper limit) and observed lower world education alliance Higher cannot be reliably predicted MRR Across languages. persistent MRR The gap between baseline and ground truth indicates that the model has scope to retrieve intent directly from audio optimization.

Architecture: Jointly trained dual encoders

Its core is S2R is a Dual encoder architecture. one audio encoder Convert verbal queries into rich queries audio embedding capture semantic meaning, while file encoder Generate the corresponding vector representation of the document. The system is trained using paired (audio query, related document) data so that the vector for the audio query is geometrically close to the vector corresponding to the document in representation space. This training goal directly aligns speech to the retrieval target and eliminates fragile dependence on the exact sequence of words.

Service Path: Streaming Audio, Similarity Search and Ranking

At inference time, the audio is streaming to pre-trained audio encoder Generate query vectors. This vector is used for Effective identification A set of highly relevant candidate results from the Google index; this Search ranking system– It integrates hundreds of signals – and then calculates the final sequence. This implementation retains the mature ranking stack while using Speech semantic embedding.

Evaluate S2R on SVQ

superior Simple Speech Questions (SVQ) In evaluation, the post compares three systems: Cascade automatic speech recognition (blue), Cascading ground truth (green), and S2R (orange). this S2R bar significantly better than baseline Cascade automatic speech recognition and method upper limit set by Cascading ground truth exist MRRthe authors refer to the remaining gaps as future research space.

Open Source: SVQ and the Large-Scale Sound Embedding Benchmark (MSEB)

To support community progress, Google open sourced Simple Speech Questions (SVQ) About Hugging Face: Recording a brief audio question 26 locales covering 17 languages and under a variety of audio conditions (clean, background speech noise, traffic noise, media noise). This dataset is released as an unsplit evaluation set and is licensed CC-BY-4.0. SVQ is part of Large-Scale Sound Embedding Benchmark (MSEB)an open framework for evaluating good embedding methods across tasks.

Main points

  • Google has moved voice search to Speech to Retrieval (S2R)maps speech queries to embeddings and skips transcription.
  • Dual encoder Design (Audio Encoder + Document Encoder) aligns audio/query vectors with document embeddings for direct semantic retrieval.
  • In the evaluation, S2R performs better than others Produce ASR → Retrieve Cascade Sum method True transcript cap for MRR.
  • S2R Yes life in production and Provide multiple language servicesintegrates with Google’s existing ranking stack.
  • Published by Google Simple Speech Questions (SVQ) (17 languages, 26 locales) MSEB Standardized speech retrieval benchmarks.

Speech to Retrieval (S2R) This is a meaningful architectural revision, not a cosmetic upgrade: By replacing the ASR→Text hinge with a speech-native embedding interface, Google aligns optimization goals with search quality and eliminates a major source of cascading errors. Production deployment and multilingual coverage are important, but now the interesting work is operational—calibrating audio-derived relevance scores, stress testing transcoding and noise conditions, and quantifying privacy tradeoffs when speech embeddings become query keys.


Check Technical details here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


Max is an Artificial Intelligence Analyst at MarkTechPost in Silicon Valley, actively shaping the future of technology. He teaches robotics at Brainvyne, fights spam with ComplyEmail, and uses artificial intelligence every day to transform complex technological advances into clear and understandable insights

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...