Meet Yambda: The world’s largest activity dataset to speed up recommendation systems

Yandex recently made a significant contribution to the recommendation system community through its release yambdathe world’s largest publicly available data set for recommendation system research and development. The dataset aims to bridge the gap between academic research and industry-scale applications, providing nearly 5 billion anonymous user interaction activities from Yandex Music, one of the company’s flagship streaming services with over 28 million users per month.

Why Yambda is important: Solving critical data gaps in recommendation systems

Recommendation systems are personalized experiences of many digital services today, from e-commerce and social networks to streaming platforms. These systems rely heavily on large quantities of behavioral data, such as clicking, likes, and listening, to infer user preferences and provide tailored content.

However, the recommended system field lags behind other AI domains (such as natural language processing), mainly due to the lack of large, publicly accessible data sets. Unlike large language models (LLMS), which learn from publicly available text sources, the recommendation system requires sensitive behavioral data – which is commercially valuable and difficult to anonymize. As a result, companies traditionally carefully protect this data, thus limiting researchers’ access to datasets of reality scale.

Existing datasets, such as Spotify’s million-playlist dataset, Netflix prize data, and Criteo’s click logs, are either too small to lack time details or have insufficient literature on developing production-level recommendation models. Yandex Release yambda Address these challenges by providing a rich range of features and anonymous safeguards.

What Yambda contains: Scale, Richness, and Privacy

this yambda The dataset includes 4.79 billion anonymous user interactions collected over 10 months. These events come from approximately one million users interacting with nearly 9.4 million tracks of Yandex Music. The dataset includes:

  • User interaction: Implicit feedback (listen) and explicit feedback (like, don’t like and their deletion).
  • Anonymous audio embed: Vector representation of trajectories derived from convolutional neural networks enables the model to take advantage of audio content similarity.
  • Organic interactive logo: The “is_organic” flag indicates whether the user has discovered independently or through suggested tracks, thereby facilitating behavioral analysis.
  • Exact timestamp: Each event is ordered at timestamp to retain time, which is essential for modeling sequential user behavior.

All users and tracking identifiers are anonymous with digital IDs to comply with privacy standards, ensuring that personally identifiable information is not disclosed.

The dataset is available in the Apache Parquet format, which is optimized for big data processing frameworks such as Apache Spark and Hadoop, and is also compatible with analytics libraries such as Pandas and Porars. This makes YAMBDA available for researchers and developers working in different environments.

Evaluation method: Global time separation

A key innovation in Yandex dataset is adoption Global Time Split (GTS) Evaluate strategies. In a typical recommendation system study, a widely used output method deletes the last interaction for each user to test. However, this approach undermines the temporal continuity of user interaction, thus creating unrealistic training conditions.

GTS, on the other hand, breaks down the data based on the timestamp and preserves the entire sequence of events. This approach more closely mimics the proposed scheme in the real world, as it prevents any future data leak into training and allows the model to be tested on the interactions that are truly invisible, chronologically.

This time-aware evaluation is crucial for benchmarking algorithms under realistic constraints and understanding their actual effectiveness.

Includes baseline models and metrics

To support benchmarking and accelerate innovation, Yandex provides baseline recommendation models implemented on datasets, including:

  • Mosdo: Recommend popular models based on popular items.
  • DecayPop: A popularity model for shorter time.
  • itemknn: A collaborative filtering method based on neighborhoods.
  • ials: Implicit alternating square matrix decomposition.
  • BPR: Bayesian personalized ranking, a paired ranking method.
  • Sansa and Sasrec: The sequence-aware model utilizes the self-term generation mechanism.

Evaluate these benchmarks using standard recommendation metrics, for example:

  • NDCG@K (standardized discount cumulative gain): The measurement criteria for measuring quality emphasize the location of the relevant items.
  • Memories @k: Evaluate the proportion of relevant items retrieved.
  • coverage@k: Indicates the diversity of suggestions throughout the directory.

These benchmark analyses are provided to help researchers quickly evaluate the performance of new algorithms relative to established methods.

Wide applicability of music streams

Although the dataset originates from the music streaming service, its value is far beyond that domain. Interaction types, user behavior dynamics and large-scale YAMBDA become the common benchmark for recommendation systems across fields such as e-commerce, video platforms and social networks. The algorithms verified on this dataset can generalize or adapt to various suggested tasks.

Benefits of different stakeholders

  • academia: Theories and new algorithms are strictly tested on industry-related scales.
  • Startups and SMEs: Provide resources comparable to those owned by technology giants, upgrade the competition venue and accelerate the development of advanced recommendation engines.
  • End User: Indirectly benefit from smarter suggestion algorithms that can improve content discovery, reduce search time and increase engagement.

My Inspur: Yandex’s Personalized Recommendation System

Yandex Music utilizes a proprietary recommendation system called My wavewhich contains deep neural networks and AI to personalize music suggestions. My waves analyze thousands of factors, including:

  • User interaction sequence and listening history.
  • Customizable preferences such as emotions and language.
  • Real-time music analysis of spectrograms, rhythms, sounds, frequency ranges and genres.

The system dynamically adapts to personal tastes by identifying audio similarity and predictive preferences and demonstrates a complex pipeline of suggestions that benefit from large datasets such as Yambda.

Ensure privacy and ethical use

issued yambda The importance of privacy in recommendation system research is emphasized. Yandex anonymously anonymously with digital IDs and omits personal identification information. This dataset contains only interactive signals and does not reveal the exact user identity or sensitive attributes.

This balance between openness and privacy allows for strong research while protecting individual user data, a key consideration for the ethical advancement of AI technology.

Access and Version

Yandex offers three sizes of Yambda datasets to suit different research and computing capabilities:

  • Full version: ~5 billion events.
  • Chinese version: ~500 million events.
  • Small version: ~50 million events.

All versions are available Hug the faceThis is a popular platform for hosting datasets and machine learning models that can be easily integrated into research workflows.

in conclusion

Yandex Release yambda The data set marks a critical moment in the research of recommendation systems. It sets new standards for benchmarking and accelerated innovation by providing anonymous interactive data at an unprecedented scale with time-aware evaluation and benchmarking pairing. Researchers, startups and businesses can now explore and develop recommendation systems that better reflect real-world usage and provide enhanced personalization.

As recommendation systems continue to impact countless online experiences, datasets like Yambda play a fundamental role in driving the boundaries of AI-driven personalized implementation.

Check yambda Dataset of hugging faces.


notes: Thanks to Yandex team for their thought leadership/resources in this article. The Yandex team supports and sponsors this content/article.


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

You may also like...