Yandex Releases Alchemist: A Compact Supervised Fine Tuning Dataset for Enhanced Text-to-Image T2I Model Quality

Despite great progress in text-to-image (T2I) generation brought by models such as DALL-E 3, Imagen 3 and Stable Diffusion 3, consistent output quality has been achieved in both aesthetics and alignment – still a persistent challenge. Although large-scale preprocessing provides general knowledge, it is not sufficient to achieve high aesthetic quality and consistency. Supervised fine-tuning (SFT) is a critical post-training step, but its effectiveness depends largely on the quality of the fine-tuning dataset.
The current public data set used in SFT targets narrow visual domains (such as anime or specific art genre), or basic heuristic filters that rely on network-scale data. Curation by artificial leaders is expensive, cannot be reduced, and often fails to identify the most improved samples. Furthermore, the recent T2i model uses an internal proprietary dataset with minimal transparency, thus limiting the repeatability of the results and slowing down collective progress in the field.
Method: Model-guided dataset planning
To alleviate these problems, Yandex has released AlchemistThis is a common SFT dataset that is available for public use, consisting of 3,350 carefully selected image text pairs. Unlike conventional datasets, Alchemist is constructed using a new approach that utilizes a pretrained diffusion model as a sample mass estimator. This approach allows the selection of training data that has a high impact on the performance of the generative model without relying on subjective human labels or simple aesthetic scores.
Alchemists aim to improve the output quality of the T2I model through targeted fine-tuning. The release also includes five fine-tuned versions of the publicly stable diffusion model. With public permission, data sets and models can be accessed in front of the embrace. More information about methods and experiments – in preprints.
Technical design: Filtering pipeline and dataset characteristics
The Alchemist’s construction involves a multi-stage filtering pipeline starting with about 10 billion online procurement images. The pipeline structure is as follows:
- Initial filtering: Remove NSFW content and low resolution images (threshold > 1024×1024 pixels).
- Coarse quality filtration: Apply classifiers to exclude images with compression artifacts, motion blur, watermarks and other defects. These classifiers are trained on standard image quality evaluation datasets, such as KONIQ-10K and PIPAL.
- Deduplication and IQA-based pruning: SIFT-like features are used to cluster similar images, retaining only high-quality images. Images were further scored using the Topiq model to ensure a clean sample was retained.
- Diffus-based selection: A key contribution is to rank images using cross-attention activation of pre-trained diffusion models. The scoring function identifies samples of features related to visual complexity, aesthetic appeal, and style richness. This makes the sample selection most likely to improve downstream model performance.
- Subtitle rewrite: The final selected image was fine-tuned using a visual language model to produce a timely text description. This step ensures better consistency and availability in the SFT workflow.
Through the ablation study, the authors determined that increasing the dataset size to 3,350 (e.g., 7K or 19K samples) would result in lower quality of the fine-tuning model, enhancing the goal, with the value of high-quality data exceeding the original volume.
Results of multiple T2i models
The effectiveness of the alchemist was evaluated in five stable diffusion variants: SD1.5, SD2.1, SDXL, SD3.5 medium and SD3.5 large. Each model is fine-tuned using three datasets: (i) the Alchemist dataset, (ii) the size matching subset of Laion-Assethetics V2, and (iii) the respective baselines.
Human Assessment: Expert annotators were evaluated side by side across four criteria – text image correlation, aesthetic quality, image complexity, and loyalty. The alchemist-adjusted model showed statistically significant improvements in the statistically significant aesthetic and complexity scores, usually outperforming the benchmark and Laion-Asthetics-andshetics-adjusted versions at a profit margin of 12-20%. Importantly, text image correlation remains stable, indicating that rapid alignment is not negatively affected.
Automatic indicators: In metrics such as FD-DINOV2, editing scores, ImagerWard, and HPS-V2, the alchemist-adjusted model usually has a higher score than its counterparts. It is worth noting that improvements are more consistent than baseline models compared to the size matching-based LAION model.
Dataset size ablation: Fine-tuning with larger alchemists (7k and 19k samples) results in lower performance, emphasizing stricter filtration and higher per-sample mass are more influential than dataset size.
Yandex has leveraged the dataset to train its proprietary text-on-image generation model Yandexart v2.5 and plans to continue to leverage it for future model updates.
in conclusion
Alchemists provide a clear and empirical verification pathway to improve the quality of text-to-image generation through supervised fine-tuning. This approach emphasizes sample quality oversized and introduces replicable dataset building methods without relying on proprietary tools.
Although these improvements are most evident in perceptual properties such as aesthetics and image complexity, the framework also highlights tradeoffs that emerge in reality, especially for newer fundamental models that have been optimized by internal SFT. However, the Alchemist sets new standards for general-purpose SFT datasets and provides researchers and developers with valuable resources to improve the output quality of generated visual models.
Check The paper is here and Alchemist Dataset of hugging faces. Thanks to Yandex team for their thought leadership/resources in this article.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.
