DeepSeek-AI releases DeepSeek-R1-Zero and DeepSeek-R1: first-generation inference models that stimulate the reasoning capabilities of LL.M.s through reinforcement learning

Large language models (LLMs) have made significant progress in natural language processing, performing well in tasks such as understanding, generation, and inference. However, challenges remain. Achieving robust inference often requires extensive supervised fine-tuning, which limits scalability and generalization. In addition, issues such as poor readability and the balance between computational efficiency and inference complexity still exist, prompting researchers to explore new methods.
DeepSeek-R1: A new approach to LLM inference
Introduction to recent work of DeepSeek-AI DeepSeek-R1a model designed to enhance reasoning capabilities through reinforcement learning (RL). This work resulted in two models:
- DeepSeek-R1-Zeroare trained using only RL and exhibit emergent reasoning behaviors such as Chain of Thought (CoT) reasoning.
- DeepSeek-R1which builds on its predecessor by incorporating a multi-stage training pipeline to address challenges such as readability and language mixing while maintaining high inference performance.
These models are designed to overcome existing limitations, combining innovative reinforcement learning techniques with a structured training process to achieve scalability and usability.

Technological innovation and advantages
1. Reinforcement learning for reasoning tasks: DeepSeek-R1-Zero employs reinforcement learning without relying on supervised data. It uses Group Relative Policy Optimization (GRPO) to significantly improve benchmark performance by evaluating multiple outputs to optimize inference. For example, its AIME 2024 pass@1 score increased from 15.6% to 71.0% during training.
2. Multi-stage training in DeepSeek-R1: DeepSeek-R1 incorporates cold-start data (thousands of curated CoT examples) to fine-tune its base model before proceeding to inference-focused RL. This process ensures that the output is coherent and user-friendly by incorporating language consistency bonuses.
3. Distillation of smaller models: To address computational limitations, DeepSeek-AI uses the Qwen and Llama architectures to refine 6 smaller models (1.5B to 70B parameters) from DeepSeek-R1. These models retain strong inference capabilities, with the 14B distillation model achieving a pass@1 score of 69.7% at AIME 2024, outperforming some larger models.
Results: Performance Insights
DeepSeek-R1’s performance is supported by benchmark results:
- Reasoning baseline:
- AIME 2024: 79.8% pass@1, exceeding OpenAI’s o1-mini.
- MATH-500: 97.3% passes @1, comparable to OpenAI-o1-1217.
- GPQA Diamond: 71.5% Pass @1, good at fact-based reasoning.
- Coding and STEM tasks:
- Codeforces Elo Rating: 2029, outperformed 96.3% of human participants.
- SWE-Bench verification: resolution of 49.2%, competitive with other leading models.
- General abilities:
- The ArenaHard and AlpacaEval 2.0 benchmarks demonstrate strong generalization capabilities, achieving win rates of 92.3% and 87.6% respectively.
Distillation model highlights: Smaller models like DeepSeek-R1-Distill-Qwen-32B demonstrated strong performance with a pass@1 score of 72.6% on AIME 2024, demonstrating effective scalability and practicality.

Conclusion: Improving AI Reasoning
DeepSeek-AI’s DeepSeek-R1 and DeepSeek-R1-Zero represent meaningful advances in LL.M. reasoning capabilities. By leveraging reinforcement learning, cold-start data, and distillation techniques, these models address key limitations while promoting accessibility through open source availability under the MIT license. The API (‘model=deepseek-reasoner’) further enhances usability for developers and researchers.
Looking to the future, DeepSeek-AI plans to improve multi-language support, enhance software engineering capabilities, and improve prompt sensitivity. These efforts aim to further establish DeepSeek-R1 as a powerful solution for inference-focused AI applications. By incorporating thoughtful training examples, DeepSeek-R1 illustrates how artificial intelligence can tackle increasingly complex challenges.
Check Paper, DeepSeek R1 and DeepSeek R1 Zero. All credit for this study goes to the researchers on this project. Also, don’t forget to follow us twitter and join our telegram channel and LinkedIn GroupOP. Don’t forget to join our 65k+ ML SubReddit.
[Recommended Read] Nebius AI Studio extends with vision models, new language models, embeddings and LoRA (promoted)
The article “DeepSeek-AI releases DeepSeek-R1-Zero and DeepSeek-R1: first-generation reasoning models that motivate LL.M. reasoning abilities through reinforcement learning” first appeared on MarkTechPost.