The rise of small inference models: Can compact AI match GPT-level inference?

In recent years, the success of large language models (LLMS) has attracted the AI field. Originally designed for natural language processing, these models have evolved into powerful reasoning tools that can solve complex problems through step-by-step thinking processes similar to humans. However, despite its excellent inference capabilities, LLMS has important drawbacks, including high computing costs and slow deployment speeds, which make them impractical for real-life use in resource-constrained environments such as mobile devices or edge computing. This leads to an increasing interest in developing smaller, more efficient models that can provide similar inference capabilities while minimizing costs and resource requirements. This article explores the rise of these small inference models, their potential, challenges and impact on the future of AI.
Changes in perspective
For most of AI’s recent history, the field follows the principle of “the law of scaling”, which shows that model performance is predicted as data, computational power, and model size increases. While this approach produces a powerful model, it also leads to significant tradeoffs including high infrastructure costs, environmental impacts, and latency issues. Not all applications require the full functionality of a large model with hundreds of billions of parameters. In many practical cases, such as in cases such as device assistants, health care and education, their model can achieve similar results if valid reasons.
Understand the reasoning in AI
Inference in AI refers to the ability of the model to follow a logical chain, understand causality, infer meaning, plan the meaning in the steps, and determine contradictions. For language models, this usually means not only retrieving information, but also manipulating and inferring information through a structured step-by-step approach. Typically, this level of reasoning is achieved by fine-tuning LLMS to perform multi-step science before getting an answer. Although effective, these methods require a large amount of computing resources and can be slow and expensive to deploy, raising concerns about their accessibility and environmental impact.
Understand small inference models
Small inference models are designed to replicate the inference capabilities of large models, but are more efficient in terms of computing power, memory usage, and latency. These models often employ a technique called knowledge distillation, where smaller models (“students”) learn from larger pretrained models (“teachers”). The distillation process involves training smaller models on data generated by larger data to pass on inference capabilities. Then, fine-tune the student model to improve its performance. In some cases, enhanced learning of domain-specific reward functions is employed to further enhance the model’s ability to perform task-specific reasoning.
The rise and progress of small inference models
A significant milestone in the development of small inference models is the release of DeepSeek-R1. Despite training with a relatively modest amount of older GPUs, DeepSeek-R1 competes for performance on OpenAi O1 on benchmarks such as MMLU and GSM-8K (such as OpenAI’s O1). This achievement leads to a reconsideration of the traditional scaling method that assumes that larger models are inherently superior.
The success of DeepSeek-R1 can be attributed to its innovative training process that combines large-scale reinforcement learning without relying on early stage supervised fine-tuning. This innovation led to the creation of DeepSeek-R1-Zero, which has impressive inference capabilities compared to large inference models. Further improvements, such as the use of cold-start data, enhance model coherence and task execution, especially in areas such as mathematics and code.
Furthermore, distillation technology proves crucial to developing smaller and more efficient models from larger models. For example, DeepSeek released a distilled version of its model, ranging in size from 1.5 billion to 70 billion parameters. Using these models, the researchers trained a relatively small model, DeepSeek-R1-Distill-Qwen-32b, which outperformed OpenAI’s O1-Mini, spanning various benchmarks. These models can now be deployed using standard hardware, making them more feasible in a wide range of applications.
Can small models match GPT-level reasoning
To evaluate whether a small inference model (SRMS) can match the inference capability of a large model (LRM) such as GPT, it is important to evaluate its performance on standard benchmarks. For example, in the MMLU test, the DeepSeek-R1 model scored around 0.844, comparable to large models such as O1. The distilled type of DeepSeek-R1 achieves top performance on the GSM-8K dataset (GSM-8K dataset), surpassing the O1 and O1 Mini.
In coding tasks such as those on LiveCodeBench and CodeForces, the distillation model of DeepSeek-R1 is performed in a similar manner to O1-Mini and GPT-4O, demonstrating strong inference capabilities in programming. However, larger models still have the advantage in tasks that require a broader language understanding or deal with long context windows, as smaller models tend to be more specific.
Despite its advantages, small models may still struggle with extended inference tasks or in the face of distributing data. For example, in the LLM chess simulation, DeepSeek-R1 makes more mistakes than the large model, indicating its ability to maintain focus and accuracy over the long term.
Trade-offs and practical significance
The tradeoff between model size and performance is crucial when comparing SRM to GPT-level LRMS. Smaller models require less memory and computing power, making them ideal for edge devices, mobile applications, or situations where offline reasoning is required. This efficiency leads to lower operating costs, and models such as the DeepSeek-R1 are 96% cheaper than larger models such as the O1.
However, these efficiency improvements bring some compromises. Smaller models are often used for specific tasks, which may limit their versatility compared to larger models. For example, while DeepSeek-R1 performs well in mathematics and coding, it lacks multimodal capabilities such as the ability to interpret images, such as images that GPT-4O (e.g. GPT-4O) can handle.
Despite these limitations, the practical application of small inference models is still widespread. In healthcare, they can power analyse medical data from standard hospital servers. In education, they can be used to develop personalized coaching systems that provide students with step-by-step feedback. In scientific research, they can perform data analysis and hypothesis testing in fields such as mathematics and physics. The open source nature of models such as DeepSeek-R1 also facilitates collaboration and enables access to AI, allowing smaller organizations to benefit from advanced technologies.
Bottom line
Transforming language models into smaller inference models is a major advance in AI. Although these models may not yet fully match the wide range of capabilities of large language models, they have key advantages in efficiency, cost-effectiveness, and accessibility. By balancing reasoning capabilities and resource efficiency, smaller models will play a key role in a variety of applications, making AI more practical and sustainable.