Salesforce AI research introduces reward-guided speculative decoding (RSD): a new framework that can improve the inference efficiency of the Big Speech Model (LLMS), up to 4.4×

In recent years, the rapid scaling of large language models (LLMs) has led to a great improvement in natural language comprehension and reasoning capabilities. However, this progress comes with an important warning: the inference process (one response of a token at a time creates a computational bottleneck. As the size and complexity of the LLM grow, the delay and energy demand for sequential tokens change in size and complexity. To be very large. These challenges are particularly serious in real-world deployments, where cost, speed and scalability are critical in the real world. Traditional decoding methods such as greed or beam search methods often require repeated evaluation of large models, This leads to high computational overhead. In addition, even if parallel decoding techniques are used, the efficiency and quality of output can be elusively maintained. This situation stimulates searches for new technologies that can reduce inference costs without sacrificing accuracy. Therefore, Researchers have been exploring a hybrid approach that combines lightweight models with stronger counterparts in an effort to achieve the best balance between speed and performance, which is for real-time applications, interactive systems and large-scale deployment in cloud environments. It is crucial.
Salesforce AI research introduces reward-guided speculative decoding (RSD), a new framework designed to improve the efficiency of inference from large language models (LLMS). RSD takes this as its core, leveraging a dual-model strategy: a fast, lightweight “draft” model that works in concert with a more powerful “target” model. The draft model generates the output of preliminary candidates quickly, while the Process Reward Model (PRM) evaluates the quality of these outputs in real time. Unlike traditional speculative decoding, which insists on strictly unbiased markup matching and target model, RSD introduces controlled bias. This bias is carefully designed to favor high-reward outputs (those are considered more likely to be correct or context-dependent), thus greatly reducing unnecessary calculations. The method is based on a mathematically derived threshold strategy that determines when the target model should be intervened. Through dynamic mixed outputs of the two models based on reward function, RSD not only accelerates the inference process, but also improves the overall quality of the generated response. This breakthrough approach is detailed in the accompanying paper, which represents a huge leap forward in solving the inherent inefficiency generated by continuous tokens in LLMS.

Technical details and benefits of RSD
In-depth research on technology, RSD operates by integrating two models in a continuous but collaborative way. Initially, the draft model generates candidate tokens or inference steps at low computational costs. Each candidate is then evaluated using the reward function, which acts as a quality gate. If the reward of the candidate token exceeds a predetermined threshold, the output will be accepted; if not, the system calls for a more compute-intensive target model to generate refined tokens. The process is guided by a weighted function (usually a binary step function) that adjusts the dependency on drafts and target models. The dynamic quality control provided by the Process Reward Model (PRM) ensures that only the most promising output bypasses the target model, saving computing. One of the outstanding benefits of this approach is the “acceleration of bias” where controlled bias is not a harm but rather a strategic choice that prioritizes high-value outcomes. This brings two key benefits: First, the overall reasoning process can be 4.4 times faster than running the target model only. Second, it usually produces a +3.5 average accuracy improvement compared to the conventional parallel decoding baseline. Essentially, RSD aligns efficiency with accuracy, which can greatly reduce the number of floating point operations (FLOP) while still delivering outputs that meet or exceed the performance of the target model. The basis of theory and algorithmic details, such as the hybrid distribution and adaptive acceptance criteria defined by PRSD, provide a powerful framework for practical frameworks that are actually deployed in various inference tasks.
opinion
The experience verification of RSD is convincing. The experiments detailed in this article show that RSD always delivers excellent performance on challenging benchmarks such as GSM8K, Math500, OlympiaDbench, and GPQA. For example, on the Math500 benchmark (a dataset designed to test mathematical reasoning), RSD has an accuracy of 88.0 when configuring the 72B target model and 7B PRM, while the target model runs separately is 85.6. This configuration not only reduces the compute load by nearly 4.4× fewer slippers, but also improves inference accuracy. The results highlight the potential of RSD over traditional methods such as speculative decoding (SD) or even based on advanced search-based technologies such as Beam Search or Best-N strategies.

Conclusion: A new paradigm for effective LLM reasoning
In short, the reward-guided speculative decoding (RSD) marks an important milestone in the search for more effective LLM inferences. The dual challenges of computational cost and output quality are effectively solved by combining lightweight draft models with powerful target models and by introducing reward-based acceptance criteria. Innovative approaches to bias acceleration enable systems to selectively bypass expensive calculations of high-return outputs, thus simplifying the inference process. Dynamic quality control mechanisms (created by process reward models) ensure that computing resources are allocated wisely and can only be involved with the target model if necessary. With empirical results showing up to 4.4x inference speeds up to 4.4x and with a +3.5 improvement in average accuracy over traditional methods, RSD not only paves the way for more scalable LLM deployments, but also designed for hybrid decoding framework designs. New standards.
Check Paper and github pages. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 75K+ ml reddit.
Recommended open source AI platform: ‘Intellagent is an open source multi-proxy framework that evaluates complex dialogue AI systems‘ (Promotion)
Post-Salesforce AI research introduces reward-guided speculative decoding (RSD): a novel framework that improves the efficiency of large-speech model (LLMS) reasoning, up to 4.4× less than 4.4× less than Marktechpost.