Researchers at Apple and Duke University have proposed a reinforcement learning approach that enables LLM to provide intermediate answers, improve speed and accuracy

by admin · May 30, 2025

Long-term COT inference improves the performance of large language models on complex tasks, but with drawbacks. Typical “thinking” methods slow down response times and disrupt real-time interactions like chatbots. This also has the risk of inaccuracy, as errors in early reasoning steps can lead to misleading final answers. Unlike people who often share some of their thoughts or conclusions in conversations, LLMS delays the answer until all the reasoning is completed. Although RL is often used to train inference models, it mainly rewards the final answer, thus ignoring useful intermediate insights. The teaching model alternating between thinking and answering is becoming increasingly interested, but this remains a challenge.

RL has become a popular method of augmenting reasoning in LLM, based on its success in models with human preferences. Two common reward types guide RLs: Results-based rewards (ORM), which focuses on final answers and process-based rewards (PRMs), which provide feedback on intermediate reasoning steps. Although PRMs provide more detailed oversight, they often rely on human annotations and other models, making them complex and easily attributed to reward hackers. Additionally, efforts to improve LLM reasoning explore ways to drive strategies, structured reasoning, tool integration, and reduce latency and increase efficiency.

Researchers at Apple and Duke University introduced interwoven reasoning, a new RL approach that allows language models to alternate between thinking and answering when solving complex multi-step problems. Rather than waiting until the end of the response, the model provides informative intermediate answers, improving user feedback and guiding their reasoning. Using direct rule-based rewards, the model is trained to produce useful inference steps, resulting in response speeds exceeding 80% and with up to 19.3% accuracy. This approach is trained only in quality check and logical datasets, showing strong generalizations for more challenging benchmarks such as Math, GPQA, and MMLU.

This study proposes a reinforced learning framework to train LLM for interwoven reasoning, where the model alternates between internal thinking and user-oriented intermediate answers. Once the model reaches meaningful milestones in reasoning, each intermediate step or “sub-answer” will be shared. Special training templates with and Use tags. This method uses rule-based rewards (particularly format, final accuracy and conditional intermediate accuracy) to guide learning. It is worth noting that intermediate rewards are applied only when specific criteria are met, ensuring that the model prioritizes overall correctness. They also tested different reward schemes such as all- or no-man, partial credit and time-shortened rewards to optimize the quality of reasoning.

Interleaved inference methods were evaluated on familiar and unfamiliar datasets using the QWEN2.5 model (1.5b and 7b). Unlike traditional methods of thinking and answers, interleaving methods provide answers step by step, increasing speed and usefulness. When combined with intermediate rewards, it significantly improves model performance while delaying responses by more than 80%. Even without exposure to new areas during training, the model adapts well, showing strong generalization. These results highlight the value of interleaved inference in making AI systems more sensitive and effective in making multi-step inference tasks in real-world.

In summary, the study explores how interlaced inference (the model alternates between inference and producing intermediate answers) can significantly improve performance and responsiveness. Using the QWEN2.5-1.5B model, the authors show that providing timely intermediate feedback during training can improve accuracy and accelerate response generation. Testing different RL strategies, PPO showed stable results, conditional, time-shortened rewards proved to be the most effective. This approach can be well extended to complex tasks and outperforms traditional ideas, followed by answering baselines. Unlike the token-level reward model, the method adopts simple rule-based rewards after completing the complete inference step, thus avoiding reward hacks. Ultimately, interwoven reasoning can improve the quality and efficiency of reasoning without relying on external tools.

View paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.