LLM strives to act on what they know: Google DeepMind researchers use reinforcement learning fine-tuning to bridge knowledge gaps

Language models trained on huge Internet-scale datasets have become prominent language understanding and generation tools. Their potential goes beyond language tasks and can operate as decision-making agents in an interactive environment. When applied to environments where action choices are required, these models are expected to take action effectively using their internal knowledge and reasoning. Their ability to consider context, trade-off options and select actions opens up new possibilities for them to integrate into proxy systems that interact with dynamic environments.
Despite such commitment, these models show critical limitations in decision making. Although they can form accurate chains of reasoning, they are usually unable to act on them. This problem is identified as a gap in knowledge where the model identifies the correct strategies but does not implement them in practice. Another important issue is greed, in which the advanced reward choices are repeatedly chosen prematurely in this model, ignoring alternative strategies that may lead to better results. Furthermore, smaller models exhibit frequency bias, regardless of reward, tend toward the actions commonly seen, thus impairing exploration and hindering learning from various situations.
To address these challenges, researchers have tried various strategies. Traditional enhanced learning methods, including gangster algorithms such as UCB, are designed to manage exploration exploration-exploration trade-offs. By contrast, cultural learning and behavioral clones mimic the expert trajectory, but often exacerbates the same decision-making bias. Although some exploration strategies have improved slightly, these approaches lack mechanisms that reliably translate internal reasoning into optimal effects, especially in complex or random environments.
Researchers at Google DeepMind and JKU Linz’s LIT AI Lab focus on refining language model behaviors through enhanced learning fine-tuning (RLFT). Their method uses the self-generated chain of thought (COT) principle as a training signal. By evaluating the rewards of actions that follow specific reasoning steps, the model learns to make decisions that tend to sound logical in practice and produce high rewards. This reinforcement links model reasoning with environmental feedback, promotes improved decision consistency and reduces the gap between thought and behavior.
The method’s token-based fine-tuning usage environment interactions. At each step, the model receives an input instruction and the latest action reward history and generates a sequence containing the basic principles and selected actions. These outputs are evaluated based on the environmental rewards and whether the action meets the required format. A fine is applied when the model fails to generate a valid operation. Over time, reward molding encourages consistent output formats while saving exploration. This process includes Monte Carlo baseline estimates and generalized advantage estimation of variable length tasks such as TIC-TAC-TOE, allowing the model to be learned from different decision sequences.
Performance results show that RLFT greatly improves the decision-making ability of the model. In a button-based multi-arm gangster setup with 10 arms, the action coverage of the 2B parameter model increased from 40% to more than 52% after 30,000 gradient updates. In an environment with 20 choices, coverage remains the best option, but shows meaningful improvements. After RLFT, the frequency deviation in the 2B model dropped from early repetition to 35%. Furthermore, in TIC-TAC-TOE, the 2B model’s winning rate against random opponents rose from 15% to 75%, and the model achieved a draw for the best Monte Carlo tree search agent, with an average return from -0.95 to 0.0. Furthermore, larger models such as the 27B variant show an 87% reasoning rate for generating the correctness, but only 21% of the best action was selected without RLFT. After fine-tuning, the gap is significantly lower.
Research shows that by refining large language models by strengthening their reasoning process, they can enhance their ability to act based on knowledge. This connection between thought and action is essential to creating reliable decision-making agents. The proposed method provides a practical way to develop more capable and autonomous LLM-based agents by directly resolving common decision errors and enhancing successful behaviors.
View paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.
🚨Build a Genai you can trust. ⭐️Parlant is your open source engine for controlled, compliance and purposeful AI conversations – Star Parlant on Github! (Promotion)