LLM strives to act on what they know: Google DeepMind researchers use reinforcement learning fine-tuning to bridge knowledge gaps

adminMay 19, 2025

0 30 3 minutes read

LLM strives to act on what they know: Google DeepMind researchers use reinforcement learning fine-tuning to bridge knowledge gaps

Language models trained on huge Internet-scale datasets have become prominent language understanding and generation tools. Their potential goes beyond language tasks and can operate as decision-making agents in an interactive environment. When applied to environments where action choices are required, these models are expected to take action effectively using their internal knowledge and reasoning. Their ability to consider context, trade-off options and select actions opens up new possibilities for them to integrate into proxy systems that interact with dynamic environments.

Despite such commitment, these models show critical limitations in decision making. Although they can form accurate chains of reasoning, they are usually unable to act on them. This problem is identified as a gap in knowledge where the model identifies the correct strategies but does not implement them in practice. Another important issue is greed, in which the advanced reward choices are repeatedly chosen prematurely in this model, ignoring alternative strategies that may lead to better results. Furthermore, smaller models exhibit frequency bias, regardless of reward, tend toward the actions commonly seen, thus impairing exploration and hindering learning from various situations.

To address these challenges, researchers have tried various strategies. Traditional enhanced learning methods, including gangster algorithms such as UCB, are designed to manage exploration exploration-exploration trade-offs. By contrast, cultural learning and behavioral clones mimic the expert trajectory, but often exacerbates the same decision-making bias. Although some exploration strategies have improved slightly, these approaches lack mechanisms that reliably translate internal reasoning into optimal effects, especially in complex or random environments.

Researchers at Google DeepMind and JKU Linz’s LIT AI Lab focus on refining language model behaviors through enhanced learning fine-tuning (RLFT). Their method uses the self-generated chain of thought (COT) principle as a training signal. By evaluating the rewards of actions that follow specific reasoning steps, the model learns to make decisions that tend to sound logical in practice and produce high rewards. This reinforcement links model reasoning with environmental feedback, promotes improved decision consistency and reduces the gap between thought and behavior.

The method’s token-based fine-tuning usage environment interactions. At each step, the model receives an input instruction and the latest action reward history and generates a sequence containing the basic principles and selected actions. These outputs are evaluated based on the environmental rewards and whether the action meets the required format. A fine is applied when the model fails to generate a valid operation. Over time, reward molding encourages consistent output formats while saving exploration. This process includes Monte Carlo baseline estimates and generalized advantage estimation of variable length tasks such as TIC-TAC-TOE, allowing the model to be learned from different decision sequences.

Performance results show that RLFT greatly improves the decision-making ability of the model. In a button-based multi-arm gangster setup with 10 arms, the action coverage of the 2B parameter model increased from 40% to more than 52% after 30,000 gradient updates. In an environment with 20 choices, coverage remains the best option, but shows meaningful improvements. After RLFT, the frequency deviation in the 2B model dropped from early repetition to 35%. Furthermore, in TIC-TAC-TOE, the 2B model’s winning rate against random opponents rose from 15% to 75%, and the model achieved a draw for the best Monte Carlo tree search agent, with an average return from -0.95 to 0.0. Furthermore, larger models such as the 27B variant show an 87% reasoning rate for generating the correctness, but only 21% of the best action was selected without RLFT. After fine-tuning, the gap is significantly lower.

Research shows that by refining large language models by strengthening their reasoning process, they can enhance their ability to act based on knowledge. This connection between thought and action is essential to creating reliable decision-making agents. The proposed method provides a practical way to develop more capable and autonomous LLM-based agents by directly resolving common decision errors and enhancing successful behaviors.

View paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

🚨Build a Genai you can trust. ⭐️Parlant is your open source engine for controlled, compliance and purposeful AI conversations – Star Parlant on Github! (Promotion)

adminMay 19, 2025

0 30 3 minutes read

LLM strives to act on what they know: Google DeepMind researchers use reinforcement learning fine-tuning to bridge knowledge gaps

admin

Leave a Reply Cancel reply

New study finds freshwater availability amounts for lithium mining overestimate – Air quality issues

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Wastewater technology is not as “green” as it should be

Explore UAE headphone price expectations in 2025

Tutorial on using OpenAi codex with GitHub repository for seamless AI-driven development

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Rehabilitation strategies can improve clinical outcomes after concussion within the first three weeks

Interventions may reduce defects associated with premature birth to inhibit responses.

Hepatitis C drugs enhance Remdesivir’s antiviral activity against Covid-19

admin

The bank missed it. These AI models find fraud hidden in obvious sight

Reinforcement learning enables LLMS to master search: Ant Group researchers introduce SEM to optimize tool usage and reasoning efficiency

Related Articles

Evaluate where to implement proxy AI in your business

Moments beyond AHA: Building reasoning in a large language model

Dimitri Masin, CEO and Co-founder of Gradient Labs – Interview Series

Build an infrastructure for effective atmosphere coding in an enterprise

Leave a Reply Cancel reply

Tutorial on using OpenAi codex with GitHub repository for seamless AI-driven development

If not recorded, it won’t happen: US documentation and regulation of randomized controlled trials of human nutrition

G quadruples reveal molecular links between telomeres and telomerase: key findings in tumor transformation, aging and regeneration therapy

Rehabilitation strategies can improve clinical outcomes after concussion within the first three weeks

Interventions may reduce defects associated with premature birth to inhibit responses.

Hepatitis C drugs enhance Remdesivir’s antiviral activity against Covid-19