What is the real reason for LLMS? A framework that separates logic from knowledge

Unraveling reasoning in modern LLM: Why the final answer is not enough
Recent advances in reasoning-focused LLMs such as O1/3 and DeepSeek-R1 of OpenAI have led to significant improvements to complex tasks. However, the step-by-step reasoning behind these models is not yet clear. Most evaluations focus on the ultimate accuracy, which hides the reasoning process and does not reveal how the model combines knowledge and logic. Some early methods tried to measure reasoning by comparing the answers to the original question, but this approach was flawed because the model often relied on previous inferences or internal knowledge. The need for reasoning in fields such as mathematics and medicine differs, emphasizing the importance of developing better, domain-aware assessment methods to build trustworthy AI.
Disadvantages of Final Mathematics and Medicine Assessment
Recent LLMs have made impressive progress in inference tasks, especially math and medicine, thanks to better training data and reward strategies. However, most progress focuses on improving the accuracy of the final answer rather than understanding the step-by-step reasons for the model. Past work has marked factual errors in the reasoning chain, or similarity is measured between the reasoning step and the original problem. However, this similarity does not guarantee logical soundness or factual correctness, as LLMS often utilizes internal knowledge or earlier reasoning.
A new framework for separating knowledge and logic in LLM reasoning
Researchers at UC Santa Cruz, Stanford and Tongji University went beyond the evaluation of the final answer by dividing LLM reasoning into two key parts: factual knowledge and logical steps. They introduced a detailed framework that utilizes two indicators: Knowledge Index (KI), to achieve factual accuracy and information acquisition (Infogain) for inference quality. Their analysis of QWEN models across mathematical and medical tasks shows that reasoning skills are not easily transferred between domains. Although supervised fine-tuning improves accuracy, it often compromises the depth of reasoning. However, reinforcement learning helps to perfect reasoning by deleting irrelevant information. This work underscores the importance of more thoughtful evaluation and training of LLM.
Evaluation of reasoning using QWEN2.5-7B and DEEPSEEK-R1 models
The researchers evaluated reasoning in LLM through SFT and RL training by analyzing QWEN2.5-7B and its DeepSeek-r1-Distield version. They used tasks in the field of mathematics and medical to break down the response into logical steps and evaluated using two key metrics: information gain (how much uncertainty is reduced for each reasoning step) and knowledge index (the sequence of each step will be accurate, and the expert source is verified). Although Infogain tracks the informativeness of each step, KI checks whether the knowledge is consistent with real-world facts. This approach reveals the models and where they can be shaky in accuracy or logic.
Supervised fine-tuning and reinforcement learning in specific domain tasks
This study evaluated two variants of QWEN-2.5-7B – QWEN-BASE and distilled QWEN-R1 on medical tasks. The results show that the QWEN cardinality always outperforms QWEN-R1 in terms of accuracy, knowledge retention and reasoning, especially after SFT and RL. The distillation model may struggle with previous training, which leads to domain mismatch. Interestingly, SFT enhances medical knowledge more effectively than RL, although it may slightly impair reasoning efficiency. On the other hand, RL improves reasoning and knowledge when applying post-STF. Unlike math-centric tasks, medical benchmarks rely more on factual knowledge than abstract reasoning.
Conclusion: Towards an easier explanation and trustworthy LLMS
In summary, the study introduces a framework that separates knowledge from reasoning to evaluate the way LLM thinks, especially in high-risk areas such as medicine and mathematics. Researchers use SFT and RL-trained QWEN models, which, although SFT improves factual accuracy, is essential in medicine but often weakens reasoning. However, RL enhances reasoning by pruning the wrong information. The framework can be extended to areas such as law or finance where structured thinking is crucial. Overall, this approach helps to shed light on how LLMS makes decisions and proposes ways to train specific areas.
View paper,,,,, Code and Project page. All credits for this study are to the researchers on the project. Also, please feel free to follow us twitter And don’t forget to join us 99K+ ml reddit And subscribe Our newsletter.
Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.
