AI

LLM’s Fight with Real Dialogue: Microsoft and Salesforce Researchers Reveal 39% Performance Decline for Multi-Turning Tasks

Conversational artificial intelligence to enable large language models (LLMs) to gradually reveal dynamic interactions of user needs. These systems are widely deployed in tools that aid coding, writing, and research by interpreting and responding to natural language descriptions. The wish is that these models can flexibly adapt to user input for multiple rounds, thereby adjusting their understanding with each new piece of information. This contrasts sharply with the static, single-turn response and highlights a major design goal: maintaining contextual coherence and delivering accurate results in extended conversations.

The ongoing problem in conversation AI is that the model cannot handle user instructions distributed in multiple conversation turns. Instead of receiving all the necessary information at the same time, LLM must gradually extract and integrate key details. However, when the task is not specified in advance, the model tends to make early assumptions about what is required and try the final solution too early. This leads to errors that persist through dialogue, as these models usually stick to their early interpretations. As a result, once the LLM fails in understanding, it works to recover, resulting in incomplete or misleading answers.

Most current tools use single-turn, fully specified prompts to evaluate LLMS, where all tasks require a single bite. Even in studies that advocate multi-transfer analysis, dialogue is often contingent and is seen as an isolated subtask rather than evolving traffic. These evaluations cannot explain when the model is fragmented and must actively construct context from multiple exchanges. As a result, evaluations often miss the face of the core difficulty model: unspecified inputs are integrated over several conversation turns without a clear direction.

Researchers at Microsoft Research and Salesforce Research have launched a simulation setup that mimics how users reveal information in real conversations. Their “fragment simulation” approach takes full instructions from high-quality benchmarks and divides them into smaller logically connected parts or “fragments”. Each shard provides a single element of the original instruction and is then revealed sequentially over multiple turns. This simulates the gradual disclosure of information that occurs in practice. The setup includes a simulated user powered by LLM who decides which fragment to reveal next and naturally re-customizes it to suit the ongoing environment. The setup also uses a classification mechanism to evaluate whether the assistant’s response is attempted or needs clarification, further refining the simulation of the true interaction.

The technology developed simulates five types of conversations, including single-to-turn instructions and multiple multi-turn settings. In the fragment simulation, LLMS receives instructions at one time, forcing them to wait for the full answer. This setup evaluates 15 LLMs in six generation tasks: encoding, SQL queries, API operations, mathematical problems, data-to-text descriptions, and document summary. Each task comes from established datasets such as GSM8K, Spider, and Totto. For each LLM and guidance, 10 simulations were performed, with a total of more than 200,000 simulations. Using a percentile-based scoring system to calculate power, unreliability, and average performance, allowing direct comparison of the best and worst results for each model.

A consistent decline in performance was observed in the fragmented environment in all tasks and models. On average, performance dropped from 90% of single rotation to 65%, down 25 points in multi-turn scenarios. The main reason is not the reduction in capabilities, but the sharp increase in unreliability. Although the capability declined by 16%, the unreliability increased by 112%, indicating that the way the model was performed varies greatly as the information was gradually presented. For example, even the best performing models like GPT-4.1 and Gemini 2.5 Pro will show an average degradation of 30-40%. Additional calculations at generation time or to reduce randomness (temperature settings) provide only minor improvements in consistency.

This study elucidates that even state-of-the-art LLMs do not have the ability to manage tasks that require a complex dialogue that gradually unfolds. The fragment simulation method effectively reveals how the model falters in adapting to evolving instructions, thus highlighting the urgent need to improve the reliability of multi-turn setups. Enhanced LLM’s ability to process incomplete descriptions over time is crucial for dialogue to be naturally unstructured and incremental real-life applications.


View paper and GitHub pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 90K+ ml reddit.


Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button