This AI paper introduces group thinking: a token-level multi-agent inference paradigm for faster, more cooperative LLM inference

by admin · May 24, 2025

A prominent area explored involves enabling large language models (LLMs) to operate in a collaborative manner. Now, the potential of multi-agent systems powered by LLM is being examined to coordinate challenging issues by splitting tasks and working simultaneously. This direction has attracted attention due to its potential to increase efficiency and reduce latency in real-time applications.

A common problem in collaborative LLM systems is the proxy turn-based communication. In such a system, each agent must wait for others to complete their inference steps before proceeding. This slows down processing, especially when a quick response is required. Furthermore, agents often replicate efforts or produce inconsistent outputs because they do not see evolving ideas from generation to generation peers. This latency and redundancy reduces the practicality of deploying multi-proxy LLMs, especially when limiting time and computing, such as edge devices.

Most current solutions rely on sequential or independent parallel sampling techniques to improve inference. Methods such as thinking chains drive helping models solve problems in a structured way, but often increase in reasoning time. This approach is extended in this regard through branch reasoning paths, such as thought trees and thought maps. However, these methods still do not allow adaptation between agents in real time. Multi-agent setup has explored collaborative approaches, but mainly through alternating message exchanges, which again introduces latency. Some advanced systems propose complex dynamic scheduling or role-based configurations that have not been optimized for effective inference.

Mediatek Research’s research introduces a new approach called Group Think. This approach enables multiple inference agents in a single LLM to run simultaneously, thereby observing part of each other’s output at the token level. Each reasoning thread adapts to the evolving ideas of others. This mechanism reduces duplication and enables the agent to move directions if another thread is better positioned to continue the specific inference line. Group ideas are implemented through a token-level attention mechanism that enables each agent to participate in previously generated tokens from all agents, thus enabling real-time collaboration.

This method can work by allocating each proxy its own sequence of token indexes, thus interleaving its output in memory. These interlaced tokens are stored in a shared cache that is accessible to all agents during generation. This design allows effective attention between inference threads without architectural changes to the transformer model. This implementation is available in both personal devices and data centers. On a local device, it can effectively use idle calculations by batching multiple proxy outputs even if the batch size is 1. In data centers, group thinking allows multiple requests to be processed together, interweaving the tokens while maintaining the correct attention dynamics.

Performance tests show that the group’s thinking significantly improves delays and output quality. In enumerating tasks (such as listing 100 different names), it gets nearly complete results faster than traditional thinking methods. Acceleration is proportional to the number of thinkers. For example, four thinkers reduced the delay by about four times. In the division and controversial issue, using the Floyd-Warshall algorithm on the five node graph, four thinkers cut the completion time to half of a single agent. The team believes that the code generation challenges solved in programming tasks are more effective than the baseline model. The model has four or more thinkers, generating the correct snippets than the traditional inference model.

This study shows that existing LLMSs, although there is no explicit training for collaboration, can already demonstrate emerging group reasoning behaviors under group thinking settings. In experiments, agents naturally diversify their work to avoid redundancy, often separating tasks by topic or focus area. These findings suggest that through specialized training on collaborative data, the efficiency and complexity of group ideas can be further improved.

View paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Nikhil is an intern consultant at Marktechpost. He is studying for a comprehensive material degree in integrated materials at the Haragpur Indian Technical College. Nikhil is an AI/ML enthusiast and has been studying applications in fields such as biomaterials and biomedical sciences. He has a strong background in materials science, and he is exploring new advancements and creating opportunities for contribution.