LLM can now reason outside of inference language: Researchers introduce soft thinking to replace discrete tokens with continuous concept embedding

Human reasoning naturally operates through abstract nonverbal concepts, rather than relying strictly on discrete language tokens. However, the current LLM is limited to reasoning within natural language boundaries, generating one token at a time through predefined vocabulary. This label-by-market approach not only limits the expressive power of the model, but also limits the breadth of inference paths it can explore, especially in ambiguous or complex situations. The standard chain of ideas (COT) approach exemplifies this limitation, forcing the model to adopt a single path at each step. Instead, human cognition is more flexible and parallel, allowing multiple ideas to be considered simultaneously and delayed verbal expression until the concept is fully formed. This makes human reasoning more adaptable and powerful in dealing with uncertainty.

To address these limitations, the researchers proposed a transition from token-based inference to reasoning in a continuous conceptual space, representing the inference step as a combination of token embeddings. This approach allows the model to explore multiple inference trajectories in parallel and integrate richer concept representations. Previous research has shown that the potential of manipulating hidden states to influence inference outcomes or to introduce potential plans. However, applying continuous spatial reasoning to larger models presents challenges. In a model below 7B parameters, the shared weight between the input and output layers allows the hidden state to be consistent with the token embedding, thereby facilitating continuous inference. However, in larger models, the input and output space are decoupled, and the use of hidden states directly as input leads to difficult mismatches. Attempts to retrain these gaps often lead to over-performance or degradation, highlighting the difficulty of achieving effective continuous reasoning on a large scale.

Researchers at the University of California, Santa Barbara, University of California, Santa Cruz, University of California, University of California, University of Los Angeles, Purdue, Purdue, LMSYS Org, and Microsoft have proposed soft thinking. This training-free approach enhances the reasoning of large language models by running in continuous conceptual space. Rather than selecting a discrete token in each step, the model generates a concept token (a probability-weighted mixture of all token embeddings), thus implementing parallel inference on multiple paths. This leads to a richer, more abstract representation. The method includes a cold stop mechanism to increase efficiency. Evaluation of mathematical and coding tasks showed up to 2.48% accuracy, while the tokens used were 22.4% less inference than standard chains.

The soft thinking approach enhances standard COT inference by replacing discrete token sampling (probability distribution over the entire vocabulary) with concept tokens. These distributions compute weighted embeddings, allowing the model to reason in a continuous conceptual space. This preserves uncertainty and allows multiple inference paths to be explored in parallel. When the model becomes confident, improves efficiency and prevents crashes, the cold stop mechanism monitors entropy to prevent inference. Theoretical analysis shows that soft thinking provides a more expressive and computational alternative to discrete COT by linearizing the approximation of all marginalization on all inference paths.

The study used three open source LLMs of different sizes and architectures to evaluate the soft thinking method of eight benchmarks of mathematics and programming in mathematics and programming. Compared with standard and greedy COT methods, soft thinking consistently improves accuracy (by @1), while greatly reducing the number of tokens generated, indicating that reasoning is more effective. This method uses concept tokens and cold start controllers without modifying model weights or requiring additional training. Experiments show that soft thinking can balance higher accuracy with lower computational costs, surpassing the benchmark by enabling richer, more abstract inference with fewer steps in various tasks and models.

In short, soft thinking is a training-free approach that enables large language models to reason with continuous concept tokens instead of traditional discrete tokens. By combining weighted token embedding, soft thinking can enable the model to explore multiple inference paths simultaneously, thereby improving accuracy and efficiency. After mathematical and coding benchmarks, it consistently improves the accuracy of @1 while reducing the number of tokens generated without additional training or architectural changes. This method maintains interpretability and concise reasoning. Future research may focus on training adaptations to improve robustness, especially for distributing inputs. This code is publicly accessible.

View paper and GitHub pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.