AI

LightThinker: Dynamic compression of intermediate ideas to improve LLM reasoning

Inference can be enhanced by breaking down complex problems into sequential substeps, such as the Business Chain (COT) prompt. Recent advances, such as O1-like mindsets, have introduced features including trial and error, backtracking, correction and iteration to improve model performance for puzzles. However, these improvements have significant computational costs. Due to the limitations of the transformer architecture, the increased token generation creates significant memory overhead because attention mechanism complexity increases quadratically with context length, while the storage time of KV cache storage increases linearly. For example, when the context length of Qwen32b reaches 10,000 tokens, the KV cache consumes equivalent memory to the entire model.

The current methods to accelerate LLM inference are divided into three main categories: quantitative models, generating fewer tokens and reducing KV cache. Quantitative models involve parameter and KV cache quantization technology. In the KV cache category, pruning-based selection in discrete spaces and compression merged in continuous spaces are key strategies. A pruning-based policy implements a specific eviction policy to retain only important tokens during the inference process. Merger-based strategies introduce anchor tokens to compress historically important information. The difference between the two methods is that the pruning-based approach is trained without training, but requires the eviction policy to be applied for each generated token, and the merge-based approach requires model training.

Researchers from Kuajiang University, Ant Group and Qingjiang University – The Joint Laboratory of Knowledge Graphs of Ant Group has proposed LightThinker to enable LLMS to compress intermediate ideas during dynamic reasoning. Inspired by human cognition, LightThinker compresses lengthy inference steps into compact representations and discards the original inference chain, greatly reducing the number of tokens stored in the context window. The researchers also introduced the dependency (DEP) metric to quantify the compression effect by measuring dependency on historical tokens during power generation. In addition, LightThinker reduces peak memory usage and inference time while maintaining competitive accuracy, providing a direction for promising directions to improve LLM efficiency in complex inference tasks.

The LightThinker method was evaluated using the QWEN2.5-7B and LLAMA3.1-8B models. The researchers used the customized-stratos-17k dataset to perform full parameter instructional adjustments and specified the result model as vanilla. Five comparative baselines were implemented: two training-free acceleration methods (H2O and SEPLLM), a training-based method (ANLLM), and COT prompts applied to instructions and R1-Distill models. The evaluation occurs on four datasets (GSM8K, MMLU, GPQA, and BBH), and measures effectiveness and efficiency (through inference time, peak token count and dependency metrics). This implementation has two compression methods: token-level compression (convert every 6 tokens to 2) and thought-level compression (segmented ideas with “nn” as the delimiter).

The evaluation results of four metrics of the two models on all datasets reveal several important findings. Distill-R1 has always performed poorly compared to COTs for all datasets, and its performance gap is attributed to duplication problems caused by greedy decoding. H2O effectively preserves model performance while also reducing memory usage, thus verifying its greedy eviction policy for long article generation. However, H2O greatly increases inference time (51% in QWEN and 72% in Llama), due to its token eviction policy creating overhead for each generated token. In addition, LightThinker matches H2O’s performance with similar compression rates while reducing inference time, QWEN is reduced by 52%, while Llama is reduced by 41%.

In this article, the researchers introduce LightThinker, a new way to improve LLM efficiency in complex inference tasks through dynamic compression of intermediate ideas during power generation. By training the model to learn the best timing and methods to compress lengthy inference steps for detailed representations, LightThinker greatly reduces memory overhead and computational costs while maintaining competitive accuracy. However, there are still some limitations: compatibility with parameter-effective fine-tuning methods such as Lora or Qlora has not been explored, the potential advantages of larger training datasets are unknown, and performance degradation is noteworthy on the Llama series models when training for temporary predictions under training.


Check Paper. All credits for this study are to the researchers on the project. Also, please keep an eye on us twitter And don’t forget to join us 80k+ ml subcolumn count.

🚨 Recommended Reading – LG AI Research Unleashes Nexus: An Advanced System Integration Agent AI Systems and Data Compliance Standards to Address Legal Issues in AI Datasets


Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.

🚨Recommended open source AI platform: “Intellagent is an open source multi-agent framework that evaluates complex dialogue AI systems” (promoted)

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button