LifelOngagentBench: Benchmark for evaluating continuous learning of LLM-based agents

Lifelong learning is crucial for intelligent agents to navigate a changing environment, but the current LLM-based agents are under-falling – they lack memory and see each task as a new beginning. Although LLM has changed language tasks and inspired proxy-based systems, these proxy are still stateless and cannot be learned from past experience. True advances to general intelligence require agents that can retain, adapt and reuse knowledge over time. Unfortunately, the current benchmarks focus primarily on isolated tasks, neglecting the reuse of skill and knowledge retention. Without a standardized assessment of lifelong learning, it is difficult to measure real progress, and issues like label errors and repeatability further hinder actual development.
Lifelong learning, also known as continuous learning, is designed to help AI systems build and retain knowledge across tasks while avoiding catastrophic forgetting. Most previous work in this field has focused on non-interactive tasks such as image classification or sequential fine-tuning, where models process static inputs and outputs without responding to changing environments. However, applying lifelong learning to LLM-based agents running in dynamic, interactive settings is still not full of repulsion. Existing benchmarks (such as Webarena, AgentBench, and VisualWebarena) evaluate one-time task performance, but do not support learning over time. Even interactive research involving games or tools lacks a standard framework for evaluating agency lifelong learning.
Researchers from China Technical University of Southern China, MBZUAI, Chinese Academy of Sciences and East China Normal University have launched Lifelongagentskent, the first comprehensive benchmark for evaluating LLM-based agent lifelong learning. It has built-in label verification, repeatability and modular design of three environments (database, operating system and knowledge graph) with interdependent, skill-driven tasks. The study shows that routine empirical replays are often invalid due to limitations in including irrelevant information and context length. To solve this problem, the team proposed a group self-consistent mechanism that gathers past experience and applies voting strategies, thus greatly improving the lifelong learning performance of various LLM architectures.
LifelOngagentEns is a benchmark that aims to test how effective language model-based agents are learned and adapted over time to a range of tasks. This setting treats learning as a sequential decision problem for POMDP using target conditions in three environments: database, operating system, and knowledge graph. Tasks are structured around core skills and are carefully designed to reflect the complexity in the real world, focusing on factors such as task difficulty, overlapping skills, and ambient noise. Task generation combines automation and manual verification to ensure quality and diversity. This benchmark helps assess whether agents can be based on past knowledge and continuously improve in a dynamic, skill-driven environment.
LifelOngagentBench is a new evaluation framework designed to test how LLM-based agents can learn over time by rigorously handling tasks, unlike previous benchmark components that focus on isolated or parallel tasks. Its modular system includes components such as agents, environments, and controllers that can run independently and communicate via RPC. The framework prioritizes repeatability and flexibility, supporting a variety of environments and models. Through experiments, it has been shown that the past trajectory of the empirical replay-predation agent can significantly improve performance, especially on complex tasks. However, larger reconstructions may lead to memory problems, emphasizing the need for more efficient replay and memory management strategies.
In short, LifeLongagentBench is a groundbreaking benchmark designed to evaluate the ability of LLM-based agents to learn over time. Unlike early benchmarks that treat a proxy as static, the framework tests its ability to span knowledge across interconnected tasks in dynamic environments such as databases, operating systems, and knowledge graphs. It provides modular design, repeatability and automated evaluation. Although experience replays and group self-consistency are promising in terms of enhanced learning, problems such as model cross-models such as memory overload and inconsistent benefits persist. This work lays the foundation for developing more adaptable, memory-efficient proxying, with future directions focusing on smarter memory usage and multimodal tasks in the real world.
View paper. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.