Meet M3 proxy: Multi-modal proxy with long-term memory and enhanced inference capabilities
In the future, home robots can manage their own trivia every day and learn family models from ongoing experiences. Over time, it may serve coffee without asking. For multimodal agents, this intelligence depends on (a) continually observing the world through multimodal sensors, (b) storing its experience in long-term memory, and (c) reasoning on this memory to guide its actions. The current research focuses on LLM-based agents, but multimodal agents will process various inputs and store richer multimodal content. This presents new challenges in maintaining consistency in long-term memory. Multimodal agents must build internal world knowledge similar to how humans learn, rather than simply storing descriptive experiences.
Existing attempts include attaching the original proxy trajectory to memory, such as conversations or execution history. Certain methods enhance this approach by combining abstracts, potential embeddings, or structured knowledge representations. In multimodal proxy, memory formation is closely related to online video understanding, such as methods of extending context windows or compressing visual tokens, often failing to extend long-term video streams. A memory-based approach to storing encoded visual features improves scalability but struggles with maintaining long-term consistency. The Socrates model framework generates language-based memory to describe videos, providing scalability, but faces challenges when tracking evolving events and entities.
Researchers from Wild Seeds, Qianjiang University and Shanghai Qiaotang University proposed the M3 Engent, a multimodal proxy framework with long-term memory. The M3 agent processes real-time visual and auditory input to build and update its memory, just like humans. Unlike standard plot memory, it also develops semantic memory, which can accumulate world knowledge over time. Its memory is organized in a multi-modal structure centered on entities, ensuring a deeper and more coherent understanding of the environment. When giving instructions, the M3 agent participates in multi-turn reasoning and automatically retrieves relevant information. In addition, the M3 board is developed to evaluate the effectiveness of the M3 agent.
The M3 proxy contains multi-mode LLM and a long-term storage module, which runs through two parallel processes: memory and control. Long-term memory is an external database that stores structured multimodal data in a memory graph where nodes represent different memory items with unique IDs, modals, original content, embeddings, and metadata. During the memory process, the M3 agent performs video stream editing through clipping, generating episodic memory of semantic memory such as abstract knowledge such as identity and relationships for the original content and semantic memory. For control, the agent performs multi-turn reasoning, using the search function to get relevant memory in up to H play. RL optimizes the framework by training a separate model of memory and control to achieve peak performance.
The M3 agent and all benchmarks are evaluated on the M3 basic robot and the M3 bench. On the M3 bench robot, the M3 agent achieved a 6.3% accuracy improvement on the strongest baseline MA-LLM, while on the M3-Bench-Web and Videomme-Long, it outperformed the Geminigpt4o-Hybrid at 7.7% and 5.3% respectively. Furthermore, M3 agents outperform MA-LMM in human understanding 4.2%, and MA-LMM performs 8.5% in cross-pattern inference on M3 bench robots. On the M3 base network, it outperforms the Gemini-GPT4O-Hybrid with a gain of 15.5%, which is 6.7% in these categories. These results emphasize the M3 agent’s ability to maintain role consistency, enhance human ability to understand and effectively integrate multimodal information.
In summary, the researchers introduced the M3 Agent, a multimodal framework with long-term memory capable of processing real-time video and audio streams to build plot and semantic memory. This enables the agent to accumulate world knowledge and maintain consistent, context-rich memory over time. Experimental results show that the M3 agent outperforms all benchmarks in multiple benchmarks. Detailed case studies highlight current limitations and propose future directions such as improving attention mechanisms for semantic memory and developing more effective visual memory systems. These advances pave the way for AI agents that are more human-like in practical applications.
Check Paper and Github page. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sajjad Ansari is a final year undergraduate student from IIT Kharagpur. As a technology enthusiast, he delves into the practical application of AI, focusing on understanding AI technology and its real-world impact. He aims to express complex AI concepts in a clear and easy way.