Nested Learning: A New Machine Learning Approach for Continuous Learning that Treats Models as Nested Optimization Problems to Enhance Long Context Handling
How do we build AI systems that can learn new information over time without forgetting what they learned previously or retraining from scratch? Google researchers introduced nested learning, a machine learning approach that treats a model as a collection of smaller nested optimization problems, rather than a single network trained by an outer loop. The goal is to attack catastrophic forgetting and move large models toward continuous learning, closer to the way biological brains manage memory and adapt over time.

What is nested learning?
Research paper from Google “Nested Learning, the Illusion of Deep Learning Architectures” Model complex neural networks as a set of coherent optimization problems that are nested or run in parallel and optimized together. Each internal problem has its own context flow, input sequence, gradients or states observed by that component, and its own update frequency.
Instead of treating training as a layer of tiles plus an optimizer, nested learning orders them by update frequency. Updated parameters are usually located at the inner level, while slowly updating parameters form the outer level. This hierarchy defines a neural learning module, where each level compresses its own context flow into its parameters. The research team shows that this view covers standard backpropagation on MLPs, linear attention, and general optimizers, all of which are instances of associative memory.
In this framework, an associative memory is any operator that maps keys to values and is trained on an internal goal. The research team formalized associative memory and then showed that backpropagation itself can be written as a one-step gradient descent update, learning a mapping from the input to the local surprise signal, that is, the gradient of the loss with respect to the output.


As a deep optimizer for associative memory
Once optimizers are viewed as learning modules, nested learning suggests redesigning them with richer internal goals. Standard momentum can be written as a linear associative memory over past gradients and trained using a dot product similarity objective. This internal goal produces Hebbian-like update rules that do not model dependencies between data samples.
The team of researchers replaced this similarity objective with an L2 regression loss on gradient features, resulting in an update rule that better manages limited memory capacity and better remembers gradient sequences. They then generalized momentum memory from linear mapping to MLP and defined deep momentum gradient descent, where momentum states are generated by neural memory and can be passed through nonlinear functions (e.g. Newton Schulz). This perspective also restores the Muon optimizer to a special case.


continuum memory system
In a typical sequence model, attention acts as working memory over the current context window, while feedforward blocks store pre-trained knowledge as long-term memory that is rarely updated after training. Nested learning researchers extend this binary view to continuous memory systems (CMS).
CMS is defined as a chain of MLP blocks, from MLP(f₁) to MLP(fₖ), where each block has its own update frequency and block size. For an input sequence, the output is obtained by applying these blocks sequentially. The parameters of each block are updated only every C^(ℓ) steps, so each block compresses context at different time scales into its parameters. The standard Transformer with a feedforward block is restored to the special case of k equal to 1.
This structure transforms long-term memory into a series of levels across frequencies, rather than a single static feed-forward layer. The research directly links this to multi-timescale synaptic and system integration processes in the brain, where different parts of the system learn at different rates while sharing a common architecture.
HOPE, a self-modifying architecture built on Titan
To demonstrate the practicality of nested learning, the research team designed HOPE, a self-referential sequence model that applies this paradigm to recurrent architectures. HOPE is a variant of Titans, a long-term memory architecture in which neural memory modules learn to remember surprising events at test time and help focus attention on markers long in the past.
Titans only have level 2 parameter updates, which yields first order in context learning. HOPE expands on Titan in two ways. First, it is self-modifying, it can optimize its own memory through self-referential processes, and it can in principle support unlimited levels of contextual learning. Second, it integrates contiguous memory system blocks so that memory updates occur at multiple frequencies and extend over longer context windows.


Understand the results
The research team evaluated the HOPE and baseline of language modeling and common sense reasoning tasks on three parameter scales: 340M, 760M and 1.3B parameters. Benchmarks include Wiki and LMB perplexity for language modeling and PIQA, HellaSwag, WinoGrande, ARC Easy, ARC Challenge, Social IQa, and BoolQ accuracy for inference. Table 1 given below reports the results for HOPE, Transformer++, RetNet, Gated DeltaNet, TTT, Samba and Titans.


Main points
- Nested learning treats the model as multiple nested optimization problems with different update frequencies, directly targeting catastrophic forgetting in continuous learning.
- The framework reinterprets backpropagation, attention, and optimizers as associative memory modules that compress their own context streams, providing a unified view of architecture and optimization.
- Deep optimizers in nested learning replace simple dot product similarity with a richer objective (e.g., L2 regression) and use neural memory, resulting in more expressive and context-aware update rules.
- Continuous memory systems model memory as a series of MLP blocks that are updated at different rates, creating short-, medium-, and long-range memory, rather than a static feed-forward layer.
- The HOPE architecture is a self-modifying variant of Titans built using nested learning principles, which shows improved language modeling, long-context reasoning, and continuous learning performance compared to powerful Transformer and recurrent baselines.
Nested learning is the reconstruction of deep networks into neural learning modules, integrating architecture and optimization into a single system. The introduction of deep momentum gradient descent, continuous memory system and HOPE architecture provides concrete ways for richer associative memory and better continuous learning. Overall, this work moves continuous learning from an afterthought to a primary design axis.
Check Paper and technical details. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an AI media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.
🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.