Folding LLM agents using build context with memory compression and tools for long-view inference

by admin · October 16, 2025

In this tutorial, we’ll explore how to build a context-collapsed LLM agent that efficiently solves long-term, complex tasks by intelligently managing limited context. We design agents to break large tasks into smaller subtasks, perform inference or computation when needed, and then collapse each completed subtrajectory into a concise summary. By doing this, we can retain the basic knowledge while keeping the active memory small. Check The complete code is here.

import os, re, sys, math, random, json, textwrap, subprocess, shutil, time
from typing import List, Dict, Tuple
try:
   import transformers
except:
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", "transformers", "accelerate", "sentencepiece"], check=True)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
MODEL_NAME = os.environ.get("CF_MODEL", "google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
llm = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device_map="auto")
def llm_gen(prompt: str, max_new_tokens=160, temperature=0.0) -> str:
   out = llm(prompt, max_new_tokens=max_new_tokens, do_sample=temperature>0.0, temperature=temperature)[0]["generated_text"]
   return out.strip()

We first set up the environment and load the lightweight Hugging Face model. We use this model to generate and process text locally, ensuring the agent runs smoothly on Google Colab without any API dependencies. Check The complete code is here.

import ast, operator as op
OPS = {ast.Add: op.add, ast.Sub: op.sub, ast.Mult: op.mul, ast.Div: op.truediv, ast.Pow: op.pow, ast.USub: op.neg, ast.FloorDiv: op.floordiv, ast.Mod: op.mod}
def _eval_node(n):
   if isinstance(n, ast.Num): return n.n
   if isinstance(n, ast.UnaryOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.operand))
   if isinstance(n, ast.BinOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.left), _eval_node(n.right))
   raise ValueError("Unsafe expression")
def calc(expr: str):
   node = ast.parse(expr, mode="eval").body
   return _eval_node(node)
class FoldingMemory:
   def __init__(self, max_chars:int=800):
       self.active=[]; self.folds=[]; self.max_chars=max_chars
   def add(self,text:str):
       self.active.append(text.strip())
       while len(self.active_text())>self.max_chars and len(self.active)>1:
           popped=self.active.pop(0)
           fold=f"- Folded: {popped[:120]}..."
           self.folds.append(fold)
   def fold_in(self,summary:str): self.folds.append(summary.strip())
   def active_text(self)->str: return "n".join(self.active)
   def folded_text(self)->str: return "n".join(self.folds)
   def snapshot(self)->Dict: return {"active_chars":len(self.active_text()),"n_folds":len(self.folds)}

We defined a simple calculator tool for basic arithmetic and created a memory system that dynamically collapses past context into concise summaries. This helps us retain manageable active memory while retaining important information. Check The complete code is here.

SUBTASK_DECOMP_PROMPT="""You are an expert planner. Decompose the task below into 2-4 crisp subtasks.
Return each subtask as a bullet starting with '- ' in priority order.
Task: "{task}" """
SUBTASK_SOLVER_PROMPT="""You are a precise problem solver with minimal steps.
If a calculation is needed, write one line 'CALC(expr)'.
Otherwise write 'ANSWER: '.
Think briefly; avoid chit-chat.


Task: {task}
Subtask: {subtask}
Notes (folded context):
{notes}


Now respond with either CALC(...) or ANSWER: ..."""
SUBTASK_SUMMARY_PROMPT="""Summarize the subtask outcome in List[str]:
   return [ln[2:].strip() for ln in text.splitlines() if ln.strip().startswith("- ")]

We design prompt templates to guide agents in breaking down tasks, solving subtasks, and summarizing results. These structured prompts enable clear communication between inference steps and model responses. Check The complete code is here.

def run_subtask(task:str, subtask:str, memory:FoldingMemory, max_tool_iters:int=3)->Tuple[str,str,List[str]]:
   notes=(memory.folded_text() or "(none)")
   trace=[]; final=""
   for _ in range(max_tool_iters):
       prompt=SUBTASK_SOLVER_PROMPT.format(task=task,subtask=subtask,notes=notes)
       out=llm_gen(prompt,max_new_tokens=96); trace.append(out)
       m=re.search(r"CALC((.+?))",out)
       if m:
           try:
               val=calc(m.group(1))
               trace.append(f"TOOL:CALC -> {val}")
               out2=llm_gen(prompt+f"nTool result: {val}nNow produce 'ANSWER: ...' only.",max_new_tokens=64)
               trace.append(out2)
               if out2.strip().startswith("ANSWER:"):
                   final=out2.split("ANSWER:",1)[1].strip(); break
           except Exception as e:
               trace.append(f"TOOL:CALC ERROR -> {e}")
       if out.strip().startswith("ANSWER:"):
           final=out.split("ANSWER:",1)[1].strip(); break
   if not final:
       final="No definitive answer; partial reasoning:n"+"n".join(trace[-2:])
   summ=llm_gen(SUBTASK_SUMMARY_PROMPT.format(name=subtask,trace="n".join(trace),final=final),max_new_tokens=80)
   summary_bullets="n".join(parse_bullets(summ)[:3]) or f"- {subtask}: {final[:60]}..."
   return final, summary_bullets, trace
class ContextFoldingAgent:
   def __init__(self,max_active_chars:int=800):
       self.memory=FoldingMemory(max_chars=max_active_chars)
       self.metrics={"subtasks":0,"tool_calls":0,"chars_saved_est":0}
   def decompose(self,task:str)->List[str]:
       plan=llm_gen(SUBTASK_DECOMP_PROMPT.format(task=task),max_new_tokens=96)
       subs=parse_bullets(plan)
       return subs[:4] if subs else ["Main solution"]
   def run(self,task:str)->Dict:
       t0=time.time()
       self.memory.add(f"TASK: {task}")
       subtasks=self.decompose(task)
       self.metrics["subtasks"]=len(subtasks)
       folded=[]
       for st in subtasks:
           self.memory.add(f"SUBTASK: {st}")
           final,fold_summary,trace=run_subtask(task,st,self.memory)
           self.memory.fold_in(fold_summary)
           folded.append(f"- {st}: {final}")
           self.memory.add(f"SUBTASK_DONE: {st}")
       final=llm_gen(FINAL_SYNTH_PROMPT.format(task=task,folds=self.memory.folded_text()),max_new_tokens=200)
       t1=time.time()
       return {"task":task,"final":final.strip(),"folded_summaries":self.memory.folded_text(),
               "active_context_chars":len(self.memory.active_text()),
               "subtask_finals":folded,"runtime_sec":round(t1-t0,2)}

We implemented the core logic of the agent, where each subtask is executed, summarized, and folded back into memory. This step demonstrates how context folding enables the agent to reason iteratively without losing previous reasoning. Check The complete code is here.

DEMO_TASKS=[
   "Plan a 3-day study schedule for ML with daily workouts and simple meals; include time blocks.",
   "Compute a small project budget with 3 items (laptop 799.99, course 149.5, snacks 23.75), add 8% tax and 5% buffer, and present a one-paragraph recommendation."
]
def pretty(d): return json.dumps(d, indent=2, ensure_ascii=False)
if __name__=="__main__":
   agent=ContextFoldingAgent(max_active_chars=700)
   for i,task in enumerate(DEMO_TASKS,1):
       print("="*70)
       print(f"DEMO #{i}: {task}")
       res=agent.run(task)
       print("n--- Folded Summaries ---n"+(res["folded_summaries"] or "(none)"))
       print("n--- Final Answer ---n"+res["final"])
       print("n--- Diagnostics ---")
       diag={k:res[k] for k in ["active_context_chars","runtime_sec"]}
       diag["n_subtasks"]=len(agent.decompose(task))
       print(pretty(diag))

We run the agent on a sample task to observe how it plans, executes, and synthesizes the final result. Through these examples, we see the complete context folding process, resulting in concise and coherent output.

In summary, we demonstrate how context folding enables long-view reasoning while avoiding memory overload. We see how each sub-task is planned, executed, summarized and refined into compact knowledge, mimicking how intelligent agents handle complex workflows over time. By combining decomposition, tool usage, and context compression, we create a lightweight yet powerful agent system that efficiently scales inference.

Check The complete code is here and Paper . Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.