How to build advanced AI agents using short-term and vector-based long-term memory of abstracts

by admin · September 2, 2025

In this tutorial, we introduce you to a high-level AI agent that not only chats, but also remembers. We start from scratch and demonstrate how to combine lightweight LLM, FAISS vector search, and summary mechanisms for creating short- and long-term memories. Working with embedded and auto-scaling facts, we can make agents that adapt to our instructions, review important details in later conversations, and intelligently compress the environment to ensure that the interaction remains smooth and efficient. Check The complete code is here.

!pip -q install transformers accelerate bitsandbytes sentence-transformers faiss-cpu


import os, json, time, uuid, math, re
from datetime import datetime
import torch, faiss
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

We first install the base library and import all the required modules for the agent. We set up the environment to determine whether we use the GPU or the CPU, allowing us to run the model efficiently. Check The complete code is here.

def load_llm(model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"):
   try:
       if DEVICE=="cuda":
           bnb=BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_compute_dtype=torch.bfloat16,bnb_4bit_quant_type="nf4")
           tok=AutoTokenizer.from_pretrained(model_name, use_fast=True)
           mdl=AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb, device_map="auto")
       else:
           tok=AutoTokenizer.from_pretrained(model_name, use_fast=True)
           mdl=AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, low_cpu_mem_usage=True)
       return pipeline("text-generation", model=mdl, tokenizer=tok, device=0 if DEVICE=="cuda" else -1, do_sample=True)
   except Exception as e:
       raise RuntimeError(f"Failed to load LLM: {e}")

We define a function to load our language model. We set it up so that if a GPU is available, we use 4-bit quantization to improve efficiency; otherwise, we will return to the CPU via the optimized settings. This ensures that we can generate text smoothly regardless of the hardware we are running. Check The complete code is here.

class VectorMemory:
   def __init__(self, path="/content/agent_memory.json", dim=384):
       self.path=path; self.dim=dim; self.items=[]
       self.embedder=SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device=DEVICE)
       self.index=faiss.IndexFlatIP(dim)
       if os.path.exists(path):
           data=json.load(open(path))
           self.items=data.get("items",[])
           if self.items:
               X=torch.tensor([x["emb"] for x in self.items], dtype=torch.float32).numpy()
               self.index.add(X)
   def _emb(self, text):
       v=self.embedder.encode([text], normalize_embeddings=True)[0]
       return v.tolist()
   def add(self, text, meta=None):
       e=self._emb(text); self.index.add(torch.tensor([e]).numpy())
       rec={"id":str(uuid.uuid4()),"text":text,"meta":meta or {}, "emb":e}
       self.items.append(rec); self._save(); return rec["id"]
   def search(self, query, k=5, thresh=0.25):
       if len(self.items)==0: return []
       q=self.embedder.encode([query], normalize_embeddings=True)
       D,I=self.index.search(q, min(k, len(self.items)))
       out=[]
       for d,i in zip(D[0],I[0]):
           if i==-1: continue
           if d>=thresh: out.append((d,self.items[i]))
       return out
   def _save(self):
       slim=[{k:v for k,v in it.items()} for it in self.items]
       json.dump({"items":slim}, open(self.path,"w"), indent=2)

We have created a VectorMemory class that provides long-term memory for our agents. We store past interactions as using Minilm as embeddings and indexing them with Faiss so that we can search and recall relevant information later. Each memory is saved to disk, allowing the agent to retain its memory in the session. Check The complete code is here.

def now_iso(): return datetime.now().isoformat(timespec="seconds")
def clamp(txt, n=1600): return txt if len(txt)self.max_turns:
           convo="n".join([f"{r}: {t}" for r,t in self.turns])
           s=self._gen(SUMMARIZE_PROMPT(clamp(convo, 3500)), max_new_tokens=180, temp=0.2)
           self.summary=s; self.turns=self.turns[-4:]
   def recall(self, query, k=5):
       hits=self.mem.search(query, k=k)
       return "n".join([f"- ({d:.2f}) {h['text']} [meta={h['meta']}]" for d,h in hits])
   def ask(self, user):
       self.turns.append(("user", user))
       saved, memline = self._distill_and_store(user)
       mem_ctx=self.recall(user, k=6)
       prompt=self._chat_prompt(user, mem_ctx)
       reply=self._gen(prompt)
       self.turns.append(("assistant", reply))
       self._maybe_summarize()
       status=f"💾 memory_saved: {saved}; " + (f"note: {memline}" if saved else "note: -")
       print(f"nUSER: {user}nASSISTANT: {reply}n{status}")
       return reply

We bring everything together. We designed agents to generate responses through context, refining important facts into long-term memories, and regularly summarizing conversations to manage short-term contexts. With this setup, we create an assistant to recall, recall and adapt to the interactions we interact with. Check The complete code is here.

agent=MemoryAgent()


print("✅ Agent ready. Try these:n")
agent.ask("Hi! My name is Nicolaus, I prefer being called Nik. I'm preparing for UPSC in 2027.")
agent.ask("Also, I work at  Visa in analytics and love concise answers.")
agent.ask("What's my exam year and how should you address me next time?")
agent.ask("Reminder: I like agentic RAG tutorials with single-file Colab code.")
agent.ask("Given my prefs, suggest a study focus for this week in one paragraph.")

We instantiate memory and immediately exercise through some messages to sow long-term memory and verify memories. We confirm that it remembers the names and exam years we like, adapts to our simplicity style, and uses past preferences (agent rags, single file Colab) to tailor the research guide.

All in all, when we give our AI agents the ability to remember, we see how powerful it is. Now we have an agent that stores key details that will recall when relevant and summarize the conversation to stay efficient. This approach keeps our interactions contextual and evolving, making agents more personal and smart in every exchange. With this foundation, we are ready to further expand memory, explore richer patterns, and try more advanced memory instrument designs.

Check The complete code is here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.