A coding guide for building brain-inspired hierarchical reasoning with embrace facial models

by admin · August 30, 2025

In this tutorial, we set out to reproduce the spirit of a hierarchical reasoning model (HRM) using a locally run free embrace facial model. We introduce the design of lightweight but structured reasoning agents, where we are both architects and experimenters. By breaking down the problems into sub-objectives, solving them in Python, criticizing the results and synthesizing the final answers, we can experience how hierarchical planning and execution improves inference performance. This process allows us to see in real-time how to implement brain-inspired workflows without the need for a large number of model sizes or expensive APIs. Check Paper and complete code.

!pip -q install -U transformers accelerate bitsandbytes rich


import os, re, json, textwrap, traceback
from typing import Dict, Any, List
from rich import print as rprint
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline


MODEL_NAME = "Qwen/Qwen2.5-1.5B-Instruct"
DTYPE = torch.bfloat16 if torch.cuda.is_available() else torch.float32

We first need to install the required libraries and then load the QWEN2.5-1.5B-Inscrut model from the hug surface. We set the data type based on the availability of the GPU to ensure efficient model execution in COLAB.

tok = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(
   MODEL_NAME,
   device_map="auto",
   torch_dtype=DTYPE,
   load_in_4bit=True
)
gen = pipeline(
   "text-generation",
   model=model,
   tokenizer=tok,
   return_full_text=False
)

We load the token and model, configure it to 4 bits for efficiency, and wrap everything in a text generation pipeline so we can easily interact with the models in Colab. Check Paper and complete code.

def chat(prompt: str, system: str = "", max_new_tokens: int = 512, temperature: float = 0.3) -> str:
   msgs = []
   if system:
       msgs.append({"role":"system","content":system})
   msgs.append({"role":"user","content":prompt})
   inputs = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
   out = gen(inputs, max_new_tokens=max_new_tokens, do_sample=(temperature>0), temperature=temperature, top_p=0.9)
   return out[0]["generated_text"].strip()


def extract_json(txt: str) -> Dict[str, Any]:
   m = re.search(r"{[sS]*}$", txt.strip())
   if not m:
       m = re.search(r"{[sS]*?}", txt)
   try:
       return json.loads(m.group(0)) if m else {}
   except Exception:
       # fallback: strip code fences
       s = re.sub(r"^```.*?n|n```$", "", txt, flags=re.S)
       try:
           return json.loads(s)
       except Exception:
           return {}

We define the assistant function: the chat function allows us to send prompts to the model through optional system instructions and sampling controls, while Extract_json reliably parses structured JSON output from the model, even if the response includes a code fence or other text. Check Paper and complete code.

def extract_code(txt: str) -> str:
   m = re.search(r"```(?:python)?s*([sS]*?)```", txt, flags=re.I)
   return (m.group(1) if m else txt).strip()


def run_python(code: str, env: Dict[str, Any] | None = None) -> Dict[str, Any]:
   import io, contextlib
   g = {"__name__": "__main__"}; l = {}
   if env: g.update(env)
   buf = io.StringIO()
   try:
       with contextlib.redirect_stdout(buf):
           exec(code, g, l)
       out = l.get("RESULT", g.get("RESULT"))
       return {"ok": True, "result": out, "stdout": buf.getvalue()}
   except Exception as e:
       return {"ok": False, "error": str(e), "trace": traceback.format_exc(), "stdout": buf.getvalue()}


PLANNER_SYS = """You are the HRM Planner.
Decompose the TASK into 2–4 atomic, code-solvable subgoals.
Return compact JSON only: {"subgoals":[...], "final_format":""}."""


SOLVER_SYS = """You are the HRM Solver.
Given SUBGOAL and CONTEXT vars, output a single Python snippet.
Rules:
- Compute deterministically.
- Set a variable RESULT to the answer.
- Keep code short; stdlib only.
Return only a Python code block."""


CRITIC_SYS = """You are the HRM Critic.
Given TASK and LOGS (subgoal results), decide if final answer is ready.
Return JSON only: {"action":"submit"|"revise","critique":"...", "fix_hint":""}."""


SYNTH_SYS = """You are the HRM Synthesizer.
Given TASK, LOGS, and final_format, output only the final answer (no steps).
Follow final_format exactly."""

We have added two important parts: utility features and system prompts. The Extract_Code function pulls Python fragments from the output of the model, and Run_python safely executes these fragments and captures their results. In addition to this, we define four role prompts, planners, solvers, critics, and synthesizers, which guide the model to break down tasks into sub-objections, solve them with code, verify correctness and ultimately produce a clean answer. Check Paper and complete code.

def plan(task: str) -> Dict[str, Any]:
   p = f"TASK:n{task}nReturn JSON only."
   return extract_json(chat(p, PLANNER_SYS, temperature=0.2, max_new_tokens=300))


def solve_subgoal(subgoal: str, context: Dict[str, Any]) -> Dict[str, Any]:
   prompt = f"SUBGOAL:n{subgoal}nCONTEXT vars: {list(context.keys())}nReturn Python code only."
   code = extract_code(chat(prompt, SOLVER_SYS, temperature=0.2, max_new_tokens=400))
   res = run_python(code, env=context)
   return {"subgoal": subgoal, "code": code, "run": res}


def critic(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]:
   pl = [{"subgoal": L["subgoal"], "result": L["run"].get("result"), "ok": L["run"]["ok"]} for L in logs]
   out = chat("TASK:n"+task+"nLOGS:n"+json.dumps(pl, ensure_ascii=False, indent=2)+"nReturn JSON only.",
              CRITIC_SYS, temperature=0.1, max_new_tokens=250)
   return extract_json(out)


def refine(task: str, logs: List[Dict[str, Any]]) -> Dict[str, Any]:
   sys = "Refine subgoals minimally to fix issues. Return same JSON schema as planner."
   out = chat("TASK:n"+task+"nLOGS:n"+json.dumps(logs, ensure_ascii=False)+"nReturn JSON only.",
              sys, temperature=0.2, max_new_tokens=250)
   j = extract_json(out)
   return j if j.get("subgoals") else {}


def synthesize(task: str, logs: List[Dict[str, Any]], final_format: str) -> str:
   packed = [{"subgoal": L["subgoal"], "result": L["run"].get("result")} for L in logs]
   return chat("TASK:n"+task+"nLOGS:n"+json.dumps(packed, ensure_ascii=False)+
               f"nfinal_format: {final_format}nOnly the final answer.",
               SYNTH_SYS, temperature=0.0, max_new_tokens=120).strip()


def hrm_agent(task: str, context: Dict[str, Any] | None = None, budget: int = 2) -> Dict[str, Any]:
   ctx = dict(context or {})
   trace, plan_json = [], plan(task)
   for round_id in range(1, budget+1):
       logs = [solve_subgoal(sg, ctx) for sg in plan_json.get("subgoals", [])]
       for L in logs:
           ctx_key = f"g{len(trace)}_{abs(hash(L['subgoal']))%9999}"
           ctx[ctx_key] = L["run"].get("result")
       verdict = critic(task, logs)
       trace.append({"round": round_id, "plan": plan_json, "logs": logs, "verdict": verdict})
       if verdict.get("action") == "submit": break
       plan_json = refine(task, logs) or plan_json
   final = synthesize(task, trace[-1]["logs"], plan_json.get("final_format", "Answer: "))
   return {"final": final, "trace": trace}

We implement a complete HR management loop: we plan sub-objectives, solve each loop by generating and running python (capture results), then we then make a critical, optionally refine the plan, and synthesize a clean final answer. We curated these rounds in HRM_AGENT, with the intermediate results as context, so we said in the commentator that “submit” once iterative improvements and stops. Check Paper and complete code.

ARC_TASK = textwrap.dedent("""
Infer the transformation rule from train examples and apply to test.
Return exactly: "Answer: ", where  is a Python list of lists of ints.
""").strip()
ARC_DATA = {
   "train": [
       {"inp": [[0,0],[1,0]], "out": [[1,1],[0,1]]},
       {"inp": [[0,1],[0,0]], "out": [[1,0],[1,1]]}
   ],
   "test": [[0,0],[0,1]]
}
res1 = hrm_agent(ARC_TASK, context={"TRAIN": ARC_DATA["train"], "TEST": ARC_DATA["test"]}, budget=2)
rprint("n[bold]Demo 1 — ARC-like Toy[/bold]")
rprint(res1["final"])


WM_TASK = "A tank holds 1200 L. It leaks 2% per hour for 3 hours, then is refilled by 150 L. Return exactly: 'Answer: '."
res2 = hrm_agent(WM_TASK, context={}, budget=2)
rprint("n[bold]Demo 2 — Word Math[/bold]")
rprint(res2["final"])


rprint("n[dim]Rounds executed (Demo 1):[/dim]", len(res1["trace"]))

We run two demonstrations to validate the agent: an arc-form task in which we infer the transformation from train pairs and apply it to the test grid, as well as the problem of word methodology for examining numerical inference. We call HRM_AGENT in each task, print the final answer, and show the number of inferences required for the ARC to run.

In short, we realize that what we are building is more than just a simple demonstration. This is a window explaining how to make a smaller model exceed its weight. Through layered planning, resolution and criticism, we enable a free embrace facial model to perform tasks with surprisingly robustness. We have a deeper understanding of how brain-inspired structures paired with practical open source tools that allow us to explore reasoning benchmarks and creative experiments without incurring high costs. This hands-on trip shows us that anyone willing to tinker, iterate, and learn has access to advanced cognitive workflows.

Check Paper and complete code. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.