Coding Guide for End-to-End Robot Learning, Using Lerobot: Training, Evaluation and Visualization of Behavioral Cloning Policy on Pusht

by admin · September 20, 2025

In this tutorial, we use the hugging face to move step by step Lerobot Library training and evaluation of behavioral cloning policies on Pusht datasets. We first set up the environment in Google Colab, install the required dependencies, and load the dataset through Lerobot’s unified API. We then designed a compact visual motion strategy that combines the convolutional main chain with small MLP header groups, allowing us to map image and state observations directly into robotic actions. By training on a subset of datasets to increase speed, we can quickly demonstrate how Lerobot enables a reproducible dataset-driven robot learning pipeline. Check The complete code is here.

!pip -q install --upgrade lerobot torch torchvision timm imageio[ffmpeg]


import os, math, random, io, sys, json, pathlib, time
import torch, torch.nn as nn, torch.nn.functional as F
from torch.utils.data import DataLoader, Subset
from torchvision.utils import make_grid, save_image
import numpy as np
import imageio.v2 as imageio


try:
   from lerobot.common.datasets.lerobot_dataset import LeRobotDataset
except Exception:
   from lerobot.datasets.lerobot_dataset import LeRobotDataset


DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
SEED = 42
random.seed(SEED); np.random.seed(SEED); torch.manual_seed(SEED)

We first install the required libraries and set up our training environment. We import all required modules, configure the dataset loader and fix random seeds to ensure repeatability. We also detected whether we were running on the GPU or CPU, allowing our experiments to run efficiently. Check The complete code is here.

REPO_ID = "lerobot/pusht" 
ds = LeRobotDataset(REPO_ID) 
print("Dataset length:", len(ds))


s0 = ds[0]
keys = list(s0.keys())
print("Sample keys:", keys)


def key_with(prefixes):
   for k in keys:
       for p in prefixes:
           if k.startswith(p): return k
   return None


K_IMG = key_with(["observation.image", "observation.images", "observation.rgb"])
K_STATE = key_with(["observation.state"])
K_ACT = "action"
assert K_ACT in s0, f"No 'action' key found in sample. Found: {keys}"
print("Using keys -> IMG:", K_IMG, "STATE:", K_STATE, "ACT:", K_ACT)

We load the Pusht dataset and check its structure. We check available keys to determine which keys correspond to images, states, and operations, and plot them as consistent access throughout the training pipeline. Check The complete code is here.

class PushTWrapper(torch.utils.data.Dataset):
   def __init__(self, base):
       self.base = base
   def __len__(self): return len(self.base)
   def __getitem__(self, i):
       x = self.base[i]
       img = x[K_IMG]
       if img.ndim == 4: img = img[-1]
       img = img.float() / 255.0 if img.dtype==torch.uint8 else img.float()
       state = x.get(K_STATE, torch.zeros(2))
       state = state.float().reshape(-1)
       act = x[K_ACT].float().reshape(-1)
       if img.shape[-2:] != (96,96):
           img = F.interpolate(img.unsqueeze(0), size=(96,96), mode="bilinear", align_corners=False)[0]
       return {"image": img, "state": state, "action": act}


wrapped = PushTWrapper(ds)
N = len(wrapped)
idx = list(range(N))
random.shuffle(idx)
n_train = int(0.9*N)
train_idx, val_idx = idx[:n_train], idx[n_train:]


train_ds = Subset(wrapped, train_idx[:12000])
val_ds   = Subset(wrapped, val_idx[:2000])


BATCH = 128
train_loader = DataLoader(train_ds, batch_size=BATCH, shuffle=True, num_workers=2, pin_memory=True)
val_loader   = DataLoader(val_ds,   batch_size=BATCH, shuffle=False, num_workers=2, pin_memory=True)

We package each sample so that we always get a standardized 96×96 image, a flat state and action, and if a time stack is present, the last frame is selected. Then we shuffled the cards, split them into trains/vals, and did a block size quick run. Finally, we create valid data loaders through batch processing, shuffling and fixed memory to keep training smooth. Check The complete code is here.

class SmallBackbone(nn.Module):
   def __init__(self, out=256):
       super().__init__()
       self.conv = nn.Sequential(
           nn.Conv2d(3, 32, 5, 2, 2), nn.ReLU(inplace=True),
           nn.Conv2d(32, 64, 3, 2, 1), nn.ReLU(inplace=True),
           nn.Conv2d(64,128, 3, 2, 1), nn.ReLU(inplace=True),
           nn.Conv2d(128,128,3, 1, 1), nn.ReLU(inplace=True),
       )
       self.head = nn.Sequential(nn.AdaptiveAvgPool2d(1), nn.Flatten(), nn.Linear(128, out), nn.ReLU(inplace=True))
   def forward(self, x): return self.head(self.conv(x))


class BCPolicy(nn.Module):
   def __init__(self, img_dim=256, state_dim=2, hidden=256, act_dim=2):
       super().__init__()
       self.backbone = SmallBackbone(img_dim)
       self.mlp = nn.Sequential(
           nn.Linear(img_dim + state_dim, hidden), nn.ReLU(inplace=True),
           nn.Linear(hidden, hidden//2), nn.ReLU(inplace=True),
           nn.Linear(hidden//2, act_dim)
       )
   def forward(self, img, state):
       z = self.backbone(img)
       if state.ndim==1: state = state.unsqueeze(0)
       z = torch.cat([z, state], dim=-1)
       return self.mlp(z)


policy = BCPolicy().to(DEVICE)
opt = torch.optim.AdamW(policy.parameters(), lr=3e-4, weight_decay=1e-4)
scaler = torch.cuda.amp.GradScaler(enabled=(DEVICE=="cuda"))


@torch.no_grad()
def evaluate():
   policy.eval()
   mse, n = 0.0, 0
   for batch in val_loader:
       img = batch["image"].to(DEVICE, non_blocking=True)
       st  = batch["state"].to(DEVICE, non_blocking=True)
       act = batch["action"].to(DEVICE, non_blocking=True)
       pred = policy(img, st)
       mse += F.mse_loss(pred, act, reduction="sum").item()
       n += act.numel()
   return mse / n


def cosine_lr(step, total, base=3e-4, min_lr=3e-5):
   if step>=total: return min_lr
   cos = 0.5*(1+math.cos(math.pi*step/total))
   return min_lr + (base-min_lr)*cos


EPOCHS = 4 
steps_total = EPOCHS*len(train_loader)
step = 0
best = float("inf")
ckpt = "/content/lerobot_pusht_bc.pt"


for epoch in range(EPOCHS):
   policy.train()
   for batch in train_loader:
       lr = cosine_lr(step, steps_total); step += 1
       for g in opt.param_groups: g["lr"] = lr


       img = batch["image"].to(DEVICE, non_blocking=True)
       st  = batch["state"].to(DEVICE, non_blocking=True)
       act = batch["action"].to(DEVICE, non_blocking=True)


       opt.zero_grad(set_to_none=True)
       with torch.cuda.amp.autocast(enabled=(DEVICE=="cuda")):
           pred = policy(img, st)
           loss = F.smooth_l1_loss(pred, act)
       scaler.scale(loss).backward()
       nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
       scaler.step(opt); scaler.update()


   val_mse = evaluate()
   print(f"Epoch {epoch+1}/{EPOCHS} | Val MSE: {val_mse:.6f}")
   if val_mse

We define a compact visual motion strategy: CNN skeleton extracts images Our image functionality fused with the robot state to predict two-dimensional operations. We trained using ADAMW (Cosine Learning Evaluation Timeline, Mixed Accuracy, and Gradient Clips) while evaluating on the validation set by MSE. We check the best model by validating loss so that the most powerful policies can be reloaded later. Check The complete code is here.

policy.load_state_dict(torch.load(ckpt)["state_dict"]); policy.eval()
os.makedirs("/content/vis", exist_ok=True)


def draw_arrow(imgCHW, action_xy, scale=40):
   import PIL.Image, PIL.ImageDraw
   C,H,W = imgCHW.shape
   arr = (imgCHW.clamp(0,1).permute(1,2,0).cpu().numpy()*255).astype(np.uint8)
   im = PIL.Image.fromarray(arr)
   dr = PIL.ImageDraw.Draw(im)
   cx, cy = W//2, H//2
   dx, dy = float(action_xy[0])*scale, float(-action_xy[1])*scale
   dr.line((cx, cy, cx+dx, cy+dy), width=3, fill=(0,255,0))
   return np.array(im)


frames = []
with torch.no_grad():
   for i in range(60):
       b = wrapped[i]
       img = b["image"].unsqueeze(0).to(DEVICE)
       st  = b["state"].unsqueeze(0).to(DEVICE)
       pred = policy(img, st)[0].cpu()
       frames.append(draw_arrow(b["image"], pred))
video_path = "/content/vis/pusht_pred.mp4"
imageio.mimsave(video_path, frames, fps=10)
print("Wrote", video_path)


grid = make_grid(torch.stack([wrapped[i]["image"] for i in range(16)]), nrow=8)
save_image(grid, "/content/vis/grid.png")
print("Saved grid:", "/content/vis/grid.png")

We reload the best checkpoint and then switch the policy to evaluation so that we can see its behavior. We overlay the action arrows on the frame, stitch them into a short MP4, and save the fast image grid to get a snapshot view of the dataset. This allows us to confirm at a glance the action of our model’s real Pusht observations.

In short, we see how Lerobot can easily integrate data processing, policy definition and evaluation into one framework. By training our lightweight policies and visualizing predictive actions on the Pusht framework, we confirm that the library provides us with a practical entry point for robot learning without the need for actual hardware. Now we have the ability to extend the pipeline to more advanced models such as diffusion or ACT policies, try different datasets, and even share our trained policies on the Hug Horizon.

Check The complete code is here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

🔥[Recommended Read] NVIDIA AI Open Source VIPE (Video Pose Engine): A powerful and universal 3D video annotation tool for spatial AI

Coding Guide for End-to-End Robot Learning, Using Lerobot: Training, Evaluation and Visualization of Behavioral Cloning Policy on Pusht

You may also like...

live chat

Recent Posts

Coding Guide for End-to-End Robot Learning, Using Lerobot: Training, Evaluation and Visualization of Behavioral Cloning Policy on Pusht

You may also like...

Can current flow without electrons?

Amazon Nova Foundation Model: Redefining price and performance for generative AI

Recommended aspect ratio by meta creative placement

live chat

Recent Posts