How to build an agent-based deep reinforcement learning system with course progression, adaptive exploration, and meta-level UCB planning

In this tutorial, we build an advanced agent deep reinforcement learning system that guides the agent to not only learn actions in the environment, but also learn how to choose its own training strategy. We designed the Dueling Double DQN learner, introduced courses of increasing difficulty, and integrated multiple exploration modes to adapt to the development of training. Most importantly, we built a meta-agent to plan, evaluate, and regulate the entire learning process, allowing us to experience how an agent can transform reinforcement learning into a self-directed strategic workflow. Check The complete code is here.

!pip install -q gymnasium[classic-control] torch matplotlib


import gymnasium as gym
import numpy as np
import torch, torch.nn as nn, torch.optim as optim
from collections import deque, defaultdict
import math, random, matplotlib.pyplot as plt


random.seed(0); np.random.seed(0); torch.manual_seed(0)


class DuelingQNet(nn.Module):
   def __init__(self, obs_dim, act_dim):
       super().__init__()
       hidden = 128
       self.feature = nn.Sequential(
           nn.Linear(obs_dim, hidden),
           nn.ReLU(),
       )
       self.value_head = nn.Sequential(
           nn.Linear(hidden, hidden),
           nn.ReLU(),
           nn.Linear(hidden, 1),
       )
       self.adv_head = nn.Sequential(
           nn.Linear(hidden, hidden),
           nn.ReLU(),
           nn.Linear(hidden, act_dim),
       )


   def forward(self, x):
       h = self.feature(x)
       v = self.value_head(h)
       a = self.adv_head(h)
       return v + (a - a.mean(dim=1, keepdim=True))


class ReplayBuffer:
   def __init__(self, capacity=100000):
       self.buffer = deque(maxlen=capacity)
   def push(self, s,a,r,ns,d):
       self.buffer.append((s,a,r,ns,d))
   def sample(self, batch_size):
       batch = random.sample(self.buffer, batch_size)
       s,a,r,ns,d = zip(*batch)
       def to_t(x, dt): return torch.tensor(x, dtype=dt, device=device)
       return to_t(s,torch.float32), to_t(a,torch.long), to_t(r,torch.float32), to_t(ns,torch.float32), to_t(d,torch.float32)
   def __len__(self): return len(self.buffer)

We established the core structure of a deep reinforcement learning system. We initialize the environment, create the Duel Q network, and prepare replay buffers to store transitions efficiently. When we build these foundations, we prepare everything the agent needs to start learning. Check The complete code is here.

class DQNAgent:
   def __init__(self, obs_dim, act_dim, gamma=0.99, lr=1e-3, batch_size=64):
       self.q = DuelingQNet(obs_dim, act_dim).to(device)
       self.tgt = DuelingQNet(obs_dim, act_dim).to(device)
       self.tgt.load_state_dict(self.q.state_dict())
       self.buf = ReplayBuffer()
       self.opt = optim.Adam(self.q.parameters(), lr=lr)
       self.gamma = gamma
       self.batch_size = batch_size
       self.global_step = 0


   def _eps_value(self, step, start=1.0, end=0.05, decay=8000):
       return end + (start - end) * math.exp(-step/decay)


   def select_action(self, state, mode, strategy, softmax_temp=1.0):
       s = torch.tensor(state, dtype=torch.float32, device=device).unsqueeze(0)
       with torch.no_grad():
           q_vals = self.q(s).cpu().numpy()[0]
       if mode == "eval":
           return int(np.argmax(q_vals)), None
       if strategy == "epsilon":
           eps = self._eps_value(self.global_step)
           if random.random() 

We define how the agent observes the environment, chooses actions, and updates its neural network. We implemented Double DQN logic, gradient update and exploration strategies to allow the agent to balance learning and discovery. When we complete this snippet, we equip our agent with full low-level learning capabilities. Check The complete code is here.

class MetaAgent:
   def __init__(self, agent):
       self.agent = agent
       self.levels = {
           "EASY": 100,
           "MEDIUM": 300,
           "HARD": 500,
       }
       self.plans = []
       for diff in self.levels.keys():
           for mode in ["train", "eval"]:
               for expl in ["epsilon", "softmax"]:
                   self.plans.append((diff, mode, expl))
       self.counts = defaultdict(int)
       self.values = defaultdict(float)
       self.t = 0
       self.history = []


   def _ucb_score(self, plan, c=2.0):
       n = self.counts[plan]
       if n == 0:
           return float("inf")
       return self.values[plan] + c * math.sqrt(math.log(self.t+1) / n)


   def select_plan(self):
       self.t += 1
       scores = [self._ucb_score(p) for p in self.plans]
       return self.plans[int(np.argmax(scores))]


   def make_env(self, diff):
       max_steps = self.levels[diff]
       return gym.make("CartPole-v1", max_episode_steps=max_steps)


   def meta_reward_fn(self, diff, mode, avg_return):
       r = avg_return
       if diff == "MEDIUM": r += 20
       if diff == "HARD": r += 50
       if mode == "eval" and diff == "HARD": r += 50
       return r


   def update_plan_value(self, plan, meta_reward):
       self.counts[plan] += 1
       n = self.counts[plan]
       mu = self.values[plan]
       self.values[plan] = mu + (meta_reward - mu) / n


   def run(self, meta_rounds=30):
       eval_log = {"EASY":[], "MEDIUM":[], "HARD":[]}
       for k in range(1, meta_rounds+1):
           diff, mode, expl = self.select_plan()
           env = self.make_env(diff)
           avg_ret = self.agent.run_episodes(env, 5 if mode=="train" else 3, mode, expl if mode=="train" else "epsilon")
           env.close()
           if k % 3 == 0:
               self.agent.update_target()
           meta_r = self.meta_reward_fn(diff, mode, avg_ret)
           self.update_plan_value((diff,mode,expl), meta_r)
           self.history.append((k, diff, mode, expl, avg_ret, meta_r))
           if mode == "eval":
               eval_log[diff].append((k, avg_ret))
           print(f"{k} {diff} {mode} {expl} {avg_ret:.1f} {meta_r:.1f}")
       return eval_log

We designed the agent layer to decide how the agent should be trained. We use UCB bandit to select difficulty levels, modes, and exploration styles based on past performance. When we repeatedly ran these selections, we observed that the meta-agent strategically guided the entire training process. Check The complete code is here.

tmp_env = gym.make("CartPole-v1", max_episode_steps=100)
obs_dim, act_dim = tmp_env.observation_space.shape[0], tmp_env.action_space.n
tmp_env.close()


agent = DQNAgent(obs_dim, act_dim)
meta = MetaAgent(agent)


eval_log = meta.run(meta_rounds=36)


final_scores = agent.evaluate_across_levels(meta.levels, episodes=10)
print("Final Evaluation")
for k, v in final_scores.items():
   print(k, v)

We bring everything together by launching a meta-round, where meta-agents select plans and DQN agents execute them. We track how performance develops and how agents adapt to increasingly difficult tasks. As this code runs, we see the emergence of long-term autonomous learning. Check The complete code is here.

plt.figure(figsize=(9,4))
for diff, color in [("EASY","tab:blue"), ("MEDIUM","tab:orange"), ("HARD","tab:red")]:
   if eval_log[diff]:
       x, y = zip(*eval_log[diff])
       plt.plot(x, y, marker="o", label=f"{diff}")
plt.xlabel("Meta-Round")
plt.ylabel("Avg Return")
plt.title("Agentic Meta-Control Evaluation")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

We visualize how agents perform over time on easy, medium, and hard tasks. We observe the learning trends, improvements, and effects of agent planning reflected in the curves. When we analyze these plots, we can gain insight into how strategic decisions affect the overall progress of the agent.

In summary, we observed our agent evolve into a system that can learn at multiple levels, refine its strategy, adapt its exploration, and strategically choose how to train itself. We observe that the meta-agent refines its decisions through UCB-based planning and guides low-level learners to complete more challenging tasks and improves stability. By gaining a deeper understanding of how agent structure amplifies reinforcement learning, we can create systems that can plan, adapt, and optimize their own improvements over time.


Check The complete code is here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.


Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

You may also like...