Q-Learning, UCB and MCTS explore how agents can collaboratively learn intelligent problem-solving strategies in dynamic grid environments

by admin · October 29, 2025

In this tutorial, we explore how exploration strategies can shape intelligent decision-making through agent-based problem solving. We built and trained three agents, namely Q-Learning with epsilon-greedy exploration, Upper Confidence Bound (UCB), and Monte Carlo Tree Search (MCTS), to navigate in a grid world and achieve goals efficiently while avoiding obstacles. Additionally, we experimented with different methods of balancing exploration and exploitation, visualized learning curves, and compared each agent’s adaptation and performance under uncertainty. Check The complete code is here.

import numpy as np
import random
from collections import defaultdict, deque
import math
import matplotlib.pyplot as plt
from typing import List, Tuple, Dict


class GridWorld:
   def __init__(self, size=10, n_obstacles=15):
       self.size = size
       self.grid = np.zeros((size, size))
       self.start = (0, 0)
       self.goal = (size-1, size-1)
       obstacles = set()
       while len(obstacles)

We first create a grid world environment and challenge our agent to achieve goals while avoiding obstacles. We designed its structure, defined movement rules, and ensured realistic navigation boundaries to simulate an interactive problem-solving space. This forms the basis for our exploration agent’s operation and learning. Check The complete code is here.

class QLearningAgent:
   def __init__(self, n_actions=4, alpha=0.1, gamma=0.95, epsilon=1.0):
       self.n_actions = n_actions
       self.alpha = alpha
       self.gamma = gamma
       self.epsilon = epsilon
       self.q_table = defaultdict(lambda: np.zeros(n_actions))
   def get_action(self, state, valid_actions):
       if random.random()

We implemented a Q-Learning agent that learns through experience guided by the epsilon greedy policy. We observe how it explores random actions early and gradually focuses on the most valuable paths. Through iterative updates, it learns to effectively balance exploration and exploitation.

class UCBAgent:
   def __init__(self, n_actions=4, c=2.0, gamma=0.95):
       self.n_actions = n_actions
       self.c = c
       self.gamma = gamma
       self.q_values = defaultdict(lambda: np.zeros(n_actions))
       self.action_counts = defaultdict(lambda: np.zeros(n_actions))
       self.total_counts = defaultdict(int)
   def get_action(self, state, valid_actions):
       self.total_counts[state] += 1
       ucb_values = []
       for action in valid_actions:
           q = self.q_values[state][action]
           count = self.action_counts[state][action]
           if count == 0:
               return action
           exploration_bonus = self.c * math.sqrt(math.log(self.total_counts[state]) / count)
           ucb_values.append((action, q + exploration_bonus))
       return max(ucb_values, key=lambda x: x[1])[0]
   def update(self, state, action, reward, next_state, valid_next_actions):
       self.action_counts[state][action] += 1
       count = self.action_counts[state][action]
       current_q = self.q_values[state][action]
       if valid_next_actions:
           max_next_q = max([self.q_values[next_state][a] for a in valid_next_actions])
       else:
           max_next_q = 0
       target = reward + self.gamma * max_next_q
       self.q_values[state][action] += (target - current_q) / count

We developed the UCB agent, which uses confidence intervals to guide its exploration decisions. We observe how it strategically attempts less visited operations while prioritizing those that yield higher returns. This approach helps us understand more mathematically grounded exploration strategies. Check The complete code is here.

class MCTSNode:
   def __init__(self, state, parent=None):
       self.state = state
       self.parent = parent
       self.children = {}
       self.visits = 0
       self.value = 0.0
   def is_fully_expanded(self, valid_actions):
       return len(self.children) == len(valid_actions)
   def best_child(self, c=1.4):
       choices = [(action, child.value / child.visits +
                   c * math.sqrt(2 * math.log(self.visits) / child.visits))
                  for action, child in self.children.items()]
       return max(choices, key=lambda x: x[1])


class MCTSAgent:
   def __init__(self, env, n_simulations=50):
       self.env = env
       self.n_simulations = n_simulations
   def search(self, state):
       root = MCTSNode(state)
       for _ in range(self.n_simulations):
           node = root
           sim_env = GridWorld(size=self.env.size)
           sim_env.grid = self.env.grid.copy()
           sim_env.agent_pos = state
           while node.is_fully_expanded(sim_env.get_valid_actions(node.state)) and node.children:
               action, _ = node.best_child()
               node = node.children[action]
               sim_env.agent_pos = node.state
           valid_actions = sim_env.get_valid_actions(node.state)
           if valid_actions and not node.is_fully_expanded(valid_actions):
               untried = [a for a in valid_actions if a not in node.children]
               action = random.choice(untried)
               next_state, _, _ = sim_env.step(action)
               child = MCTSNode(next_state, parent=node)
               node.children[action] = child
               node = child
           total_reward = 0
           depth = 0
           while depth

We build Monte Carlo Tree Search (MCTS) agents to simulate and plan for multiple potential future outcomes. We saw how it builds search trees, expands promising branches, and backpropagates results to refine decisions. This allows agents to plan wisely before acting. Check The complete code is here.

def train_agent(agent, env, episodes=500, max_steps=100, agent_type="standard"):
   rewards_history = []
   for episode in range(episodes):
       state = env.reset()
       total_reward = 0
       for step in range(max_steps):
           valid_actions = env.get_valid_actions(state)
           if agent_type == "mcts":
               action = agent.search(state)
           else:
               action = agent.get_action(state, valid_actions)
           next_state, reward, done = env.step(action)
           total_reward += reward
           if agent_type != "mcts":
               valid_next = env.get_valid_actions(next_state)
               agent.update(state, action, reward, next_state, valid_next)
           state = next_state
           if done:
               break
       rewards_history.append(total_reward)
       if hasattr(agent, 'decay_epsilon'):
           agent.decay_epsilon()
       if (episode + 1) % 100 == 0:
           avg_reward = np.mean(rewards_history[-100:])
           print(f"Episode {episode+1}/{episodes}, Avg Reward: {avg_reward:.2f}")
   return rewards_history


if __name__ == "__main__":
   print("=" * 70)
   print("Problem Solving via Exploration Agents Tutorial")
   print("=" * 70)
   env = GridWorld(size=8, n_obstacles=10)
   agents_config = {
       'Q-Learning (ε-greedy)': (QLearningAgent(), 'standard'),
       'UCB Agent': (UCBAgent(), 'standard'),
       'MCTS Agent': (MCTSAgent(env, n_simulations=30), 'mcts')
   }
   results = {}
   for name, (agent, agent_type) in agents_config.items():
       print(f"nTraining {name}...")
       rewards = train_agent(agent, GridWorld(size=8, n_obstacles=10),
                             episodes=300, agent_type=agent_type)
       results[name] = rewards
   plt.figure(figsize=(12, 5))
   plt.subplot(1, 2, 1)
   for name, rewards in results.items():
       smoothed = np.convolve(rewards, np.ones(20)/20, mode="valid")
       plt.plot(smoothed, label=name, linewidth=2)
   plt.xlabel('Episode')
   plt.ylabel('Reward (smoothed)')
   plt.title('Agent Performance Comparison')
   plt.legend()
   plt.grid(alpha=0.3)
   plt.subplot(1, 2, 2)
   for name, rewards in results.items():
       avg_last_100 = np.mean(rewards[-100:])
       plt.bar(name, avg_last_100, alpha=0.7)
   plt.ylabel('Average Reward (Last 100 Episodes)')
   plt.title('Final Performance')
   plt.xticks(rotation=15, ha="right")
   plt.grid(axis="y", alpha=0.3)
   plt.tight_layout()
   plt.show()
   print("=" * 70)
   print("Tutorial Complete!")
   print("Key Concepts Demonstrated:")
   print("1. Epsilon-Greedy exploration")
   print("2. UCB strategy")
   print("3. MCTS-based planning")
   print("=" * 70)

We train all three agents in a grid world and visualize their learning progress and performance. We analyze how each strategy, such as Q-Learning, UCB, and MCTS, adapts to the environment over time. Finally, we compare the results and gain insights into which exploration method solves the problem faster and more reliably.

In summary, we successfully implemented and compared three exploration-driven agents, each of which demonstrated unique strategies for solving the same navigation challenges. We look at how epsilon-greedy enables incremental learning through randomness, how UCB balances confidence with curiosity, and how MCTS leverages simulated rollout for foresight and planning. This exercise helps us understand how different exploration mechanisms affect the convergence, adaptability, and efficiency of reinforcement learning.

Check The complete code is here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

Q-Learning, UCB and MCTS explore how agents can collaboratively learn intelligent problem-solving strategies in dynamic grid environments

You may also like...

live chat

Recent Posts

Q-Learning, UCB and MCTS explore how agents can collaboratively learn intelligent problem-solving strategies in dynamic grid environments

You may also like...

Common errors to avoid when developing MVPs

Microsoft open source github adverb chat extension and code – Free for all developers now

Diet and exercise remain vital to GLP-1 medication

live chat

Recent Posts