Build a coding implementation of a complete self-commissioned LLM workflow through Ollama, REST API and Gradio Chat interface

by admin · August 20, 2025

In this tutorial, we implemented a fully functional Horama The environment inside Google Colab to replicate self-hosted LLM workflows. We first install Ollama directly on the COLAB VM using the official Linux installer, and then start Ollama Server in the background to expose HTTP API: 11434 on LocalHost. After validating the service, we will balance resource constraints with availability in a CPU-only environment like QWEN2.5:0.5B-Instruct or Llama3.2:1b. To interact with these models programmatically, we use the /API/CHAT endpoint via the streaming-enabled Python request module, which allows token-level output to be gradually captured. Finally, we have a layer Gradio– Based on the UI of this client, we can issue prompts, maintain multi-turn history, configure parameters such as temperature and context size, and view the results in real time. Check The complete code is here.

import os, sys, subprocess, time, json, requests, textwrap
from pathlib import Path


def sh(cmd, check=True):
   """Run a shell command, stream output."""
   p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
   for line in p.stdout:
       print(line, end="")
   p.wait()
   if check and p.returncode != 0:
       raise RuntimeError(f"Command failed: {cmd}")


if not Path("/usr/local/bin/ollama").exists() and not Path("/usr/bin/ollama").exists():
   print("🔧 Installing Ollama ...")
   sh("curl -fsSL  | sh")
else:
   print("✅ Ollama already installed.")


try:
   import gradio 
except Exception:
   print("🔧 Installing Gradio ...")
   sh("pip -q install gradio==4.44.0")

We first check if Ollama is already installed on the system, and if not, install it using an official script. At the same time, we make sure that we can ensure that we use Gradio by importing or installing the required version when lost. In this way, we prepared the Colab environment to run the chat interface smoothly. Check The complete code is here.

def start_ollama():
   try:
       requests.get(" timeout=1)
       print("✅ Ollama server already running.")
       return None
   except Exception:
       pass
   print("🚀 Starting Ollama server ...")
   proc = subprocess.Popen(["ollama", "serve"], stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
   for _ in range(60):
       time.sleep(1)
       try:
           r = requests.get(" timeout=1)
           if r.ok:
               print("✅ Ollama server is up.")
               break
       except Exception:
           pass
   else:
       raise RuntimeError("Ollama did not start in time.")
   return proc


server_proc = start_ollama()

We start the Ollama server in the background and continue to check its healthy endpoint until it responds successfully. By doing this, we make sure the server is running and ready before sending any API requests. Check The complete code is here.

MODEL = os.environ.get("OLLAMA_MODEL", "qwen2.5:0.5b-instruct")
print(f"🧠 Using model: {MODEL}")
try:
   tags = requests.get(" timeout=5).json()
   have = any(m.get("name")==MODEL for m in tags.get("models", []))
except Exception:
   have = False


if not have:
   print(f"⬇️  Pulling model {MODEL} (first time only) ...")
   sh(f"ollama pull {MODEL}")

We define the default model to use, check if it is available on the Ollama server, and if not, we will automatically pull it. This ensures that the selected model is ready before starting any chat sessions. Check The complete code is here.

OLLAMA_URL = "


def ollama_chat_stream(messages, model=MODEL, temperature=0.2, num_ctx=None):
   """Yield streaming text chunks from Ollama /api/chat."""
   payload = {
       "model": model,
       "messages": messages,
       "stream": True,
       "options": {"temperature": float(temperature)}
   }
   if num_ctx:
       payload["options"]["num_ctx"] = int(num_ctx)
   with requests.post(OLLAMA_URL, json=payload, stream=True) as r:
       r.raise_for_status()
       for line in r.iter_lines():
           if not line:
               continue
           data = json.loads(line.decode("utf-8"))
           if "message" in data and "content" in data["message"]:
               yield data["message"]["content"]
           if data.get("done"):
               break

We create a streaming client for the Ollama /API /CHAT endpoint, at which point we send the message as a JSON payload and a revenue token when sent. This allows us to process the response step by step, so we can see the output of the model in real time instead of waiting for the full completion. Check The complete code is here.

def smoke_test():
   print("n🧪 Smoke test:")
   sys_msg = {"role":"system","content":"You are concise. Use short bullets."}
   user_msg = {"role":"user","content":"Give 3 quick tips to sleep better."}
   out = []
   for chunk in ollama_chat_stream([sys_msg, user_msg], temperature=0.3):
       print(chunk, end="")
       out.append(chunk)
   print("n🧪 Done.n")
try:
   smoke_test()
except Exception as e:
   print("⚠️ Smoke test skipped:", e)

We perform a quick smoke test by sending simple prompts through the streaming client to confirm that the model responds correctly. This helps us verify that Ollama is installed, the server is running, and the selected model is working before we build the full chat UI. Check The complete code is here.

import gradio as gr


SYSTEM_PROMPT = "You are a helpful, crisp assistant. Prefer bullets when helpful."


def chat_fn(message, history, temperature, num_ctx):
   msgs = [{"role":"system","content":SYSTEM_PROMPT}]
   for u, a in history:
       if u: msgs.append({"role":"user","content":u})
       if a: msgs.append({"role":"assistant","content":a})
   msgs.append({"role":"user","content": message})
   acc = ""
   try:
       for part in ollama_chat_stream(msgs, model=MODEL, temperature=temperature, num_ctx=num_ctx or None):
           acc += part
           yield acc
   except Exception as e:
       yield f"⚠️ Error: {e}"


with gr.Blocks(title="Ollama Chat (Colab)", fill_height=True) as demo:
   gr.Markdown("# 🦙 Ollama Chat (Colab)nSmall local-ish LLM via Ollama + Gradio.n")
   with gr.Row():
       temp = gr.Slider(0.0, 1.0, value=0.3, step=0.1, label="Temperature")
       num_ctx = gr.Slider(512, 8192, value=2048, step=256, label="Context Tokens (num_ctx)")
   chat = gr.Chatbot(height=460)
   msg = gr.Textbox(label="Your message", placeholder="Ask anything…", lines=3)
   clear = gr.Button("Clear")


   def user_send(m, h):
       m = (m or "").strip()
       if not m: return "", h
       return "", h + [[m, None]]


   def bot_reply(h, temperature, num_ctx):
       u = h[-1][0]
       stream = chat_fn(u, h[:-1], temperature, int(num_ctx))
       acc = ""
       for partial in stream:
           acc = partial
           h[-1][1] = acc
           yield h


   msg.submit(user_send, [msg, chat], [msg, chat])
      .then(bot_reply, [chat, temp, num_ctx], [chat])
   clear.click(lambda: None, None, chat)


print("🌐 Launching Gradio ...")
demo.launch(share=True)

We integrated Gradio to build an interactive CHAT UI on the top of the Ollama server, converting user input and conversation history to the correct message format at the top of the server, and streaming back as a model response. The slider lets us adjust parameters such as temperature and context length, while the chat box and clear button provide a simple real-time interface for testing different tips.

In short, we built a reproducible pipeline for running Ollama in COLAB: installation, server startup, model management, API access and user interface integration. The system uses Ollama’s REST API as the core interaction layer, providing both command line and Python stream access, while Gradio handles session persistence and chat rendering. This approach retains the “self-hosted” design described in the original guide, but it adapts to the limitations of Colab, where Docker and GPU-supported Ollama images are impractical. The result is a compact and technically complete framework that allows us to try multiple LLMs, dynamically adjust the generated parameters, and locally test the conversational AI in a laptop environment.

Check The complete code is here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.