How to build a fully functional custom GPT-style conversational AI locally using Hugging Face Transformers
In this tutorial, we build our own custom GPT style chat system from scratch using the native Hugging Face model. We first load a lightweight command-tuned model that understands conversational prompts, then wrap it in a structured chat framework that includes system roles, user memory, and assistant responses. We define how the agent interprets context, constructs messages, and optionally uses small built-in tools to obtain local data or simulate search results. Finally, we have a fully functional conversation model that behaves like a personalized GPT run. Check The complete code is here.
!pip install transformers accelerate sentencepiece --quiet
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List, Tuple, Optional
import textwrap, json, os
We first install the necessary libraries and import the required modules. We ensure that the environment has all necessary dependencies, such as transformers, torches, and sentences, available for use. This setup allows us to work seamlessly with the Hugging Face model within Google Colab. Check The complete code is here.
MODEL_NAME = "microsoft/Phi-3-mini-4k-instruct"
BASE_SYSTEM_PROMPT = (
"You are a custom GPT running locally. "
"Follow user instructions carefully. "
"Be concise and structured. "
"If something is unclear, say it is unclear. "
"Prefer practical examples over corporate examples unless explicitly asked. "
"When asked for code, give runnable code."
)
MAX_NEW_TOKENS = 256
We configure the model name, define system prompts that control the assistant’s behavior, and set token limits. We determine how our custom GPT should be responsive, clean, structured, and useful. This section defines the basis of our model’s identity and teaching style. Check The complete code is here.
print("Loading model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token_id is None:
tokenizer.pad_token_id = tokenizer.eos_token_id
model = AutoModelForCausalLM.from_pretrained(
MODEL_NAME,
torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
device_map="auto"
)
model.eval()
print("Model loaded.")
We load the tokenizer and model from Hugging Face into memory and prepare for inference. We automatically adjust device mapping based on available hardware to ensure GPU acceleration wherever possible. Once loaded, our model can generate responses. Check The complete code is here.
ConversationHistory = List[Tuple[str, str]]
history: ConversationHistory = [("system", BASE_SYSTEM_PROMPT)]
def wrap_text(s: str, w: int = 100) -> str:
return "n".join(textwrap.wrap(s, width=w))
def build_chat_prompt(history: ConversationHistory, user_msg: str) -> str:
prompt_parts = []
for role, content in history:
if role == "system":
prompt_parts.append(f"n{content}n")
elif role == "user":
prompt_parts.append(f"n{content}n")
elif role == "assistant":
prompt_parts.append(f"n{content}n")
prompt_parts.append(f"n{user_msg}n")
prompt_parts.append("n")
return "".join(prompt_parts)
We initialize the conversation history starting from the system role and create a prompt generator to format the message. We define how to structure user and assistant turns in a consistent conversational structure. This ensures that the model always correctly understands the conversation context. Check The complete code is here.
def local_tool_router(user_msg: str) -> Optional[str]:
msg = user_msg.strip().lower()
if msg.startswith("search:"):
query = user_msg.split(":", 1)[-1].strip()
return f"Search results about '{query}':n- Key point 1n- Key point 2n- Key point 3"
if msg.startswith("docs:"):
topic = user_msg.split(":", 1)[-1].strip()
return f"Documentation extract on '{topic}':n1. The agent orchestrates tools.n2. The model consumes output.n3. Responses become memory."
return None
We added a lightweight tool router that extends GPT’s ability to simulate tasks such as search or document retrieval. We define logic to detect special prefixes in user queries, such as “search:” or “docs:”. This simple agent design provides situational awareness to our assistant. Check The complete code is here.
def generate_reply(history: ConversationHistory, user_msg: str) -> str:
tool_context = local_tool_router(user_msg)
if tool_context:
user_msg = user_msg + "nnUseful context:n" + tool_context
prompt = build_chat_prompt(history, user_msg)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=MAX_NEW_TOKENS,
do_sample=True,
top_p=0.9,
temperature=0.6,
pad_token_id=tokenizer.eos_token_id
)
decoded = tokenizer.decode(output_ids[0], skip_special_tokens=True)
reply = decoded.split("")[-1].strip() if "" in decoded else decoded[len(prompt):].strip()
history.append(("user", user_msg))
history.append(("assistant", reply))
return reply
def save_history(history: ConversationHistory, path: str = "chat_history.json") -> None:
data = [{"role": r, "content": c} for (r, c) in history]
with open(path, "w") as f:
json.dump(data, f, indent=2)
def load_history(path: str = "chat_history.json") -> ConversationHistory:
if not os.path.exists(path):
return [("system", BASE_SYSTEM_PROMPT)]
with open(path, "r") as f:
data = json.load(f)
return [(item["role"], item["content"]) for item in data]
We define a primary reply generation function that combines history, context, and model reasoning to produce consistent output. We’ve also added the ability to save and load past conversations for persistence. This code snippet forms the core of the operation of our custom GPT. Check The complete code is here.
print("n--- Demo turn 1 ---")
demo_reply_1 = generate_reply(history, "Explain what this custom GPT setup is doing in 5 bullet points.")
print(wrap_text(demo_reply_1))
print("n--- Demo turn 2 ---")
demo_reply_2 = generate_reply(history, "search: agentic ai with local models")
print(wrap_text(demo_reply_2))
def interactive_chat():
print("nChat ready. Type 'exit' to stop.")
while True:
try:
user_msg = input("nUser: ").strip()
except EOFError:
break
if user_msg.lower() in ("exit", "quit", "q"):
break
reply = generate_reply(history, user_msg)
print("nAssistant:n" + wrap_text(reply))
# interactive_chat()
print("nCustom GPT initialized successfully.")
We tested the entire setup by running the demo prompt and displaying the generated response. We also created an optional interactive chat loop to talk directly to the assistant. Finally, we confirmed that our custom GPT was running locally and responding intelligently in real time.
In summary, we designed and implemented a custom session agent that reflects GPT-style inference without relying on any external services. We saw how to make local models interactive through on-the-fly orchestration, lightweight tool routing, and conversational memory management. This approach allows us to understand the internal logic behind commercial GPT systems. It allows us to experiment with our own rules, behaviors, and integrations in a transparent and completely offline way.
Check The complete code is here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.
Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an AI media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.
🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.