Coding implementation to build a transformer-based regression language model to predict continuous values of text
We will build a regression language model (RLM) that directly predicts continuous numerical values from the sequence of text in this coding implementation. We focus on training transformer-based architectures that learn quantitative relationships hidden in natural language descriptions rather than categorizing or generating text. We first generate the synthesized text log data, effectively illustrate it, and then train the lightweight transformer encoder to map the language prompt to the real target. Finally, we not only understand how to implement RLM from scratch, but we can also see their learning behavior and test their generalization of invisible examples. Check The complete code is here.
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import matplotlib.pyplot as plt
from collections import Counter
import re
torch.manual_seed(42)
np.random.seed(42)
print("🚀 Regression Language Model (RLM) Tutorial")
print("=" * 60)
We first need to import basic libraries such as Pytorch, Numpy, and Matplotlib to build and visualize our regression language model. We set up random seeds to ensure repeatability and initialize the environment, ensuring consistent results every time the tutorial is run. Check The complete code is here.
def generate_synthetic_data(n_samples=2000):
"""Generate synthetic text-to-number regression data"""
templates = [
("The temperature is {} degrees", lambda x: x),
("I rate this {} out of ten", lambda x: x),
("The price is {} dollars", lambda x: x),
("Confidence level: {}", lambda x: x / 100),
("Speed of {} kilometers per hour", lambda x: x / 10),
("{} percent complete", lambda x: x / 100),
("Scored {} points in the game", lambda x: x / 10),
("The distance is {} meters", lambda x: x),
]
data = []
for _ in range(n_samples):
template, transform = templates[np.random.randint(len(templates))]
value = np.random.uniform(0, 100)
text = template.format(round(value, 1))
target = transform(value)
data.append((text, target))
return data
We create a synthetic dataset that pairs natural language sentences with corresponding values. By using multiple templates such as temperature, ratings and percentages, we ensure that the model learns multiple text-numeric relationships. This controlled setup helps us simulate realistic regression tasks without relying on external data. Check The complete code is here.
class SimpleTokenizer:
def __init__(self):
self.word2idx = {"": 0, "": 1}
self.idx2word = {0: "", 1: ""}
self.vocab_size = 2
def fit(self, texts):
"""Build vocabulary from texts"""
words = []
for text in texts:
words.extend(re.findall(r'w+|[^ws]', text.lower()))
word_counts = Counter(words)
for word, _ in word_counts.most_common():
if word not in self.word2idx:
self.word2idx[word] = self.vocab_size
self.idx2word[self.vocab_size] = word
self.vocab_size += 1
def encode(self, text, max_len=20):
"""Convert text to token indices"""
words = re.findall(r'w+|[^ws]', text.lower())
indices = [self.word2idx.get(w, 1) for w in words]
if len(indices)
We designed a simple token that converts the original text into a numeric token that the model can handle. It constructs a vocabulary from all unique words, each mapping each vocabulary to an index and automatically processing unknown words and filling them. This step ensures that our text input is converted into a consistent machine-readable sequence for training. Check The complete code is here.
class RLMDataset(Dataset):
def __init__(self, data, tokenizer, max_len=20):
self.data = data
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
text, target = self.data[idx]
tokens = self.tokenizer.encode(text, self.max_len)
return torch.tensor(tokens), torch.tensor([target], dtype=torch.float32)
class RegressionLanguageModel(nn.Module):
def __init__(self, vocab_size, embed_dim=128, num_heads=4, num_layers=2,
dropout=0.1, max_len=20):
super().__init__()
self.token_embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
self.position_embedding = nn.Embedding(max_len, embed_dim)
encoder_layer = nn.TransformerEncoderLayer(
d_model=embed_dim,
nhead=num_heads,
dim_feedforward=embed_dim * 4,
dropout=dropout,
batch_first=True
)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
self.fc1 = nn.Linear(embed_dim, 64)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(dropout)
self.fc2 = nn.Linear(64, 1)
self.max_len = max_len
def forward(self, x):
batch_size, seq_len = x.shape
positions = torch.arange(0, seq_len, device=x.device).unsqueeze(0).expand(batch_size, -1)
token_embed = self.token_embedding(x)
pos_embed = self.position_embedding(positions)
embeddings = token_embed + pos_embed
padding_mask = (x == 0)
encoded = self.transformer(embeddings, src_key_padding_mask=padding_mask)
mask_expanded = (~padding_mask).unsqueeze(-1).float()
summed = (encoded * mask_expanded).sum(dim=1)
pooled = summed / mask_expanded.sum(dim=1)
x = self.fc1(pooled)
x = self.relu(x)
x = self.dropout(x)
output = self.fc2(x)
return output
We wrap the text-name pairs into a pytorch dataset, where we classify each sentence as a batch. We then construct a transformer-based RLM: tokens and position embeddings flow through multi-layer encoder, we refer to non-filled tokens and feed the results to a small MLP header for regression. In fact, we allow the encoder to learn numerical clues from the language, while the header maps them to a single continuous value. Check The complete code is here.
def train_rlm(model, train_loader, val_loader, epochs=15, lr=0.001):
criterion = nn.MSELoss()
optimizer = optim.Adam(model.parameters(), lr=lr)
train_losses, val_losses = [], []
print(f"n📊 Training on {device}")
print("-" * 60)
for epoch in range(epochs):
model.train()
train_loss = 0
for tokens, targets in train_loader:
tokens, targets = tokens.to(device), targets.to(device)
optimizer.zero_grad()
outputs = model(tokens)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
train_loss += loss.item()
train_loss /= len(train_loader)
train_losses.append(train_loss)
model.eval()
val_loss = 0
with torch.no_grad():
for tokens, targets in val_loader:
tokens, targets = tokens.to(device), targets.to(device)
outputs = model(tokens)
loss = criterion(outputs, targets)
val_loss += loss.item()
val_loss /= len(val_loader)
val_losses.append(val_loss)
print(f"Epoch {epoch+1:2d}/{epochs} | Train Loss: {train_loss:.4f} | Val Loss: {val_loss:.4f}")
return train_losses, val_losses
We train the model on the GPU using ADAM and MSE losses, and if any, iterate over small batches to reverse and update the weights. We switch to the verification mode of verification, track training and verification loss at the end of each epoch, and print the progress so that we can see the learning dynamics in real time. Check The complete code is here.
print("n📝 Generating synthetic data...")
data = generate_synthetic_data(2000)
split_idx = int(0.8 * len(data))
train_data, val_data = data[:split_idx], data[split_idx:]
print(f"Train samples: {len(train_data)}, Val samples: {len(val_data)}")
print("n🔤 Building tokenizer...")
tokenizer = SimpleTokenizer()
tokenizer.fit([text for text, _ in train_data])
print(f"Vocabulary size: {tokenizer.vocab_size}")
train_dataset = RLMDataset(train_data, tokenizer)
val_dataset = RLMDataset(val_data, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)
print("n🏗️ Building Regression Language Model...")
model = RegressionLanguageModel(vocab_size=tokenizer.vocab_size)
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")
train_losses, val_losses = train_rlm(model, train_loader, val_loader)
plt.figure(figsize=(10, 4))
plt.plot(train_losses, label="Train Loss", linewidth=2)
plt.plot(val_losses, label="Val Loss", linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss (MSE)')
plt.title('RLM Training Progress')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print("n🎯 Testing Predictions:")
print("-" * 60)
test_examples = [
"The temperature is 25.5 degrees",
"I rate this 8.0 out of ten",
"The price is 45.0 dollars",
"75.0 percent complete"
]
with torch.no_grad():
for text in test_examples:
tokens = torch.tensor([tokenizer.encode(text)]).to(device)
prediction = model(tokens).item()
print(f"Input: {text}")
print(f"Predicted value: {prediction:.4f}n")
print("✅ RLM Tutorial Complete!")
We generate and split synthetic data, fit our tokens, wrap everything in a Pytorch dataset/loader, and build transformer-based RLM. We train the model, visualize the loss curve to validate the learning, and then run some natural language test prompts to see the predicted continuous values. In this way, we complete the end-to-end RLM pipeline.
In summary, we successfully designed, trained and evaluated a regression language model that predicts continuous values of text input. We observe how simple regression headers can enable the model to capture the numerical semantics embedded in the language by combining position embedding, transformer encoders. By generating synthetic data, visualizing training progress, and testing predictions, we demonstrate how RLMS bridges the gap between language comprehension and numerical reasoning.
Check The complete code is here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter. wait! Are you on the telegram? Now, you can also join us on Telegram.
Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.
🙌Follow Marktechpost: Add us as the preferred source on Google.