How to build supervised AI models when there is no annotated data

by admin · November 3, 2025

One of the biggest challenges in real-world machine learning is that supervised models require labeled data, but in many real-world scenarios, the data you start with is almost always unlabeled. Manually annotating thousands of samples is not only slow; It’s expensive, tedious, and often impractical.

This is where active learning becomes a game changer.

Active learning is a subset of machine learning in which the algorithm is not a passive consumer of data, but an active participant. Rather than pre-labeling the entire data set, the model intelligently selects which data points to label next. It interactively queries humans or oracles for labels on the most informative examples, allowing it to learn faster with fewer annotations. Check The complete code is here.

The workflow typically looks like this:

A small seed portion of the dataset is first labeled to train an initial weak model.
Use this model to generate predictions and confidence scores on unlabeled data.
Compute a confidence measure (e.g., probability gap) for each prediction.
Only the samples with the lowest confidence, i.e. those for which the model is most uncertain, are selected.
These uncertain samples are manually labeled and added to the training set.
Retrain the model and repeat the cycle of Prediction → Rank Confidence → Label → Retrain.
After several iterations, the model can achieve near-fully supervised performance while requiring fewer manually labeled samples.

In this article, we walk through how to apply this strategy and show how active learning can help you build high-quality supervised models with minimal labeling effort. Check The complete code is here.

Install and import libraries

pip install numpy pandas scikit-learn matplotlib


import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In this tutorial we will use the make_classification dataset from the sklearn library

SEED = 42 # For reproducibility
N_SAMPLES = 1000 # Total number of data points
INITIAL_LABELED_PERCENTAGE = 0.10 # Your constraint: Start with 10% labeled data
NUM_QUERIES = 20 # Number of times we ask the "human" to label a confusing sample

NUM_QUERIES = 20 represents the annotation budget in an active learning setting. In a real-world workflow, this means the model selects the 20 most puzzling samples and sends them to human annotators for labeling—each annotation costs time and money. In our simulations, we replicate this process automatically: during each iteration, the model selects an uncertain sample, the code immediately retrieves its true label (acting as a human oracle), and retrains the model using this new information.

Therefore, setting NUM_QUERIES = 20 means that we are simulating the benefits of labeling only 20 strategically chosen samples and observing how much the model improves with limited but valuable human effort.

Active learning data generation and segmentation strategies

This block handles data generation and initial segmentation, powering the entire active learning experiment. It first uses make_classification to create 1,000 synthetic samples for a binary classification problem. The dataset is then divided into 10% holdout test set for final evaluation and 90% pooling for training. Of this pool, only 10% is retained as a small initial labeled set (matching constraints that start with very limited annotations), while the remaining 90% becomes the unlabeled pool. This setting creates realistic low-label scenarios designed for active learning, where there are large amounts of unlabeled samples available for strategic querying. Check The complete code is here.

X, y = make_classification(
    n_samples=N_SAMPLES, n_features=10, n_informative=5, n_redundant=0,
    n_classes=2, n_clusters_per_class=1, flip_y=0.1, random_state=SEED
)

# 1. Split into 90% Pool (samples to be queried) and 10% Test (final evaluation)
X_pool, X_test, y_pool, y_test = train_test_split(
    X, y, test_size=0.10, random_state=SEED, stratify=y
)

# 2. Split the 90% Pool into Initial Labeled (10% of the pool) and Unlabeled (90% of the pool)
X_labeled_current, X_unlabeled_full, y_labeled_current, y_unlabeled_full = train_test_split(
    X_pool, y_pool, test_size=1.0 - INITIAL_LABELED_PERCENTAGE,
    random_state=SEED, stratify=y_pool
)

# A set to track indices in the unlabeled pool for efficient querying and removal
unlabeled_indices_set = set(range(X_unlabeled_full.shape[0]))

print(f"Initial Labeled Samples (STARTING N): {len(y_labeled_current)}")
print(f"Unlabeled Pool Samples: {len(unlabeled_indices_set)}")

Initial training and baseline assessment

This block trains an initial logistic regression model using only a small labeled seed set and evaluates its accuracy on a holdout test set. The labeled sample count and baseline accuracy are then stored as the first point in the performance history, establishing a starting baseline before active learning begins. Check The complete code is here.

labeled_size_history = []
accuracy_history = []

# Train the baseline model on the small initial labeled set
baseline_model = LogisticRegression(random_state=SEED, max_iter=2000)
baseline_model.fit(X_labeled_current, y_labeled_current)

# Evaluate performance on the held-out test set
y_pred_init = baseline_model.predict(X_test)
accuracy_init = accuracy_score(y_test, y_pred_init)

# Record the baseline point (x=90, y=0.8800)
labeled_size_history.append(len(y_labeled_current))
accuracy_history.append(accuracy_init)

print(f"INITIAL BASELINE (N={labeled_size_history[0]}): Test Accuracy: {accuracy_history[0]:.4f}")

active learning loop

This block contains the core of the active learning process, where the model iteratively selects the most uncertain samples, receives their true labels, retrains and evaluates performance. At each iteration, the current model predicts probabilities for all unlabeled samples, identifies the sample with the highest uncertainty (lowest confidence), and “queries” its true label—simulating a human annotator. The newly labeled data points are added to the training set, the new model is retrained and the accuracy is recorded. Repeating this loop for 20 queries demonstrates how directional tagging can quickly improve model performance with minimal annotation effort. Check The complete code is here.

current_model = baseline_model # Start the loop with the baseline model

print(f"nStarting Active Learning Loop ({NUM_QUERIES} Queries)...")

# -----------------------------------------------
# The Active Learning Loop (Query, Annotate, Retrain, Evaluate)
# Purpose: Run 20 iterations to demonstrate strategic labeling gains.
# -----------------------------------------------
for i in range(NUM_QUERIES):
    if not unlabeled_indices_set:
        print("Unlabeled pool is empty. Stopping.")
        break
    
    # --- A. QUERY STRATEGY: Find the Least Confident Sample ---
    # 1. Get probability predictions from the CURRENT model for all unlabeled samples
    probabilities = current_model.predict_proba(X_unlabeled_full)
    max_probabilities = np.max(probabilities, axis=1)

    # 2. Calculate Uncertainty Score (1 - Max Confidence)
    uncertainty_scores = 1 - max_probabilities

    # 3. Identify the index of the sample with the MAXIMUM uncertainty score
    current_indices_list = list(unlabeled_indices_set)
    current_uncertainty = uncertainty_scores[current_indices_list]
    most_uncertain_idx_in_subset = np.argmax(current_uncertainty)
    query_index_full = current_indices_list[most_uncertain_idx_in_subset]
    query_uncertainty_score = uncertainty_scores[query_index_full]

    # --- B. HUMAN ANNOTATION SIMULATION ---
    # This is the single critical step where the human annotator intervenes.
    # We look up the true label (y_unlabeled_full) for the sample the model asked for.
    X_query = X_unlabeled_full[query_index_full].reshape(1, -1)
    y_query = np.array([y_unlabeled_full[query_index_full]])
    
    # Update the Labeled Set: Add the new annotated sample (N becomes N+1)
    X_labeled_current = np.vstack([X_labeled_current, X_query])
    y_labeled_current = np.hstack([y_labeled_current, y_query])
    # Remove the sample from the unlabeled pool
    unlabeled_indices_set.remove(query_index_full)
    
    # --- C. RETRAIN and EVALUATE ---
    # Train the NEW model on the larger, improved labeled set
    current_model = LogisticRegression(random_state=SEED, max_iter=2000)
    current_model.fit(X_labeled_current, y_labeled_current)

    # Evaluate the new model on the held-out test set
    y_pred = current_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Record results for plotting
    labeled_size_history.append(len(y_labeled_current))
    accuracy_history.append(accuracy)

    # Output status
    print(f"nQUERY {i+1}: Labeled Samples: {len(y_labeled_current)}")
    print(f"  > Test Accuracy: {accuracy:.4f}")
    print(f"  > Uncertainty Score: {query_uncertainty_score:.4f}")

final_accuracy = accuracy_history[-1]

final result

The experiment successfully verified the efficiency of active learning. By focusing the annotation effort on 20 strategically chosen samples (increasing the labeled set from 90 to 110), the model’s performance on the unseen test set improved from 0.8800 (88%) arrive 0.9100 (91%).

Accuracy increased by 3 percentage points with minimal increase in annotation workload and training data volume increased by approximately 22%, resulting in measurable and meaningful performance improvements.

Essentially, active learners act as intelligent managers, ensuring that every dollar or minute spent on manual labeling delivers the greatest possible benefit, proving that smart labeling is more valuable than random or batch labeling. Check The complete code is here.

Plot the results

plt.figure(figsize=(10, 6))
plt.plot(labeled_size_history, accuracy_history, marker="o", linestyle="-", color="#00796b", label="Active Learning (Least Confidence)")
plt.axhline(y=final_accuracy, color="red", linestyle="--", alpha=0.5, label="Final Accuracy")
plt.title('Active Learning: Accuracy vs. Number of Labeled Samples')
plt.xlabel('Number of Labeled Samples')
plt.ylabel('Test Set Accuracy')
plt.grid(True, linestyle="--", alpha=0.7)
plt.legend()
plt.tight_layout()
plt.show()

Check The complete code is here. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.

I am a Civil Engineering graduate (2022) from Jamia Millia Islamia, New Delhi and I am very interested in data science, especially neural networks and their applications in various fields.

🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.

How to build supervised AI models when there is no annotated data

Install and import libraries

Active learning data generation and segmentation strategies

Initial training and baseline assessment

active learning loop

final result

Plot the results

You may also like...

live chat

Recent Posts

How to build supervised AI models when there is no annotated data

Install and import libraries

Active learning data generation and segmentation strategies

Initial training and baseline assessment

active learning loop

final result

Plot the results

You may also like...

Clean port, clear future: decarburized Norwegian maritime hub

Build an AI-driven application using code workflows in Plan → File → tinydev

The rise of AI-driven coding: efficiency or a cybersecurity nightmare?

live chat

Recent Posts