End-to-end transformer model optimization implementation by embracing faces optimal, ONNX runtime and quantized end-to-end transformer model optimization
In this tutorial, we browse how to use hugged faces optimal Optimize transformer models and keep them accurate faster. We first set up Distilbert on the SST-2 dataset and then compare different execution engines, including normal Pytorch and Torch.com.ple.compile, onnx Runtime and ONNX quantization. By doing this step by step, we can gain practical experience in model export, optimization, quantification and benchmarking in the Google Colab environment. Check The complete code is here.
!pip -q install "transformers>=4.49" "optimum[onnxruntime]>=1.20.0" "datasets>=2.20" "evaluate>=0.4" accelerate
from pathlib import Path
import os, time, numpy as np, torch
from datasets import load_dataset
import evaluate
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
from optimum.onnxruntime import ORTModelForSequenceClassification, ORTQuantizer
from optimum.onnxruntime.configuration import QuantizationConfig
os.environ.setdefault("OMP_NUM_THREADS", "1")
os.environ.setdefault("MKL_NUM_THREADS", "1")
MODEL_ID = "distilbert-base-uncased-finetuned-sst-2-english"
ORT_DIR = Path("onnx-distilbert")
Q_DIR = Path("onnx-distilbert-quant")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
BATCH = 16
MAXLEN = 128
N_WARM = 3
N_ITERS = 8
print(f"Device: {DEVICE} | torch={torch.__version__}")
We first install the required libraries and set up our environment to embrace the face best when running through ONNX. We configure the path, batch size and iterative settings and confirm whether it is running on the CPU or GPU. Check The complete code is here.
ds = load_dataset("glue", "sst2", split="validation[:20%]")
texts, labels = ds["sentence"], ds["label"]
metric = evaluate.load("accuracy")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
def make_batches(texts, max_len=MAXLEN, batch=BATCH):
for i in range(0, len(texts), batch):
yield tokenizer(texts[i:i+batch], padding=True, truncation=True,
max_length=max_len, return_tensors="pt")
def run_eval(predict_fn, texts, labels):
preds = []
for toks in make_batches(texts):
preds.extend(predict_fn(toks))
return metric.compute(predictions=preds, references=labels)["accuracy"]
def bench(predict_fn, texts, n_warm=N_WARM, n_iters=N_ITERS):
for _ in range(n_warm):
for toks in make_batches(texts[:BATCH*2]):
predict_fn(toks)
times = []
for _ in range(n_iters):
t0 = time.time()
for toks in make_batches(texts):
predict_fn(toks)
times.append((time.time() - t0) * 1000)
return float(np.mean(times)), float(np.std(times))
We load SST-2 validation slices and prepare tokenization, precision metrics and batch processing. We define run_eval to calculate accuracy from any predictor and benchmark to warm up and end-to-end infer time. For these helpers, we compared different engines using the same data and batch processing. Check The complete code is here.
torch_model = AutoModelForSequenceClassification.from_pretrained(MODEL_ID).to(DEVICE).eval()
@torch.no_grad()
def pt_predict(toks):
toks = {k: v.to(DEVICE) for k, v in toks.items()}
logits = torch_model(**toks).logits
return logits.argmax(-1).detach().cpu().tolist()
pt_ms, pt_sd = bench(pt_predict, texts)
pt_acc = run_eval(pt_predict, texts, labels)
print(f"[PyTorch eager] {pt_ms:.1f}±{pt_sd:.1f} ms | acc={pt_acc:.4f}")
compiled_model = torch_model
compile_ok = False
try:
compiled_model = torch.compile(torch_model, mode="reduce-overhead", fullgraph=False)
compile_ok = True
except Exception as e:
print("torch.compile unavailable or failed -> skipping:", repr(e))
@torch.no_grad()
def ptc_predict(toks):
toks = {k: v.to(DEVICE) for k, v in toks.items()}
logits = compiled_model(**toks).logits
return logits.argmax(-1).detach().cpu().tolist()
if compile_ok:
ptc_ms, ptc_sd = bench(ptc_predict, texts)
ptc_acc = run_eval(ptc_predict, texts, labels)
print(f"[torch.compile] {ptc_ms:.1f}±{ptc_sd:.1f} ms | acc={ptc_acc:.4f}")
We load the baseline Pytorch classifier, define the PT_PREDICT assistant, and benchmark/rating it on SST-2. We then try to use torch.com for instant graphics optimization and if successful, run the same benchmarks to compare speed and accuracy under the same settings. Check The complete code is here.
provider = "CUDAExecutionProvider" if DEVICE == "cuda" else "CPUExecutionProvider"
ort_model = ORTModelForSequenceClassification.from_pretrained(
MODEL_ID, export=True, provider=provider, cache_dir=ORT_DIR
)
@torch.no_grad()
def ort_predict(toks):
logits = ort_model(**{k: v.cpu() for k, v in toks.items()}).logits
return logits.argmax(-1).cpu().tolist()
ort_ms, ort_sd = bench(ort_predict, texts)
ort_acc = run_eval(ort_predict, texts, labels)
print(f"[ONNX Runtime] {ort_ms:.1f}±{ort_sd:.1f} ms | acc={ort_acc:.4f}")
Q_DIR.mkdir(parents=True, exist_ok=True)
quantizer = ORTQuantizer.from_pretrained(ORT_DIR)
qconfig = QuantizationConfig(approach="dynamic", per_channel=False, reduce_range=True)
quantizer.quantize(model_input=ORT_DIR, quantization_config=qconfig, save_dir=Q_DIR)
ort_quant = ORTModelForSequenceClassification.from_pretrained(Q_DIR, provider=provider)
@torch.no_grad()
def ortq_predict(toks):
logits = ort_quant(**{k: v.cpu() for k, v in toks.items()}).logits
return logits.argmax(-1).cpu().tolist()
oq_ms, oq_sd = bench(ortq_predict, texts)
oq_acc = run_eval(ortq_predict, texts, labels)
print(f"[ORT Quantized] {oq_ms:.1f}±{oq_sd:.1f} ms | acc={oq_acc:.4f}")
We export the model to ONNX, run it with ONNX runtime, and then apply dynamic quantization using Optimum’s Ortquantizer and Benchmark to see how latency improves while accuracy remains comparable. Check The complete code is here.
pt_pipe = pipeline("sentiment-analysis", model=torch_model, tokenizer=tokenizer,
device=0 if DEVICE=="cuda" else -1)
ort_pipe = pipeline("sentiment-analysis", model=ort_model, tokenizer=tokenizer, device=-1)
samples = [
"What a fantastic movie—performed brilliantly!",
"This was a complete waste of time.",
"I’m not sure how I feel about this one."
]
print("nSample predictions (PT | ORT):")
for s in samples:
a = pt_pipe(s)[0]["label"]
b = ort_pipe(s)[0]["label"]
print(f"- {s}n PT={a} | ORT={b}")
import pandas as pd
rows = [["PyTorch eager", pt_ms, pt_sd, pt_acc],
["ONNX Runtime", ort_ms, ort_sd, ort_acc],
["ORT Quantized", oq_ms, oq_sd, oq_acc]]
if compile_ok: rows.insert(1, ["torch.compile", ptc_ms, ptc_sd, ptc_acc])
df = pd.DataFrame(rows, columns=["Engine", "Mean ms (↓)", "Std ms", "Accuracy"])
display(df)
print("""
Notes:
- BetterTransformer is deprecated on transformers>=4.49, hence omitted.
- For larger gains on GPU, also try FlashAttention2 models or FP8 with TensorRT-LLM.
- For CPU, tune threads: set OMP_NUM_THREADS/MKL_NUM_THREADS; try NUMA pinning.
- For static (calibrated) quantization, use QuantizationConfig(approach="static") with a calibration set.
""")
We conduct sane check predictions with fast emotional pipelines and printed Pytorch vs onnx tags side by side. We then assemble a summary table to compare latency and accuracy between engines and insert torch.com when available. We end with practical annotations, allowing us to extend our workflow to other backends and quantization patterns.
In short, we can clearly see how the best approach can help us bridge the gap between standard Pytorch models and ready for production, optimized deployments. While maintaining accuracy, we achieve acceleration through ONNX runtime and quantization, and we also explored Torch.com delivers direct benefits in Pytorch. This workflow demonstrates a practical approach to balancing the performance and efficiency of the transformer model, providing a foundation that can be further expanded through advanced backends such as OpenVino or Tensorrt.
Check The complete code is here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.
Talk to us about content partnerships/promotions on Marktechpost.com
Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.
🔥[Recommended Read] NVIDIA AI Open Source VIPE (Video Pose Engine): A powerful and universal 3D video annotation tool for spatial AI