Meet OLLM: A lightweight Python library that brings 100k-Context LLM inference to 8 GB consumer GPU via SSD offload-no quantization required

Ollm is a lightweight Python library built on Huggingface Transformers and Pytorch and runs large alternative Transformers on NVIDIA GPUs by actively accelerating weight and KV-Cache for fast local SSDs. This project uses flashattention-2 of FP16/BF16 and disk-supported KV cache to take offline targets, single GPU workloads and explicit avoidance of quantization to keep VRAM within 8-10 GB in 8-10 GB while processing ~100k context tokens.

But what new features are there?

(1) KV cache read/write bypass mmap Reduce the use of host RAM; (2) qwen3-next-80b’s disk support; (3) Llama-3 Flashattention-2 is used for stability; (4) GPT-Oss memory decomposition through the “Flash Attention” kernel and MLP decomposition is reduced. The table end-to-end memory/I/O footprints published by the maintainer are on the RTX 3060 Ti (8 GB):

  • Qwen3-Next-80b (BF16, 160 GB weight, 50K CTX) →~7.5 GB VRAM + ~180 GB SSD; indicated throughput “≈1tok/2 s”.
  • GPT-OSS-20B (packaged BF16, 10K CTX) →~7.3 GB VRAM + 15 GB SSD.
  • Llama-3.1-8B (FP16, 100K CTX) →~6.6 GB VRAM + 69 GB SSD.

How it works

The OLLM stream layer directly from the SSD to the GPU, offloads the attention kV cache to the SSD, and optionally offloads the layer to the CPU. It uses flashattention-2 with online softmax (SoftMax), so the full attention matrix is ​​never implemented, and the chunks of MLP projection is to bind peak memory. This shifts the bottleneck from VRAM to storage bandwidth and latency, which is why the OLLM project emphasizes NVME-level SSD and KVIKIO/CUFILE (GPUDIRECT Storage) for high-throughput file I/O.

Supported models and GPUs

Out-of-the-box sample cover Llama-3 (1b/3b/8b),,,,, GPT-oss-20band Qwen3-Next-80b. The library is targeted at Nvidia Ampere (RTX 30xx, Series A), ADA (RTX 40XX, L4) and Hopper; Qwen3-Next requires the development of transformers (≥4.57.0.dev). It is worth noting that the QWEN3-NEXT-80B is a sparse MOE (total 80b, ~3B Active), and the vendor is usually located in multiple A100/H100 deployments; Ollm’s statement is that you can implement It takes offline by paying SSD fines and accepting low throughput. This is contrary to the VLLM documentation, which implies multi-GPU servers of the same model family.

Installation and Minimum Usage

The project has been licensed by MIT and obtained on PYPI (pip install ollm), and one more kvikio-cu{cuda_version} Dependence of high-speed disk I/O. For Qwen3-next models, install the transformer from GitHub. A short example in README shows Inference(...).DiskCache(...) Wiring and generate(...) With streaming text callback. (PYPI currently lists 0.4.1; reads are changed with reference to 0.4.2.)

Performance expectations and trade-offs

  • Throughput: On RTX 3060 TI, the maintainer reported ~0.5 tok/s on a 50K context, which can be used for batch/offline analytics rather than interactive chat. SSD latency dominates.
  • Storage pressure: Long context requires a large KV cache; OLLM writes it to the SSD to keep the VRAM flat. This reflects a wider industry work on KV offloading (such as NVIDIA DYNAMO/NIXL and community discussions), but the approach is still stored and specific to workloads.
  • Hardware reality check: Running qwen3-next-80b “on consumer hardware” is Feasible With the help of OLLM disk-centric design, the typical high-throughput recommendation of this model can still expect multi-GPU servers. Think of OLLM as a capitalized execution path, offline pass instead of a replacement for production stacks (e.g. VLLM/TGI).

Bottom line

OLLM has introduced a clear design point: maintaining high precision, pushing memory toward SSDs, and making extra-long context viable on a single 8 GB NVIDIA GPU. It doesn’t match the data center throughput, but for offline document/log analysis, compliance review or large text summary, it’s a pragmatic way to comfortably perform 8b-20b models and even gradually rise to MOE-80B if you can tolerate ~100–200 GB of fast local storage and Sub-1 Tok/S Generation and Sub-1 Tok/s Generation.


Check Github repo is here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.


Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

🔥[Recommended Read] NVIDIA AI Open Source VIPE (Video Pose Engine): A powerful and universal 3D video annotation tool for spatial AI

You may also like...