Liquid AI releases LFM2-8B-A1B: an expert hybrid on-device with 8.3B parameters and 1.5B active parameters per token
How powerful can sparsity be? 8.3B-Parameters Ministry of Education and ~1.5B active paths Deliver service on your phone without impacting latency or memory? Liquid AI has been released LFM2-8B-A1Ba small Mix of Experts (MoE) model built for on-device execution under tight memory, latency, and energy budgets. Unlike most MoE efforts, which are optimized for cloud batch services, the LFM2-8B-A1B targets mobile phones, laptops, and embedded systems. it shows 8.3B total parameters but only activate ~1.5B parameters per token,use sparse expert routing to preserve small computational paths,while increasing representation capabilities. The model is in LFM Open License v1.0 (lfm1.0)
Understand the architecture
The LFM2-8B-A1B retains the LFM2 “fast backbone” and inserts sparse MoE feedforward blocks to increase capacity without significantly increasing active computing. Backbone network usage 18 gated short convolutional blocks and 6 Grouped Query Attention (GQA) blocks. All layers Except the first two Includes MoE blocks; first two kept dense for stability. Each MoE block definition 32 experts;Router selection Top 4 experts per coin with a Normalized sigmoid gate and Adaptive routing bias Balance the load and stabilize your training. The context length is 32,768 tokens;vocabulary 65,536;Report on pre-training budget ~12T tokens.
This approach enables per-token FLOPs and cache growth to be limited by active paths (attention + four expert MLPs), while the total capacity allows for cross-domain specialization, such as multilingual knowledge, mathematics, and code, use cases that typically regress on very small dense models.

performance signal
Liquid AI reports LFM2-8B-A1B Runs significantly faster than Qwen3-1.7B CPU testing using internal XNNPACK based stack and custom CPU MoE core. Public land coverage Using int8 dynamic activation for int4 quantization exist AMD Ryzen AI 9 HX370 and Samsung Galaxy S24 Ultra. Liquid AI team positions quality as comparable to 3–4B dense modelwhile keeping the active calculation close to 1.5B. No cross-vendor “×-faster” headline multipliers are published; these claims are made based on comparisons of each device to similar active models.
In terms of accuracy, the model card lists the results of 16 benchmarks, including MMLU/MMLU-Pro/GPQA (knowledge), IFEval/IFBench/Multi-IF (instruction following), GSM8K/GSMPlus/MATH500/MATH-Lvl-5 (mathematics), and MGSM/MMMLU (multilingual). These numbers demonstrate competitive instruction compliance and mathematical performance at small model scales and improved knowledge capacity relative to LFM2-2.6B, consistent with the larger total parameter budget.




Deployment and tools
LFM2-8B-A1B ships with Transformers/vLLM for GPU inference and a GGUF build for llama.cpp; the official GGUF repository lists the Q4_0 approximately 4.7 GB most F16 about 16.7GB For local running, while camel.cpp Requires recent build lfm2moe
support(b6709+) to avoid “unknown model schema” errors. Liquid’s CPU Verification Use Q4_0 and int8 dynamic activation exist AMD Ryzen AI 9 HX370 and Samsung Galaxy S24 Ultrawhere LFM2-8B-A1B shows a higher Qwen3-1.7B In similar activity parameter categories; execution torch For mobile/embedded CPU deployment reference.




Main points
- Architecture and routing: LFM2-8B-A1B pairs the LFM2 fast backbone (18 gated short convolutional blocks + 6 GQA blocks) with sparse MoE FFN per layer (all but the first two layers), using 32 experts, with top-4 routing via normalized sigmoid gating and adaptive bias; 8.3B total parameters, approximately 1.5B active per token.
- Target on device: Designed for mobile phones, laptops, and embedded CPU/GPUs; quantized variant “comfortably” suitable for private, low-latency use on high-end consumer hardware.
- Performance positioning. Liquid Report LFM2-8B-A1B Yes Significantly faster than Qwen3-1.7B In the CPU test, the goal is 3–4B intensive quality While maintaining approximately 1.5B active paths.
LFM2-8B-A1B demonstrates that sparse MoE is feasible under common server-scale mechanisms. This model combines the LFM2 convolutional attention backbone with expert MLP for each layer (except the first two layers), keeping the token calculation at around 1.5B while improving the quality to 3-4B dense classes. With standard and GGUF weights, llama.cpp/ExecuTorch/vLLM paths, and relaxed device-side posture, the LFM2-8B-A1B is a specific choice for building low-latency, personal assistants, and application-embedded copilots on consumer and edge hardware.
Check Model hugging face and technical details. Please feel free to check out our GitHub page for tutorials, code, and notebooks. In addition, welcome to follow us twitter And don’t forget to join our 100k+ ML SubReddit and subscribe our newsletter. wait! Are you using Telegram? Now you can also join us via telegram.

Asif Razzaq is the CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of artificial intelligence for the benefit of society. His most recent endeavor is the launch of Marktechpost, an artificial intelligence media platform that stands out for its in-depth coverage of machine learning and deep learning news that is technically sound and easy to understand for a broad audience. The platform has more than 2 million monthly views, which shows that it is very popular among viewers.
🙌 FOLLOW MARKTECHPOST: Add us as your go-to source on Google.