Researchers from the National University of Singapore introduce Dimple: a discrete diffusion multi-model model that enables efficient and controllable text generation

by admin · May 29, 2025

In recent months, there has been an increasing interest in applying diffusion models (e.g. images) designed for continuous data (e.g. images). This has led to the development of discrete diffusion language models (DLMs) that view text generation as a deprivation process. Unlike traditional self-circular models, DLMSs can decode parallel and better control structures, thus providing flexible initialization such as the whole sequence, clear control of the output format, and improved padding by bidirectional attention. Furthermore, their non-sequential nature opens the door to faster generation. Despite these benefits, most current multi-modal large language models (MLLMSs) such as Llama, Qwen-VL and InternVL rely solely on automatic response methods.

Work on diffusion-based language models explores continuous and discrete diffusion spaces. Continuous methods, such as diffuseq and sed, use embedded or relaxed classification spaces to smooth. In contrast, discrete models such as SDDM and RDM tailor the diffusion process of language structures. Training techniques vary, but the language modeling loss or entropy-based score matching is often used with masks. Some hybrid models, such as AR-diffusion and SSD-LM, combine autoregression and diffusion strategies to take advantage of the two approaches. Meanwhile, open source MLLMs such as LLAVA and InternVL have made progress through visual guidance adjustment and joint preprocessing, but still follow the autoregressive generation scheme.

Researchers from the National University of Singapore attended Dimple, the first discrete DMLLM, integrating visual encoders with discrete diffusion-based language models. To overcome the instability and performance problems of purely diffusion-based training, they introduced a two-stage training method (AutoreRexressive-then-diffusion), combining the initial automatic regression alignment with subsequent diffusion-based masking language modeling. Dimpl-7b is more than 3.9% of Llava-Next on the benchmark. The team also introduced confident decoding of dynamic token generation and explored structural priors to precise control of output. These innovations significantly improve reasoning efficiency, power generation flexibility and structural controllability without sacrificing performance.

Dimpl is a discrete diffusion multi-mode LLM that integrates visual encoders with diffusion-based language models. To address inefficient efficiency in diffusion training, such as sparse supervision and limited generative coverage, the model is divided into two stages: first, automatic regression training is performed using a causal concern mask for visual alignment, and then diffusion training is performed to restore generative capabilities. During the inference period, the dynamic “confidence decoding” strategy adjusts token updates based on predicted confidence. Despite using a smaller training sample, Dimpl showed competitive performance in multiple benchmarks, outperforming similar autoregressive models, although it lags behind state-of-the-art systems at a larger scale.

This experiment evaluates DMLLM DIMPLE to perform autoregressive models based on the guided tracking task. Dimpl trained with a hybrid strategy, combining autoregression and diffusion tuning, has similar training data on most benchmarks, showing strong performance over the model. Although it lags behind models trained in larger datasets, Dimple benefits from a stronger fundamental language model. Ablation studies have shown that combining autoregression and diffusion adjustments can alleviate problems such as length bias and improve consistency. Only a small performance drop, pre-filling further inference speed significantly improves inference speed, which makes the model both effective and competitive in multimodal understanding tasks.

In summary, the first DMLLM DIMPLE is designed to overcome the limitations of pure discrete diffusion training, such as instability and length deviation. Dimpl adopts a hybrid training method that starts with self-rotation learning and then performs diffusion adjustments to produce a Dimpl-7b model that performs 3.9% better than Llava-Next. Decoding strategies, confident decoding, greatly reduce the inference steps, while pre-filling effects can increase speed with the lowest performance trade-off. Dimpl can also implement structured and controllable output through structural priors, thereby providing fine-grained control of format and length functions, which is difficult for automatic regression models to provide.

View paper, model on hug surface and github page. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 95k+ ml reddit And subscribe Our newsletter.

Sana Hassan, a consulting intern at Marktechpost and a dual-degree student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. He is very interested in solving practical problems, and he brings a new perspective to the intersection of AI and real-life solutions.