The QWEN team introduced Qwen-Image-edit: an image editing version of QWEN-IMAGE with advanced features for semantic and appearance editing

by admin · August 18, 2025

In the domain of multimodal AI, instruction-based image editing models are changing the way users interact with visual content. Released by Alibaba’s Qwen team in August 2025, Qwen-Image-Edit is built on the 20B parameter QWEN-IMAGE Foundation to provide advanced editing capabilities. The model is excellent in semantic editing (e.g., style transfer and novel view synthesis) and appearance editing (e.g., precise object modification), while retaining the intensity of Qwen-Image in complex text to render English and Chinese. Integrated with QWEN chat and available by embracing the face, it reduces barriers to professional content creation, from IP design to error correction in generated artwork.

Architecture and key innovation

QWEN-IMAGE-EDIT extends QWEN-IMAGE’s multi-mode diffusion transformer (MMDIT) architecture, which includes a QWEN2.5-VL multi-mode large language model (MLLM) for text conditioning, a denaturation automatic device (VAE) for image Tokenization and MMDit BackbOne’s Variation Automatic Configuration (VAE). For editing, it introduces double encoding: QWEN2.5-VL handles high-level semantic features, while the input image is a low-level reconstruction detail added to the MMDIT Image stream. This can enable balanced semantic coherence (e.g., maintain object identity during pose changes) and visual fidelity (e.g., retain unmodified areas).

Multi-mode telescopic rope (MSROPE) position encoding has a frame dimension to distinguish pre-edited images and supports tasks such as text image-to-image image (TI2I) editing. Text-rich data were fine-tuned, and superior reconstruction was achieved in the form of 33.42 psnr on general images, and reconstruction of 33.42 psnr was obtained on heavier images, which outperformed the heavier images and performed better than magnetic flux and SD-3.5-VAE. These enhancements allow Qwen-Image-Edit to handle bilingual text editing while preserving the original font, size, and style.

The main features of Qwen-Image-edit

Semantics and Appearance Edit: Supports low-level visual appearance editing (e.g., adding, removing or modifying elements while keeping other areas unchanged) and advanced visual semantic editing (e.g., IP creation, object rotation, and style transfer, allowing pixel changes with semantic consistency).
Precise text editing: Enable bilingual (Chinese and English) text editing, including direct addition, deletion and text modification in images, while retaining the original font, size and style.
Strong benchmark performance: Implement state-of-the-art results in multiple public benchmarks for image editing tasks, positioning them as a powerful base model for generation and manipulation.

Training and data pipelines

Using Qwen-Image-curated natural image text pairs (55%), design (27%), human (13%) and synthetic (5%) domains, QWEN-IMAGE-EDIT uses multitasking training range to unify T2I, I2I, I2I and TI2I purposes. The seven-stage filtering pipeline perfects the data to achieve quality and balance, and combines synthetic text rendering strategies (pure, composition, complex) to solve the long-tail problem in Chinese characters.

Training uses traffic matching with the scalability of the producer-consumer framework, followed by supervised fine-tuning and enhanced learning (DPO and GRPO) for preference alignment. For edit-specific tasks, it integrates novel view synthesis and depth estimation using DepthPro as a teacher model. This can lead to excellent performance, such as correcting calligraphy errors through chain editing.

Advanced editing features

QWEN-IMAGE-EDIT shines in semantic editing, allowing IP creation to generate MBTI theme emojis from mascots (Eg, Capybara) while maintaining character consistency. It supports new 180-degree view synthesis, high fidelity rotating objects or scenes, reaches 15.11 psnr on GSO, and can be passed through professional models such as CRM. Style transfer transforms portraits into art forms, such as Studio Ghibli, maintaining semantic integrity.

For appearance editing, it adds elements such as signs with realistic reflections or removes beautiful details like hair chains without changing the surroundings. Bilingual text editing is precise: change “hope” to “Qwen” on the poster or correct Chinese characters in calligraphy through bounding boxes. Chained editing allows iterative corrections, for example, step by step fixing the “audio” until it is accurate.

Benchmark results and evaluation

QWEN-IMAGE-EDIT Leader Editing Benchmark, scored 7.56 on GEDIT pallet and 7.52 on CN, outperforming GPT Image 1 (7.53 EN, 7.30 CN) and Flux.1 Kontext.1 Kontext.1 Kontext [Pro] (6.56 EN, 1.23 CN). On Imagedit, it generally hits 4.27, which performs well in tasks like object replacement (4.66) and style change (4.81). Depth estimation produces 0.078 ABSREL on Kitti, competing with Depthything V2.

Human evaluation in the AI arena positioned its basic model in the API with a powerful text rendering advantage. These metrics highlight their advantages in following teaching and multilingual loyalty.

Deployment and practical usage

QWEN-IMAGE-EDIT can be deployed through a hugging facial diffuser:

from diffusers import QwenImageEditPipeline
import torch
from PIL import Image

pipeline = QwenImageEditPipeline.from_pretrained("Qwen/Qwen-Image-Edit")
pipeline.to(torch.bfloat16).to("cuda")

image = Image.open("input.png").convert("RGB")
prompt = "Change the rabbit's color to purple, with a flash light background."
output = pipeline(image=image, prompt=prompt, num_inference_steps=50, true_cfg_scale=4.0).images
output.save("output.png")

Alibaba Cloud’s Model Studio provides API access for scalable inference. The GITHUB repository is licensed under Apache 2.0 and provides training code.

What the future means

Qwen-Image-Edit advances the visual language interface, providing seamless content manipulation for creators. Its unified understanding and generation approach proposes potential expansions to video and 3D, thereby facilitating innovative applications in AI-driven design.

Check Technical details, model about face hugging and Try chatting here. Check out ours anytime Tutorials, codes and notebooks for github pages. Also, please stay tuned for us twitter And don’t forget to join us 100K+ ml reddit And subscribe Our newsletter.

Asif Razzaq is CEO of Marktechpost Media Inc. As a visionary entrepreneur and engineer, ASIF is committed to harnessing the potential of artificial intelligence to achieve social benefits. His recent effort is to launch Marktechpost, an artificial intelligence media platform that has an in-depth coverage of machine learning and deep learning news that can sound both technically, both through technical voices and be understood by a wide audience. The platform has over 2 million views per month, demonstrating its popularity among its audience.

The QWEN team introduced Qwen-Image-edit: an image editing version of QWEN-IMAGE with advanced features for semantic and appearance editing

Architecture and key innovation

The main features of Qwen-Image-edit

Training and data pipelines

Advanced editing features

Benchmark results and evaluation

Deployment and practical usage

What the future means

You may also like...

live chat

Recent Posts

The QWEN team introduced Qwen-Image-edit: an image editing version of QWEN-IMAGE with advanced features for semantic and appearance editing

Architecture and key innovation

The main features of Qwen-Image-edit

Training and data pipelines

Advanced editing features

Benchmark results and evaluation

Deployment and practical usage

What the future means

You may also like...

Testing starts with the first device to improve and treat odor loss

MING-LITE-UNI: An open source AI framework designed to unify text and perspectives through a multi-modal structure of autoregressive

Super Bowl History

live chat

Recent Posts