Tokenset: A framework for semantic-aware visual representation based on dynamic sets

by admin · March 25, 2025

The visual generation framework follows a two-stage approach: first compressing the visual signal into a latent representation, and then modeling the low-dimensional distribution. However, regardless of the semantic complexity of different regions in the image, conventional tokenization methods apply uniform spatial compression ratios. For example, in beach photos, simple sky areas are the same representativeness as semantically rich prospects. Low-dimensional features are extracted based on merge methods, but lack direct supervision of individual elements, often yielding suboptimal results. The communication-based approach using both sides matches suffers from inherent instability, which leads to inefficient convergence as the supervised signals vary in training iterations.

Image tokenization has evolved greatly to address compression challenges. Variable automatic encoder (VAE) takes the lead in mapping images into a low-dimensional continuous distribution. VQVAE and VQGAN introduce hierarchical potential representations by projecting images into discrete token sequences, while VQVAE-2, RQVAE and MOVQ introduce hierarchical latent representations through residual quantization. When scaling the codebook size, FSQ, SIMVQ and VQGAN-LC resolved the crash of the representation. Other methods such as ensemble modeling have also changed from traditional bag of word (BOW) representation to more complex techniques. Technologies such as DSPN use chamfer loss, while TSPN and DETR use Hungarian matching, although these processes often produce inconsistent training signals.

Researchers from the University of Science and Technology of China and Tencent Hunyuan Research have proposed a new paradigm for generating images fundamentally through set-based tokenization and allocation modeling. Their token method dynamically allocates encoding capabilities based on regional semantic complexity. This unordered token set representation can enhance global environmental aggregation and improve robustness against local perturbations. Furthermore, they introduced fixed and discrete diffusion (FSDD), the first framework for simultaneously processing discrete values, fixed sequence lengths, and summation invariance, thus enabling efficient ensemble distribution modeling. Experiments show the advantages of this method in semantic perceived representation and power generation quality.

The experiments were performed on the Imagenet dataset using 256 × 256 resolution images and the results were performed using the 50,000 image verification set of the Frechet Inspection distance (FID) metric. Titok’s strategy follows token training and applies data augmentation including random cropping and horizontal flips. The model was trained on Imagenet at 1000k steps, with a batch size of 256, equivalent to 200 epochs. The training combines the learning rate warm-up phase, followed by cosine attenuation, gradient clips of 1.0, and a moving average of the exponent, with a decay rate of 0.999. Including discriminator losses to improve quality and stable training, the decoder was trained only during the last 500K steps. Maskgit’s proxy code facilitates the training process.

The results show the key advantages of the token approach. Permutation invariance was confirmed by visual and quantitative assessment. All reconstructed images are visually the same, regardless of the token order, with consistent quantitative results across different permutations. This confirms that the network successfully learns permutation invariance even if it is trained only in a possible subset of permutations. Each token integrates global context information with the theoretical receiving field, including through decoupling positional relationships and eliminating the spatial bias caused by sequences, thus including the entire feature space. Furthermore, the FSDD method only satisfies all required properties simultaneously, resulting in excellent performance metrics.

In summary, the token framework represents a paradigm transfer of visual representation by transferring from a serialized token to a setting-based method that can dynamically allocate representation based on semantic complexity. The double transformation mechanism creates a graph mapping between the unordered token set and the structured integer sequence, so that FSDD can be used to effectively model the settings distribution. Furthermore, the set-based tokenization method provides different advantages, bringing possibilities for image representation and generation. This direction opens up new perspectives for the development of next-generation generative models and plans to analyze and unlock the full potential of this representation and modeling approach.

Check Paper and github pages. All credits for this study are to the researchers on the project. Also, please stay tuned for us twitter And don’t forget to join us 85k+ ml reddit.

Postal map kenset: A dynamic set-based framework for semantic perception visual representation first appeared on Marktechpost.