Autoregressive (AR) models have fundamentally changed the landscape of generative AI, forming the backbone of state-of-the-art language and image generation models. The central principle of these models is based on predicting the next token – be it a word in a sentence or an image patch in an image. A recent research paper, however, challenges the traditional definition of a token and presents an innovative approach to autoregressive image generation.
Until now, "tokens" have been considered the smallest units of prediction, often discrete symbols in natural language processing or quantized image patches in image generation. However, the optimal definition of a token for two-dimensional image structures has long remained an unsolved problem. Additionally, AR models suffer from so-called "exposure bias." This arises from "teacher forcing" during training, where the model is always fed the correct previous tokens. In the inference phase, when the model has to use its own predictions, this can lead to an accumulation of errors.
The new framework "xAR" expands the concept of the token to an entity "X," which can take various forms: a single patch, a cell (a k x k grouping of neighboring patches), a sample (a non-local grouping of distant patches), a scale (from coarse to fine), or even the entire image. Instead of classifying discrete tokens, xAR uses continuous entity regression, based on flow-matching methods in each AR step.
Another important aspect of xAR is "Noisy Context Learning" (NCL). During training, the model is deliberately exposed to noisy contexts instead of working exclusively with the correct previous tokens. This approach reduces exposure bias, as the model learns to handle erroneous inputs, which improves the robustness and accuracy of predictions in the inference phase.
The results of xAR are promising. The base model xAR-B (172M parameters) outperforms larger models like DiT-XL/SiT-XL (675M parameters) in image generation on ImageNet-256 while achieving 20 times faster inference. The largest model, xAR-H (1.1B parameters), sets a new standard with an FID score of 1.24, without relying on Vision Foundation Models (e.g., DINOv2) or advanced Guidance Interval Sampling. It is also 2.2 times faster than the previously best-performing model.
xAR represents a significant advance in autoregressive image generation. The flexible definition of prediction units and the reduction of exposure bias through NCL open up new possibilities for the development of even more powerful and efficient generative models. The research results suggest that xAR has the potential to revolutionize image generation in various application areas, from art to medical imaging.
Bibliographie: - https://arxiv.org/abs/2502.20388 - https://huggingface.co/papers/2502.20388 - https://arxiv.org/html/2502.20388v1 - https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey - https://neurips.cc/virtual/2024/poster/94115 - https://openreview.net/forum?id=gojL67CfS8 - https://paperswithcode.com/paper/next-patch-prediction-for-autoregressive - http://paperreading.club/page?id=287866 - https://huggingface.co/papers/2412.15119 - https://www.researchgate.net/publication/386093859_High-Resolution_Image_Synthesis_via_Next_Token_Prediction