arxiv: 2604.21450 · v1 · submitted 2026-04-23 · 💻 cs.CV · cs.AI· cs.LG

VARestorer: One-Step VAR Distillation for Real-World Image Super-Resolution

Yixuan Zhu , Shilin Ma , Haolin Wang , Ao Li , Yanzhe Jing , Yansong Tang , Lei Chen , Jiwen Lu

show 1 more author

Jie Zhou

This is my paper

Pith reviewed 2026-05-09 22:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords visual autoregressive modelsimage super-resolutionmodel distillationone-step inferencereal-world restorationpyramid conditioningparameter-efficient adaptation

0 comments

The pith

Distribution matching distills iterative visual autoregressive models into single-pass real-world image super-resolution predictors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to adapt visual autoregressive models, originally designed for step-by-step image generation from text, to the problem of restoring real-world low-quality photographs to high quality. Iterative next-scale prediction creates slow inference and accumulating errors that blur or distort outputs, so the authors replace it with a direct one-step process. This matters for applications needing fast, coherent restoration such as mobile photography or video enhancement, where repeated passes are impractical. They achieve the change through distribution matching that aligns one-step outputs with the original model's behavior, plus pyramid conditioning that lets the model attend bidirectionally across scales. Experiments on standard benchmarks show the resulting model reaches top perceptual scores while running ten times faster and updating only a small fraction of parameters.

Core claim

VARestorer is a distillation framework that converts a pre-trained text-to-image visual autoregressive model into a one-step image super-resolution system. Distribution matching removes the need for iterative refinement and thereby cuts error propagation. Pyramid image conditioning with cross-scale attention supplies bidirectional information flow so that low-quality input tokens are not overlooked later in the sequence. Fine-tuning occurs through parameter-efficient adapters on just 1.2 percent of the weights. On the DIV2K dataset the method records 72.32 MUSIQ and 0.7669 CLIPIQA while delivering tenfold faster inference than standard VAR iteration.

What carries the argument

Distribution matching distillation together with pyramid conditioning and cross-scale attention, which together replace iterative next-scale prediction while preserving the autoregressive transformer's structure.

Load-bearing premise

Distribution matching plus pyramid conditioning can substitute for the full iterative next-scale process without creating new artifacts or losing global coherence on varied real-world low-quality inputs.

What would settle it

If the one-step model produces visibly more artifacts, lower MUSIQ scores, or less coherent structures than the original iterative VAR when both are tested on the same diverse set of real-world degraded images, the claim that the substitution works would be refuted.

Figures

Figures reproduced from arXiv: 2604.21450 by Ao Li, Haolin Wang, Jie Zhou, Jiwen Lu, Lei Chen, Shilin Ma, Yansong Tang, Yanzhe Jing, Yixuan Zhu.

**Figure 2.** Figure 2: Comparison of VAR-based ISR approaches. (a) Zero-shot upsampling uses LQ tokens at [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The overall framework of VARestorer. (a) VARestorer utilizes VAR distillation framework for real-ISR. During training, we employ the pre-trained text-to-image VAR model as the teacher to predict the high-quality tokens and calculate the token-level KL divergence for distribution alignment. (b) To fully exploit the LQ input, we introduce cross-scale pyramid conditioning, which allows the student model to … view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons on real-world datasets. Our VARestorer delivers exceptional details with just one-step inference. The numbers following each method indicate the corresponding inference steps. Please zoom in for a better view. have their limitations in assessing visual quality as they often penalize high-frequence details in our generated images, e.g., hair texture. Therefore, we also include the wi… view at source ↗

**Figure 5.** Figure 5: Visual results of the ablations. Our distillation method, cross-scale attention, and distribution matching collectively enhance the visual quality of the generated images by reducing artifacts, preserving fine details, and ensuring better structural consistency. 4.2 MAIN RESULTS Quantitative Comparisons. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

Recent advancements in visual autoregressive models (VAR) have demonstrated their effectiveness in image generation, highlighting their potential for real-world image super-resolution (Real-ISR). However, adapting VAR for ISR presents critical challenges. The next-scale prediction mechanism, constrained by causal attention, fails to fully exploit global low-quality (LQ) context, resulting in blurry and inconsistent high-quality (HQ) outputs. Additionally, error accumulation in the iterative prediction severely degrades coherence in ISR task. To address these issues, we propose VARestorer, a simple yet effective distillation framework that transforms a pre-trained text-to-image VAR model into a one-step ISR model. By leveraging distribution matching, our method eliminates the need for iterative refinement, significantly reducing error propagation and inference time. Furthermore, we introduce pyramid image conditioning with cross-scale attention, which enables bidirectional scale-wise interactions and fully utilizes the input image information while adapting to the autoregressive mechanism. This prevents later LQ tokens from being overlooked in the transformer. By fine-tuning only 1.2\% of the model parameters through parameter-efficient adapters, our method maintains the expressive power of the original VAR model while significantly enhancing efficiency. Extensive experiments show that VARestorer achieves state-of-the-art performance with 72.32 MUSIQ and 0.7669 CLIPIQA on DIV2K dataset, while accelerating inference by 10 times compared to conventional VAR inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VARestorer shows a workable distillation route from iterative VAR to one-step real-world ISR, but the quality evidence stays thin without ablations or direct coherence checks.

read the letter

The main thing here is a distillation setup that converts a pre-trained text-to-image VAR into a single-pass model for real-world super-resolution. Distribution matching removes the iterative next-scale loop, and pyramid conditioning with cross-scale attention tries to restore bidirectional information flow from the low-quality input that causal attention normally blocks. They also keep changes small by training only adapters on 1.2 percent of the parameters, which is a practical move for preserving the base model's capacity while cutting inference time by roughly 10x.

Referee Report

3 major / 2 minor

Summary. The paper proposes VARestorer, a distillation framework that converts a pre-trained text-to-image visual autoregressive (VAR) model into a one-step real-world image super-resolution (ISR) model. It uses distribution matching to eliminate iterative next-scale prediction and error accumulation, combined with pyramid image conditioning and cross-scale attention to enable bidirectional scale interactions and better exploit global low-quality context. Only 1.2% of parameters are fine-tuned via adapters. The work claims state-of-the-art no-reference perceptual metrics (72.32 MUSIQ, 0.7669 CLIPIQA on DIV2K) and a 10x inference speedup over standard VAR.

Significance. If the empirical results hold under rigorous verification, the contribution would be significant: it shows how distribution matching can distill iterative autoregressive generative models into efficient one-step predictors for restoration tasks, preserving perceptual quality while achieving substantial speedups. The parameter-efficient adaptation and pyramid conditioning approach could generalize to other conditional generation problems and enable practical deployment of large VAR models in real-time ISR applications.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): The central performance claims (SOTA MUSIQ/CLIPIQA scores and 10x speedup) are stated without any description of the experimental protocol, training details, baseline methods, ablation studies, or error analysis. This prevents verification of whether distribution matching plus pyramid conditioning actually reproduces the multi-step VAR output distribution without coherence loss or new artifacts on real-world degradations.
[§3] §3 (Method): The claim that bidirectional cross-scale attention in pyramid conditioning fully compensates for the removal of causal iterative refinement (and thereby avoids error propagation) is load-bearing for the one-step efficiency argument, yet no analysis, visualization, or comparison of global structure preservation (e.g., via LPIPS, FID, or qualitative examples on out-of-distribution degradations) is provided to support it.
[§4] §4 (Experiments): Reliance solely on no-reference metrics (MUSIQ, CLIPIQA) on DIV2K does not directly test the weakest assumption that the distilled model avoids mode collapse or hallucination where iterative VAR was stable; reference-based or human preference evaluations on diverse real-world inputs are needed to substantiate the quality claim.

minor comments (2)

[§3.2] Clarify the exact form of the distribution matching loss and how it is optimized in the one-step setting (e.g., which divergence is used and on which features).
[§4] The abstract states 'extensive experiments' but the provided text supplies none; ensure all tables and figures include standard deviations, number of runs, and exact baseline implementations.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential significance. We address each major comment below and have revised the manuscript to provide the requested details, analyses, and evaluations.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central performance claims (SOTA MUSIQ/CLIPIQA scores and 10x speedup) are stated without any description of the experimental protocol, training details, baseline methods, ablation studies, or error analysis. This prevents verification of whether distribution matching plus pyramid conditioning actually reproduces the multi-step VAR output distribution without coherence loss or new artifacts on real-world degradations.

Authors: We agree that the original submission provided insufficient detail on the experimental protocol, which limits independent verification. In the revised manuscript, the abstract has been updated to summarize the evaluation protocol, and Section 4 now includes a dedicated subsection with full training details, baseline implementations, ablation studies, and error analysis. We add quantitative distribution-matching comparisons (via FID and perceptual metrics) on real-world degradations to show that the one-step model reproduces the multi-step VAR output without coherence loss or new artifacts. revision: yes
Referee: [§3] §3 (Method): The claim that bidirectional cross-scale attention in pyramid conditioning fully compensates for the removal of causal iterative refinement (and thereby avoids error propagation) is load-bearing for the one-step efficiency argument, yet no analysis, visualization, or comparison of global structure preservation (e.g., via LPIPS, FID, or qualitative examples on out-of-distribution degradations) is provided to support it.

Authors: We acknowledge that the claim regarding compensation for iterative refinement requires stronger empirical support. The revised Section 3 now incorporates visualizations of attention maps, quantitative comparisons using LPIPS and FID on out-of-distribution degradations, and qualitative examples contrasting the one-step outputs with multi-step VAR. These additions demonstrate that pyramid conditioning with cross-scale attention preserves global structure and mitigates error propagation. revision: yes
Referee: [§4] §4 (Experiments): Reliance solely on no-reference metrics (MUSIQ, CLIPIQA) on DIV2K does not directly test the weakest assumption that the distilled model avoids mode collapse or hallucination where iterative VAR was stable; reference-based or human preference evaluations on diverse real-world inputs are needed to substantiate the quality claim.

Authors: The referee is correct that no-reference metrics alone are insufficient to fully substantiate the absence of mode collapse or hallucination. In the revision, Section 4 has been expanded to include reference-based metrics (LPIPS, FID) where ground-truth is available, plus a human preference study on diverse real-world inputs drawn from multiple datasets beyond DIV2K. These results confirm that the distilled model maintains perceptual quality without introducing hallucinations relative to the original iterative VAR. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external metrics

full rationale

The paper's central proposal is a distillation framework that applies distribution matching and pyramid conditioning (with cross-scale attention) to convert an iterative VAR model into a one-step ISR model, fine-tuning only adapters. No derivation chain is presented that reduces by construction to its own inputs: there are no equations shown where a fitted parameter is renamed as a prediction, no self-definitional loops, and no load-bearing self-citations or imported uniqueness theorems. Performance is asserted via concrete external metrics (72.32 MUSIQ, 0.7669 CLIPIQA on DIV2K) and a 10x inference speedup, which are falsifiable against held-out data rather than tautological. The method description remains self-contained against benchmarks without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard assumptions from knowledge-distillation and attention literature; no new free parameters, axioms, or invented entities are explicitly introduced beyond the described adapters and conditioning mechanism.

pith-pipeline@v0.9.0 · 5581 in / 1141 out tokens · 36698 ms · 2026-05-09T22:00:37.911885+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Ntire 2017 challenge on single image super-resolution: Dataset and study

Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and study. InCVPRW, pp. 126–135,

2017
[2]

Language models are few-shot learners.NeurIPS, 33:1877–1901,

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.NeurIPS, 33:1877–1901,

1901
[3]

Ntire 2022 challenge on perceptual image quality assessment

Jinjin Gu, Haoming Cai, Chao Dong, Jimmy S Ren, Radu Timofte, Yuan Gong, Shanshan Lao, Shuwei Shi, Jiahao Wang, Sidi Yang, et al. Ntire 2022 challenge on perceptual image quality assessment. InCVPR, pp. 951–967,

2022
[4]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis.arXiv preprint arXiv:2412.04431, 2024

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis.arXiv preprint arXiv:2412.04431,

work page arXiv
[5]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 30,

10 Published as a conference paper at ICLR 2026 Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 30,

2026
[6]

Controlvar: Exploring con- trollable visual autoregressive modeling.arXiv preprint arXiv:2406.09750, 2024

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Zhe Lin, Rita Singh, and Bhiksha Raj. Controlvar: Exploring controllable visual autoregressive modeling.arXiv preprint arXiv:2406.09750, 2024a. Yawei Li, Kai Zhang, Jingyun Liang, Jiezhang Cao, Ce Liu, Rui Gong, Yulun Zhang, Hao Tang, Yun Liu, Denis Demandolx, et al. Lsdir: A large scale dataset for image restorati...

work page arXiv
[7]

Controlar: Controllable image generation with autoregressive models.arXiv preprint arXiv:2410.02705, 2024b

Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Controlar: Controllable image generation with autore- gressive models.arXiv preprint arXiv:2410.02705, 2024b. Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin t...

work page arXiv
[8]

arXiv preprint arXiv:2308.15070 (2023)

Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Ben Fei, Bo Dai, Wanli Ouyang, Yu Qiao, and Chao Dong. Diffbir: Towards blind image restoration with generative diffusion prior.arXiv preprint arXiv:2308.15070,

work page arXiv
[9]

Star: Scale-wise text-conditioned autoregressive image generation.arXiv preprint arXiv:2406.10797, 2024

Xiaoxiao Ma, Mohan Zhou, Tao Liang, Yalong Bai, Tiejun Zhao, Huaian Chen, and Yi Jin. Star: Scale-wise text-to-image generation via auto-regressive representations.arXiv preprint arXiv:2406.10797,

work page arXiv
[10]

Visual autoregressive modeling for image super-resolution.arXiv preprint arXiv:2501.18993,

Yunpeng Qu, Kun Yuan, Jinhua Hao, Kai Zhao, Qizhi Xie, Ming Sun, and Chao Zhou. Visual autoregressive modeling for image super-resolution.arXiv preprint arXiv:2501.18993,

work page arXiv
[11]

High- resolution image synthesis with latent diffusion models

11 Published as a conference paper at ICLR 2026 Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pp. 10684–10695,

2026
[12]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko-...

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Exploiting diffusion prior for real-world image super-resolution.arXiv preprint arXiv:2305.07015, 2023

Jianyi Wang, Kelvin CK Chan, and Chen Change Loy. Exploring clip for assessing the look and feel of images. InAAAI, volume 37, pp. 2555–2563, 2023a. Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin CK Chan, and Chen Change Loy. Exploiting diffusion prior for real-world image super-resolution. InarXiv preprint arXiv:2305.07015, 2023b. Jianyi Wang, Zongsh...

work page arXiv 1905
[14]

Addsr: Accelerating diffusion-based blind super- resolution with adversarial diffusion distillation.arXiv preprint arXiv:2404.01717, 2024

Rongyuan Wu, Lingchen Sun, Zhiyuan Ma, and Lei Zhang. One-step effective diffusion network for real-world image super-resolution.NeurIPS, 2024a. Rongyuan Wu, Tao Yang, Lingchen Sun, Zhengqiang Zhang, Shuai Li, and Lei Zhang. Seesr: Towards semantics-aware real-world image super-resolution. InCVPR, pp. 25456–25467, 2024b. Rui Xie, Ying Tai, Kai Zhang, Zhen...

work page arXiv
[15]

Gan prior embedded network for blind face restoration in the wild

12 Published as a conference paper at ICLR 2026 Tao Yang, Peiran Ren, Xuansong Xie, and Lei Zhang. Gan prior embedded network for blind face restoration in the wild. InCVPR, pp. 672–681,

2026
[16]

Car: Controllable autoregressive modeling for visual generation

Ziyu Yao, Jialin Li, Yifeng Zhou, Yong Liu, Xi Jiang, Chengjie Wang, Feng Zheng, Yuexian Zou, and Lei Li. Car: Controllable autoregressive modeling for visual generation.arXiv preprint arXiv:2410.04671,

work page arXiv
[17]

One-step diffusion with distribution matching distillation

Tianwei Yin, Micha¨el Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation.arXiv preprint arXiv:2311.18828,

work page arXiv
[18]

Difface: Blind face restoration with diffused error contrac- tion.arXiv preprint arXiv:2212.06512,

Zongsheng Yue and Chen Change Loy. Difface: Blind face restoration with diffused error contrac- tion.arXiv preprint arXiv:2212.06512,

work page arXiv
[19]

13 Published as a conference paper at ICLR 2026 A blue car driving through a quiet neighborhood A green field with a river under a serene sky A cute dog standing in front of a motobike Real Input Real Input Real Input VARestorer VARestorer VARestorer Figure A: V ARestorer achieves strong one-step restoration by effectively leveraging the knowledge of the ...

2026
[20]

To mitigate error accumulation caused by next-scale prediction, we distill the pretrained model into a one-step model

to generate corresponding image captions. To mitigate error accumulation caused by next-scale prediction, we distill the pretrained model into a one-step model. Specifically, we concatenate the input image tokens from all scales and predict the output in a single step. Cross-scale attention is incorporated to ensure that all input in- formation contribute...

2026
[21]

w/o prompt

can effectively encode and reconstruct smaller images (e.g., 512, 768). (2) During inference, we first resize the LR inputs to512 2 like OSEDiff, enabling arbitrary lower-resolution inputs. (3) For higher resolutions, we can adopt two approaches:tiling-based inference like diffusion-based methods andfine-tuning on larger and mixed scales, both supported b...

2026
[22]

To validate this, we include a training MSE loss curve in Figure H (left, Ours-512), which clearly shows that the loss stabilizes well before 10K steps

reaches conver- gence within 10K steps (∼2 days, 3.7 epochs). To validate this, we include a training MSE loss curve in Figure H (left, Ours-512), which clearly shows that the loss stabilizes well before 10K steps. We also trained variants for 20K and 25K steps and observed no meaningful improvement across any metric. This confirms that the model has alre...

2026
[23]

despite providing outputs that are sharper, more natural, and more faithful to real-world image distributions. To provide a more comprehensive evaluation, we also report non-reference perceptual metrics such as MANIQA and CLIPIQA, where V ARestorer achieves substantial improvements (Table 1). A common concern is whether the improvement in perceptual metri...

work page arXiv 2026