pith. machine review for the scientific record. sign in

arxiv: 2505.18600 · v3 · submitted 2025-05-24 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Chain-of-Zoom: Extreme Super-Resolution via Scale Autoregression and Preference Alignment

Authors on Pith no claims yet
classification 💻 cs.CV cs.AIcs.LG
keywords chain-of-zoommodelbeyondextremehighmulti-scale-awarepreferenceprompts
0
0 comments X
read the original abstract

Modern single-image super-resolution (SISR) models deliver photo-realistic results at the scale factors on which they are trained, but collapse when asked to magnify far beyond that regime. We address this scalability bottleneck with Chain-of-Zoom (CoZ), a model-agnostic framework that factorizes SISR into an autoregressive chain of intermediate scale-states with multi-scale-aware prompts. CoZ repeatedly re-uses a backbone SR model, decomposing the conditional probability into tractable sub-problems to achieve extreme resolutions without additional training. Because visual cues diminish at high magnifications, we augment each zoom step with multi-scale-aware text prompts generated by a vision-language model (VLM). The prompt extractor itself is fine-tuned using Generalized Reward Policy Optimization (GRPO) with a critic VLM, aligning text guidance towards human preference. Experiments show that a standard 4x diffusion SR model wrapped in CoZ attains beyond 256x enlargement with high perceptual quality and fidelity. Project Page: https://bryanswkim.github.io/chain-of-zoom/.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Tiled Prompts: Overcoming Prompt Misguidance in Image and Video Super-Resolution

    cs.CV 2026-02 unverdicted novelty 7.0

    Tiled Prompts generates tile-specific text prompts for each latent tile in diffusion super-resolution to reduce errors from global prompts and improve perceptual quality.