Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction.arXiv preprint arXiv:2505.21473, 2025b

Yiheng Liu, Liao Qu, Huichao Zhang, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Xian Li, Shuai Wang, Daniel K Du, et al · 2025 · arXiv 2505.21473

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

ChannelTok: Efficient Flexible-Length Vision Tokenization

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

ChannelTok introduces channel-wise tokenization with stochastic tail-dropping to achieve rFID 2.92 on ImageNet at 8.6x faster decoding and 2.1x smaller size than prior flexible tokenizers.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

Autoregressive Visual Generation Needs a Prologue

cs.CV · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Prologue adds a small set of learnable tokens trained exclusively with AR cross-entropy loss to decouple generation from reconstruction in autoregressive visual models, yielding lower gFID on ImageNet 256x256.

Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry

cs.CV · 2026-05-25 · unverdicted · novelty 6.0 · 2 refs

MIND integrates discrete patch tokenization into diffusion score functions via soft top-k and dual-branch layers, achieving FID 22.73 (no guidance) and 2.06 (with guidance) on ImageNet-256 after 80 epochs, outperforming DiT and larger LlamaGen models.

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

cs.CV · 2026-06-08 · unverdicted · novelty 5.0

MilliVid compresses video frames into multi-scale token hierarchies and uses coarse-to-fine rollout in a diffusion model to maintain long-range geometric and object consistency on Minecraft videos.

Reward-Forcing: Autoregressive Video Generation with Reward Feedback

cs.CV · 2026-01-23 · unverdicted · novelty 5.0

Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.

citing papers explorer

Showing 7 of 7 citing papers after filters.

ChannelTok: Efficient Flexible-Length Vision Tokenization cs.CV · 2026-06-03 · unverdicted · none · ref 23
ChannelTok introduces channel-wise tokenization with stochastic tail-dropping to achieve rFID 2.92 on ImageNet at 8.6x faster decoding and 2.1x smaller size than prior flexible tokenizers.
Diffusing in the Right Space: A Systematic Study of Latent Diffusability cs.CV · 2026-06-02 · unverdicted · none · ref 54
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
Autoregressive Visual Generation Needs a Prologue cs.CV · 2026-05-07 · unverdicted · none · ref 21 · 2 links
Prologue adds a small set of learnable tokens trained exclusively with AR cross-entropy loss to decouple generation from reconstruction in autoregressive visual models, yielding lower gFID on ImageNet 256x256.
Diffusion Image Generation with Explicit Modeling of Data Manifold Geometry cs.CV · 2026-05-25 · unverdicted · none · ref 32 · 2 links
MIND integrates discrete patch tokenization into diffusion score functions via soft top-k and dual-branch layers, achieving FID 22.73 (no guidance) and 2.06 (with guidance) on ImageNet-256 after 80 epochs, outperforming DiT and larger LlamaGen models.
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations cs.CV · 2026-04-27 · unverdicted · none · ref 21
VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.
MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation cs.CV · 2026-06-08 · unverdicted · none · ref 21
MilliVid compresses video frames into multi-scale token hierarchies and uses coarse-to-fine rollout in a diffusion model to maintain long-range geometric and object consistency on Minecraft videos.
Reward-Forcing: Autoregressive Video Generation with Reward Feedback cs.CV · 2026-01-23 · unverdicted · none · ref 15
Reward-Forcing guides autoregressive video generation with reward feedback to achieve performance comparable to teacher-dependent methods on benchmarks like VBench without relying on distillation.

Detailflow: 1d coarse-to-fine autoregressive image generation via next-detail prediction.arXiv preprint arXiv:2505.21473, 2025b

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer