hub

Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711

NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al · 2025 · arXiv 2508.10711

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 baseline 1 other 1

citation-polarity summary

background 2 baseline 1 unclear 1

representative citing papers

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing

cs.CR · 2026-05-11 · unverdicted · novelty 7.0

Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.

Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting

cs.CV · 2026-03-30 · unverdicted · novelty 7.0

Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

cs.CV · 2026-02-06 · unverdicted · novelty 7.0

PlanViz is a new benchmark with three sub-tasks and PlanScore metric to evaluate planning-oriented image generation and editing by unified multimodal models for computer-use tasks.

FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation

cs.CV · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.

VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

Generative Refinement Networks for Visual Synthesis

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

cs.CV · 2026-04-09 · unverdicted · novelty 6.0

Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose generative knowledge for discriminative tasks.

MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

cs.CV · 2026-04-08 · unverdicted · novelty 6.0

MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.

From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation

cs.LG · 2026-03-12 · unverdicted · novelty 6.0

EG-GRPO improves autoregressive text-to-image models by reallocating RL updates according to token entropy, excluding low-entropy tokens from reward signals while adding entropy bonuses to high-entropy ones, yielding state-of-the-art results on standard benchmarks.

LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens

cs.CV · 2026-02-12 · unverdicted · novelty 6.0

LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.

Steering Visual Generation in Unified Multimodal Models with Understanding Supervision

cs.CV · 2026-05-07 · unverdicted · novelty 5.0

Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.

citing papers explorer

Showing 12 of 12 citing papers.

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning cs.CV · 2026-05-20 · unverdicted · none · ref 26 · 2 links
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
Generate "Normal", Edit Poisoned: Branding Injection via Hint Embedding in Image Editing cs.CR · 2026-05-11 · unverdicted · none · ref 8
Invisible hints such as logos embedded in images are re-rendered by diffusion models during text-guided editing, enabling phishing and model-poisoning attacks with average success rates of 44.4% and 32.2%.
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting cs.CV · 2026-03-30 · unverdicted · none · ref 31
Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks cs.CV · 2026-02-06 · unverdicted · none · ref 38
PlanViz is a new benchmark with three sub-tasks and PlanScore metric to evaluate planning-oriented image generation and editing by unified multimodal models for computer-use tasks.
FlashAR: Efficient Post-Training Acceleration for Autoregressive Image Generation cs.CV · 2026-05-10 · unverdicted · none · ref 17 · 2 links
FlashAR accelerates autoregressive image generation up to 22.9x by post-training a pre-trained raster-scan model with a complementary vertical head and dynamic fusion for two-way next-token prediction.
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations cs.CV · 2026-04-27 · unverdicted · none · ref 38
VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.
Generative Refinement Networks for Visual Synthesis cs.CV · 2026-04-14 · unverdicted · none · ref 51
GRN uses hierarchical binary quantization and entropy-guided refinement to set new ImageNet records of 0.56 rFID for reconstruction and 1.81 gFID for class-conditional generation while releasing code and models.
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator cs.CV · 2026-04-09 · unverdicted · none · ref 12
Uni-ViGU unifies video generation and understanding by extending a diffusion video generator with unified continuous-discrete flow matching, modality-driven MoE layers, and bidirectional training stages that repurpose generative knowledge for discriminative tasks.
MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation cs.CV · 2026-04-08 · unverdicted · none · ref 28
MAR-GRPO stabilizes GRPO for AR-diffusion hybrids via multi-trajectory expectation and uncertainty-based token selection, yielding better visual quality, stability, and spatial understanding than baselines.
From Broad Exploration to Stable Synthesis: Entropy-Guided Optimization for Autoregressive Image Generation cs.LG · 2026-03-12 · unverdicted · none · ref 24
EG-GRPO improves autoregressive text-to-image models by reallocating RL updates according to token entropy, excluding low-entropy tokens from reward signals while adding entropy bonuses to high-entropy ones, yielding state-of-the-art results on standard benchmarks.
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens cs.CV · 2026-02-12 · unverdicted · none · ref 62
LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
Steering Visual Generation in Unified Multimodal Models with Understanding Supervision cs.CV · 2026-05-07 · unverdicted · none · ref 47
Using understanding tasks as direct supervision during post-training improves image generation and editing in unified multimodal models.

Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer