super hub Mixed citations

Emu3: Next-Token Prediction is All You Need

Jinsheng Wang, Quan Sun, Xiaosong Zhang, Xinlong Wang, Yufeng Cui, Zhengxiong Luo · 2024 · cs.CV · arXiv 2409.18869

Mixed citation behavior. Most common role is baseline (46%).

138 Pith papers citing it

Baseline 46% of classified citations

open full Pith review browse 138 citing papers more from Jinsheng Wang arXiv PDF

abstract

While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship models such as SDXL and LLaVA-1.6, while eliminating the need for diffusion or compositional architectures. Emu3 is also capable of generating high-fidelity video via predicting the next token in a video sequence. We simplify complex multimodal model designs by converging on a singular focus: tokens, unlocking great potential for scaling both during training and inference. Our results demonstrate that next-token prediction is a promising path towards building general multimodal intelligence beyond language. We open-source key techniques and models to support further research in this direction.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 19 background 18 other 2 dataset 1 method 1

citation-polarity summary

baseline 19 background 18 unclear 2 use dataset 1 use method 1

claims ledger

abstract While next-token prediction is considered a promising path towards artificial general intelligence, it has struggled to excel in multimodal tasks, which are still dominated by diffusion models (e.g., Stable Diffusion) and compositional approaches (e.g., CLIP combined with LLMs). In this paper, we introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction. By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences. Emu3 outperforms several well-established task-spe

authors

Jinsheng Wang Quan Sun Xiaosong Zhang Xinlong Wang Yufeng Cui Zhengxiong Luo

co-cited works

representative citing papers

Flow-GRPO: Training Flow Matching Models via Online RL

cs.CV · 2025-05-08 · unverdicted · novelty 8.0

Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.

MemLearner: Learning to Query Context memory for Video World Models

cs.CV · 2026-06-30 · unverdicted · novelty 7.0

MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.

Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

DLR creates discrete latent tokens from rendered CoT images via clustering, enabling up to 20x compression and interpretable trajectories that outperform continuous latent baselines on reasoning tasks.

Obliviate: Erasing Concepts from Autoregressive Image Generation Models

cs.CV · 2026-06-26 · unverdicted · novelty 7.0

Obliviate erases targeted concepts from autoregressive image generators via KL supervision on visual tokens over full trajectories, cutting nudity rates sharply on benchmarks while keeping general performance.

AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

cs.CV · 2026-06-05 · unverdicted · novelty 7.0

AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token baselines.

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

VPE inserts an internal autoregressive visual semantic token generation step to guide image token production in unified models, reporting faster convergence, higher quality, and superior editing preservation (PSNR 26.76 vs 19.92) versus external alternatives.

Diffusing in the Right Space: A Systematic Study of Latent Diffusability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.

OctoT2I: A Self-Evolving Agentic Text-to-Image Router

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

OctoT2I uses a no-supervision PSEL loop to discover model capability frontiers and route T2I tasks, reaching 0.96 GenEval score with 90.3% speedup over Flow-GRPO.

Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

Mind-Omni unifies seven brain-vision-language tasks in one discrete-diffusion framework with a brain tokenizer and a new BQA dataset, claiming SOTA multi-task performance competitive with larger single-task models.

Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

cs.CV · 2026-05-20 · unverdicted · novelty 7.0 · 2 refs

Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.

RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution

cs.CV · 2026-05-20 · conditional · novelty 7.0

RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

cs.CR · 2026-05-19 · conditional · novelty 7.0

ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.

Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.

Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.

UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning

cs.MM · 2026-05-12 · unverdicted · novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.

ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.

Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models

cs.CV · 2026-04-27 · unverdicted · novelty 7.0

XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.

Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.

Probing Visual Planning in Image Editing Models

cs.CV · 2026-04-23 · unverdicted · novelty 7.0

Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.

Exploring Spatial Intelligence from a Generative Perspective

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.

IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

IAD-Unify unifies industrial anomaly segmentation, region-grounded language understanding, and mask-guided generation in one framework using DINOv2 token injection into Qwen3.5, supported by the new Anomaly-56K dataset of 59,916 images.

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

Modality-Aware and Anatomical Vector-Quantized Autoencoding for Multimodal Brain MRI

cs.CV · 2026-04-06 · unverdicted · novelty 7.0

NeuroQuant is a modality-aware 3D VQ-VAE that uses dual-stream encoding, a shared anatomical codebook, and FiLM to achieve superior multi-modal brain MRI reconstruction.

Gen-Searcher: Reinforcing Agentic Search for Image Generation

cs.CV · 2026-03-30 · unverdicted · novelty 7.0 · 2 refs

Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.

citing papers explorer

Showing 50 of 138 citing papers.

Flow-GRPO: Training Flow Matching Models via Online RL cs.CV · 2025-05-08 · unverdicted · none · ref 67 · internal anchor
Flow-GRPO is the first online RL method for flow matching models, raising GenEval accuracy from 63% to 95% and text-rendering accuracy from 59% to 92% with little reward hacking.
MemLearner: Learning to Query Context memory for Video World Models cs.CV · 2026-06-30 · unverdicted · none · ref 54 · internal anchor
MemLearner introduces a learning-based adaptive context query method using query tokens in video world models to improve long-term scene consistency over rule-based retrieval.
Why Struggle with Continuous Latents? Interpretable Discrete Latent Reasoning via Rendered Compression cs.CL · 2026-06-29 · unverdicted · none · ref 42 · internal anchor
DLR creates discrete latent tokens from rendered CoT images via clustering, enabling up to 20x compression and interpretable trajectories that outperform continuous latent baselines on reasoning tasks.
Obliviate: Erasing Concepts from Autoregressive Image Generation Models cs.CV · 2026-06-26 · unverdicted · none · ref 45 · internal anchor
Obliviate erases targeted concepts from autoregressive image generators via KL supervision on visual tokens over full trajectories, cutting nudity rates sharply on benchmarks while keeping general performance.
AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens cs.CV · 2026-06-05 · unverdicted · none · ref 9 · internal anchor
AdaTok learns content-dependent token budgets for discrete 1D image tokenization via prioritized representation learning and a GRPO allocation policy, achieving rFID 1.50 at ~118 tokens average versus fixed 256-token baselines.
Imagine Before You Draw: Visual Prompt Engineering for Image Generation cs.CV · 2026-06-03 · unverdicted · none · ref 3 · internal anchor
VPE inserts an internal autoregressive visual semantic token generation step to guide image token production in unified models, reporting faster convergence, higher quality, and superior editing preservation (PSNR 26.76 vs 19.92) versus external alternatives.
Diffusing in the Right Space: A Systematic Study of Latent Diffusability cs.CV · 2026-06-02 · unverdicted · none · ref 71 · internal anchor
A large-scale empirical study across tokenizers and diffusion backbones identifies Velocity Irreducible Variance (VIV) as one of the most stable predictors of latent diffusion generation quality.
OctoT2I: A Self-Evolving Agentic Text-to-Image Router cs.AI · 2026-06-01 · unverdicted · none · ref 46 · internal anchor
OctoT2I uses a no-supervision PSEL loop to discover model capability frontiers and route T2I tasks, reaching 0.96 GenEval score with 90.3% speedup over Flow-GRPO.
Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion cs.AI · 2026-05-28 · unverdicted · none · ref 87 · internal anchor
Mind-Omni unifies seven brain-vision-language tasks in one discrete-diffusion framework with a brain tokenizer and a new BQA dataset, claiming SOTA multi-task performance competitive with larger single-task models.
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning cs.CV · 2026-05-20 · unverdicted · none · ref 12 · 2 links · internal anchor
Uni-Edit introduces a data synthesis pipeline turning VQA data into reasoning-intensive editing instructions, enabling single-task tuning that boosts all three capabilities in models like BAGEL and Janus-Pro.
RankE: End-to-End Post-Training for Discrete Text-to-Image Generation with Decoder Co-Evolution cs.CV · 2026-05-20 · conditional · none · ref 63 · internal anchor
RankE co-evolves AR policy and decoder via alternating ranking optimization, improving both FID and CLIP scores on LlamaGen-XL and Janus-Pro where policy-only RL degrades FID.
Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models cs.CR · 2026-05-19 · conditional · none · ref 73 · internal anchor
ToBAC is the first backdoor attack on unified autoregressive models, using data or model poisoning to make triggers elicit cross-modal malicious behavior in text and image generation.
Images in Sentences: Scaling Interleaved Instructions for Unified Visual Generation cs.CV · 2026-05-12 · unverdicted · none · ref 37 · internal anchor
INSET embeds images as native tokens in interleaved instructions, outperforming prior methods on multi-image consistency and text alignment as complexity grows.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm cs.CV · 2026-05-12 · unverdicted · none · ref 54 · 2 links · internal anchor
Proposes V2V-Zero, a training-free framework replacing text conditioning with VLM final-layer hidden states from visual pages, achieving 0.85 on GenEval and 32.7/100 on new Simple-V2V Bench across models including video extension.
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning cs.MM · 2026-05-12 · unverdicted · none · ref 20 · internal anchor
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixed strategies.
ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models cs.CV · 2026-05-11 · unverdicted · none · ref 37 · internal anchor
ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Beyond Accuracy: Benchmarking Cross-Task Consistency in Unified Multimodal Models cs.CV · 2026-04-27 · unverdicted · none · ref 35 · internal anchor
XTC-Bench reveals that strong performance on generation or understanding tasks in unified multimodal models does not guarantee cross-task semantic consistency, which instead depends on how tightly coupled the learning objectives are across modalities.
Hallo-Live: Real-Time Streaming Joint Audio-Video Avatar Generation with Asynchronous Dual-Stream and Human-Centric Preference Distillation cs.CV · 2026-04-26 · unverdicted · none · ref 44 · internal anchor
Hallo-Live achieves 20.38 FPS real-time text-to-audio-video avatar generation with 0.94s latency using asynchronous dual-stream diffusion and HP-DMD preference distillation, matching teacher model quality at 16x higher throughput.
Probing Visual Planning in Image Editing Models cs.CV · 2026-04-23 · unverdicted · none · ref 29 · internal anchor
Image editing models fail zero-shot visual planning on abstract mazes and queen puzzles but generalize after finetuning, yet still cannot match human zero-shot efficiency.
Exploring Spatial Intelligence from a Generative Perspective cs.CV · 2026-04-22 · unverdicted · none · ref 29 · internal anchor
Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
IAD-Unify: A Region-Grounded Unified Model for Industrial Anomaly Segmentation, Understanding, and Generation cs.CV · 2026-04-14 · unverdicted · none · ref 34 · internal anchor
IAD-Unify unifies industrial anomaly segmentation, region-grounded language understanding, and mask-guided generation in one framework using DINOv2 token injection into Qwen3.5, supported by the new Anomaly-56K dataset of 59,916 images.
Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models cs.CV · 2026-04-13 · unverdicted · none · ref 63 · internal anchor
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
Modality-Aware and Anatomical Vector-Quantized Autoencoding for Multimodal Brain MRI cs.CV · 2026-04-06 · unverdicted · none · ref 32 · internal anchor
NeuroQuant is a modality-aware 3D VQ-VAE that uses dual-stream encoding, a shared anatomical codebook, and FiLM to achieve superior multi-modal brain MRI reconstruction.
Gen-Searcher: Reinforcing Agentic Search for Image Generation cs.CV · 2026-03-30 · unverdicted · none · ref 47 · 2 links · internal anchor
Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.
A Unified and Controllable Framework for Layered Image Generation with Visual Effects cs.CV · 2026-01-21 · unverdicted · none · ref 50 · internal anchor
LASAGNA produces layered images with integrated visual effects in a single pass, enabling drift-free edits via alpha compositing while releasing a 48K dataset and a 242-sample benchmark.
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation cs.CV · 2026-01-04 · unverdicted · none · ref 61 · internal anchor
GAR-Font is a global-aware autoregressive framework for multimodal few-shot font generation that adds global tokenization, a language-style adapter, and post-refinement to improve style coherence over patch-based methods.
dMLLM-TTS: Self-Verified and Efficient Test-Time Scaling for Diffusion Multi-Modal Large Language Models cs.CV · 2025-12-22 · conditional · none · ref 26 · internal anchor
dMLLM-TTS delivers up to 6x more efficient test-time scaling for diffusion MLLMs via O(N+T) hierarchical search and self-verified feedback, improving generation quality on GenEval across three models.
MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection cs.CV · 2025-11-29 · conditional · none · ref 49 · internal anchor
MVAD is the first comprehensive benchmark dataset for AI-generated multimodal video-audio detection, with three realistic forgery patterns, high-quality outputs from state-of-the-art models, and diversity across visual styles and content categories.
AIA: Rethinking Architecture Decoupling Strategy In Unified Multimodal Model cs.CV · 2025-11-27 · unverdicted · none · ref 39 · internal anchor
AIA loss teaches unified multimodal models task-specific cross-modal attention patterns to reduce conflicts between image understanding and generation without architecture decoupling.
Discrete Guidance Matching: Exact Guidance for Discrete Flow Matching cs.LG · 2025-09-26 · conditional · none · ref 79 · internal anchor
Derives exact guidance transition rates for discrete flow matching models that require only one model evaluation per sampling step and unify prior approximation-based methods.
FluentAvatar: Flicker-Free Talking-Head Animation via Phoneme-Guided Autoregressive Modeling cs.CV · 2025-09-15 · unverdicted · none · ref 24 · internal anchor
Phoneme-guided autoregressive framework for talking-head animation that reduces inter-frame flicker via causal keyframe generation and timestamp-aware interpolation, outperforming diffusion baselines on FVD and a new BG-Flicker metric.
Transfer between Modalities with MetaQueries cs.CV · 2025-04-08 · unverdicted · none · ref 16 · internal anchor
MetaQueries act as an efficient bridge allowing multimodal LLMs to augment diffusion-based image generation and editing without complex training or unfreezing the LLM backbone.
DualToken: Towards Unifying Visual Understanding and Generation with Dual Visual Vocabularies cs.CV · 2025-03-18 · unverdicted · none · ref 45 · internal anchor
DualToken disentangles semantics and appearance via separate codebooks in one tokenizer, reporting 0.25 rFID, 82% ImageNet zero-shot accuracy, and gains over VILA-U on understanding and generation benchmarks.
WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs cs.CV · 2025-02-06 · unverdicted · none · ref 71 · internal anchor
WorldSense provides the first benchmark requiring synergistic audio-video-text understanding on 1,662 real-world videos and 3,172 QA pairs, where the best current multimodal LLM reaches only 65.1% accuracy.
T2I-FactualBench: Benchmarking the Factuality of Text-to-Image Models with Knowledge-Intensive Concepts cs.CV · 2024-12-05 · unverdicted · none · ref 53 · internal anchor
T2I-FactualBench is a new three-tier benchmark for factuality of knowledge-intensive concepts in T2I models, using multi-round VQA evaluation to show SOTA models need improvement.
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation cs.CV · 2024-10-17 · unverdicted · none · ref 83 · internal anchor
Janus decouples visual encoding into task-specific pathways inside a single autoregressive transformer to unify multimodal understanding and generation while outperforming earlier unified models.
UniTac: A Unified Multimodal Model for Cross-Sensor Tactile Understanding and Generation cs.RO · 2026-06-30 · unverdicted · none · ref 39 · internal anchor
UniTac is the first unified multimodal model for cross-sensor tactile understanding and generation, using dual-level representations, two new understanding tasks, and a two-stage training paradigm with sensor-prior sampling to achieve SOTA understanding and realistic cross-sensor generation.
SciIR: A Large-scale Training Dataset and Benchmark for Scientific Image Reasoning Generation cs.CV · 2026-06-29 · unverdicted · none · ref 40 · internal anchor
Introduces SciIR-82k dataset and SciIR-Bench for scientific image reasoning generation organized by Peirce's semiotic triad, with fine-tuning raising model score from 35% to 43%.
Nemotron-Labs-Diffusion-Image: Advancing Masked Discrete Diffusion for High-Resolution Image Synthesis cs.CV · 2026-06-29 · unverdicted · none · ref 9 · internal anchor
A masked discrete diffusion model adds token editing at inference and grouped cross-entropy training to reach 0.90 GenEval, 86.9 DPG, and 10.76 HPSv3 scores.
Mural: Transferring LLM knowledge to image generation via Mixture-of-Transformers cs.CV · 2026-06-27 · unverdicted · none · ref 30 · internal anchor
Mural transfers knowledge from a frozen LLM to text-to-image synthesis via MoT shared attention, achieving 0.85 GenEval, 86.75 DPG-Bench, and 0.66 WISE while exhibiting emergent behaviors without multimodal or reasoning supervision.
COMPASS: Grounding Composition-Intent Guidance in Unified Multimodal Models cs.AI · 2026-06-27 · unverdicted · none · ref 29 · internal anchor
COMPASS is a unified multimodal framework using a shared expert token τ_c to ground composition-intent for both perception and generation, backed by the new Comp-11 dataset.
Lighting-Consistent Object Transfer Across Radiance Fields cs.GR · 2026-06-21 · unverdicted · none · ref 257 · internal anchor
Diffusion-based per-view harmonization for lighting-consistent object transfer between 3DGS scenes, using heterogeneous training data and final 3D consolidation.
OmniGen-AR: AutoRegressive Any-to-Image Generation cs.CV · 2026-06-08 · unverdicted · none · ref 86 · internal anchor
OmniGen-AR is a unified autoregressive framework for any-to-image generation that tokenizes text and visual conditions together and uses disentangled causal attention to support tasks like text-to-image, depth-to-image, image editing, and text-to-video while reporting 0.63 on GenEval and 80.02 on VB
Knowledge Distillation for Visual Autoregressive Models cs.CV · 2026-06-04 · unverdicted · none · ref 46 · internal anchor
VarKD is a distillation framework for visual AR models that uses student samples and selective teacher supervision to reduce token ambiguity, outperforming prior baselines on ImageNet.
ProductWebGen: Benchmarking Multimodal Product Webpage Generation cs.CV · 2026-05-31 · unverdicted · none · ref 30 · internal anchor
Introduces ProductWebGen benchmark for multimodal product webpage generation, compares editing-based vs unified-model workflows on 500 samples, and releases ProductWebGen-1k SFT dataset.
Representation Forcing for Bottleneck-Free Unified Multimodal Models cs.CV · 2026-05-29 · unverdicted · none · ref 48 · internal anchor
Representation Forcing enables end-to-end pixel-space unified multimodal models by making visual representation prediction a native autoregressive generation target that guides subsequent pixel diffusion in the same backbone.
Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models cs.CV · 2026-05-29 · unverdicted · none · ref 46 · internal anchor
Lumos-Nexus is a training-efficient video generation framework using two-stage alignment of a lightweight model followed by progressive frequency bridging to a high-fidelity generator in homogeneous latent space, plus the new VR-Bench for reasoning evaluation.
Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization cs.CV · 2026-05-28 · unverdicted · none · ref 38 · internal anchor
GCPO performs per-token credit assignment in discrete policy optimization by setting token advantages proportional to the difference in model predictions under positive versus negative prompts, outperforming GRPO and DAPO on text-to-image and chain-of-thought tasks.
DriveWAM: Video Generative Priors Enable Scalable World-Action Modeling for Autonomous Driving cs.CV · 2026-05-27 · unverdicted · none · ref 8 · internal anchor
DriveWAM converts video generative priors into a unified video-action policy for driving, reporting strong benchmark performance and positive scaling from 4k to 100k clips.
Reinforcing Few-step Generators via Reward-Tilted Distribution Matching cs.CV · 2026-05-25 · unverdicted · none · ref 73 · internal anchor
RTDMD unifies KL minimization to a reward-tilted teacher into distribution matching plus reward terms, using AC-DMD in stage one and hybrid GRPO-style gradients plus SubGRPO in stage two to reach new SOTA on preference, aesthetic, and compositional metrics with 4-step generation on SD3, SD3.5, and F

Emu3: Next-Token Prediction is All You Need

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer