Canonical reference

Unified autoregressive visual generation and understanding with continuous tokens

Lijie Fan, Luming Tang, Siyang Qin, Tianhong Li, Xuan Yang, Siyuan Qiao, Andreas Steiner, Chen Sun, Yuanzhen Li, Tao Zhu, et al · 2025 · arXiv 2503.13436

Canonical reference. 100% of citing Pith papers cite this work as background.

7 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 7 citing papers

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.

Lance: Unified Multimodal Modeling by Multi-Task Synergy

cs.CV · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

cs.CV · 2026-04-27 · unverdicted · novelty 6.0 · 2 refs

Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.

Image Diffusion Preview with Consistency Solver

cs.LG · 2025-12-15 · unverdicted · novelty 6.0

ConsistencySolver enables high-quality low-step diffusion previews by adapting general linear multistep methods into a lightweight RL-optimized solver, matching multistep DPM-Solver FID with 47% fewer steps and cutting user interaction time by nearly 50%.

Show-o2: Improved Native Unified Multimodal Models

cs.CV · 2025-06-18 · unverdicted · novelty 4.0

Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.

Step1X-Edit: A Practical Framework for General Image Editing

cs.CV · 2025-04-24 · unverdicted · novelty 4.0

Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models on the new GEdit-Bench.

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

cs.AI · 2026-04-04

citing papers explorer

Showing 7 of 7 citing papers.

Pseudo-Unification: Entropy Probing Reveals Divergent Information Patterns in Unified Multimodal Models cs.CV · 2026-04-13 · unverdicted · none · ref 15
Unified multimodal models exhibit pseudo-unification due to modality-asymmetric entropy encoding and pattern-split responses between text and image generation.
Lance: Unified Multimodal Modeling by Multi-Task Synergy cs.CV · 2026-05-18 · unverdicted · none · ref 26 · 2 links
Lance presents a dual-stream mixture-of-experts model with modality-aware positional encoding and staged multi-task training that outperforms prior open-source unified models on image and video generation while keeping strong understanding performance.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation cs.CV · 2026-04-27 · unverdicted · none · ref 13 · 2 links
Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.
Image Diffusion Preview with Consistency Solver cs.LG · 2025-12-15 · unverdicted · none · ref 8
ConsistencySolver enables high-quality low-step diffusion previews by adapting general linear multistep methods into a lightweight RL-optimized solver, matching multistep DPM-Solver FID with 47% fewer steps and cutting user interaction time by nearly 50%.
Show-o2: Improved Native Unified Multimodal Models cs.CV · 2025-06-18 · unverdicted · none · ref 38
Show-o2 unifies text, image, and video understanding and generation in a single autoregressive-plus-flow-matching model built on 3D causal VAE representations.
Step1X-Edit: A Practical Framework for General Image Editing cs.CV · 2025-04-24 · unverdicted · none · ref 12
Step1X-Edit integrates a multimodal LLM with a diffusion decoder, trained on a custom high-quality dataset, to deliver image editing performance that surpasses open-source baselines and approaches proprietary models on the new GEdit-Bench.
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning cs.AI · 2026-04-04 · unreviewed · ref 14

Unified autoregressive visual generation and understanding with continuous tokens

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer