arxiv: 2604.28185 · v1 · submitted 2026-04-30 · 💻 cs.CV

Recognition: unknown

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

Keming Wu , Zuhao Yang , Kaichen Zhang , Shizun Wang , Haowei Zhu , Sicong Leng , Zhongyu Yang , Qijie Wang

show 19 more authors

Sudong Wang Ziting Wang Zili Wang Hui Zhang Haonan Wang Hang Zhou Yifan Pu Xingxuan Li Fangneng Zhan Bo Li Lidong Bing Yuxin Song Ziwei Liu Wenhu Chen Jingdong Wang Xinchao Wang Xiaojuan Qi Shijian Lu Bin Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual generationworld modelingagentic AIgenerative modelstaxonomyevaluation benchmarksdiffusion modelscausal reasoning

0 comments

The pith

Visual generation must advance from basic image synthesis to agentic world modeling that respects structure, dynamics, and causal relations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recent visual generation models produce photorealistic outputs but lack reliable spatial reasoning, long-term consistency, and causal understanding. The paper argues the field needs to shift toward intelligent generation where outputs are grounded in physical structure, temporal dynamics, domain knowledge, and cause-effect relations rather than appearance alone. To guide this shift, it introduces a five-level taxonomy that progresses from simple atomic mapping through conditional, in-context, and agentic stages up to full world-modeling generation. The authors review technical drivers such as flow matching and unified models while critiquing how standard benchmarks emphasize perceptual quality and miss deeper failures. This roadmap aims to reorient research toward systems that can act as interactive world simulators.

Core claim

The paper establishes that visual generation evolves along five distinct capability levels—Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation—moving from passive renderers to interactive, agentic, world-aware generators. Progress requires integrating techniques like flow matching, unified understanding-generation architectures, improved representations, reward modeling, and synthetic data, while current evaluations systematically overestimate advancement by overlooking structural, temporal, and causal errors revealed through stress tests and expert case studies.

What carries the argument

A five-level taxonomy that classifies visual generation systems by increasing degrees of intelligence, from basic atomic mapping of pixels to full agentic world modeling that maintains persistent states, causal relations, and interactive dynamics.

If this is right

Models will require unified architectures that jointly handle visual understanding and generation rather than separate pipelines.
Training will shift emphasis toward data curation, synthetic data distillation, and reward modeling that penalize causal inconsistencies.
Evaluation protocols must incorporate stress tests for spatial reasoning and long-term temporal coherence beyond single-frame quality.
Sampling acceleration techniques will become essential to support real-time agentic interaction and iterative world updates.
Post-training methods will focus on aligning outputs with domain knowledge and physical plausibility constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adopting the taxonomy could redirect funding and benchmarks away from single-image photorealism toward simulators that support planning and intervention.
The same progression might apply to other modalities such as audio or 3D scene generation, suggesting a general pattern for multimodal world models.
If world-modeling generation succeeds, downstream applications like robotics simulation and video game engines could merge with generative AI systems.
The framework implies testable predictions about which technical drivers most accelerate movement between levels.

Load-bearing premise

The five-level taxonomy correctly captures the necessary stages of evolution in the field and that standard evaluations fail to detect structural, temporal, and causal shortcomings.

What would settle it

A benchmark suite that scores current top models on long-horizon causal consistency and persistent state tracking in generated scenes, where performance drops sharply compared with existing perceptual metrics.

read the original abstract

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This roadmap paper introduces a five-level taxonomy for visual generation that usefully organizes the shift toward agentic and world-modeling systems, though its claims about evaluation gaps rest more on synthesis than new quantified evidence.

read the letter

The main point is a new five-level taxonomy—Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation—that frames progress from passive rendering to systems handling structure, dynamics, and causality. The authors argue current models still fall short on spatial reasoning and long-horizon consistency despite gains in photorealism and instruction following. They point to drivers like flow matching, unified models, reward modeling, and synthetic data as the path forward. This framing is clean and pulls together trends that many in the field are already seeing piecemeal. The review of benchmarks plus in-the-wild stress tests and case studies gives the critique some concrete footing, which is better than pure opinion pieces. The taxonomy itself is the clearest addition; it is not derived from equations but serves as an organizing lens that could help set research priorities. The softer spots are the evaluation claims. The paper asserts that standard metrics miss structural and causal failures, and while the abstract mentions supporting reviews, the strength depends on how detailed those breakdowns turn out to be in the full text. Without heavy new data or controlled experiments, the argument stays interpretive rather than definitive. No circular math or invented entities here, just a synthesis. This is for researchers working on generative vision, simulation, or agentic systems who want a forward map rather than a new algorithm. Readers already deep in the literature will find the taxonomy most useful as a discussion tool. It deserves peer review because the perspective is coherent and the technical drivers are accurately identified; referees can push on the evidence for the evaluation gaps and whether the levels map cleanly to real capability jumps. I would send it out.

Referee Report

2 major / 3 minor

Summary. The paper claims that recent visual generation models excel in photorealism, typography, instruction following, and interactive editing but continue to struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. It argues for a shift from appearance synthesis to intelligent visual generation grounded in structure, dynamics, domain knowledge, and causal relations. To frame this evolution, the manuscript introduces a five-level taxonomy progressing from Atomic Generation through Conditional, In-Context, and Agentic Generation to World-Modeling Generation. It analyzes technical drivers including flow matching, unified understanding-and-generation models, improved representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. The work further contends that current evaluations overestimate progress by prioritizing perceptual quality over structural, temporal, and causal failures, supported by a combination of benchmark review, in-the-wild stress tests, and expert-constrained case studies.

Significance. If the taxonomy and evaluation critique hold, this roadmap could meaningfully shape research priorities in computer vision by offering a capability-centered organizing framework that moves the field beyond passive rendering toward interactive, knowledge-aware systems. The synthesis of drivers such as flow matching and unified models provides practical guidance for practitioners, while the call for better evaluation practices could stimulate development of more diagnostic benchmarks. As a conceptual synthesis rather than an empirical contribution, its primary value lies in clarifying the progression toward agentic and world-modeling capabilities and highlighting gaps that current perceptual metrics miss.

major comments (2)

[§3] §3 (Taxonomy definition): The boundary between Agentic Generation and World-Modeling Generation is not sharply delineated; both levels invoke interaction, domain knowledge, and causal relations, risking overlap that could undermine the taxonomy's utility as a precise organizing lens for classifying models or tracking progress.
[§4] §4 (Evaluation section): The central claim that evaluations systematically overestimate progress by missing structural, temporal, and causal failures is supported only by a high-level description of benchmark review, stress tests, and case studies; without concrete failure statistics, quantitative comparisons to perceptual metrics, or tabulated examples of specific models' shortcomings, the critique remains qualitative and less load-bearing for the argument.

minor comments (3)

[Abstract] Abstract: The phrase 'expert-constrained case studies' is introduced without clarifying the nature of the constraints or how expert input was incorporated, which could affect reproducibility and reader interpretation of the evaluation methodology.
[Introduction] Introduction and §2: Several technical drivers (e.g., flow matching, reward modeling) are listed without brief one-sentence definitions or pointers to foundational references on first mention, reducing accessibility for readers less familiar with the sub-area.
[References] References: The analysis of unified understanding-and-generation models would benefit from explicit citations to the specific works discussed, ensuring the literature synthesis is fully traceable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback. We appreciate the recognition that the taxonomy and evaluation critique could help shape research priorities in visual generation. We address each major comment below and will revise the manuscript accordingly to improve clarity and evidentiary support.

read point-by-point responses

Referee: [§3] §3 (Taxonomy definition): The boundary between Agentic Generation and World-Modeling Generation is not sharply delineated; both levels invoke interaction, domain knowledge, and causal relations, risking overlap that could undermine the taxonomy's utility as a precise organizing lens for classifying models or tracking progress.

Authors: We thank the referee for identifying this potential source of overlap. We agree that sharper boundaries will strengthen the taxonomy's utility. In the revised manuscript, we will expand §3 with an explicit comparison table across five dimensions (interaction scope, knowledge integration, state persistence, causal reasoning depth, and evaluation focus) to differentiate the levels. Agentic Generation will be characterized as goal-directed interaction with external tools or environments for task completion, while World-Modeling Generation will be defined by the construction and maintenance of an internal, causally consistent simulation supporting prediction and long-horizon coherence. We will also add one or two additional model exemplars per level to illustrate the distinctions without altering the overall five-level progression. revision: yes
Referee: [§4] §4 (Evaluation section): The central claim that evaluations systematically overestimate progress by missing structural, temporal, and causal failures is supported only by a high-level description of benchmark review, stress tests, and case studies; without concrete failure statistics, quantitative comparisons to perceptual metrics, or tabulated examples of specific models' shortcomings, the critique remains qualitative and less load-bearing for the argument.

Authors: We acknowledge that the evaluation critique would benefit from more concrete presentation. The manuscript already reviews specific benchmark limitations (e.g., reliance on FID and CLIP scores that overlook spatial and causal errors) and describes failure modes from in-the-wild tests and expert case studies. To make this evidence more load-bearing, we will insert a new summary table in the revised §4 that tabulates representative models, their reported perceptual scores, and the structural/temporal/causal shortcomings observed in the stress tests. This table will draw directly from the existing benchmark review and case studies rather than introducing new experiments, thereby preserving the paper's scope as a conceptual synthesis while improving traceability and impact of the argument. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy is an external organizing framework with no derivations or self-referential reductions

full rationale

The paper is a survey and roadmap proposing a five-level taxonomy (Atomic Generation through World-Modeling Generation) to frame a shift toward intelligent visual generation. No equations, fitted parameters, predictions, or uniqueness theorems appear in the abstract or described content. The taxonomy functions as an interpretive lens rather than a result derived from internal definitions, self-citations, or data fits. Technical driver analysis and evaluation critiques rely on literature synthesis and stated benchmark reviews without reducing to the paper's own inputs by construction. This is a standard non-circular conceptual contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central argument rests on domain assumptions about current model shortcomings and introduces new conceptual categories in the taxonomy without independent empirical grounding or falsifiable predictions in the abstract.

axioms (2)

domain assumption Recent visual generation models struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding.
This limitation is stated directly in the abstract as the motivation for the proposed shift and taxonomy.
domain assumption Current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures.
Asserted in the abstract as a key observation from benchmark review and case studies, but without the actual review details provided here.

invented entities (2)

Atomic Generation no independent evidence
purpose: Lowest level in the taxonomy representing passive, appearance-focused renderers.
Newly defined category in the five-level framework introduced by the authors.
World-Modeling Generation no independent evidence
purpose: Highest level representing interactive, agentic generators with full world awareness and causal understanding.
Newly defined target state in the taxonomy as the goal for intelligent visual generation.

pith-pipeline@v0.9.0 · 5599 in / 1575 out tokens · 90380 ms · 2026-05-07T06:33:59.071699+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
WorldReasonBench: Human-Aligned Stress Testing of Video Generators as Future World-State Predictors
cs.CV 2026-05 unverdicted novelty 6.0

The paper presents WorldReasonBench, a benchmark that tests video generators on maintaining physical, social, logical, and informational consistency when predicting future states from initial conditions and actions.

Reference graph

Works this paper leans on

104 extracted references · 99 canonical work pages · cited by 2 Pith papers · 36 internal anchors

[1]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062,

work page internal anchor Pith review arXiv
[2]

Wasserstein GAN

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan, 2017.https://arxiv.org/abs/1701.07875. Artificial Analysis. Artificial analysis image arena.https://artificialanalysis.ai/text-to-image,

work page Pith review arXiv 2017
[3]

Dream to manipulate: Compositional world models empowering robot imitation learning with imagination, 2025

Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagination.arXiv preprint arXiv:2412.14957,

work page arXiv
[4]

Dragon: A large-scale dataset of realistic images generated by diffusion models.arXiv preprint arXiv:2505.11257, 2025a

Giulia Bertazzini, Daniele Baracchi, Dasara Shullani, Isao Echizen, and Alessandro Piva. Dragon: A large-scale dataset of realistic images generated by diffusion models.arXiv preprint arXiv:2505.11257, 2025a. Giulia Bertazzini, Daniele Baracchi, Dasara Shullani, Isao Echizen, and Alessandro Piva. Dragon: A large-scale dataset of realistic images generated...

work page arXiv
[5]

Training Diffusion Models with Reinforcement Learning

Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,

work page internal anchor Pith review arXiv
[6]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis, 2019.https://arxiv.org/abs/1809.11096. Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition...

work page internal anchor Pith review arXiv 2019
[7]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699,

work page internal anchor Pith review arXiv
[8]

Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951, 2025

Siyu Cao, Hangting Chen, Peng Chen, Yiji Cheng, Yutao Cui, Xinchi Deng, Ying Dong, Kipper Gong, Tianpeng Gu, Xiusen Gu, et al. Hunyuanimage 3.0 technical report.arXiv preprint arXiv:2509.23951,

work page arXiv
[10]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

https://arxiv.org/abs/2405.09818. 109 Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shijie Zhou, Weilin Huang, Gordon Wetzstein, Mohammad Soleymani, and Peng Wang. Bytemorph: Benchmarking instruction-guided image editing with non-rigid motions. arXiv preprint arXiv:2506.03107, 2025a. Di Chang, Mingdeng Cao, Yichun Shi, Bo Liu, Shengqu Cai, Shij...

work page internal anchor Pith review arXiv
[11]

Large Video Planner Enables Generalizable Robot Control

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025a. Haoyu Chen, Xiaojie Xu, Wenbo Li, Jingjing Ren, Tian Ye, Songhua Liu, Ying-Cong Chen, Lei Zhu, and Xinchao Wang...

work page internal anchor Pith review arXiv 2025
[12]

δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024b

Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen. δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125, 2024b. Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Peter Sushko, Gaoyang Jiang, Yao Wan, and Ranjay Krishna. M...

work page doi:10.1145/3746027.3758292
[13]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets, 2016.https://arxiv.org/abs/1606. 03657. Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal...

work page internal anchor Pith review arXiv 2016
[14]

PaddleOCR 3.0 Technical Report

Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, Yue Zhang, Wenyu Lv, Kui Huang, Yichao Zhang, Jing Zhang, Jun Zhang, Yi Liu, Dianhai Yu, and Yanjun Ma. Paddleocr 3.0 technical report, 2025.https://arxiv.org/abs/2507.05595. Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanj...

work page internal anchor Pith review arXiv 2025
[15]

Emerging Properties in Unified Multimodal Pretraining

https://arxiv.org/abs/2505.14683. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee,

work page internal anchor Pith review arXiv
[16]

Prism: A unified framework for photorealistic reconstruction and intrinsic scene modeling, 2025.https://arxiv.org/abs/2504.14219

Alara Dirik, Tuanfeng Wang, Duygu Ceylan, Stefanos Zafeiriou, and Anna Frühstück. Prism: A unified framework for photorealistic reconstruction and intrinsic scene modeling, 2025.https://arxiv.org/abs/2504.14219. Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-...

work page arXiv 2025
[17]

and Barry Zhang

Erik S. and Barry Zhang. Building effective agents. Anthropic Engineering Blog, December 2024.https://www. anthropic.com/engineering/building-effective-agents. Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lac...

2024
[18]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

https://arxiv.org/abs/2403.03206. Jiacheng Fan, Zhiyue Zhao, Yiqian Zhang, Chao Chen, Peide Wang, Hengdi Zhang, and Zhengxue Cheng. Robopaint: From human demonstration to any robot and any view.arXiv preprint arXiv:2602.05325,

work page internal anchor Pith review arXiv
[19]

Tinyfusion: Diffusion transformers learned shallow

111 Gongfan Fang, Kunjun Li, Xinyin Ma, and Xinchao Wang. Tinyfusion: Diffusion transformers learned shallow. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 18144–18154, 2025a. HaoFang, ZechaoZhan, WeixinFeng, ZiweiHuang, XubinLi, andTiezhengGe. Tbstar-edit: Fromimageeditingpattern shifting to consistency enhancement.ArXiv...

work page arXiv
[20]

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557,

work page internal anchor Pith review arXiv
[21]

arXiv preprint arXiv:2506.01943 , year=

Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, and Dahua Lin. Learning video generation for robotic manipulation with collaborative trajectory control.arXiv preprint arXiv:2506.01943,

work page arXiv
[22]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618,

work page internal anchor Pith review arXiv
[23]

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al

doi: 10.1109/TPAMI.2025.3610614. Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346,

work page doi:10.1109/tpami.2025.3610614 2025
[24]

Seed-data-edit technical report: A hybrid dataset for instructional image editing

Yuying Ge, Sijie Zhao, Chen Li, Yixiao Ge, and Ying Shan. Seed-data-edit technical report: A hybrid dataset for instructional image editing.arXiv preprint arXiv:2405.04007,

work page arXiv
[25]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025a. Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, Linus, Di Wang, and Jie Jiang. X-omni: Reinforcement learning makes discrete a...

work page internal anchor Pith review arXiv 2025
[26]

Generative Adversarial Networks

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.https://arxiv.org/abs/1406.2661. Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron Courville. Improved training of wasserstein gans, 2017.https://arxiv.org/abs/1...

work page internal anchor Pith review arXiv 2014
[27]

Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898, 2025.https://arxiv.org/abs/2506.18898

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, and Lu Jiang. Vision as a dialect: Unifying visual understanding and generation via text-aligned representations.arXiv preprint arXiv:2506.18898, 2025.https://arxiv.org/abs/2506.18898. Junjie He, Yifeng Geng, and Liefeng Bo. Uniportrait: A unified framework for identit...

work page arXiv 2025
[28]

arXiv preprint arXiv:2603.28088 , year=

Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, and Yang Yang. Gems: Agent-native multimodal generation with memory and skills.arXiv preprint arXiv:2603.28088,

work page arXiv
[29]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022.https: //arxiv.org/abs/2207.12598. Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851,

work page internal anchor Pith review arXiv 2022
[30]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,

work page internal anchor Pith review arXiv
[31]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024a. Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy w...

work page internal anchor Pith review arXiv 2025
[32]

The GAN is dead; long live the GAN! A modern GAN baseline.arXiv preprint arXiv:2501.05441, 2025a.https://arxiv.org/abs/2501.05441

Yiwen Huang, Aaron Gokaslan, Volodymyr Kuleshov, and James Tompkin. The GAN is dead; long live the GAN! A modern GAN baseline.arXiv preprint arXiv:2501.05441, 2025a.https://arxiv.org/abs/2501.05441. Zehuan Huang, Yuan-Chen Guo, Haoran Wang, Ran Yi, Lizhuang Ma, Yan-Pei Cao, and Lu Sheng. Mv-adapter: Multi-view consistent image generation made easy. InProc...

work page arXiv
[33]

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509,

work page internal anchor Pith review arXiv
[34]

Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models. arXiv preprint arXiv:2505.12705,

work page arXiv
[35]

COLE: A hierarchical generation framework for graphic design.arXiv preprint arXiv:2311.16974,

Peidong Jia, Chenxuan Li, Zeyu Liu, Yichao Shen, Xingru Chen, Yuhui Yuan, Yinglin Zheng, Dong Chen, Ji Li, Xiaodong Xie, et al. COLE: A hierarchical generation framework for graphic design.arXiv preprint arXiv:2311.16974,

work page arXiv
[36]

Lego-edit: A general image editing framework with model-level bricks and mllm builder.ArXiv, abs/2509.12883, 2025.https://api.semanticscholar.org/CorpusID:281325583

Qifei Jia, Yu Liu, Yajie Chai, Xintong Yao, Qiming Lu, Yasen Zhang, Runyu Shi, Ying Huang, and Guoquan Zhang. Lego-edit: A general image editing framework with model-level bricks and mllm builder.ArXiv, abs/2509.12883, 2025.https://api.semanticscholar.org/CorpusID:281325583. Bowen Jiang, Yuan Yuan, Xinyi Bai, Zhuoqun Hao, Alyson Yin, Yaojie Hu, Wenyu Liao...

work page arXiv 2025
[37]

pdf, accessed 2026-04-22

Available athttps://joyai-image.s3.cn-north-1.jdcloud-oss.com/JoyAI-Image. pdf, accessed 2026-04-22. Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up gans for text-to-image synthesis. InCVPR,

2026
[38]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks, 2019.https://arxiv.org/abs/1812.04948. 114 Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems, volume 35,

work page arXiv 2019
[39]

Gen2sim: Scaling up robot learning in simulation with generative models.arXiv preprint arXiv:2310.18308,

Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with generative models.arXiv preprint arXiv:2310.18308,

work page arXiv
[40]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

work page internal anchor Pith review arXiv
[41]

Tenenbaum

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B Tenenbaum. Learning to act from actionless videos through dense correspondences.arXiv preprint arXiv:2310.08576,

work page arXiv
[42]

FLUX.1 Kontext: Flow Matching for In-Context Image Generation and Editing in Latent Space

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context image ...

work page internal anchor Pith review arXiv 2025
[43]

arXiv preprint arXiv:2508.09976 (2025)

Marion Lepert, Jiaying Fang, and Jeannette Bohg. Masquerade: Learning from in-the-wild human videos using data-editing.arXiv preprint arXiv:2508.09976, 2025a. Marion Lepert, Jiaying Fang, and Jeannette Bohg. Phantom: Training robots without robots using only human videos. arXiv preprint arXiv:2503.00779, 2025b. Guangrun Li, Yaoxu Lyu, Zhuoyang Liu, Chengk...

work page arXiv 2025
[44]

Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent

Yunlong Lin, Zixu Lin, Kunjie Lin, Jinbin Bai, Panwang Pan, Chenxin Li, Haoyu Chen, Zhongdao Wang, Xinghao Ding, Wenbo Li, et al. Jarvisart: Liberating human artistic creativity via an intelligent photo retouching agent. arXiv preprint arXiv:2506.17612,

work page arXiv
[45]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2023.https://arxiv.org/abs/2210.02747. Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yi Xin, Xinyue Li, Qi Qin, Yu Qiao, Hongsheng Li, and Peng Gao. Lumina-mgpt: Illuminate flexible photorealistic text-to-image generation with multimod...

work page internal anchor Pith review arXiv 2023
[46]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Blog post. Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081,

work page internal anchor Pith review arXiv
[47]

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. InAdvances in Neural Information Processing Systems, volume 35, 2022a. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of ...

work page arXiv
[48]

Minxing Luo, Zixun Xia, Liaojun Chen, Zhenhang Li, Weichao Zeng, Jianye Wang, Wentao Cheng, Yaxing Wang, Yu ZHOU, and Jian Yang

https://api.semanticscholar.org/CorpusID: 279070661. Minxing Luo, Zixun Xia, Liaojun Chen, Zhenhang Li, Weichao Zeng, Jianye Wang, Wentao Cheng, Yaxing Wang, Yu ZHOU, and Jian Yang. Beyond flat text: Dual self-inherited guidance for visual text generation.ArXiv, abs/2501.05892, 2025a.https://api.semanticscholar.org/CorpusID:275458598. Xin Luo, Jiahao Wang...

work page arXiv
[49]

Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models

117 Jian Ma, Yonglin Deng, Chen Chen, Nanyang Du, Haonan Lu, and Zhenyu Yang. Glyphdraw2: Automatic generation of complex glyph posters with diffusion models and large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 5955–5963, 2025a. Jiancang Ma, Xujie Zhu, Zihao Pan, Qirong Peng, Xu Guo, Chen Chen, and H...

work page arXiv
[50]

Shapesplat: A large-scale dataset of gaussian splats and their self-supervised pretraining.arXiv preprint arXiv:2408.10906, 2024b

Qi Ma, Yue Li, Bin Ren, Nicu Sebe, Ender Konukoglu, Theo Gevers, Luc Van Gool, and Danda Pani Paudel. Shapesplat: A large-scale dataset of gaussian splats and their self-supervised pretraining.arXiv preprint arXiv:2408.10906, 2024b. Qianli Ma, Yaowei Zheng, Zhelun Shi, Zhongkai Zhao, Bin Jia, Ziyue Huang, Zhiqi Lin, Youjie Li, Jiacheng Yang, Yanghua Peng,...

work page arXiv
[51]

arXiv preprint arXiv:2508.15772 (2025)

Qingyang Mao, Qi Cai, Yehao Li, Yingwei Pan, Mingyue Cheng, Ting Yao, Qi Liu, and Tao Mei. Visual autoregressive modeling for instruction-guided image editing.ArXiv, abs/2508.15772, 2025.https://api.semanticscholar.org/ CorpusID:280700028. Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adv...

work page arXiv 2025
[52]

arXiv preprint arXiv:2601.07823 , year=

Zhiting Mei, Tenny Yin, Ola Shorinwa, Apurva Badithela, Zhonghe Zheng, Joseph Bruno, Madison Bland, Lihan Zha, Asher Hancock, Jaime Fernández Fisac, Philip Dames, and Anirudha Majumdar. Video generation models in robotics: Applications, research challenges, future directions.arXiv preprint arXiv:2601.07823,

work page arXiv
[53]

Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802,

Fanqing Meng, Wenqi Shao, Lixin Luo, Yahong Wang, Yiran Chen, Quanfeng Lu, Yue Yang, Tianshuo Yang, Kaipeng Zhang, Yu Qiao, et al. Phybench: A physical commonsense benchmark for evaluating text-to-image models.arXiv preprint arXiv:2406.11802,

work page arXiv
[54]

Spectral Normalization for Generative Adversarial Networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks, 2018.https://arxiv.org/abs/1802.05957. 118 MMagic Contributors. MMagic: OpenMMLab multimodal advanced, generative, and intelligent creation toolbox. https://github.com/open-mmlab/mmagic,

work page Pith review arXiv 2018
[55]

Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al

Open-source repository. Chong Mou, Yanze Wu, Wenxu Wu, Zinan Guo, Pengze Zhang, Yufeng Cheng, Yiming Luo, Fei Ding, Shiwen Zhang, Xinghui Li, et al. Dreamo: A unified framework for image customization. InProceedings of the SIGGRAPH Asia 2025 Conference Papers, pages 1–12,

2025
[56]

Transition matching distillation for fast video generation.arXiv preprint arXiv:2601.09881, 2026

Weili Nie, Julius Berner, Nanye Ma, Chao Liu, Saining Xie, and Arash Vahdat. Transition matching distillation for fast video generation.arXiv preprint arXiv:2601.09881,

work page arXiv
[57]

Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

Yuwei Niu, Munan Ning, Mengren Zheng, Weiyang Jin, Bin Lin, Peng Jin, Jiaqi Liao, Chaoran Feng, Kunpeng Ning, Bin Zhu, et al. Wise: A world knowledge-informed semantic evaluation for text-to-image generation.arXiv preprint arXiv:2503.07265,

work page arXiv
[58]

Improving robotic manipulation robustness via NICE scene surgery.arXiv preprint arXiv:2511.22777, 2025a

Sajjad Pakdamansavoji, Mozhgan Pourkeshavarz, Adam Sigal, Zhiyuan Li, Rui Heng Yang, and Amir Rasouli. Improving robotic manipulation robustness via NICE scene surgery.arXiv preprint arXiv:2511.22777, 2025a. Sajjad Pakdamansavoji, Mozhgan Pourkeshavarz, Adam Sigal, Zhiyuan Li, Rui Heng Yang, and Amir Rasouli. Improving robotic manipulation robustness via ...

work page arXiv 2025
[59]

Scalable Diffusion Models with Transformers

Available athttps://deepmind.google/blog/genie-2-a-large-scale-foundation-world-model/. William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.https://arxiv.org/abs/2212.09748. Xu Peng, Junwei Zhu, Boyuan Jiang, Ying Tai, Donghao Luo, Jiangning Zh...

work page internal anchor Pith review arXiv 2023
[60]

arXiv preprint arXiv:2602.12529 , year=

Bowen Ping, Chengyou Jia, Minnan Luo, Hangwei Qian, and Ivor Tsang. Flow-factory: A unified framework for reinforcement learning in flow-matching models.arXiv preprint arXiv:2602.12529,

work page arXiv
[61]

Aligning text-to- image diffusion models with reward backpropagation (2023).arXiv preprint arXiv:2310.03739,

Mihir Prabhudesai, Anirudh Goyal, Deepak Pathak, and Katerina Fragkiadaki. Aligning text-to-image diffusion models with reward backpropagation.arXiv preprint arXiv:2310.03739,

work page arXiv
[62]

Art: Anonymous region transformer for variable multi-layer transparent image generation.arXiv preprint arXiv:2502.18364, 2025a

119 Yifan Pu, Yiming Zhao, Zhicong Tang, Ruihong Yin, Haoxing Ye, Yuhui Yuan, Dong Chen, Jianmin Bao, Sirui Zhang, Yanbin Wang, Lin Liang, Lijuan Wang, Ji Li, Xiu Li, Zhouhui Lian, Gao Huang, and Baining Guo. Art: Anonymous region transformer for variable multi-layer transparent image generation.arXiv preprint arXiv:2502.18364, 2025a. Yifan Pu, Yiming Zha...

work page arXiv 2025
[63]

Du, Zehuan Yuan, and Xinglong Wu

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K. Du, Zehuan Yuan, and Xinglong Wu. TokenFlow: Unified image tokenizer for multimodal understanding and generation.arXiv preprint arXiv:2412.03069, 2024.https://arxiv.org/abs/2412.03069. Julian Quevedo, Quinn McIntyre, Spruce Campbell, Xinlei Chen, and Robert Wachen. Oasis: ...

work page arXiv 2024
[64]

Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks, 2016.https://arxiv.org/abs/1511.06434. Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. InInternational conference...

work page internal anchor Pith review arXiv 2016
[65]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022.https://arxiv.org/abs/2112.10752. Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven genera...

work page internal anchor Pith review arXiv 2022
[66]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kamyar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022.https://arxiv.org/abs/2205.11487. Tim...

work page internal anchor Pith review arXiv 2022
[67]

Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, et al

https://zenodo.org/records/17344183. Samin Mahdizadeh Sani, Max Ku, Nima Jamali, Matina Mahdizadeh Sani, Paria Khoshtab, Wei-Chieh Sun, Parnian Fazel, Zhi Rui Tam, Thomas Chong, Edisy Kin Wai Chan, et al. Imagenworld: Stress-testing image generation models with explainable human evaluation on open-ended real-world tasks. InThe Fourteenth International Con...

work page arXiv
[68]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review arXiv
[69]

Diff2flow: Training flow matching models via diffusion model alignment, 2025.https://arxiv.org/abs/2506.02221

Johannes Schusterbauer, Ming Gui, Frank Fundel, and Björn Ommer. Diff2flow: Training flow matching models via diffusion model alignment, 2025.https://arxiv.org/abs/2506.02221. 120 Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. Seedream 4.0: Toward next-generation multimod...

work page arXiv 2025
[70]

Post-training quantization on diffusion models

Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1972–1981,

1972
[71]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

work page internal anchor Pith review arXiv
[72]

IMAGHar- mony: Controllable image editing with consistent object quantity and layout,

Fei Shen, Xiaoyu Du, Yutong Gao, Jian Yu, Yushe Cao, Xing Lei, and Jinhui Tang. Imagharmony: Controllable image editing with consistent object quantity and layout.ArXiv, abs/2506.01949, 2025a.https://api.semanticscholar. org/CorpusID:279119734. Yang Shen, Xiu-Shen Wei, Yifan Sun, Yuxin Song, Tao Yuan, Jian Jin, Heyang Xu, Yazhou Yao, and Errui Ding. Expla...

work page arXiv
[73]

Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963,

Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators.arXiv preprint arXiv:2512.06963, 2025b. Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. Instantbooth: Personalized text-to-image generation without test-time finetuning. InProceeding...

work page arXiv
[74]

Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025a. https://arxiv.org/abs/2510.15301. Wenda Shi, Yiren Song, Zihan Rao, Dengming Zhang, Jiaming Liu, and Xingxing Zou. Wordcon: Word-level typography ...

work page arXiv
[75]

Chimera: Compositional image generation using part-based concepting, 2025.https://arxiv.org/abs/2510.18083

Shivam Singh, Yiming Chen, Agneet Chatterjee, Amit Raj, James Hays, Yezhou Yang, and Chitta Baral. Chimera: Compositional image generation using part-based concepting, 2025.https://arxiv.org/abs/2510.18083. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2021a. Yan...

work page arXiv 2025
[76]

Mitty: Diffusion-based human-to- robot video generation.arXiv preprint arXiv:2512.17253, 2025

Yiren Song, Cheng Liu, Weijia Mao, and Mike Zheng Shou. Mitty: Diffusion-based human-to-robot video generation. arXiv preprint arXiv:2512.17253, 2025a. Yuxin Song, Wenkai Dong, Shizun Wang, Qi Zhang, Song Xue, Tao Yuan, Hu Yang, Haocheng Feng, Hang Zhou, Xinyan Xiao, et al. Query-kontext: An unified multimodal model for image generation and editing.arXiv ...

work page arXiv
[78]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

https: //arxiv.org/abs/2406.06525. Peter Sushko, Ayana Bharadwaj, Zhi Yang Lim, Vasily Ilin, Ben Caffee, Dongping Chen, Mohammadreza Salehi, Cheng- Yu Hsieh, and Ranjay Krishna. Realedit: Reddit edits as a large-scale empirical dataset for image transformations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), p...

work page internal anchor Pith review arXiv
[79]

Understanding generative ai capabilities in everyday image editing tasks.arXiv preprint arXiv:2505.16181,

Mohammad Reza Taesiri, Brandon Collins, Logan Bolton, Viet Dac Lai, Franck Dernoncourt, Trung Bui, and Anh Totti Nguyen. Understanding generative ai capabilities in everyday image editing tasks.arXiv preprint arXiv:2505.16181,

work page arXiv
[80]

Instantcharacter: Personalize any characters with a scalable diffusion transformer frame- work.arXiv preprint arXiv:2504.12395, 2025

Jiale Tao, Yanbing Zhang, Qixun Wang, Yiji Cheng, Haofan Wang, Xu Bai, Zhengguang Zhou, Ruihuang Li, Linqing Wang, Chunyu Wang, Qin Lin, and Qinglin Lu. Instantcharacter: Personalize any characters with a scalable diffusion transformer framework.ArXiv, abs/2504.12395, 2025a. https://api.semanticscholar.org/CorpusID:277856764. Tang Tao, Likui Zhang, Youpen...

work page arXiv
[81]

Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837,

work page arXiv
[82]

DataVisT5: A pre-trained language model for jointly understanding text and data visualization

Zhuoyue Wan, Yuanfeng Song, Shuaimin Li, Chen Jason Zhang, and Raymond Chi-Wing Wong. DataVisT5: A pre-trained language model for jointly understanding text and data visualization. In41st IEEE International Conference on Data Engineering, ICDE 2025, pages 1704–1717. IEEE,

2025

Showing first 80 references.