arxiv: 2510.26583 · v1 · pith:F6CG7G7Vnew · submitted 2025-10-30 · 💻 cs.CV

Emu3.5: Native Multimodal Models are World Learners

Yufeng Cui , Honghao Chen , Haoge Deng , Xu Huang , Xinghang Li , Jirong Liu , Yang Liu , Zhuoyan Luo

show 15 more authors

Jinsheng Wang Wenxuan Wang Yueze Wang Chengyuan Wang Fan Zhang Yingli Zhao Ting Pan Xianduo Li Zecheng Hao Wenxuan Ma Zhuo Chen Yulong Ao Tiejun Huang Zhongyuan Wang Xinlong Wang

This is my paper

Pith reviewed 2026-05-18 01:07 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal world modelnext-token predictionvision-language generationembodied manipulationdiscrete diffusion adaptationvideo sequence modeling

0 comments

The pith

Emu3.5 shows that next-token prediction on trillions of video tokens induces generalizable world dynamics for exploration and manipulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Emu3.5 as a multimodal model that predicts the next state in mixed vision and language sequences. It is trained end-to-end on more than 10 trillion tokens drawn from sequential frames and transcripts of internet videos using only a unified next-token prediction objective. This setup allows the model to accept and produce interleaved vision-language outputs while supporting long-horizon generation and embodied actions. A sympathetic reader would care because the work suggests that scaling simple prediction on real-world sequential data can produce consistent representations of space and time. The authors further apply reinforcement learning and introduce Discrete Diffusion Adaptation to improve reasoning and speed up image inference by roughly twenty times.

Core claim

Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. After post-training with large-scale reinforcement learning, Emu3.5 exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks.

What carries the argument

Unified next-token prediction on interleaved vision-language sequences from internet video data, which induces implicit learning of world dynamics across modalities.

Load-bearing premise

Next-token prediction on large-scale interleaved internet video frames and transcripts is sufficient to induce generalizable, spatiotemporally consistent world dynamics without additional structured supervision, simulation environments, or explicit physical priors.

What would settle it

Demonstrating that Emu3.5 generates spatiotemporally inconsistent future states or fails on novel open-world manipulation tasks outside the distribution of its training videos would falsify the central claim.

read the original abstract

We introduce Emu3.5, a large-scale multimodal world model that natively predicts the next state across vision and language. Emu3.5 is pre-trained end-to-end with a unified next-token prediction objective on a corpus of vision-language interleaved data containing over 10 trillion tokens, primarily derived from sequential frames and transcripts of internet videos. The model naturally accepts interleaved vision-language inputs and generates interleaved vision-language outputs. Emu3.5 is further post-trained with large-scale reinforcement learning to enhance multimodal reasoning and generation. To improve inference efficiency, we propose Discrete Diffusion Adaptation (DiDA), which converts token-by-token decoding into bidirectional parallel prediction, accelerating per-image inference by about 20x without sacrificing performance. Emu3.5 exhibits strong native multimodal capabilities, including long-horizon vision-language generation, any-to-image (X2I) generation, and complex text-rich image generation. It also exhibits generalizable world-modeling abilities, enabling spatiotemporally consistent world exploration and open-world embodied manipulation across diverse scenarios and tasks. For comparison, Emu3.5 achieves performance comparable to Gemini 2.5 Flash Image (Nano Banana) on image generation and editing tasks and demonstrates superior results on a suite of interleaved generation tasks. We open-source Emu3.5 at https://github.com/baaivision/Emu3.5 to support community research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Emu3.5, a native multimodal world model pre-trained end-to-end via next-token prediction on over 10 trillion tokens of interleaved vision-language data derived primarily from internet video frames and transcripts. It further applies large-scale RL post-training and introduces Discrete Diffusion Adaptation (DiDA) to convert autoregressive decoding into bidirectional parallel prediction for ~20x faster per-image inference. The central claims are that the resulting model achieves performance parity with Gemini 2.5 Flash on image generation/editing tasks, superiority on interleaved generation tasks, and exhibits generalizable world-modeling abilities that enable spatiotemporally consistent long-horizon exploration and open-world embodied manipulation across diverse scenarios.

Significance. If the empirical claims are substantiated, the work would provide evidence that scaling next-token prediction on large-scale interleaved video corpora can induce internal world models sufficient for open-ended embodied tasks without explicit physical simulators or structured priors. The open-sourcing of the model and the DiDA efficiency technique are concrete contributions that could accelerate follow-up research in multimodal world models. However, the current lack of detailed quantitative evaluation makes it difficult to gauge the precise advance relative to existing video-prediction and multimodal foundation models.

major comments (2)

[Abstract and §4] Abstract and §4 (Evaluation): The claims of performance parity with Gemini 2.5 Flash Image on image generation/editing and superiority on interleaved tasks are presented without any quantitative tables, specific metrics (e.g., FID, CLIP score, human preference rates), baselines, dataset splits, or error bars. This absence is load-bearing for the central empirical assertions and prevents verification of the stated results.
[§3.2] §3.2 (DiDA): The description of Discrete Diffusion Adaptation as converting token-by-token decoding into bidirectional parallel prediction lacks a precise algorithmic formulation, training objective, or ablation showing that the 20x speedup preserves the world-modeling capabilities claimed elsewhere. Without these details the efficiency claim cannot be assessed independently of the main results.

minor comments (2)

[§2] §2 (Related Work): The positioning against prior video world models (e.g., those using explicit dynamics or simulation) would benefit from a clearer statement of how the purely next-token objective differs mechanistically from those baselines.
[§5] Figure captions and §5 (Qualitative Results): Several qualitative rollout figures lack explicit frame indices, conditioning context length, or failure-case examples, making it harder to judge the claimed spatiotemporal consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful and constructive comments on our manuscript. We appreciate the feedback highlighting areas where additional clarity and substantiation would strengthen the presentation. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): The claims of performance parity with Gemini 2.5 Flash Image on image generation/editing and superiority on interleaved tasks are presented without any quantitative tables, specific metrics (e.g., FID, CLIP score, human preference rates), baselines, dataset splits, or error bars. This absence is load-bearing for the central empirical assertions and prevents verification of the stated results.

Authors: We agree that the current presentation of the comparative claims would benefit from more detailed quantitative support to allow independent verification. In the revised manuscript, we will expand §4 with comprehensive tables that include specific metrics such as FID, CLIP scores, human preference rates, relevant baselines, dataset splits, and error bars. These additions will directly substantiate the reported performance parity with Gemini 2.5 Flash Image on generation and editing tasks as well as the superiority on interleaved generation tasks. revision: yes
Referee: [§3.2] §3.2 (DiDA): The description of Discrete Diffusion Adaptation as converting token-by-token decoding into bidirectional parallel prediction lacks a precise algorithmic formulation, training objective, or ablation showing that the 20x speedup preserves the world-modeling capabilities claimed elsewhere. Without these details the efficiency claim cannot be assessed independently of the main results.

Authors: We acknowledge that the description of Discrete Diffusion Adaptation (DiDA) in §3.2 is currently high-level. In the revision, we will add a precise algorithmic formulation of the method, explicitly state the training objective used for the adaptation, and include ablation studies demonstrating that the approximately 20x inference speedup maintains the claimed world-modeling capabilities. These details will enable independent evaluation of the efficiency technique. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript presents Emu3.5 as a native multimodal model trained end-to-end via next-token prediction on an external corpus of over 10T interleaved vision-language tokens from internet videos. Reported capabilities (long-horizon generation, X2I, world exploration, embodied manipulation) are positioned as emergent outcomes evaluated against independent benchmarks and qualitative rollouts. No equations, fitted parameters, or self-citations are shown that reduce these outcomes to quantities defined by the model's own training loop or prior author work. The derivation chain remains self-contained against external data and standard evaluation protocols.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that next-token prediction on internet video data induces world dynamics, plus the introduction of DiDA as a new decoding procedure. No machine-checked proofs or parameter-free derivations are mentioned.

free parameters (1)

Training data scale
Over 10 trillion tokens chosen to enable world learning; the exact selection criteria and filtering are not detailed in the abstract.

axioms (1)

domain assumption Next-token prediction on interleaved vision-language video data captures generalizable world dynamics
Invoked as the pre-training objective that produces the reported world-modeling abilities.

invented entities (1)

Discrete Diffusion Adaptation (DiDA) no independent evidence
purpose: Converts token-by-token autoregressive decoding into bidirectional parallel prediction for faster inference
New method introduced to achieve 20x acceleration without performance loss.

pith-pipeline@v0.9.0 · 5864 in / 1420 out tokens · 37038 ms · 2026-05-18T01:07:44.123776+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling
cs.CV 2026-05 unverdicted novelty 7.0

Edit-Compass and EditReward-Compass are new unified benchmarks for fine-grained image editing evaluation and realistic reward modeling in reinforcement learning optimization.
MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

MULTITEXTEDIT benchmark reveals that all tested text-in-image editing models show pronounced degradation on non-English languages, especially Hebrew and Arabic, mainly in text accuracy and script fidelity.
Exploring Spatial Intelligence from a Generative Perspective
cs.CV 2026-04 unverdicted novelty 7.0

Fine-tuning multimodal models on a new synthetic spatial benchmark improves generative spatial compliance on real and synthetic tasks and transfers to better spatial understanding.
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

UDM-GRPO is the first RL integration for uniform discrete diffusion models, using final clean samples as actions and forward-process trajectory reconstruction to raise GenEval accuracy from 69% to 96% and OCR accuracy...
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 7.0

UniGeo unifies geometric guidance across three levels in video models to reduce geometric drift and improve consistency in camera-controllable image editing.
InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
cs.CV 2026-05 conditional novelty 6.0

InsightTok improves text and face fidelity in discrete image tokenization via content-aware perceptual losses, with gains transferring to autoregressive generation.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
UniGeo: Unifying Geometric Guidance for Camera-Controllable Image Editing via Video Models
cs.CV 2026-04 unverdicted novelty 6.0

UniGeo adds unified geometric guidance at three levels in video models to reduce geometric drift and improve structural fidelity in camera-controllable image editing.
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
cs.CV 2026-02 unverdicted novelty 6.0

LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
cs.CV 2026-05 unverdicted novelty 5.0

SenseNova-U1 presents native unified multimodal models that match top understanding VLMs while delivering strong performance in image generation, infographics, and interleaved tasks via the NEO-unify architecture.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation
cs.CV 2026-04 unverdicted novelty 5.0

Tuna-2 shows pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive or superior results on understanding and generation benchmarks.
Context Unrolling in Omni Models
cs.CV 2026-04 unverdicted novelty 5.0

Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.
Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm
cs.CV 2025-11 unverdicted novelty 5.0

Video generation models demonstrate competitive multimodal reasoning on a new benchmark, matching or exceeding VLMs on visual puzzles and achieving 92% on MATH and 69.2% on MMMU.
Awaking Spatial Intelligence in Unified Multimodal Understanding and Generation
cs.GR 2026-05 unverdicted novelty 4.0

JoyAI-Image unifies visual understanding, generation, and editing in one model and claims stronger spatial intelligence through bidirectional perception-generation loops.
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
cs.AI 2026-04 unverdicted novelty 4.0

TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

131 extracted references · 131 canonical work pages · cited by 17 Pith papers · 36 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. arXiv preprint arXiv:2305.13245, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Claude 3.5: An ai assistant by anthropic, 2023

Anthropic. Claude 3.5: An ai assistant by anthropic, 2023

work page 2023
[4]

The chosen one: Consistent characters in text-to-image diffusion models

Omri Avrahami, Amir Hertz, Yael Vinker, Moab Arar, Shlomi Fruchter, Ohad Fried, Daniel Cohen- Or, and Dani Lischinski. The chosen one: Consistent characters in text-to-image diffusion models. In ACM SIGGRAPH 2024 conference papers, pages 1–12, 2024

work page 2024
[5]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Improving image generation with better captions.https://cdn.openai.com/ papers/dall-e-3.pdf, 2023

James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, Wesam Manassra, Prafulla Dhariwal, Casey Chu, Yunxin Jiao, and Aditya Ramesh. Improving image generation with better captions.https://cdn.openai.com/ papers/dall-e-3.pdf, 2023

work page 2023
[7]

Instructpix2pix: Learning to follow image editing instructions

Tim Brooks, Aleksander Holynski, and Alexei A Efros. Instructpix2pix: Learning to follow image editing instructions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023

work page 2023
[8]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

Minwoo Byeon, Beomhee Park, Haecheon Kim, Sungjun Lee, Woonhyuk Baek, and Saehoon Kim. Coyo-700m: Image-text pair dataset.https://github.com/kakaobrain/coyo-dataset, 2022

work page 2022
[10]

HiDream-I1: A High-Efficient Image Generative Foundation Model with Sparse Diffusion Transformer

Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. Hidream-i1: A high-efficient image generative foundation model with sparse diffusion transformer.arXiv preprint arXiv:2505.22705, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Flash diffusion: Acceler- ating any conditional diffusion model for few steps image generation

Clement Chadebec, Onur Tasar, Eyal Benaroche, and Benjamin Aubin. Flash diffusion: Acceler- ating any conditional diffusion model for few steps image generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 15686–15695, 2025

work page 2025
[12]

Oneig-bench: Omni-dimensional nuanced evaluation for image generation

Jingjing Chang, Yixiao Fang, Peng Xing, Shuhan Wu, Wei Cheng, Rui Wang, Xianfang Zeng, Gang Yu, and Hai-Bao Chen. Oneig-bench: Omni-dimensional nuanced evaluation for image generation. arXiv preprint arxiv:2506.07977, 2025

work page arXiv 2025
[13]

Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts

Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web- scale image-text pre-training to recognize long-tail visual concepts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021

work page 2021
[14]

Interleaved scene graphs for interleaved text-and-image generation assess- ment.arXiv preprint arXiv:2411.17188, 2024

Dongping Chen, Ruoxi Chen, Shu Pu, Zhaoyi Liu, Yanru Wu, Caixi Chen, Benlin Liu, Yue Huang, Yao Wan, Pan Zhou, et al. Interleaved scene graphs for interleaved text-and-image generation assess- ment.arXiv preprint arXiv:2411.17188, 2024

work page arXiv 2024
[15]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models- architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025. 37

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation

Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. Sharegpt-4o-image: Aligning multimodal models with gpt-4o-level image generation. arXiv preprint arXiv:2506.18095, 2025

work page arXiv 2025
[17]

Multiref: Controllable image generation with multiple visual refer- ences.ArXiv, abs/2508.06905, 2025

Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Petr Sushko, Gaoyang Jiang, Yao Wan, and Ranjay Krishna. Multiref: Controllable image generation with multiple visual refer- ences.ArXiv, abs/2508.06905, 2025

work page arXiv 2025
[18]

Postercraft: Rethinking high-quality aesthetic poster generation in a unified framework.arXiv preprint arXiv:2506.10741, 2025

SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen, Hengyu Shi, Shitong Shao, Yunlong Lin, Song Fei, Zhaohu Xing, et al. Postercraft: Rethinking high-quality aesthetic poster generation in a unified framework.arXiv preprint arXiv:2506.10741, 2025

work page arXiv 2025
[19]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Bagel: a web- based bacteriocin genome mining tool.Nucleic acids research, 34(suppl_2):W273–W279, 2006

Anne de Jong, Sacha AFT van Hijum, Jetta JE Bijlsma, Jan Kok, and Oscar P Kuipers. Bagel: a web- based bacteriocin genome mining tool.Nucleic acids research, 34(suppl_2):W273–W279, 2006

work page 2006
[21]

insightface.https://github.com/deepinsight/insightface, 2021

deepinsight. insightface.https://github.com/deepinsight/insightface, 2021

work page 2021
[22]

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025

work page 2025
[23]

Scaling vision transformers to 22 billion parameters

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. InInternational conference on machine learning, pages 7480–

work page
[24]

Uniform discrete diffusion with metric path for video generation.arXiv preprint arXiv:2510.24717, 2025

Haoge Deng, Ting Pan, Fan Zhang, Yang Liu, Zhuoyan Luo, Yufeng Cui, Chunhua Shen, Shiguang Shan, Zhaoxiang Zhang, and Xinlong Wang. Uniform discrete diffusion with metric path for video generation.arXiv preprint arXiv:2510.24717, 2025

work page arXiv 2025
[25]

Retinaface: Single-shot multi-level face localisation in the wild

Jiankang Deng, Jia Guo, Evangelos Ververas, Irene Kotsia, and Stefanos Zafeiriou. Retinaface: Single-shot multi-level face localisation in the wild. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5203–5212, 2020

work page 2020
[26]

Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

Nikai Du, Zhennan Chen, Shan Gao, Zhizhou Chen, Xi Chen, Zhengkai Jiang, Jian Yang, and Ying Tai. Textcrafter: Accurately rendering multiple texts in complex visual scenes.arXiv preprint arXiv:2503.23461, 2025

work page arXiv 2025
[27]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-first International Conference on Machine Learning, 2024

work page 2024
[28]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021
[29]

Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092– 27112, 2023

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems, 36:27092– 27112, 2023

work page 2023
[30]

Seedream 3.0 Technical Report

Yu Gao, Lixue Gong, Qiushan Guo, Xiaoxia Hou, Zhichao Lai, Fanshi Li, Liang Li, Xiaochen Lian, Chao Liao, Liyang Liu, et al. Seedream 3.0 technical report.arXiv preprint arXiv:2504.11346, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Discrete flow matching.Advances in Neural Information Processing Systems, 37:133345–133385, 2024

Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky TQ Chen, Gabriel Synnaeve, Yossi Adi, and Yaron Lipman. Discrete flow matching.Advances in Neural Information Processing Systems, 37:133345–133385, 2024. 38

work page 2024
[32]

SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation. arXiv preprint arXiv:2404.14396, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again.arXiv preprint arXiv:2507.22058, 2025

work page arXiv 2025
[34]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36, 2024

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[35]

Gemini 2.0 flash.https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation, 2025

Google. Gemini 2.0 flash.https://developers.googleblog.com/en/ experiment-with-gemini-20-flash-native-image-generation, 2025

work page 2025
[36]

Imagen 3, 2024

Imagen Team Google. Imagen 3, 2024

work page 2024
[37]

Imagen 4, 2025

Imagen Team Google. Imagen 4, 2025

work page 2025
[38]

Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data.arXiv preprint arXiv:2410.18558, 2024

Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, et al. Infinity-mm: Scaling multimodal performance with large-scale and high-quality instruction data.arXiv preprint arXiv:2410.18558, 2024

work page arXiv 2024
[39]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025

work page 2025
[40]

Measuring colorfulness in natural images

David Hasler and Sabine E Suesstrunk. Measuring colorfulness in natural images. InHuman vision and electronic imaging VIII, volume 5007, pages 87–95. SPIE, 2003

work page 2003
[41]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Image-to-image translation with conditional adversarial networks

Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017

work page 2017
[43]

Lego-edit: A general image editing framework with model-level bricks and mllm builder.arXiv preprint arXiv:2509.12883, 2025

Qifei Jia, Yu Liu, Yajie Chai, Xintong Yao, Qiming Lu, Yasen Zhang, Runyu Shi, Ying Huang, and Guoquan Zhang. Lego-edit: A general image editing framework with model-level bricks and mllm builder.arXiv preprint arXiv:2509.12883, 2025

work page arXiv 2025
[44]

Infiniteyou: Flexible photo recrafting while preserving your identity.arXiv preprint arXiv:2503.16418, 2025

Liming Jiang, Qing Yan, Yumin Jia, Zichuan Liu, Hao Kang, and Xin Lu. Infiniteyou: Flexible photo recrafting while preserving your identity.arXiv preprint arXiv:2503.16418, 2025

work page arXiv 2025
[45]

Musiq: Multi-scale image quality transformer

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. Musiq: Multi-scale image quality transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157, 2021

work page 2021
[46]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, et al. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.International journal of computer vision, 128(7):1956–1981, 2020

work page 1956
[47]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

work page 2023
[48]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024. 39

work page 2024
[49]

Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Di- agne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, et al. Flux. 1 kontext: Flow matching for in-context image generation and editing in latent space.arXiv preprint arXiv:2506.15742, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEuropean Conference on Computer Vision, pages 71–91. Springer, 2024

work page 2024
[51]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yan- wei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Infinity instruct: Scaling instruction selection and synthesis to enhance language models, 2025

Jijie Li, Li Du, Hanyu Zhao, Bo wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, and Yonghua Lin. Infinity instruct: Scaling instruction selection and synthesis to enhance language models, 2025

work page 2025
[53]

Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

Zhen Li, Chuanhao Li, Xiaofeng Mao, Shaoheng Lin, Ming Li, Shitian Zhao, Zhaopan Xu, Xinyue Li, Yukang Feng, Jianwen Sun, et al. Sekai: A video dataset towards world exploration.arXiv preprint arXiv:2506.15675, 2025

work page arXiv 2025
[54]

UniWorld-V1: High-Resolution Semantic Encoders for Unified Visual Understanding and Generation

Bin Lin, Zongjian Li, Xinhua Cheng, Yuwei Niu, Yang Ye, Xianyi He, Shenghai Yuan, Wangbo Yu, Shaodong Wang, Yunyang Ge, et al. Uniworld: High-resolution semantic encoders for unified visual understanding and generation.arXiv preprint arXiv:2506.03147, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. InProceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017

work page 2017
[56]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[57]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024

work page 2024
[58]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprint arXiv:2504.17761, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in neural information processing systems, 35:5775–5787, 2022

work page 2022
[60]

Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410, 2024

Zhuoyan Luo, Fengyuan Shi, Yixiao Ge, Yujiu Yang, Limin Wang, and Ying Shan. Open-magvit2: An open-source project toward democratizing auto-regressive visual generation.arXiv preprint arXiv:2409.04410, 2024

work page arXiv 2024
[61]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transform- ers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transform- ers. InEuropean Conference on Computer Vision, pages 23–40, 2024

work page 2024
[62]

Midjourney, 2025

MidJourney. Midjourney, 2025. Accessed: 2025-03-31

work page 2025
[63]

Gpt-4o.https://openai.com/index/introducing-4o-image-generation, 2025

OpenAI. Gpt-4o.https://openai.com/index/introducing-4o-image-generation, 2025

work page 2025
[64]

Image generation API.https://openai.com/index/image-generation-api/, 2025

OpenAI. Image generation API.https://openai.com/index/image-generation-api/, 2025

work page 2025
[65]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Open x-embodiment: Robotic 40 learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic 40 learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

work page 2024
[67]

Journeydb: A benchmark for generative image understanding, 2023

Junting Pan, Keqiang Sun, Yuying Ge, Hao Li, Haodong Duan, Xiaoshi Wu, Renrui Zhang, Aojun Zhou, Zipeng Qin, Yi Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Journeydb: A benchmark for generative image understanding, 2023

work page 2023
[68]

Ice-bench: A unified and comprehensive benchmark for image creating and editing.arXiv preprint arXiv:2503.14482, 2025

Yulin Pan, Xiangteng He, Chaojie Mao, Zhen Han, Zeyinzi Jiang, Jingfeng Zhang, and Yu Liu. Ice-bench: A unified and comprehensive benchmark for image creating and editing.arXiv preprint arXiv:2503.14482, 2025

work page arXiv 2025
[69]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[70]

Tokenflow: Unified image tokenizer for multimodal understanding and generation

Liao Qu, Huichao Zhang, Yiheng Liu, Xu Wang, Yi Jiang, Yiming Gao, Hu Ye, Daniel K Du, Zehuan Yuan, and Xinglong Wu. Tokenflow: Unified image tokenizer for multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2545– 2555, 2025

work page 2025
[71]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

work page 2023
[72]

Recraft.https://www.recraft.ai/, 2024

Recraft. Recraft.https://www.recraft.ai/, 2024

work page 2024
[73]

Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation

Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22500–22510, 2023

work page 2023
[74]

Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International journal of computer vision, 115(3):211–252, 2015

work page 2015
[75]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

work page 2022
[76]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556– 2565, 2018

work page 2018
[78]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[79]

Storygpt-v: Large language models as consistent story vi- sualizers

Xiaoqian Shen and Mohamed Elhoseiny. Storygpt-v: Large language models as consistent story vi- sualizers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 13273– 13283, 2025

work page 2025
[80]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.