arxiv: 2604.21921 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

Context Unrolling in Omni Models

Ceyuan Yang , Zhijie Lin , Yang Zhao , Fei Xiao , Hao He , Qi Zhao , Chaorui Deng , Kunchang Li

show 11 more authors

Zihan Ding Yuwei Guo Fuyun Wang Fangqi Zhu Xiaonan Nie Shenhan Zhu Shanchuan Lin Hongsheng Li Weilin Huang Guang Shi Haoqi Fan

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:59 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal learningcontext unrollingunified modelscross-modal reasoningknowledge manifoldmultimodal generationin-context generation

0 comments

The pith

Joint training on text, images, videos, and 3D enables explicit cross-modal reasoning in unified models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes Omni, a single model trained from the start on text, images, videos, 3D geometry, and hidden representations. It finds that this setup causes the model to engage in Context Unrolling, meaning it reasons through multiple different representations of the same information before giving an answer. This cross-modal reasoning lets it combine useful details that each modality provides separately, which the authors say produces a closer fit to the common knowledge structure underlying all the data and yields more accurate results on complex tasks. The outcome is strong results on standard benchmarks plus the ability to generate new content in any of the trained modalities based on context from the others.

Core claim

Native joint training on diverse modalities enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This aggregates complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity.

What carries the argument

Context Unrolling: explicit reasoning across multiple modal representations before prediction.

If this is right

Achieves strong performance on multimodal generation and understanding benchmarks.
Demonstrates advanced multimodal reasoning including in-context generation of text, images, video, and 3D geometry.
Aggregates complementary information from different modalities for higher fidelity predictions.
Approximates the shared multimodal knowledge manifold more closely than modality-specific approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If joint training induces this unrolling, then models might not need separate encoders for each modality if trained together at scale.
The approach could be extended to additional modalities like audio or touch to test if the unrolling generalizes.
This might imply that observed gains in large multimodal models stem partly from emergent cross-representation reasoning rather than data volume alone.
One could ablate the joint training to see whether the explicit reasoning steps vanish.

Load-bearing premise

The gains and the explicit reasoning process arise directly from the native joint training on the listed modalities rather than from model size, architecture choices, or total data volume.

What would settle it

Compare a jointly trained Omni model against an equivalent-scale model trained on single modalities separately and then merged at test time; if the separate version matches or exceeds performance without showing cross-modal reasoning steps, the claim would be falsified.

read the original abstract

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Omni shows unified training across many modalities can yield cross-modal generation, but the 'Context Unrolling' mechanism lacks the controls needed to separate it from scale or data effects.

read the letter

The main thing here is that the authors train one model, Omni, natively on text, images, videos, 3D geometry, and hidden representations, then observe that this setup produces what they call Context Unrolling: the model appears to reason across modal representations before outputting predictions. They link this to better aggregation of complementary information and stronger results on both generation and understanding tasks, including in-context generation that mixes text, image, video, and 3D outputs.

Referee Report

2 major / 2 minor

Summary. The paper introduces Omni, a unified multimodal model natively trained on text, images, videos, 3D geometry, and hidden representations. It claims that this joint training enables 'Context Unrolling,' in which the model explicitly reasons across multiple modal representations to aggregate complementary information, more faithfully approximate a shared multimodal knowledge manifold, and thereby improve downstream reasoning fidelity. The model is reported to achieve strong performance on multimodal generation and understanding benchmarks while supporting advanced in-context generation across modalities.

Significance. If the Context Unrolling mechanism could be rigorously isolated and shown to drive gains beyond scale or data diversity, the work would offer a potentially valuable empirical observation about emergent cross-modal reasoning in jointly trained multimodal models. At present, however, the absence of supporting data leaves the significance speculative.

major comments (2)

[Abstract] Abstract: the central claim that native joint training produces 'Context Unrolling' (explicit cross-modal reasoning that aggregates complementary information and improves manifold approximation) is asserted without any quantitative benchmark results, baselines, ablation studies, or description of how the unrolling process was identified or measured.
[Abstract] Abstract: no operationalization of 'explicit reasoning across multiple modal representations' is supplied (e.g., attention rollout, per-step modality traces, or causal interventions), so observed improvements cannot be distinguished from standard scaling, architecture, or data-volume effects.

minor comments (2)

The manuscript contains no equations, formal definitions, or derivations for key invented terms such as 'Context Unrolling' or 'shared multimodal knowledge manifold.'
No references to prior work on multimodal reasoning, attention visualization, or manifold learning are provided to situate the new terminology.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We address each point below and will revise the manuscript to improve clarity and evidence presentation while preserving the core contribution.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that native joint training produces 'Context Unrolling' (explicit cross-modal reasoning that aggregates complementary information and improves manifold approximation) is asserted without any quantitative benchmark results, baselines, ablation studies, or description of how the unrolling process was identified or measured.

Authors: We agree the abstract, as a concise summary, does not embed the quantitative details or methodological descriptions. The full manuscript reports benchmark results against baselines, ablation studies isolating joint multimodal training, and empirical identification of context unrolling via performance gains and cross-modal reasoning traces. We will revise the abstract to incorporate key quantitative improvements and a high-level description of how unrolling was observed. revision: yes
Referee: [Abstract] Abstract: no operationalization of 'explicit reasoning across multiple modal representations' is supplied (e.g., attention rollout, per-step modality traces, or causal interventions), so observed improvements cannot be distinguished from standard scaling, architecture, or data-volume effects.

Authors: The manuscript body provides qualitative examples, attention visualizations, and controlled ablations showing gains attributable to cross-modal interactions beyond scale or data volume alone. We acknowledge that explicit operationalization strengthens the claim and will add modality trace analyses and additional ablations in the revision to better isolate the mechanism. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical observation without derivational reduction

full rationale

The paper's core claim is presented as an empirical finding: native joint training on text/images/videos/3D/hidden representations 'enables Context Unrolling' that aggregates information and approximates a shared manifold. No equations, derivations, or parameter-fitting steps appear in the abstract or described structure. 'Context Unrolling' is introduced as an observed process, not defined circularly in terms of itself or fitted to the same benchmarks. No self-citations are invoked to justify uniqueness theorems, ansatzes, or load-bearing premises. The description does not rename known results or treat fitted inputs as predictions. The chain is observational rather than deductive, so no step reduces to its inputs by construction. This is the expected non-finding for an empirical multimodal training paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. Context Unrolling is introduced as an observed process rather than a formally defined entity with independent evidence.

invented entities (1)

Context Unrolling no independent evidence
purpose: Describes the explicit cross-modal reasoning process enabled by unified training
Presented as a newly enabled capability; no falsifiable prediction or external evidence supplied in the abstract.

pith-pipeline@v0.9.0 · 5454 in / 1207 out tokens · 35139 ms · 2026-05-09T21:59:39.282638+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mamoda2.5: Enhancing Unified Multimodal Model with DiT-MoE
cs.CV 2026-05 unverdicted novelty 4.0

Mamoda2.5 is a 25B-parameter DiT-MoE unified AR-Diffusion model that reaches top video generation and editing benchmarks with 4-step inference up to 95.9x faster than baselines.

Reference graph

Works this paper leans on

49 extracted references · 28 canonical work pages · cited by 1 Pith paper · 14 internal anchors

[1]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

FLUX.1 Kontext [dev] - Open Weights for Image Editing, 2025

Black Forest Labs. FLUX.1 Kontext [dev] - Open Weights for Image Editing, 2025. URLhttps://bfl.ai/blog/ flux-1-kontext-dev

2025
[3]

FLUX.2-klein-9B

Black Forest Labs. FLUX.2-klein-9B. https://huggingface.co/black-forest-labs/FLUX.2-klein-9B, 2026. Hugging Face Model Card. License: FLUX Non-Commercial License

2026
[4]

Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, et al. Z-image: An efficient image generation foundation model with single-stream diffusion transformer. arXiv preprint arXiv:2511.22699, 2025

work page internal anchor Pith review arXiv 2025
[5]

Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

2024
[6]

Simplevqa: Multimodal factuality evaluation for multimodal large language models

Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, et al. Simplevqa: Multimodal factuality evaluation for multimodal large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4637–4646, 2025

2025
[7]

Yufeng Cui, Honghao Chen, Haoge Deng, Xu Huang, Xinghang Li, Jirong Liu, Yang Liu, Zhuoyan Luo, Jin- sheng Wang, Wenxuan Wang, et al. Emu3. 5: Native multimodal models are world learners.arXiv preprint arXiv:2510.26583, 2025

work page arXiv 2025
[8]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review arXiv 2025
[9]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and PatternRecognition Conference, pages 24108–24118, 2025

2025
[10]

Blink: Multimodal large language models can see but not perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, pages 148–166. Springer, 2024

2024
[11]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024. URLhttps://arxiv.org/abs/2403.05530

work page internal anchor Pith review arXiv 2024
[12]

arXiv preprint arXiv:2507.22058 (2025)

Zigang Geng, Yibing Wang, Yeyao Ma, Chen Li, Yongming Rao, Shuyang Gu, Zhao Zhong, Qinglin Lu, Han Hu, Xiaosong Zhang, et al. X-omni: Reinforcement learning makes discrete autoregressive image generative models great again. arXiv preprint arXiv:2507.22058, 2025

work page arXiv 2025
[13]

Tokenflow: Consistent diffusion features for consistent video editing

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023

work page arXiv 2023
[14]

Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025

Biao Gong, Cheng Zou, Chuanyang Zheng, and et al. Ming-omni: A unified multimodal model for perception and generation. arXiv preprint arXiv:2506.09344, 2025. URLhttps://arxiv.org/abs/2506.09344

work page arXiv 2025
[15]

Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

Tianrui Guan, Fuxiao Liu, Xiyang Wu, Ruiqi Xian, Zongxia Li, Xiaoyu Liu, Xijun Wang, Lichang Chen, Furong Huang, Yaser Yacoob, et al. Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pag...

2024
[16]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review arXiv 2024
[17]

VBench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[18]

GPT-4o System Card

Aaron Hurst and OpenAI. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. URL https://arxiv. org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

GenEval 2: Addressing benchmark drift in text-to-image evaluation.arXiv preprint arXiv:2512.16853,

Amita Kamath, Kai-Wei Chang, Ranjay Krishna, Luke Zettlemoyer, Yushi Hu, and Marjan Ghazvininejad. Geneval 2: Addressing benchmark drift in text-to-image evaluation.arXiv preprint arXiv:2512.16853, 2025

work page arXiv 2025
[20]

Repurposing diffusion-based image generators for monocular depth estimation

Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

2024
[21]

A diagram is worth a dozen images

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016

2016
[22]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

W Kong, Q Tian, Z Zhang, R Min, Z Dai, J Zhou, J Xiong, X Li, B Wu, J Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models, 2025.URL https://arxiv. org/abs/2412.03603

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Max Ku, Cong Wei, Weiming Ren, Harry Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024

work page arXiv 2024
[24]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

2024
[25]

Five: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models.arXiv preprint arXiv:2503.13684, 2025

Minghan Li, Chenxi Xie, Yichen Wu, Lei Zhang, and Mengyu Wang. Five: A fine-grained video editing benchmark for evaluating emerging diffusion and rectified flow models.arXiv preprint arXiv:2503.13684, 2025

work page arXiv 2025
[26]

Vidtome: Video token merging for zero-shot video editing

Xirui Li, Chao Ma, Xiaokang Yang, and Ming-Hsuan Yang. Vidtome: Video token merging for zero-shot video editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7486–7495, 2024

2024
[27]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

work page internal anchor Pith review arXiv 2025
[28]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling. arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Step1X-Edit: A Practical Framework for General Image Editing

Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. Step1x-edit: A practical framework for general image editing.arXiv preprintarXiv:2504.17761, 2025

work page internal anchor Pith review arXiv 2025
[30]

Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

2024
[31]

Chartqa: A benchmark for question answering about charts with visual and logical reasoning

Ahmed Masry, Xuan Long Do, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. InFindings of the association for computational linguistics: ACL 2022, pages 2263–2279, 2022

2022
[32]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

2021
[33]

4m: Massively multimodal masked modeling

David Mizrahi, Roman Bachmann, Oguzhan Fatih Kar, Teresa Yeo, Mingfei Gao, Afshin Dehghan, and Amir Zamir. 4m: Massively multimodal masked modeling. arXiv preprint arXiv:2312.06647, 2023. URL https: //arxiv.org/abs/2312.06647

work page arXiv 2023
[34]

Vision language models are blind: Failing to translate detailed visual features into words.arXiv preprint arXiv:2407.06581, 2024

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind: Failing to translate detailed visual features into words.arXiv preprint arXiv:2407.06581, 2024. 13

work page arXiv 2024
[35]

Unival: Unified model for image, video, audio and language tasks.arXiv preprint arXiv:2307.16184, 2023

Mustafa Shukor, Corentin Dancette, Alexandre Rame, and Matthieu Cord. Unival: Unified model for image, video, audio and language tasks.arXiv preprint arXiv:2307.16184, 2023. URLhttps://arxiv.org/abs/2307.16184

work page arXiv 2023
[36]

Towards vqa models that can read

Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019

2019
[37]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Arenas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review arXiv 2025
[38]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

Fei Wang, Xingyu Fu, James Y Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. Muirbench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024

work page arXiv 2024
[40]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

2025
[41]

Continuous 3D Perception Model with Persistent State.arXiv preprint :2501.12387,

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

work page arXiv 2025
[42]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

work page internal anchor Pith review arXiv 2025
[43]

Qwen-Image Technical Report

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng-ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al. Qwen-image technical report.arXiv preprint arXiv:2508.02324, 2025

work page internal anchor Pith review arXiv 2025
[44]

Realworldqa: A benchmark for real-world spatial understanding

xAI. Realworldqa: A benchmark for real-world spatial understanding. https://huggingface.co/datasets/ xai-org/RealworldQA, 2024. Accessed: 2025-04-26

2024
[45]

Mmsi-bench: A benchmark for multi-image spatial intelligence

Sihan Yang, Runsen Xu, Yiman Xie, Sizhe Yang, Mo Li, Jingli Lin, Chenming Zhu, Xiaochen Chen, Haodong Duan, Xiangyu Yue, Dahua Lin, Tai Wang, and Jiangmiao Pang. Mmsi-bench: A benchmark for multi-image spatial intelligence. InICLR, 2025

2025
[46]

Videograin: Modulating space-time attention for multi-grained video editing

Xiangpeng Yang, Linchao Zhu, Hehe Fan, and Yi Yang. Videograin: Modulating space-time attention for multi-grained video editing. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[47]

Space-time diffusion features for zero-shot text-driven motion transfer

Danah Yatim, Rafail Fridman, Omer Bar-Tal, Yoni Kasten, and Tali Dekel. Space-time diffusion features for zero-shot text-driven motion transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8466–8476, 2024

2024
[48]

Anygpt: Unified multimodal llm with discrete sequence modeling

Jun Zhan and collaborators. Anygpt: Unified multimodal llm with discrete sequence modeling.arXiv preprint arXiv:2402.12226, 2024. URLhttps://arxiv.org/abs/2402.12226

work page arXiv 2024
[49]

Flare: Feed-forward geometry, appearance and camera esti- mation from uncalibrated sparse views.arXiv preprint arXiv:2502.12138,

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2025. URLhttps://arxiv.org/abs/2502.12138. 14

work page arXiv 2025