arxiv: 2501.07542 · v1 · submitted 2025-01-13 · 💻 cs.CL · cs.CV· cs.LG

Recognition: 2 theorem links

Imagine while Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li , Wenshan Wu , Huanyu Zhang , Yan Xia , Shaoguang Mao , Li Dong , Ivan Vuli\'c , Furu Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-16 23:05 UTC · model grok-4.3

classification 💻 cs.CL cs.CVcs.LG

keywords Multimodal Visualization-of-ThoughtMVoTChain-of-Thoughtspatial reasoningmultimodal large language modelstoken discrepancy lossvisual reasoning

0 comments

The pith

Multimodal models can improve spatial reasoning by generating images that visualize their step-by-step thinking process.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Multimodal Visualization-of-Thought as a method for multimodal large language models to produce images that depict their internal reasoning traces. This approach builds on Chain-of-Thought prompting but adds visual generation to handle tasks that rely on spatial understanding. A token discrepancy loss is added to the training process to make the resulting images more coherent and faithful to the model's logic. Experiments on dynamic spatial reasoning tasks show the method matches standard performance while delivering clearer gains in the hardest cases where language-only chains break down. A reader would care because the work suggests that visual thinking can reliably extend what language-based reasoning achieves alone.

Core claim

MVoT enables MLLMs to generate image visualizations of their reasoning traces, using token discrepancy loss to improve visual coherence and fidelity. This leads to competitive performance on spatial reasoning tasks and robust improvements in scenarios where Chain-of-Thought prompting fails.

What carries the argument

Multimodal Visualization-of-Thought (MVoT), the generation of image-based representations of reasoning steps inside autoregressive multimodal models, guided by a token discrepancy loss that enforces alignment between generated visuals and the model's token predictions.

If this is right

MVoT provides reliable gains precisely on the spatial problems where language-only chains are weakest.
Visual reasoning traces can be produced and used within the same autoregressive generation process.
The method maintains competitive results across easier tasks while excelling on harder ones.
Token discrepancy loss serves as a practical way to raise the quality of generated reasoning visuals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same visualization technique could be tested on non-spatial tasks such as planning or object manipulation to check for broader usefulness.
Training models from scratch to emit useful images without an extra loss term might further simplify the approach.
Integration with external visual tools or simulators could turn the generated images into verifiable intermediate states.
Similar visual trace methods might help interpretability in other multimodal systems beyond language models.

Load-bearing premise

The images created during reasoning actually capture useful internal states and help the model solve the task instead of adding new errors or distractions.

What would settle it

A controlled test in which replacing MVoT-generated images with random or blank images eliminates all performance gains over standard Chain-of-Thought on the same spatial tasks.

read the original abstract

Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MVoT adds autoregressive image generation to reasoning traces via a token discrepancy loss, but the gains on spatial tasks still need controls to show the visuals themselves are doing the work.

read the letter

The paper introduces Multimodal Visualization-of-Thought, where an MLLM generates images as intermediate reasoning steps for spatial tasks and uses a token discrepancy loss to keep those images coherent with the text trace. This is the main new piece: folding visual output directly into the autoregressive loop instead of staying in text-only CoT. The motivation from human visual thinking is clear, and the loss looks like a practical fix for keeping generated images from drifting into noise. If the full experiments back the claim of reliable gains precisely where CoT breaks, that would be a useful incremental step for multimodal reasoning systems. The reported competitive performance on dynamic spatial tasks is the part worth checking first. The soft spot is the missing isolation of the visual component. Without an ablation that applies the loss but feeds blank or random images forward, or that generates visuals without using them in later steps, it is hard to tell whether the improvements come from actual visual thinking or from the extra supervision signal alone. The abstract is also thin on exact numbers, datasets, and variance, so the full paper needs to lay those out plainly. This is aimed at groups working on multimodal agents or spatial planning. Readers who already run CoT baselines on similar tasks could get value from trying the visualization step if the causal evidence holds. I would send it to peer review so the experimental design can be tightened and the numbers can be scrutinized.

Referee Report

3 major / 2 minor

Summary. The paper proposes Multimodal Visualization-of-Thought (MVoT), a reasoning paradigm for MLLMs that generates image visualizations of internal reasoning traces during autoregressive decoding. It introduces a token discrepancy loss to improve visual coherence and fidelity of these generated images. The central claim is that MVoT achieves competitive performance on dynamic spatial reasoning tasks and delivers robust gains precisely in the hardest cases where standard Chain-of-Thought (CoT) prompting fails, by enabling visual thinking to complement verbal reasoning.

Significance. If the causal role of the generated visualizations is established, the approach could meaningfully extend multimodal reasoning beyond language-only CoT, particularly for spatial tasks. The introduction of token discrepancy loss as a training signal for visualization quality is a concrete technical contribution that could be reusable in other MLLM settings.

major comments (3)

[Experiments] Experiments section: No ablation isolates whether performance gains arise from the generated visualizations being fed forward into subsequent reasoning steps versus the token discrepancy loss alone (e.g., a control that trains with the loss but supplies blank or random images at inference). Without this, the claim that MVoT yields improvements 'via image visualizations' remains unverified and the central causal mechanism is not load-bearing.
[§4 and Experiments] §4 (method) and Experiments: The paper asserts that generated visualizations 'faithfully capture the model's internal reasoning state,' yet provides no direct evidence such as human evaluation of visualization fidelity against ground-truth reasoning traces, comparison to model attention maps, or a metric quantifying how often the image is actually referenced in later tokens. This assumption underpins the claim of robust gains in hard spatial scenarios.
[Abstract and §5] Abstract and §5: The abstract states 'competitive performance' and 'robust and reliable improvements' but supplies no quantitative metrics, error bars, dataset sizes, or statistical significance tests. The full experimental tables must be checked for whether reported gains are large enough and consistent enough to support the 'where CoT fails' claim.

minor comments (2)

[§4] Notation: The token discrepancy loss is introduced without an explicit equation number or derivation showing how it differs from standard cross-entropy or perceptual losses; add Eq. label and a short derivation.
[Figure 3] Figure clarity: Visualization examples in Figure 3 lack side-by-side CoT failure cases with the corresponding MVoT image traces; this would help readers see the claimed complementarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to strengthen the causal claims and evidence where appropriate.

read point-by-point responses

Referee: [Experiments] Experiments section: No ablation isolates whether performance gains arise from the generated visualizations being fed forward into subsequent reasoning steps versus the token discrepancy loss alone (e.g., a control that trains with the loss but supplies blank or random images at inference). Without this, the claim that MVoT yields improvements 'via image visualizations' remains unverified and the central causal mechanism is not load-bearing.

Authors: We agree this ablation is required to isolate the contribution of the visualizations. In the revised manuscript we will add a control experiment that applies the token discrepancy loss during training but supplies blank or random images at inference time. Results from this control will be reported alongside the main results to verify that gains depend on feeding the generated visualizations forward. revision: yes
Referee: [§4 and Experiments] §4 (method) and Experiments: The paper asserts that generated visualizations 'faithfully capture the model's internal reasoning state,' yet provides no direct evidence such as human evaluation of visualization fidelity against ground-truth reasoning traces, comparison to model attention maps, or a metric quantifying how often the image is actually referenced in later tokens. This assumption underpins the claim of robust gains in hard spatial scenarios.

Authors: We acknowledge the lack of direct fidelity evidence. While the token discrepancy loss is intended to promote coherence with the reasoning trace, we will add a human evaluation study in the revision that rates visualization fidelity against ground-truth reasoning steps. We will also include a quantitative metric on how frequently later tokens attend to or reference the generated image tokens, and report attention-map comparisons where feasible. revision: yes
Referee: [Abstract and §5] Abstract and §5: The abstract states 'competitive performance' and 'robust and reliable improvements' but supplies no quantitative metrics, error bars, dataset sizes, or statistical significance tests. The full experimental tables must be checked for whether reported gains are large enough and consistent enough to support the 'where CoT fails' claim.

Authors: Section 5 and its tables already contain the full quantitative results, including per-task accuracies, error bars, dataset sizes, and significance tests. The abstract is intentionally high-level, but we will revise it to cite the key metrics (e.g., average improvement on hard cases) so the claims are anchored. The tables show the gains are both statistically significant and largest precisely where CoT fails, supporting the stated conclusion. revision: partial

Circularity Check

0 steps flagged

No significant circularity; new loss and visualization paradigm are independent additions

full rationale

The paper proposes MVoT as a new reasoning paradigm that generates image visualizations of traces in MLLMs and introduces a token discrepancy loss to improve visual coherence. No equations, self-definitions, or load-bearing self-citations reduce the central claims to prior CoT inputs by construction. The improvements are presented as experimental outcomes on spatial tasks where CoT fails, with the added components (visual generation and discrepancy loss) standing as novel contributions rather than renamings or fitted predictions of existing quantities. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the token discrepancy loss is presented as a technical addition without further breakdown.

pith-pipeline@v0.9.0 · 5496 in / 1022 out tokens · 26861 ms · 2026-05-16T23:05:46.601203+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
cs.CV 2026-05 unverdicted novelty 7.0

ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
cs.MM 2026-05 unverdicted novelty 7.0

UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
cs.CV 2026-04 unverdicted novelty 7.0

SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
cs.AI 2026-01 unverdicted novelty 7.0

Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
cs.CV 2025-12 unverdicted novelty 7.0

DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
Latent Visual Reasoning
cs.CV 2025-09 unverdicted novelty 7.0

Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
cs.CV 2026-05 unverdicted novelty 6.0

SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
cs.CV 2026-05 unverdicted novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
AdaTooler-V: Adaptive Tool-Use for Images and Videos
cs.CV 2025-12 conditional novelty 6.0

AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving
cs.CV 2025-12 unverdicted novelty 6.0

OmniDrive-R1 boosts VLM reasoning score from 51.77% to 80.35% and answer accuracy from 37.81% to 73.62% on DriveLMM-o1 via reinforcement-driven interleaved multi-modal chain-of-thought with annotation-free grounding.
Mull-Tokens: Modality-Agnostic Latent Thinking
cs.CV 2025-12 unverdicted novelty 6.0

Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
cs.CV 2025-06 unverdicted novelty 6.0

VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
cs.CV 2026-05 unverdicted novelty 5.0

GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
cs.CV 2026-04 unverdicted novelty 5.0

Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
cs.CV 2026-04 unverdicted novelty 5.0

MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
cs.CV 2026-03 unverdicted novelty 5.0

A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
cs.CV 2025-11 unverdicted novelty 5.0

MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
cs.CV 2026-04 unverdicted novelty 4.0

XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
cs.CV 2025-03 unverdicted novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · cited by 21 Pith papers · 12 internal anchors

[1]

GPT-4 Technical Report

[AAA+23] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

[BDK+23] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

OpenAI Gym

[Bro16] G Brockman. Openai gym. arXiv preprint arXiv:1606.01540,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

[Cha24] Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation

[CSML24] Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation. arXiv preprint arXiv:2407.06135,

work page arXiv
[6]

Spatialrgpt: Grounded spatial reasoning in vision language model

[CYF+24] An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision language model. arXiv preprint arXiv:2406.01584,

work page arXiv
[7]

Google gemini ai update - december 2024,

[Dee24] Google DeepMind. Google gemini ai update - december 2024,

work page 2024
[8]

The Llama 3 Herd of Models

Accessed: 2024-12-27. [DJP+24] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

[HSF+24] Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna

Association for Computing Machinery. [HSF+24] Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403,

work page arXiv
[10]

LoRA: Low-Rank Adaptation of Large Language Models

[HSW+21] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Mini-behavior: A procedurally generated benchmark for long-horizon decision-making in embodied ai

[JHH+23] Emily Jin, Jiaheng Hu, Zhuoyi Huang, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, and Roberto Martín-Martín. Mini-behavior: A procedurally generated benchmark for long-horizon decision-making in embodied ai. arXiv preprint 2310.01824,

work page arXiv
[12]

Mistral 7B

[JSM+23] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Aligning cyber space with physical world: A comprehensive survey on embodied ai

[LCB+24] Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied ai. arXiv preprint arXiv:2407.06886,

work page arXiv
[14]

A closer look at logical reasoning with llms: The choice of tool matters

[LTS24] Long Hei Matthew Lam, Ramya Keerthy Thatikonda, and Ehsan Shareghi. A closer look at logical reasoning with llms: The choice of tool matters. arXiv preprint arXiv:2406.00284,

work page arXiv
[15]

Scaffolding coordinates to promote vision-language coordination in large multi-modal models

[LYC+24] Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, and Yang Liu. Scaffolding coordinates to promote vision-language coordination in large multi-modal models. arXiv preprint arXiv:2402.12058,

work page arXiv
[16]

TopViewRS: Vision-language models as top-view spatial reasoners

[LZZ+24] Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli ´c. TopViewRS: Vision-language models as top-view spatial reasoners. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Em- pirical Methods in Natural Language Processing, pages 1786–1807, Miami, Florida, USA, November

work page 2024
[17]

Sat: Spatial aptitude training for multimodal language models

[RDT+24] Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Anirud- dha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755,

work page arXiv
[18]

Multimodal latent language modeling with next-token diffusion

[SBW+24] Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Multimodal latent language modeling with next-token diffusion. arXiv preprint arXiv:2412.08635,

work page arXiv
[19]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

[SJC+24] Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

No train- ing, no problem: Rethinking classifier-free guidance for diffusion models

[SKHW24] Seyedmorteza Sadat, Manuel Kansy, Otmar Hilliges, and Romann M Weber. No train- ing, no problem: Rethinking classifier-free guidance for diffusion models. arXiv preprint arXiv:2407.02687,

work page arXiv
[21]

Jetformer: An autoregressive generative model of raw images and text.arXiv preprint arXiv:2411.19722,

[TPK24] Michael Tschannen, André Susano Pinto, and Alexander Kolesnikov. Jetformer: An au- toregressive generative model of raw images and text. arXiv preprint arXiv:2411.19722,

work page arXiv
[22]

Drive- dreamer: Towards real-world-driven world models for autonomous driving

[WZH+23] Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777,

work page arXiv
[23]

Learning Interactive Real-World Simulators

[YDG+23] Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

[YLW+23] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

3D-VLA: A 3D Vision-Language-Action Generative World Model

[ZQC+24] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

no obvious dynamic phenomenon

[ZWZ+24b] Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, Yang You, Zhaoxiang Zhang, Dawei Zhao, Liang Xiao, Jian Zhao, Jiwen Lu, and Guan Huang. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520,

work page arXiv
[27]

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

[ZYB+24] Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Table 7: Average number of key entities in FROZEN LAKE with different grid sizes

Specifically, Table 7 shows how environmental complexity evolves with larger grid sizes in FROZEN LAKE. Table 7: Average number of key entities in FROZEN LAKE with different grid sizes. Grid Size 3 4 5 6 Train 4.7097 5.7166 7.4723 9.5049 Dev 4.6737 5.4689 7.4589 10.2267 Total 4.7030 5.6682 7.4696 9.648 17 Table 8: Hyper-parameters of fine-tuning Anole 7B ...

work page 2024
[29]

Do not generate the observation by yourself

to onclude the answer when ‘previous observation aligns with one of the options’ or ‘all the actions are completed’. Do not generate the observation by yourself. Reasoning 1: The agent (red triangle) is currently positioned next to the printer. The first action in the sequence is to move right. I will simulate this action to see if the agent can move clos...

work page 1923