Recognition: 2 theorem links
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
Pith reviewed 2026-05-16 23:05 UTC · model grok-4.3
The pith
Multimodal models can improve spatial reasoning by generating images that visualize their step-by-step thinking process.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MVoT enables MLLMs to generate image visualizations of their reasoning traces, using token discrepancy loss to improve visual coherence and fidelity. This leads to competitive performance on spatial reasoning tasks and robust improvements in scenarios where Chain-of-Thought prompting fails.
What carries the argument
Multimodal Visualization-of-Thought (MVoT), the generation of image-based representations of reasoning steps inside autoregressive multimodal models, guided by a token discrepancy loss that enforces alignment between generated visuals and the model's token predictions.
If this is right
- MVoT provides reliable gains precisely on the spatial problems where language-only chains are weakest.
- Visual reasoning traces can be produced and used within the same autoregressive generation process.
- The method maintains competitive results across easier tasks while excelling on harder ones.
- Token discrepancy loss serves as a practical way to raise the quality of generated reasoning visuals.
Where Pith is reading between the lines
- The same visualization technique could be tested on non-spatial tasks such as planning or object manipulation to check for broader usefulness.
- Training models from scratch to emit useful images without an extra loss term might further simplify the approach.
- Integration with external visual tools or simulators could turn the generated images into verifiable intermediate states.
- Similar visual trace methods might help interpretability in other multimodal systems beyond language models.
Load-bearing premise
The images created during reasoning actually capture useful internal states and help the model solve the task instead of adding new errors or distractions.
What would settle it
A controlled test in which replacing MVoT-generated images with random or blank images eliminates all performance gains over standard Chain-of-Thought on the same spatial tasks.
read the original abstract
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Multimodal Visualization-of-Thought (MVoT), a reasoning paradigm for MLLMs that generates image visualizations of internal reasoning traces during autoregressive decoding. It introduces a token discrepancy loss to improve visual coherence and fidelity of these generated images. The central claim is that MVoT achieves competitive performance on dynamic spatial reasoning tasks and delivers robust gains precisely in the hardest cases where standard Chain-of-Thought (CoT) prompting fails, by enabling visual thinking to complement verbal reasoning.
Significance. If the causal role of the generated visualizations is established, the approach could meaningfully extend multimodal reasoning beyond language-only CoT, particularly for spatial tasks. The introduction of token discrepancy loss as a training signal for visualization quality is a concrete technical contribution that could be reusable in other MLLM settings.
major comments (3)
- [Experiments] Experiments section: No ablation isolates whether performance gains arise from the generated visualizations being fed forward into subsequent reasoning steps versus the token discrepancy loss alone (e.g., a control that trains with the loss but supplies blank or random images at inference). Without this, the claim that MVoT yields improvements 'via image visualizations' remains unverified and the central causal mechanism is not load-bearing.
- [§4 and Experiments] §4 (method) and Experiments: The paper asserts that generated visualizations 'faithfully capture the model's internal reasoning state,' yet provides no direct evidence such as human evaluation of visualization fidelity against ground-truth reasoning traces, comparison to model attention maps, or a metric quantifying how often the image is actually referenced in later tokens. This assumption underpins the claim of robust gains in hard spatial scenarios.
- [Abstract and §5] Abstract and §5: The abstract states 'competitive performance' and 'robust and reliable improvements' but supplies no quantitative metrics, error bars, dataset sizes, or statistical significance tests. The full experimental tables must be checked for whether reported gains are large enough and consistent enough to support the 'where CoT fails' claim.
minor comments (2)
- [§4] Notation: The token discrepancy loss is introduced without an explicit equation number or derivation showing how it differs from standard cross-entropy or perceptual losses; add Eq. label and a short derivation.
- [Figure 3] Figure clarity: Visualization examples in Figure 3 lack side-by-side CoT failure cases with the corresponding MVoT image traces; this would help readers see the claimed complementarity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and will revise the manuscript to strengthen the causal claims and evidence where appropriate.
read point-by-point responses
-
Referee: [Experiments] Experiments section: No ablation isolates whether performance gains arise from the generated visualizations being fed forward into subsequent reasoning steps versus the token discrepancy loss alone (e.g., a control that trains with the loss but supplies blank or random images at inference). Without this, the claim that MVoT yields improvements 'via image visualizations' remains unverified and the central causal mechanism is not load-bearing.
Authors: We agree this ablation is required to isolate the contribution of the visualizations. In the revised manuscript we will add a control experiment that applies the token discrepancy loss during training but supplies blank or random images at inference time. Results from this control will be reported alongside the main results to verify that gains depend on feeding the generated visualizations forward. revision: yes
-
Referee: [§4 and Experiments] §4 (method) and Experiments: The paper asserts that generated visualizations 'faithfully capture the model's internal reasoning state,' yet provides no direct evidence such as human evaluation of visualization fidelity against ground-truth reasoning traces, comparison to model attention maps, or a metric quantifying how often the image is actually referenced in later tokens. This assumption underpins the claim of robust gains in hard spatial scenarios.
Authors: We acknowledge the lack of direct fidelity evidence. While the token discrepancy loss is intended to promote coherence with the reasoning trace, we will add a human evaluation study in the revision that rates visualization fidelity against ground-truth reasoning steps. We will also include a quantitative metric on how frequently later tokens attend to or reference the generated image tokens, and report attention-map comparisons where feasible. revision: yes
-
Referee: [Abstract and §5] Abstract and §5: The abstract states 'competitive performance' and 'robust and reliable improvements' but supplies no quantitative metrics, error bars, dataset sizes, or statistical significance tests. The full experimental tables must be checked for whether reported gains are large enough and consistent enough to support the 'where CoT fails' claim.
Authors: Section 5 and its tables already contain the full quantitative results, including per-task accuracies, error bars, dataset sizes, and significance tests. The abstract is intentionally high-level, but we will revise it to cite the key metrics (e.g., average improvement on hard cases) so the claims are anchored. The tables show the gains are both statistically significant and largest precisely where CoT fails, supporting the stated conclusion. revision: partial
Circularity Check
No significant circularity; new loss and visualization paradigm are independent additions
full rationale
The paper proposes MVoT as a new reasoning paradigm that generates image visualizations of traces in MLLMs and introduces a token discrepancy loss to improve visual coherence. No equations, self-definitions, or load-bearing self-citations reduce the central claims to prior CoT inputs by construction. The improvements are presented as experimental outcomes on spatial tasks where CoT fails, with the added components (visual generation and discrepancy loss) standing as novel contributions rather than renamings or fitted predictions of existing quantities. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 24 Pith papers
-
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both
ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
-
UniPath: Adaptive Coordination of Understanding and Generation for Unified Multimodal Reasoning
UniPath adaptively models coordination-path diversity in unified multimodal models by training a path-conditioned executor and using a lightweight planner for input-dependent selection, improving performance over fixe...
-
SketchVLM: Vision language models can annotate images to explain thoughts and guide users
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
-
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
-
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
-
Latent Visual Reasoning
Latent Visual Reasoning enables autoregressive generation of latent visual states that reconstruct critical image tokens, yielding gains on perception-heavy VQA benchmarks such as 71.67% on MMVP.
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR fixes information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens via a detransformer, extending acceptable CoT length over 30x and delivering +14.12% gains on reasoni...
-
Self-Consistent Latent Reasoning: Long Latent Sequence Reasoning for Vision-Language Model
SCOLAR addresses information gain collapse in latent visual reasoning by generating independent auxiliary visual tokens from LLM hidden states, extending acceptable CoT length over 30x and achieving +14.12% gains on b...
-
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
-
Meta-CoT: Enhancing Granularity and Generalization in Image Editing
Meta-CoT uses two-level decomposition of editing operations into meta-tasks and a CoT consistency reward to improve granularity and generalization, reporting 15.8% gains across 21 tasks.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
-
AdaTooler-V: Adaptive Tool-Use for Images and Videos
AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
-
OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving
OmniDrive-R1 boosts VLM reasoning score from 51.77% to 80.35% and answer accuracy from 37.81% to 73.62% on DriveLMM-o1 via reinforcement-driven interleaved multi-modal chain-of-thought with annotation-free grounding.
-
Mull-Tokens: Modality-Agnostic Latent Thinking
Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.
-
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing
VILASR integrates visual drawing operations with reasoning in LVLMs via cold-start synthetic training, reflective rejection sampling, and reinforcement learning, yielding an 18.4% average gain on spatial reasoning benchmarks.
-
Fill the GAP: A Granular Alignment Paradigm for Visual Reasoning in Multimodal Large Language Models
GAP aligns visual latent reasoning in MLLMs at feature, context, and capacity levels, yielding the best aggregate perception and reasoning scores on Qwen2.5-VL 7B among supervised variants while providing task-relevan...
-
SpatialImaginer: Towards Adaptive Visual Imagination for Spatial Reasoning
SpatialImaginer integrates visual imagination with textual chain-of-thought to improve spatial reasoning robustness in multimodal large language models.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.
-
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning
Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.
-
MedLVR: Latent Visual Reasoning for Reliable Medical Visual Question Answering
MedLVR interleaves latent visual reasoning segments in autoregressive decoding and uses two-stage training to raise average medical VQA accuracy from 48.3% to 53.4% over a Qwen2.5-VL-7B backbone on OmniMedVQA and five...
-
Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs
A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.
-
MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
MASS adds spatiotemporal motion signals and 3D grounding to VLMs and releases MASS-Bench, yielding physics-reasoning performance within 2% of Gemini-2.5-Flash after reinforcement fine-tuning.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
-
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.
Reference graph
Works this paper leans on
-
[1]
[AAA+23] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
[BDK+23] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
[Bro16] G Brockman. Openai gym. arXiv preprint arXiv:1606.01540,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
[Cha24] Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models. arXiv preprint arXiv:2405.09818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation
[CSML24] Ethan Chern, Jiadi Su, Yan Ma, and Pengfei Liu. Anole: An open, autoregressive, native large multimodal models for interleaved image-text generation. arXiv preprint arXiv:2407.06135,
-
[6]
Spatialrgpt: Grounded spatial reasoning in vision language model
[CYF+24] An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision language model. arXiv preprint arXiv:2406.01584,
-
[7]
Google gemini ai update - december 2024,
[Dee24] Google DeepMind. Google gemini ai update - december 2024,
work page 2024
-
[8]
Accessed: 2024-12-27. [DJP+24] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Association for Computing Machinery. [HSF+24] Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. arXiv preprint arXiv:2406.09403,
-
[10]
LoRA: Low-Rank Adaptation of Large Language Models
[HSW+21] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Mini-behavior: A procedurally generated benchmark for long-horizon decision-making in embodied ai
[JHH+23] Emily Jin, Jiaheng Hu, Zhuoyi Huang, Ruohan Zhang, Jiajun Wu, Li Fei-Fei, and Roberto Martín-Martín. Mini-behavior: A procedurally generated benchmark for long-horizon decision-making in embodied ai. arXiv preprint 2310.01824,
-
[12]
[JSM+23] Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Aligning cyber space with physical world: A comprehensive survey on embodied ai
[LCB+24] Yang Liu, Weixing Chen, Yongjie Bai, Xiaodan Liang, Guanbin Li, Wen Gao, and Liang Lin. Aligning cyber space with physical world: A comprehensive survey on embodied ai. arXiv preprint arXiv:2407.06886,
-
[14]
A closer look at logical reasoning with llms: The choice of tool matters
[LTS24] Long Hei Matthew Lam, Ramya Keerthy Thatikonda, and Ehsan Shareghi. A closer look at logical reasoning with llms: The choice of tool matters. arXiv preprint arXiv:2406.00284,
-
[15]
Scaffolding coordinates to promote vision-language coordination in large multi-modal models
[LYC+24] Xuanyu Lei, Zonghan Yang, Xinrui Chen, Peng Li, and Yang Liu. Scaffolding coordinates to promote vision-language coordination in large multi-modal models. arXiv preprint arXiv:2402.12058,
-
[16]
TopViewRS: Vision-language models as top-view spatial reasoners
[LZZ+24] Chengzu Li, Caiqi Zhang, Han Zhou, Nigel Collier, Anna Korhonen, and Ivan Vuli ´c. TopViewRS: Vision-language models as top-view spatial reasoners. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Em- pirical Methods in Natural Language Processing, pages 1786–1807, Miami, Florida, USA, November
work page 2024
-
[17]
Sat: Spatial aptitude training for multimodal language models
[RDT+24] Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Anirud- dha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755,
-
[18]
Multimodal latent language modeling with next-token diffusion
[SBW+24] Yutao Sun, Hangbo Bao, Wenhui Wang, Zhiliang Peng, Li Dong, Shaohan Huang, Jianyong Wang, and Furu Wei. Multimodal latent language modeling with next-token diffusion. arXiv preprint arXiv:2412.08635,
-
[19]
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
[SJC+24] Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
No train- ing, no problem: Rethinking classifier-free guidance for diffusion models
[SKHW24] Seyedmorteza Sadat, Manuel Kansy, Otmar Hilliges, and Romann M Weber. No train- ing, no problem: Rethinking classifier-free guidance for diffusion models. arXiv preprint arXiv:2407.02687,
-
[21]
[TPK24] Michael Tschannen, André Susano Pinto, and Alexander Kolesnikov. Jetformer: An au- toregressive generative model of raw images and text. arXiv preprint arXiv:2411.19722,
-
[22]
Drive- dreamer: Towards real-world-driven world models for autonomous driving
[WZH+23] Xiaofeng Wang, Zheng Zhu, Guan Huang, Xinze Chen, Jiagang Zhu, and Jiwen Lu. Drive- dreamer: Towards real-world-driven world models for autonomous driving. arXiv preprint arXiv:2309.09777,
-
[23]
Learning Interactive Real-World Simulators
[YDG+23] Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. arXiv preprint arXiv:2310.06114,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
[YLW+23] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
3D-VLA: A 3D Vision-Language-Action Generative World Model
[ZQC+24] Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
[ZWZ+24b] Zheng Zhu, Xiaofeng Wang, Wangbo Zhao, Chen Min, Nianchen Deng, Min Dou, Yuqi Wang, Botian Shi, Kai Wang, Chi Zhang, Yang You, Zhaoxiang Zhang, Dawei Zhao, Liang Xiao, Jian Zhao, Jiwen Lu, and Guan Huang. Is sora a world simulator? a comprehensive survey on general world models and beyond. arXiv preprint arXiv:2405.03520,
-
[27]
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
[ZYB+24] Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model. arXiv preprint arXiv:2408.11039,
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Table 7: Average number of key entities in FROZEN LAKE with different grid sizes
Specifically, Table 7 shows how environmental complexity evolves with larger grid sizes in FROZEN LAKE. Table 7: Average number of key entities in FROZEN LAKE with different grid sizes. Grid Size 3 4 5 6 Train 4.7097 5.7166 7.4723 9.5049 Dev 4.6737 5.4689 7.4589 10.2267 Total 4.7030 5.6682 7.4696 9.648 17 Table 8: Hyper-parameters of fine-tuning Anole 7B ...
work page 2024
-
[29]
Do not generate the observation by yourself
to onclude the answer when ‘previous observation aligns with one of the options’ or ‘all the actions are completed’. Do not generate the observation by yourself. Reasoning 1: The agent (red triangle) is currently positioned next to the printer. The first action in the sequence is to move right. I will simulate this action to see if the agent can move clos...
work page 1923
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.