arxiv: 2604.09580 · v1 · submitted 2026-02-25 · 💻 cs.AI · cs.LG

Recognition: no theorem link

OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling

Hongyu Chen , Liang Lin , Guangrun Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:27 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords object-oriented world modelingembodied reasoningUML diagramsrobotic planninglarge language modelsworld modelingplanning coherence

0 comments

The pith

Representing the world as UML class hierarchies and activity diagrams lets language models build explicit state models and control flows that improve robotic planning over linear text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard chain-of-thought reasoning uses linear natural language that cannot clearly encode object relationships, state spaces, or causal transitions needed for physical tasks. The paper instead defines the world model as an explicit tuple of state abstraction and transition logic, then materializes it with UML class diagrams to organize perceived objects and activity diagrams to sequence actions. A training process of supervised fine-tuning plus group relative policy optimization uses only final plan success as reward to refine the diagrams even when detailed labels are scarce. On the MRoom-30k benchmark this produces plans with higher coherence, execution success, and structural fidelity than unstructured textual baselines.

Core claim

OOWM redefines the world model as the symbolic tuple W = ⟨S, T⟩, with S instantiated by UML class diagrams that ground visual perception into object hierarchies and T realized by activity diagrams that encode executable control policies. A three-stage pipeline combines supervised fine-tuning with Group Relative Policy Optimization driven by outcome-based rewards from the completed plan. This structured representation yields measurably higher planning coherence, execution success, and structural fidelity than linear text baselines on the MRoom-30k benchmark.

What carries the argument

The explicit world model W = ⟨S, T⟩ realized through UML Class Diagrams for state abstraction and Activity Diagrams for control policy.

If this is right

Explicit object hierarchies reduce state-tracking errors during multi-step planning.
Activity diagrams produce control flows that map directly to robot-executable sequences.
Outcome-based rewards enable effective learning of structured reasoning with sparse supervision.
The method raises structural fidelity of generated plans compared with free-form text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pairing the framework with vision models could allow automatic construction and updating of the UML diagrams from camera input during execution.
The same class-and-activity structure could be tested in non-robotic settings such as game environments or logistics planning that require tracking object states.
Predefined UML formalisms may constrain adaptability when the environment contains objects or relations outside the trained class vocabulary.

Load-bearing premise

That UML class hierarchies and activity diagrams can fully capture the state space, object relationships, and causal dependencies required for reliable robotic planning.

What would settle it

If OOWM plans show no higher execution success rate than standard chain-of-thought text on the MRoom-30k benchmark, the performance advantage claim is falsified.

Figures

Figures reproduced from arXiv: 2604.09580 by Guangrun Wang, Hongyu Chen, Liang Lin.

**Figure 2.** Figure 2: Instantiation of the State Abstraction ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Instantiation of the Control Policy (𝐺𝑐𝑜𝑛𝑡𝑟𝑜𝑙). (a) The visual abstraction rendered as a UML Activity Diagram, illustrating the executable control flow. (b) The corresponding serialized PlantUML code. Please zoom in to view details clearly. We formally define the embodied World Model W as a symbolic tuple W = ⟨S, T ⟩. Here, S denotes the State Abstraction, which maps high-dimensional sensory inputs (image… view at source ↗

**Figure 4.** Figure 4: Overview of the proposed OOWM architecture. The input image is processed via dynamic resolution slicing and encoded [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Overview of Group Relative Policy Optimization [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of training dynamics during Stage 3 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Impact of SFT initialization on OOWM instantiation success. 1 3 5 7 9 epoch 0.2 0.4 Value similarity precision recall f1 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Standard Chain-of-Thought (CoT) prompting empowers Large Language Models (LLMs) with reasoning capabilities, yet its reliance on linear natural language is inherently insufficient for effective world modeling in embodied tasks. While text offers flexibility, it fails to explicitly represent the state-space, object hierarchies, and causal dependencies required for robust robotic planning. To address these limitations, we propose Object-Oriented World Modeling (OOWM), a novel framework that structures embodied reasoning through the lens of software engineering formalisms. We redefine the world model not as a latent vector space, but as an explicit symbolic tuple $W = \langle S, T \rangle$: a State Abstraction ($G_\text{state}$) instantiating the environmental state $S$, coupled with a Control Policy ($G_\text{control}$) representing the transition logic $T: S \times A \rightarrow S'$. OOWM leverages the Unified Modeling Language (UML) to materialize this definition: it employs Class Diagrams to ground visual perception into rigorous object hierarchies, and Activity Diagrams to operationalize planning into executable control flows. Furthermore, we introduce a three-stage training pipeline combining Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). Crucially, this method utilizes outcome-based rewards from the final plan to implicitly optimize the underlying object-oriented reasoning structure, enabling effective learning even with sparse annotations. Extensive evaluations on the MRoom-30k benchmark demonstrate that OOWM significantly outperforms unstructured textual baselines in planning coherence, execution success, and structural fidelity, establishing a new paradigm for structured embodied reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OOWM tries to fix vague LLM plans by forcing them into UML class and activity diagrams plus GRPO outcome rewards, but the results do not show that the diagrams are what actually helps.

read the letter

The paper's main move is to replace linear text CoT with an explicit symbolic world model W = materialized as UML class diagrams for object hierarchies and activity diagrams for control flows, then train via SFT followed by GRPO that rewards only the final plan's success. This is a clean way to import software-engineering structure into embodied reasoning and it gives a concrete pipeline for turning perception into executable plans on their MRoom-30k setup. The three-stage training with sparse annotations is also a practical detail that could be reused elsewhere. What the work does reasonably well is spell out why plain text struggles with state space, hierarchies, and causal links, and it shows the diagrams can be generated and executed in a loop. The numbers on coherence and execution success are claimed to beat unstructured baselines, which at least gives a target for follow-up work. The soft spot is exactly the one in the stress-test note: no ablation tests whether the UML formalism itself drives the gains or whether the GRPO outcome reward would produce similar improvements if the model were just outputting plain text. Without that control, the central claim that the object-oriented representation captures the necessary physical structure remains untested. The abstract also omits the actual metrics, error bars, and implementation details, so the size of any improvement is still unclear. This is for people already working on structured LLM planning for robotics who want a new formalism to try. A reader who needs reproducible evidence or ablations will find it thin, but the framing is coherent enough that a serious referee could usefully ask for the missing controls and numbers. I would send it to peer review rather than desk-reject it.

Referee Report

2 major / 2 minor

Summary. The paper introduces Object-Oriented World Modeling (OOWM), which redefines the world model as an explicit symbolic tuple W = ⟨S, T⟩ consisting of a State Abstraction (G_state) and Control Policy (G_control). It materializes this via UML Class Diagrams for object hierarchies from visual perception and Activity Diagrams for planning control flows, trained via a three-stage SFT + GRPO pipeline that uses outcome rewards on final plans. The central claim is that this structured approach significantly outperforms unstructured textual baselines on the MRoom-30k benchmark in planning coherence, execution success, and structural fidelity.

Significance. If the empirical results and ablations hold, OOWM would offer a concrete bridge between software-engineering formalisms and LLM-based embodied planning, potentially improving robustness by explicitly encoding object hierarchies and causal transitions rather than relying on linear text. The use of GRPO with sparse outcome rewards is a practical strength for learning structured representations.

major comments (2)

[Experiments] Experiments section: The abstract asserts 'significant outperformance' on MRoom-30k yet supplies no quantitative metrics, tables, ablation results, error bars, or implementation details. This leaves the central empirical claim unsupported by visible evidence and prevents assessment of effect sizes.
[Method and Experiments] Method and Experiments sections: The paper attributes gains to the UML class/activity diagrams capturing state-space, object hierarchies, and causal dependencies. However, no ablation isolates the diagrammatic representation from the GRPO outcome-reward training (e.g., by treating generated diagrams as unstructured text while retaining the same SFT+GRPO pipeline). This is load-bearing for the claim that the explicit object-oriented formalism, rather than the training objective alone, drives the reported improvements in coherence and success.

minor comments (2)

[Abstract] Abstract: The mapping from the tuple W = ⟨S, T⟩ to the UML components G_state and G_control is stated but would benefit from an explicit equation or small diagram showing how visual input populates the class hierarchy and how activity flows encode transitions T.
[Introduction/Method] Notation: The symbols G_state and G_control are introduced without a clear definition of their formal relationship to the state set S and transition function T; a short definitional paragraph or equation would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point-by-point below, clarifying our position and committing to revisions that will strengthen the empirical presentation without misrepresenting the work.

read point-by-point responses

Referee: [Experiments] Experiments section: The abstract asserts 'significant outperformance' on MRoom-30k yet supplies no quantitative metrics, tables, ablation results, error bars, or implementation details. This leaves the central empirical claim unsupported by visible evidence and prevents assessment of effect sizes.

Authors: We acknowledge that the submitted version does not present the full quantitative details in a readily accessible form within the Experiments section. In the revised manuscript we will expand this section with comprehensive tables reporting exact success rates, coherence scores, execution success, and structural fidelity metrics on MRoom-30k, including error bars from multiple runs, full ablation tables, and complete implementation details (model sizes, hyperparameters, training schedules, dataset statistics, and evaluation protocols). These additions will directly support the abstract claims with visible evidence. revision: yes
Referee: [Method and Experiments] Method and Experiments sections: The paper attributes gains to the UML class/activity diagrams capturing state-space, object hierarchies, and causal dependencies. However, no ablation isolates the diagrammatic representation from the GRPO outcome-reward training (e.g., by treating generated diagrams as unstructured text while retaining the same SFT+GRPO pipeline). This is load-bearing for the claim that the explicit object-oriented formalism, rather than the training objective alone, drives the reported improvements in coherence and success.

Authors: We agree that an ablation isolating the UML diagrammatic structure from the GRPO training objective is necessary to substantiate the central claim. In the revised manuscript we will add this experiment by applying the identical three-stage SFT+GRPO pipeline to a baseline that generates unstructured textual plans rather than UML diagrams. The results will be reported in a new table within the Experiments section, allowing direct comparison of the contribution of the explicit object-oriented formalism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; modeling choice and pipeline are independent of inputs

full rationale

The paper redefines the world model as the tuple W = ⟨S, T⟩ and materializes it via UML Class and Activity Diagrams as an explicit design decision rather than a derivation from prior equations or data fits. The three-stage SFT+GRPO pipeline uses outcome rewards on final plans but does not reduce any claimed prediction to a fitted parameter or self-referential loop by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is presented as a derived result. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the high-level claim that UML diagrams suffice for world modeling.

pith-pipeline@v0.9.0 · 5591 in / 1069 out tokens · 54522 ms · 2026-05-15T19:27:35.030363+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

The Unified Modeling Language Reference Manual, Second Edition

Charles Ashbacher. 2004. "The Unified Modeling Language Reference Manual, Second Edition", by James Rumbaugh.J. Object Technol.3, 10 (2004), 193–195

work page 2004
[2]

Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Pi- otr Nyczyk, and Torsten Hoefler. 2024. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. InAAAI. AAAI Press, 17682–17690

work page 2024
[3]

Henriques, Andrea Vedaldi, and Andrew Zis- serman

Yash Bhalgat, Iro Laina, João F. Henriques, Andrea Vedaldi, and Andrew Zis- serman. 2023. Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion. InNeurIPS

work page 2023
[4]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 2024.𝜋 0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. 2022. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Beiqi Chen, Shuai Shao, Haitang Feng, Jianhuang Lai, Jianlou Si, and Guangcong Wang. 2025. Style4D-Bench: A Benchmark Suite for 4D Stylization.arXiv preprint arXiv:2508.19243(2025)

work page arXiv 2025
[7]

2026.RADAR: Benchmarking Vision-Language-Action General- ization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

Yuhao Chen, Zhihao Zhan, Xiaoxin Lin, Zijian Song, Hao Liu, Qinhan Lyu, Yubo Zu, Xiao Chen, Zhiyuan Liu, Tao Pu, Tianshui Chen, Keze Wang, Liang Lin, and Guangrun Wang. 2026.RADAR: Benchmarking Vision-Language-Action General- ization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation. Technical Report. Sun Yat-sen University...

work page 2026
[8]

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling.arXiv preprint arXiv:2412.05271(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Kun Chu, Xufeng Zhao, Cornelius Weber, and Stefan Wermter. 2025. LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language.CoRRabs/2503.17309 (2025)

work page arXiv 2025
[10]

Antonia Creswell, Murray Shanahan, and Irina Higgins. 2023. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. InICLR. OpenReview.net

work page 2023
[11]

Muzhi Han, Yifeng Zhu, Song-Chun Zhu, Ying Nian Wu, and Yuke Zhu. 2024. INTERPRET: Interactive Predicate Learning from Language Feedback for Gener- alizable Task Planning. InRobotics: Science and Systems

work page 2024
[12]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. [n. d.]. 𝜋0. 5: a vision-language-action model with open-world generalization, 2025.URL https://arxiv. org/abs/2504.160541, 2 ([n. d.]), 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. InNeurIPS

work page 2022
[14]

Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, and Guangrun Wang. 2025. VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling.arXiv preprint arXiv:2512.02902(2025)

work page arXiv 2025
[15]

Xiao Li, Jiaqi Zhang, Shuxiang Zhang, Tianshui Chen, Liang Lin, and Guan- grun Wang. 2025. In-Situ Tweedie Discrete Diffusion Models.arXiv preprint arXiv:2510.01047(2025)

work page arXiv 2025
[16]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Train- ing language models to follow instructions with h...

work page 2022
[17]

Liangming Pan, Alon Albalak, Xinyi Wang, and William Yang Wang. 2023. Logic- LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning. InEMNLP (Findings). Association for Computational Linguis- tics, 3806–3824

work page 2023
[18]

Ariadna Quattoni and Antonio Torralba. 2009. Recognizing indoor scenes. In CVPR. IEEE Computer Society, 413–420

work page 2009
[19]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.CoRRabs/2402.03300 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. 2025. VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model.CoRRabs/2504.07615 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, and Guangrun Wang. 2025. Physical autoregressive model for robotic manipulation without action pretrain- ing.arXiv preprint arXiv:2508.09822(2025)

work page arXiv 2025
[22]

Xin Su, Tiep Le, Steven Bethard, and Phillip Howard. 2024. Semi-Structured Chain-of-Thought: Integrating Multiple Sources of Knowledge for Improved Language Model Reasoning. InNAACL-HLT. Association for Computational Linguistics, 8597–8613

work page 2024
[23]

Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. Plan-and-Solve Prompting: Improving Zero-Shot Chain- of-Thought Reasoning by Large Language Models. InACL (1). Association for Computational Linguistics, 2609–2634

work page 2023
[24]

Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. 2024. Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization.arXiv preprint arXiv:2411.10442(2024)

work page arXiv 2024
[25]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InICLR. OpenReview.net

work page 2023
[26]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InNeurIPS

work page 2022
[27]

Lintao Xiang, Xinkai Chen, Jianhuang Lai, and Guangcong Wang. 2025. Distilled- 3DGS: distilled 3D Gaussian splatting.arXiv preprint arXiv:2508.14037(2025)

work page arXiv 2025
[28]

Siheng Xiong, Jieyu Zhou, Zhangding Liu, and Yusen Su. 2025. SymPlanner: Deliberate Planning in Language Models with Symbolic Representation.CoRR abs/2505.01479 (2025)

work page arXiv 2025
[29]

Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu

work page
[30]

InACL (1)

Faithful Logical Reasoning via Symbolic Chain-of-Thought. InACL (1). Association for Computational Linguistics, 13326–13365

work page
[31]

Yuanfeng Xu, Yuhao Chen, Liang Lin, and Guangrun Wang. 2026. Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion.arXiv preprint arXiv:2601.04056(2026)

work page arXiv 2026
[32]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InNeurIPS

work page 2023
[33]

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. RRHF: Rank Responses to Align Language Models with Human Feedback without tears.CoRRabs/2304.05302 (2023)

work page arXiv 2023
[34]

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. 2022. STaR: Boot- strapping Reasoning With Reasoning. InNeurIPS

work page 2022
[35]

Zhihao Zhan, Yuhao Chen, Jiaying Zhou, Qinhan Lv, Hao Liu, Keze Wang, Liang Lin, and Guangrun Wang. 2026. Stable Language Guidance for Vision-Language- Action Models.arXiv preprint arXiv:2601.04052(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Zhihao Zhan, Jiaying Zhou, Likui Zhang, Qinhan Lv, Hao Liu, Jusheng Zhang, Weizheng Li, Ziliang Chen, Tianshui Chen, Keze Wang, et al. 2025. E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion.arXiv preprint arXiv:2511.21542(2025)

work page arXiv 2025
[37]

Li Zhang. 2024. Structured Event Reasoning with Large Language Models.CoRR abs/2408.16098 (2024)

work page arXiv 2024
[38]

Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2024. Multimodal Chain-of-Thought Reasoning in Language Models. Trans. Mach. Learn. Res.2024 (2024)

work page 2024
[39]

Le, and Ed H

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. InICLR. OpenReview.net

work page 2023
[40]

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. 2023. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning. PMLR, 2165–2183

work page 2023
[41]

Heqing Zou, Tianze Luo, Guiyang Xie, Fengmao Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian Zhang, et al. 2024. From sec- onds to hours: Reviewing multimodal large language models on comprehensive long video understanding.arXiv preprint arXiv:2409.18938(2024)

work page arXiv 2024