pith. machine review for the scientific record. sign in

arxiv: 2604.09580 · v1 · submitted 2026-02-25 · 💻 cs.AI · cs.LG

Recognition: no theorem link

OOWM: Structuring Embodied Reasoning and Planning via Object-Oriented Programmatic World Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:27 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords object-oriented world modelingembodied reasoningUML diagramsrobotic planninglarge language modelsworld modelingplanning coherence
0
0 comments X

The pith

Representing the world as UML class hierarchies and activity diagrams lets language models build explicit state models and control flows that improve robotic planning over linear text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard chain-of-thought reasoning uses linear natural language that cannot clearly encode object relationships, state spaces, or causal transitions needed for physical tasks. The paper instead defines the world model as an explicit tuple of state abstraction and transition logic, then materializes it with UML class diagrams to organize perceived objects and activity diagrams to sequence actions. A training process of supervised fine-tuning plus group relative policy optimization uses only final plan success as reward to refine the diagrams even when detailed labels are scarce. On the MRoom-30k benchmark this produces plans with higher coherence, execution success, and structural fidelity than unstructured textual baselines.

Core claim

OOWM redefines the world model as the symbolic tuple W = ⟨S, T⟩, with S instantiated by UML class diagrams that ground visual perception into object hierarchies and T realized by activity diagrams that encode executable control policies. A three-stage pipeline combines supervised fine-tuning with Group Relative Policy Optimization driven by outcome-based rewards from the completed plan. This structured representation yields measurably higher planning coherence, execution success, and structural fidelity than linear text baselines on the MRoom-30k benchmark.

What carries the argument

The explicit world model W = ⟨S, T⟩ realized through UML Class Diagrams for state abstraction and Activity Diagrams for control policy.

If this is right

  • Explicit object hierarchies reduce state-tracking errors during multi-step planning.
  • Activity diagrams produce control flows that map directly to robot-executable sequences.
  • Outcome-based rewards enable effective learning of structured reasoning with sparse supervision.
  • The method raises structural fidelity of generated plans compared with free-form text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pairing the framework with vision models could allow automatic construction and updating of the UML diagrams from camera input during execution.
  • The same class-and-activity structure could be tested in non-robotic settings such as game environments or logistics planning that require tracking object states.
  • Predefined UML formalisms may constrain adaptability when the environment contains objects or relations outside the trained class vocabulary.

Load-bearing premise

That UML class hierarchies and activity diagrams can fully capture the state space, object relationships, and causal dependencies required for reliable robotic planning.

What would settle it

If OOWM plans show no higher execution success rate than standard chain-of-thought text on the MRoom-30k benchmark, the performance advantage claim is falsified.

Figures

Figures reproduced from arXiv: 2604.09580 by Guangrun Wang, Hongyu Chen, Liang Lin.

Figure 1
Figure 1. Figure 1: Comparison of Standard Text-based CoT vs. Object-Oriented Programmatic World Modeling (OOWM). (a) Input: A [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Instantiation of the State Abstraction ( [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Instantiation of the Control Policy (𝐺𝑐𝑜𝑛𝑡𝑟𝑜𝑙). (a) The visual abstraction rendered as a UML Activity Diagram, illustrating the executable control flow. (b) The correspond￾ing serialized PlantUML code. Please zoom in to view details clearly. We formally define the embodied World Model W as a symbolic tuple W = ⟨S, T ⟩. Here, S denotes the State Abstraction, which maps high-dimensional sensory inputs (image… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed OOWM architecture. The input image is processed via dynamic resolution slicing and encoded [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of Group Relative Policy Optimization [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of training dynamics during Stage 3 [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Impact of SFT ini￾tialization on OOWM instan￾tiation success. 1 3 5 7 9 epoch 0.2 0.4 Value similarity precision recall f1 [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Standard Chain-of-Thought (CoT) prompting empowers Large Language Models (LLMs) with reasoning capabilities, yet its reliance on linear natural language is inherently insufficient for effective world modeling in embodied tasks. While text offers flexibility, it fails to explicitly represent the state-space, object hierarchies, and causal dependencies required for robust robotic planning. To address these limitations, we propose Object-Oriented World Modeling (OOWM), a novel framework that structures embodied reasoning through the lens of software engineering formalisms. We redefine the world model not as a latent vector space, but as an explicit symbolic tuple $W = \langle S, T \rangle$: a State Abstraction ($G_\text{state}$) instantiating the environmental state $S$, coupled with a Control Policy ($G_\text{control}$) representing the transition logic $T: S \times A \rightarrow S'$. OOWM leverages the Unified Modeling Language (UML) to materialize this definition: it employs Class Diagrams to ground visual perception into rigorous object hierarchies, and Activity Diagrams to operationalize planning into executable control flows. Furthermore, we introduce a three-stage training pipeline combining Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO). Crucially, this method utilizes outcome-based rewards from the final plan to implicitly optimize the underlying object-oriented reasoning structure, enabling effective learning even with sparse annotations. Extensive evaluations on the MRoom-30k benchmark demonstrate that OOWM significantly outperforms unstructured textual baselines in planning coherence, execution success, and structural fidelity, establishing a new paradigm for structured embodied reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Object-Oriented World Modeling (OOWM), which redefines the world model as an explicit symbolic tuple W = ⟨S, T⟩ consisting of a State Abstraction (G_state) and Control Policy (G_control). It materializes this via UML Class Diagrams for object hierarchies from visual perception and Activity Diagrams for planning control flows, trained via a three-stage SFT + GRPO pipeline that uses outcome rewards on final plans. The central claim is that this structured approach significantly outperforms unstructured textual baselines on the MRoom-30k benchmark in planning coherence, execution success, and structural fidelity.

Significance. If the empirical results and ablations hold, OOWM would offer a concrete bridge between software-engineering formalisms and LLM-based embodied planning, potentially improving robustness by explicitly encoding object hierarchies and causal transitions rather than relying on linear text. The use of GRPO with sparse outcome rewards is a practical strength for learning structured representations.

major comments (2)
  1. [Experiments] Experiments section: The abstract asserts 'significant outperformance' on MRoom-30k yet supplies no quantitative metrics, tables, ablation results, error bars, or implementation details. This leaves the central empirical claim unsupported by visible evidence and prevents assessment of effect sizes.
  2. [Method and Experiments] Method and Experiments sections: The paper attributes gains to the UML class/activity diagrams capturing state-space, object hierarchies, and causal dependencies. However, no ablation isolates the diagrammatic representation from the GRPO outcome-reward training (e.g., by treating generated diagrams as unstructured text while retaining the same SFT+GRPO pipeline). This is load-bearing for the claim that the explicit object-oriented formalism, rather than the training objective alone, drives the reported improvements in coherence and success.
minor comments (2)
  1. [Abstract] Abstract: The mapping from the tuple W = ⟨S, T⟩ to the UML components G_state and G_control is stated but would benefit from an explicit equation or small diagram showing how visual input populates the class hierarchy and how activity flows encode transitions T.
  2. [Introduction/Method] Notation: The symbols G_state and G_control are introduced without a clear definition of their formal relationship to the state set S and transition function T; a short definitional paragraph or equation would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point-by-point below, clarifying our position and committing to revisions that will strengthen the empirical presentation without misrepresenting the work.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The abstract asserts 'significant outperformance' on MRoom-30k yet supplies no quantitative metrics, tables, ablation results, error bars, or implementation details. This leaves the central empirical claim unsupported by visible evidence and prevents assessment of effect sizes.

    Authors: We acknowledge that the submitted version does not present the full quantitative details in a readily accessible form within the Experiments section. In the revised manuscript we will expand this section with comprehensive tables reporting exact success rates, coherence scores, execution success, and structural fidelity metrics on MRoom-30k, including error bars from multiple runs, full ablation tables, and complete implementation details (model sizes, hyperparameters, training schedules, dataset statistics, and evaluation protocols). These additions will directly support the abstract claims with visible evidence. revision: yes

  2. Referee: [Method and Experiments] Method and Experiments sections: The paper attributes gains to the UML class/activity diagrams capturing state-space, object hierarchies, and causal dependencies. However, no ablation isolates the diagrammatic representation from the GRPO outcome-reward training (e.g., by treating generated diagrams as unstructured text while retaining the same SFT+GRPO pipeline). This is load-bearing for the claim that the explicit object-oriented formalism, rather than the training objective alone, drives the reported improvements in coherence and success.

    Authors: We agree that an ablation isolating the UML diagrammatic structure from the GRPO training objective is necessary to substantiate the central claim. In the revised manuscript we will add this experiment by applying the identical three-stage SFT+GRPO pipeline to a baseline that generates unstructured textual plans rather than UML diagrams. The results will be reported in a new table within the Experiments section, allowing direct comparison of the contribution of the explicit object-oriented formalism. revision: yes

Circularity Check

0 steps flagged

No significant circularity; modeling choice and pipeline are independent of inputs

full rationale

The paper redefines the world model as the tuple W = ⟨S, T⟩ and materializes it via UML Class and Activity Diagrams as an explicit design decision rather than a derivation from prior equations or data fits. The three-stage SFT+GRPO pipeline uses outcome rewards on final plans but does not reduce any claimed prediction to a fitted parameter or self-referential loop by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatz or renaming of known results is presented as a derived result. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the high-level claim that UML diagrams suffice for world modeling.

pith-pipeline@v0.9.0 · 5591 in / 1069 out tokens · 54522 ms · 2026-05-15T19:27:35.030363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    The Unified Modeling Language Reference Manual, Second Edition

    Charles Ashbacher. 2004. "The Unified Modeling Language Reference Manual, Second Edition", by James Rumbaugh.J. Object Technol.3, 10 (2004), 193–195

  2. [2]

    Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Michal Podstawski, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Hubert Niewiadomski, Pi- otr Nyczyk, and Torsten Hoefler. 2024. Graph of Thoughts: Solving Elaborate Problems with Large Language Models. InAAAI. AAAI Press, 17682–17690

  3. [3]

    Henriques, Andrea Vedaldi, and Andrew Zis- serman

    Yash Bhalgat, Iro Laina, João F. Henriques, Andrea Vedaldi, and Andrew Zis- serman. 2023. Contrastive Lift: 3D Object Instance Segmentation by Slow-Fast Contrastive Fusion. InNeurIPS

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 2024.𝜋 0: A Vision-Language-Action Flow Model for General Robot Control.arXiv preprint arXiv:2410.24164(2024)

  5. [5]

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. 2022. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817(2022)

  6. [6]

    Beiqi Chen, Shuai Shao, Haitang Feng, Jianhuang Lai, Jianlou Si, and Guangcong Wang. 2025. Style4D-Bench: A Benchmark Suite for 4D Stylization.arXiv preprint arXiv:2508.19243(2025)

  7. [7]

    2026.RADAR: Benchmarking Vision-Language-Action General- ization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation

    Yuhao Chen, Zhihao Zhan, Xiaoxin Lin, Zijian Song, Hao Liu, Qinhan Lyu, Yubo Zu, Xiao Chen, Zhiyuan Liu, Tao Pu, Tianshui Chen, Keze Wang, Liang Lin, and Guangrun Wang. 2026.RADAR: Benchmarking Vision-Language-Action General- ization via Real-World Dynamics, Spatial-Physical Intelligence, and Autonomous Evaluation. Technical Report. Sun Yat-sen University...

  8. [8]

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al . 2024. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling.arXiv preprint arXiv:2412.05271(2024)

  9. [9]

    Kun Chu, Xufeng Zhao, Cornelius Weber, and Stefan Wermter. 2025. LLM+MAP: Bimanual Robot Task Planning using Large Language Models and Planning Domain Definition Language.CoRRabs/2503.17309 (2025)

  10. [10]

    Antonia Creswell, Murray Shanahan, and Irina Higgins. 2023. Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning. InICLR. OpenReview.net

  11. [11]

    Muzhi Han, Yifeng Zhu, Song-Chun Zhu, Ying Nian Wu, and Yuke Zhu. 2024. INTERPRET: Interactive Predicate Learning from Language Feedback for Gener- alizable Task Planning. InRobotics: Science and Systems

  12. [12]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. [n. d.]. 𝜋0. 5: a vision-language-action model with open-world generalization, 2025.URL https://arxiv. org/abs/2504.160541, 2 ([n. d.]), 3

  13. [13]

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. InNeurIPS

  14. [14]

    Weiqi Li, Quande Zhang, Ruifeng Zhai, Liang Lin, and Guangrun Wang. 2025. VLA Models Are More Generalizable Than You Think: Revisiting Physical and Spatial Modeling.arXiv preprint arXiv:2512.02902(2025)

  15. [15]

    Xiao Li, Jiaqi Zhang, Shuxiang Zhang, Tianshui Chen, Liang Lin, and Guan- grun Wang. 2025. In-Situ Tweedie Discrete Diffusion Models.arXiv preprint arXiv:2510.01047(2025)

  16. [16]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Train- ing language models to follow instructions with h...

  17. [17]

    Liangming Pan, Alon Albalak, Xinyi Wang, and William Yang Wang. 2023. Logic- LM: Empowering Large Language Models with Symbolic Solvers for Faithful Logical Reasoning. InEMNLP (Findings). Association for Computational Linguis- tics, 3806–3824

  18. [18]

    Ariadna Quattoni and Antonio Torralba. 2009. Recognizing indoor scenes. In CVPR. IEEE Computer Society, 413–420

  19. [19]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.CoRRabs/2402.03300 (2024)

  20. [20]

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. 2025. VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model.CoRRabs/2504.07615 (2025)

  21. [21]

    Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, and Guangrun Wang. 2025. Physical autoregressive model for robotic manipulation without action pretrain- ing.arXiv preprint arXiv:2508.09822(2025)

  22. [22]

    Xin Su, Tiep Le, Steven Bethard, and Phillip Howard. 2024. Semi-Structured Chain-of-Thought: Integrating Multiple Sources of Knowledge for Improved Language Model Reasoning. InNAACL-HLT. Association for Computational Linguistics, 8597–8613

  23. [23]

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. 2023. Plan-and-Solve Prompting: Improving Zero-Shot Chain- of-Thought Reasoning by Large Language Models. InACL (1). Association for Computational Linguistics, 2609–2634

  24. [24]

    Weiyun Wang, Zhe Chen, Wenhai Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Jinguo Zhu, Xizhou Zhu, Lewei Lu, Yu Qiao, and Jifeng Dai. 2024. Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization.arXiv preprint arXiv:2411.10442(2024)

  25. [25]

    Le, Ed H

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. InICLR. OpenReview.net

  26. [26]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InNeurIPS

  27. [27]

    Lintao Xiang, Xinkai Chen, Jianhuang Lai, and Guangcong Wang. 2025. Distilled- 3DGS: distilled 3D Gaussian splatting.arXiv preprint arXiv:2508.14037(2025)

  28. [28]

    Siheng Xiong, Jieyu Zhou, Zhangding Liu, and Yusen Su. 2025. SymPlanner: Deliberate Planning in Language Models with Symbolic Representation.CoRR abs/2505.01479 (2025)

  29. [29]

    Jundong Xu, Hao Fei, Liangming Pan, Qian Liu, Mong-Li Lee, and Wynne Hsu

  30. [30]

    InACL (1)

    Faithful Logical Reasoning via Symbolic Chain-of-Thought. InACL (1). Association for Computational Linguistics, 13326–13365

  31. [31]

    Yuanfeng Xu, Yuhao Chen, Liang Lin, and Guangrun Wang. 2026. Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion.arXiv preprint arXiv:2601.04056(2026)

  32. [32]

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InNeurIPS

  33. [33]

    Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. 2023. RRHF: Rank Responses to Align Language Models with Human Feedback without tears.CoRRabs/2304.05302 (2023)

  34. [34]

    Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah D. Goodman. 2022. STaR: Boot- strapping Reasoning With Reasoning. InNeurIPS

  35. [35]

    Zhihao Zhan, Yuhao Chen, Jiaying Zhou, Qinhan Lv, Hao Liu, Keze Wang, Liang Lin, and Guangrun Wang. 2026. Stable Language Guidance for Vision-Language- Action Models.arXiv preprint arXiv:2601.04052(2026)

  36. [36]

    Zhihao Zhan, Jiaying Zhou, Likui Zhang, Qinhan Lv, Hao Liu, Jusheng Zhang, Weizheng Li, Ziliang Chen, Tianshui Chen, Keze Wang, et al. 2025. E0: Enhancing Generalization and Fine-Grained Control in VLA Models via Continuized Discrete Diffusion.arXiv preprint arXiv:2511.21542(2025)

  37. [37]

    Li Zhang. 2024. Structured Event Reasoning with Large Language Models.CoRR abs/2408.16098 (2024)

  38. [38]

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2024. Multimodal Chain-of-Thought Reasoning in Language Models. Trans. Mach. Learn. Res.2024 (2024)

  39. [39]

    Le, and Ed H

    Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc V. Le, and Ed H. Chi. 2023. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. InICLR. OpenReview.net

  40. [40]

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. 2023. Rt-2: Vision-language- action models transfer web knowledge to robotic control. InConference on Robot Learning. PMLR, 2165–2183

  41. [41]

    Heqing Zou, Tianze Luo, Guiyang Xie, Fengmao Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian Zhang, et al. 2024. From sec- onds to hours: Reviewing multimodal large language models on comprehensive long video understanding.arXiv preprint arXiv:2409.18938(2024)