pith. machine review for the scientific record. sign in

arxiv: 2604.07034 · v1 · submitted 2026-04-08 · 💻 cs.RO · cs.AI· cs.CV

Recognition: no theorem link

KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:43 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords robot failure analysisvision-language modelskeyframesbird's-eye viewtraining-freetokenized evidenceRoboFAC benchmarkVLM prompting
0
0 comments X

The pith

KITE turns robot videos into compact keyframe and bird's-eye-view tokens so off-the-shelf VLMs can detect, identify, and explain failures without any training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents KITE as a training-free front-end that distills long robot execution videos into a small set of motion-salient keyframes paired with schematic bird's-eye-view representations. Each keyframe carries open-vocabulary detections while the BEV encodes object layout, axes, timestamps, and confidence; these are serialized with robot-profile tokens into a unified prompt. The resulting evidence lets any off-the-shelf vision-language model perform failure detection, identification, localization, explanation, and correction. On the RoboFAC benchmark this approach yields large gains over vanilla Qwen2.5-VL in the training-free setting, especially for simulation tasks, while staying competitive with a tuned baseline. A modest QLoRA fine-tune further lifts explanation and correction quality, and qualitative tests on real dual-arm hardware indicate practical applicability.

Core claim

KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view representation that encodes relative object layout, axes, timestamps, and detection confidence; these visual cues are serialized with robot-profile and scene-context tokens into a unified prompt that supports failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM.

What carries the argument

KITE front-end: keyframe selection from motion salience, paired with schematic BEV layouts and serialized robot-profile tokens that produce compact tokenized evidence for VLMs.

If this is right

  • On RoboFAC, KITE with Qwen2.5-VL substantially outperforms vanilla Qwen2.5-VL in training-free failure detection, identification, and localization.
  • The same KITE prompt supports the full pipeline of detection through correction with one off-the-shelf model.
  • A small QLoRA fine-tune on top of KITE further improves explanation and correction quality.
  • Qualitative results on real dual-arm robots indicate that the structured evidence transfers beyond simulation.
  • KITE remains competitive with a fully tuned RoboFAC baseline while requiring no task-specific training for the base VLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same keyframe-plus-BEV serialization could be reused for other long-horizon video tasks such as anomaly detection in assembly or navigation logs.
  • Because the BEV is schematic and human-readable, the evidence stream may also serve as an interpretable log for human oversight or regulatory review.
  • If keyframe selection misses subtle contact events, adding a lightweight optical-flow or contact-sensor filter before tokenization would be a direct extension.
  • The method's reliance on open-vocabulary detections suggests easy swapping of the underlying detector without retraining the VLM prompt logic.

Load-bearing premise

The chosen keyframes and BEV schematics preserve every piece of information an off-the-shelf VLM needs to correctly analyze failures without critical omissions or misleading cues.

What would settle it

A robot failure whose root cause is visible only in frames or spatial details omitted by the keyframe and BEV selection process, causing the VLM to produce an incorrect detection or explanation.

Figures

Figures reproduced from arXiv: 2604.07034 by Feras Dayoub, King Hang Wong, Mehdi Hosseinzadeh.

Figure 1
Figure 1. Figure 1: Failure explanation in real-world settings with KITE. Example sequence from the Dual-Arm Robot (DART) in the lab. Left (top to bottom): optical flow estimates used for keyframe selection; RGB keyframe with object detection overlays; single-view depth estimates; and a pseudo-BEV schematic (circles with radius ∝ confidence; X/Z axes; timestamp). Notably, in its failure explanation, the VLM references the BEV… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of KITE. The proposed pipeline takes a raw video and distills it into a small set of salient keyframes, identified using motion-based peaks. For each keyframe, we run open-vocabulary detection to localize the robot and surrounding objects, and render a pseudo-BEV schematic that depicts the scene layout with simple, interpretable symbols. These visual elements are paired with a structured context a… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results in simulation (RoboFAC dataset). Each panel shows: RGB keyframe with object-detection overlays; optical-flow estimates; pseudo-BEV schematic (consistent object IDs; circle radius ∝ confidence; timestamp); and single-view depth estimates, all for the corresponding keyframes. We also illustrate a short structured-context excerpt, KITE’s response to a failure-localization query, and a fina… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results in real-world (ALOHA-2). Each panel shows: RGB keyframe with object-detection overlays; optical-flow estimates; pseudo-BEV schematic (consistent object IDs; circle radius ∝ confidence; timestamp); and single-view depth estimates, all for the corresponding keyframes. We also illustrate a short structured-context excerpt, KITE’s response to a failure-localization query, and a final narrat… view at source ↗
read the original abstract

We present KITE, a training-free, keyframe-anchored, layout-grounded front-end that converts long robot-execution videos into compact, interpretable tokenized evidence for vision-language models (VLMs). KITE distills each trajectory into a small set of motion-salient keyframes with open-vocabulary detections and pairs each keyframe with a schematic bird's-eye-view (BEV) representation that encodes relative object layout, axes, timestamps, and detection confidence. These visual cues are serialized with robot-profile and scene-context tokens into a unified prompt, allowing the same front-end to support failure detection, identification, localization, explanation, and correction with an off-the-shelf VLM. On the RoboFAC benchmark, KITE with Qwen2.5-VL substantially improves over vanilla Qwen2.5-VL in the training-free setting, with especially large gains on simulation failure detection, identification, and localization, while remaining competitive with a RoboFAC-tuned baseline. A small QLoRA fine-tune further improves explanation and correction quality. We also report qualitative results on real dual-arm robots, demonstrating the practical applicability of KITE as a structured and interpretable front-end for robot failure analysis. Code and models are released on our project page: https://m80hz.github.io/kite/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces KITE, a training-free front-end that distills robot execution trajectories into a compact set of motion-salient keyframes augmented with open-vocabulary detections and schematic bird's-eye-view (BEV) representations encoding layout, axes, timestamps, and confidence. These are serialized with robot-profile and scene-context tokens to form prompts for off-the-shelf VLMs, enabling failure detection, identification, localization, explanation, and correction. On the RoboFAC benchmark, KITE paired with Qwen2.5-VL yields substantial gains over the vanilla VLM in the training-free regime (especially on simulation tasks) while remaining competitive with a RoboFAC-tuned baseline; a small QLoRA fine-tune further boosts explanation and correction quality. Qualitative results on real dual-arm robots are also presented, and code/models are released.

Significance. If the benchmark results hold under scrutiny, KITE provides a practical, interpretable, and training-free mechanism for injecting structured visual evidence into VLMs for robot failure analysis, which could improve reliability and debuggability in deployed robotic systems. The explicit release of code and models is a clear strength that supports reproducibility and extension by the community.

major comments (2)
  1. [§3] §3 (Keyframe extraction and BEV serialization): The motion-salient keyframe heuristic is described at a high level but without the precise selection rule (e.g., optical-flow magnitude threshold, change-detection metric, or minimum inter-keyframe interval). This is load-bearing for the central claim because low-velocity failure modes (slow drift, grasp instability, force-threshold violations) may be systematically excluded; if the extractor drops such cues, the subsequent BEV prompt receives no signal and the reported gains on RoboFAC may be benchmark-specific rather than a general property of the representation.
  2. [§4] §4 (Experimental evaluation): The manuscript reports large improvements on RoboFAC failure detection/identification/localization but does not specify the exact evaluation metrics, confidence intervals, statistical tests, or controls for keyframe-selection bias. Without these details it is impossible to determine whether the gains are robust or sensitive to the particular failure distribution in the benchmark.
minor comments (2)
  1. The abstract and §4 could state the typical number of keyframes retained per trajectory and the resulting prompt token count; this would help readers assess the claimed compactness.
  2. Figure captions for the qualitative real-robot examples should explicitly note which failure types are illustrated and whether any low-motion cases were included.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important areas where additional precision will strengthen the manuscript. We address each major comment below and will incorporate the necessary revisions.

read point-by-point responses
  1. Referee: [§3] §3 (Keyframe extraction and BEV serialization): The motion-salient keyframe heuristic is described at a high level but without the precise selection rule (e.g., optical-flow magnitude threshold, change-detection metric, or minimum inter-keyframe interval). This is load-bearing for the central claim because low-velocity failure modes (slow drift, grasp instability, force-threshold violations) may be systematically excluded; if the extractor drops such cues, the subsequent BEV prompt receives no signal and the reported gains on RoboFAC may be benchmark-specific rather than a general property of the representation.

    Authors: We agree that the keyframe extraction procedure requires a more explicit description to support reproducibility and to allow evaluation of its coverage for low-velocity failures. In the revised manuscript we will expand §3 with the precise selection rule as implemented in the released code, including the motion-saliency metric, any thresholds, and the minimum inter-keyframe interval. We will also add a short discussion of how low-velocity or static failure cues are captured through the complementary open-vocabulary detections and BEV layout encodings, while acknowledging that the current heuristic is primarily motion-driven and may benefit from future extensions for purely static anomalies. revision: yes

  2. Referee: [§4] §4 (Experimental evaluation): The manuscript reports large improvements on RoboFAC failure detection/identification/localization but does not specify the exact evaluation metrics, confidence intervals, statistical tests, or controls for keyframe-selection bias. Without these details it is impossible to determine whether the gains are robust or sensitive to the particular failure distribution in the benchmark.

    Authors: We concur that the experimental reporting should be more complete. In the revised §4 we will explicitly define the metrics used for each sub-task (detection, identification, localization, explanation, and correction), report confidence intervals, include the results of appropriate statistical tests, and add controls or ablations that address potential keyframe-selection bias (for example, comparisons against uniform or random keyframe baselines). These additions will make the robustness of the observed gains clearer. revision: yes

Circularity Check

0 steps flagged

No circularity in KITE empirical pipeline

full rationale

The paper describes an empirical front-end pipeline that selects motion-salient keyframes from robot videos, generates schematic BEV representations, and serializes them with tokens for an off-the-shelf VLM. Performance is measured via direct comparison on the external RoboFAC benchmark against vanilla VLM and a tuned baseline. No equations, fitted parameters, or predictions are presented that reduce to the method's own inputs by construction. No uniqueness theorems, self-cited ansatzes, or self-definitional steps appear in the provided text. The keyframe heuristic and BEV encoding are presented as standard components without load-bearing self-references.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, new axioms, or invented entities are detailed; the approach relies on standard computer vision components whose effectiveness is assumed.

pith-pipeline@v0.9.0 · 5541 in / 1327 out tokens · 52560 ms · 2026-05-10T18:43:16.923352+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 26 canonical work pages · 11 internal anchors

  1. [1]

    Do as i can, not as i say: Grounding language in robotic affordances,

    A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian,et al., “Do as i can, not as i say: Grounding language in robotic affordances,” inConference on robot learning. PMLR, 2023, pp. 287–318

  2. [2]

    PaLM-E: An Embodied Multimodal Language Model

    D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu,et al., “Palm-e: An embodied multimodal language model,”arXiv preprint arXiv:2303.03378, 2023

  3. [3]

    A Generalist Agent

    S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, et al., “A generalist agent,”arXiv preprint arXiv:2205.06175, 2022

  4. [4]

    Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Zsolt Kira, Fei Xia, and Yonatan Bisk

    Y . Hu, Q. Xie, V . Jain, J. Francis, J. Patrikar, N. Keetha, S. Kim, Y . Xie, T. Zhang, Z. Zhao,et al., “Toward general-purpose robots via foundation models: A survey and meta-analysis,”arXiv preprint arXiv:2312.08782, 2023

  5. [5]

    2312.07843 , archiveprefix =

    R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapoor, K. Hausman,et al., “Foundation models in robotics: Applications, challenges, and the future,”arXiv preprint arXiv:2312.07843, 2023

  6. [6]

    arXiv preprint arXiv:2403.03174

    F. Liu, K. Fang, P. Abbeel, and S. Levine, “Moka: Open-vocabulary robotic manipulation through mark-based visual prompting,”arXiv preprint arXiv:2403.03174, 2024

  7. [7]

    Copa: General robotic manipulation through spatial constraints of parts with foundation models,

    H. Huang, F. Lin, Y . Hu, S. Wang, and Y . Gao, “Copa: General robotic manipulation through spatial constraints of parts with foundation models,”arXiv preprint arXiv:2403.08248, 2024

  8. [8]

    Rekep: Spatio- temporal reasoning of relational keypoint constraints for robotic manipulation.arXiv preprint arXiv:2409.01652, 2024

    W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,”arXiv preprint arXiv:2409.01652, 2024

  9. [9]

    Reflect: Summarizing robot experi- ences for failure explanation and correction,

    Z. Liu, A. Bahety, and S. Song, “Reflect: Summarizing robot experi- ences for failure explanation and correction,” inCoRL, 2023

  10. [10]

    Aha: A vision- language-model for detecting and reasoning over failures in robotic manipulation,

    J. Duan, W. Pumacay, N. Kumar, Y . R. Wang, S. Tian, W. Yuan, R. Krishna, D. Fox, A. Mandlekar, and Y . Guo, “Aha: A vision- language-model for detecting and reasoning over failures in robotic manipulation,” inICLR, 2025

  11. [11]

    Robofac: A comprehensive framework for robotic failure analysis and correction,

    W. Lu, M. Ye, Z. Ye, R. Tao, S. Yang, and B. Zhao, “Robofac: A comprehensive framework for robotic failure analysis and correction,”

  12. [12]

    arXiv preprint arXiv:2505.12224 , year=

    [Online]. Available: https://arxiv.org/abs/2505.12224

  13. [13]

    Compound robot - realman robotics,

    RealMan Robotics, “Compound robot - realman robotics,” https://www. realman-robotics.com/compound-robot, 2024, accessed: 2024-09-07

  14. [14]

    Aloha 2: An enhanced low-cost hardware for bimanual teleoperation,

    A. . Team, J. Aldaco, T. Armstrong, R. Baruch, J. Bingham, S. Chan, K. Draper, D. Dwibedi, C. Finn, P. Florence, S. Goodrich, W. Gramlich, T. Hage, A. Herzog, J. Hoech, T. Nguyen, I. Storz, B. Taban- pour, L. Takayama, J. Tompson, A. Wahid, T. Wahrburg, S. Xu, S. Yaroshenko, K. Zakka, and T. Z. Zhao, “Aloha 2: An enhanced low-cost hardware for bimanual te...

  15. [15]

    Large language models for robotics: A survey

    F. Zeng, W. Gan, Y . Wang, N. Liu, and P. S. Yu, “Large language models for robotics: A survey,”arXiv preprint arXiv:2311.07226, 2023

  16. [16]

    Large language models for human–robot interaction: A review,

    C. Zhang, J. Chen, J. Li, Y . Peng, and Z. Mao, “Large language models for human–robot interaction: A review,”Biomimetic Intelligence and Robotics, vol. 3, no. 4, p. 100131, 2023

  17. [17]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat,et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

  18. [18]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale,et al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  19. [19]

    GPT-4o System Card

    O. team, “Gpt-4o system card,” 2024. [Online]. Available: https://arxiv.org/abs/2410.21276

  20. [20]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” 2023

  21. [21]

    Llava-next: Improved reasoning, ocr, and world knowledge,

    H. Liu, C. Li, Y . Li, B. Li, Y . Zhang, S. Shen, and Y . J. Lee, “Llava-next: Improved reasoning, ocr, and world knowledge,” January 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-01-30-llava-next

  22. [22]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser,et al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,”arXiv preprint arXiv:2403.05530, 2024

  23. [23]

    Gemini: A Family of Highly Capable Multimodal Models

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican,et al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023

  24. [24]

    A survey of embodied ai: From simulators to research tasks,

    J. Duan, S. Yu, H. L. Tan, H. Zhu, and C. Tan, “A survey of embodied ai: From simulators to research tasks,”IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 6, no. 2, pp. 230–244, 2022

  25. [25]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,et al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

  26. [26]

    Automated agent decom- position for classical planning,

    M. Crosby, M. Rovatsos, and R. Petrick, “Automated agent decom- position for classical planning,” inProceedings of the International Conference on Automated Planning and Scheduling, vol. 23, 2013, pp. 46–54

  27. [27]

    Re- woo: Decoupling reasoning from observations for ef- ficient augmented language models

    B. Xu, Z. Peng, B. Lei, S. Mukherjee, Y . Liu, and D. Xu, “Rewoo: Decoupling reasoning from observations for efficient augmented language models,”arXiv preprint arXiv:2305.18323, 2023

  28. [28]

    Large lan- guage models are zero-shot reasoners,

    T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large lan- guage models are zero-shot reasoners,”Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022

  29. [29]

    Explainable ai for robot failures: Generating explanations that improve user assistance in fault recovery,

    D. Das, S. Banerjee, and S. Chernova, “Explainable ai for robot failures: Generating explanations that improve user assistance in fault recovery,” inProceedings of the 2021 ACM/IEEE international conference on human-robot interaction, 2021, pp. 351–360

  30. [30]

    Verbalization: Narration of autonomous robot experience

    S. Rosenthal, S. P. Selvaraj, and M. M. Veloso, “Verbalization: Narration of autonomous robot experience.” inIJCAI, vol. 16, 2016, pp. 862–868

  31. [31]

    Human trust after robot mistakes: Study of the effects of different forms of robot communication,

    S. Ye, G. Neville, M. Schrum, M. Gombolay, S. Chernova, and A. Howard, “Human trust after robot mistakes: Study of the effects of different forms of robot communication,” in2019 28th IEEE Interna- tional Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 2019, pp. 1–7

  32. [32]

    User study exploring the role of explanation of failures by robots in human robot collaboration tasks,

    P. Khanna, E. Yadollahi, M. Björkman, I. Leite, and C. Smith, “User study exploring the role of explanation of failures by robots in human robot collaboration tasks,”arXiv preprint arXiv:2303.16010, 2023

  33. [33]

    Multimodal estimation and communication of latent semantic knowledge for robust execution of robot instructions,

    J. Arkin, D. Park, S. Roy, M. R. Walter, N. Roy, T. M. Howard, and R. Paul, “Multimodal estimation and communication of latent semantic knowledge for robust execution of robot instructions,”The International Journal of Robotics Research, vol. 39, no. 10-11, pp. 1279–1304, 2020

  34. [34]

    Latte: Language trajectory transformer,

    A. Bucker, L. Figueredo, S. Haddadin, A. Kapoor, S. Ma, S. Vemprala, and R. Bonatti, “Latte: Language trajectory transformer,”arXiv preprint arXiv:2208.02918, 2022

  35. [35]

    Cape: Corrective actions from precondition errors using large language models,

    S. S. Raman, V . Cohen, I. Idrees, E. Rosen, R. Mooney, S. Tellex, and D. Paulius, “Cape: Corrective actions from precondition errors using large language models,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 14 070–14 077

  36. [36]

    I can tell what i am doing: Toward real-world natural language grounding of robot experiences,

    Z. Wang, B. Liang, V . Dhat, Z. Brumbaugh, N. Walker, R. Krishna, and M. Cakmak, “I can tell what i am doing: Toward real-world natural language grounding of robot experiences,”arXiv preprint arXiv:2411.12960, 2024

  37. [37]

    Learning to summarize and answer questions about a virtual robot’s past actions,

    C. DeChant, I. Akinola, and D. Bauer, “Learning to summarize and answer questions about a virtual robot’s past actions,”Autonomous robots, vol. 47, no. 8, pp. 1103–1118, 2023

  38. [38]

    Vision-language models as success detectors.arXiv preprint arXiv:2303.07280, 2023

    Y . Du, K. Konyushkova, M. Denil, A. Raju, J. Landon, F. Hill, N. de Freitas, and S. Cabi, “Vision-language models as success detectors,”arXiv preprint arXiv:2303.07280, 2023

  39. [39]

    VIP: Towards Universal Visual Reward and Representation via Value-Implicit Pre-Training

    Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “Vip: Towards universal visual reward and representation via value-implicit pre-training,”arXiv preprint arXiv:2210.00030, 2022

  40. [40]

    Scaling up and distilling down: Language-guided robot skill acquisition,

    H. Ha, P. Florence, and S. Song, “Scaling up and distilling down: Language-guided robot skill acquisition,” inConference on Robot Learning. PMLR, 2023, pp. 3766–3777

  41. [41]

    Gensim: Generating robotic simulation tasks via large language models,

    L. Wang, Y . Ling, Z. Yuan, M. Shridhar, C. Bao, Y . Qin, B. Wang, H. Xu, and X. Wang, “Gensim: Generating robotic simulation tasks via large language models,”arXiv preprint arXiv:2310.01361, 2023

  42. [42]

    Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning.arXiv preprint arXiv:2307.06135, 2023

    K. Rana, J. Haviland, S. Garg, J. Abou-Chakra, I. Reid, and N. Suender- hauf, “Sayplan: Grounding large language models using 3d scene graphs for scalable robot task planning,”arXiv preprint arXiv:2307.06135, 2023

  43. [43]

    Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving,

    T. Choudhary, V . Dewangan, S. Chandhok, S. Priyadarshan, A. Jain, A. K. Singh, S. Srivastava, K. M. Jatavallabhula, and K. M. Krishna, “Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 16 345–16 352

  44. [44]

    Sketch, ground, and refine: Top-down dense video captioning,

    C. Deng, S. Chen, D. Chen, Y . He, and Q. Wu, “Sketch, ground, and refine: Top-down dense video captioning,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 234–243

  45. [45]

    Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

    S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu,et al., “Grounding dino: Marrying dino with grounded pre- training for open-set object detection,”arXiv preprint arXiv:2303.05499, 2023

  46. [46]

    Depth Anything V2

    L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv:2406.09414, 2024

  47. [47]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, H. Zhong, Y . Zhu, M. Yang, Z. Li, J. Wan, P. Wang, W. Ding, Z. Fu, Y . Xu, J. Ye, X. Zhang, T. Xie, Z. Cheng, H. Zhang, Z. Yang, H. Xu, and J. Lin, “Qwen2.5-vl technical report,”arXiv preprint arXiv:2502.13923, 2025

  48. [48]

    Two-frame motion estimation based on polynomial expansion,

    G. Farnebäck, “Two-frame motion estimation based on polynomial expansion,” inScandinavian conference on Image analysis. Springer, 2003, pp. 363–370

  49. [49]

    QLoRA: Efficient Finetuning of Quantized LLMs

    T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Ef- ficient finetuning of quantized llms,”arXiv preprint arXiv:2305.14314, 2023