pith. sign in

arxiv: 2604.24002 · v1 · submitted 2026-04-27 · 💻 cs.HC · cs.AI· cs.MM

IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models

Pith reviewed 2026-05-08 02:28 UTC · model grok-4.3

classification 💻 cs.HC cs.AIcs.MM
keywords intention recognitionvideo-language modelsopen-vocabularyforward-inverse modelinghuman-robot interactionmultimodal reasoninggoal inferencestructured selection
0
0 comments X p. Extension

The pith

A two-stage forward-inverse framework enables robust open-vocabulary intention recognition in video-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that decomposing intention understanding into an initial stage of generating goal candidates from video and text, followed by a selection stage of structured inference, allows video-language models to infer human goals more accurately. This matters because effective human-robot interaction depends on correctly integrating multimodal signals into coherent interpretations of user intent. A sympathetic reader would see value in the claim that this cognitive-science-inspired split cuts down on reasoning errors like hallucinations while preserving the model's existing capabilities. If the decomposition works as described, it supplies a practical way to handle open-vocabulary cases without the model losing prior knowledge when adapting to new intention tasks.

Core claim

The central claim is that intention understanding can be decomposed into goal candidate generation followed by structured inference through selection, following forward-inverse modeling. This structured process reduces hallucinations in latent reasoning and supports open-vocabulary human intention recognition. On the IntentQA and Inst-IT Bench datasets the resulting model reaches up to 80 percent accuracy, exceeds baseline performance by 30 percent, and matches human performance while avoiding catastrophic forgetting.

What carries the argument

The forward-inverse modeling decomposition, in which forward modeling generates goal candidates from multimodal inputs and inverse modeling performs structured selection to identify the intended goal.

If this is right

  • Open-vocabulary intention recognition improves in settings that combine video and language inputs.
  • Hallucinations during the model's internal reasoning about goals are reduced.
  • The model adapts to new intention tasks without losing performance on previous ones.
  • Multimodal signals are integrated into more coherent goal interpretations.
  • The approach supplies a foundation for intention-aware human-robot interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit selection step could make model decisions more inspectable than direct end-to-end prediction.
  • Similar decomposition might help other multimodal prediction tasks such as forecasting pedestrian actions in video.
  • The method could be tested on longer video sequences to check whether the candidate-generation stage scales without added error.
  • Pairing the two-stage structure with real-time streaming video might enable live intention monitoring in collaborative settings.

Load-bearing premise

That separating goal candidate generation from the later selection step is what reduces hallucinations and prevents loss of prior knowledge.

What would settle it

Direct comparison on the IntentQA and Inst-IT Bench datasets showing that a single-stage video-language model without the selection step achieves equal or higher accuracy and equal or lower hallucination rates.

Figures

Figures reproduced from arXiv: 2604.24002 by Adrien Jacquet Cretides, Clemence Grislain, Hamed Rahimi, Mohamed Chetouani, Olivier Sigaud.

Figure 1
Figure 1. Figure 1: IntentVLM for open-vocabulary human intention recognition from video streams. Given a continuous video stream of view at source ↗
Figure 2
Figure 2. Figure 2: Closed-vocabulary solutions vs our approach. Conventional approaches formulate intention recognition as closed-set view at source ↗
Figure 3
Figure 3. Figure 3: Architecture of IntentVLM. We cast the model as a view at source ↗
Figure 4
Figure 4. Figure 4: IntentQA Benchmark [60]. Illustrative examples of the four QA types in our training dataset. CW: a man points at a screen to guide a child’s attention. CH: a pointing gesture expresses the intent to drink. TN: a boy’s pointing causes the baby to look in that direction. TP: a girl’s pointing is triggered by the baby’s action. The red box indicates the correct answer view at source ↗
Figure 5
Figure 5. Figure 5: Inst-IT Benchmark [61]. Each sample consists of temporally grounded frame-level annotations with instance IDs, a coherent video-level description, and instance-focused QA pairs. The annotations capture fine-grained attributes, in￾teractions, and temporal changes of instances across frames, enabling explicit instance-level reasoning. and reasoning. Instance detection identifies what objects are present and … view at source ↗
Figure 4
Figure 4. Figure 4: Inst-IT Bench. The second dataset, Inst-IT Bench [61], is a fine￾grained multimodal benchmark designed to evaluate instance-level comprehension in both images and videos. It provides a more de￾tailed assessment of a model’s ability to reason about specific enti￾ties and their interactions within a scene. Formally, each sample in Inst-IT Bench can be represented as: 𝑥 = (𝑀, 𝑄, A, 𝑎∗ ), (6) where 𝑀 denotes t… view at source ↗
Figure 7
Figure 7. Figure 7: Performance on the Inst-IT benchmark. (a) shows that our model remains highly competitive with baseline methods, view at source ↗
read the original abstract

Improving the effectiveness of human-robot interaction requires social robots to accurately infer human goals through robust intention understanding. This challenge is particularly critical in multimodal settings, where agents must integrate heterogeneous signals including text, visual cues to form a coherent interpretation of user intent. This paper presents IntentVLM, a novel two-stage video-language framework designed for open-vocabulary human intention recognition. The approach is inspired by forward-inverse modeling in cognitive science by decomposing intention understanding into goal candidate generation followed by structured inference through selection, effectively reducing hallucinations in latent reasoning. Evaluated on the IntentQA and Inst-IT Bench datasets, IntentVLM achieves state-of-the-art results with up to 80% accuracy, notably surpassing the baseline performance by 30% and matches human performance. Our findings demonstrate that this structured reasoning approach enhances open-vocabulary intention understanding without catastrophic forgetting, offering a robust foundation for human-centered robotics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes IntentVLM, a two-stage video-language framework for open-vocabulary human intention recognition. Drawing on forward-inverse modeling from cognitive science, it decomposes the task into goal-candidate generation followed by structured selection-based inference, with the goal of reducing hallucinations in VLM latent reasoning. Evaluated on IntentQA and Inst-IT Bench, the method is reported to reach up to 80% accuracy, exceed baselines by 30%, match human performance, and avoid catastrophic forgetting.

Significance. If the reported gains can be isolated to the proposed decomposition via controlled experiments, the work could supply a practical structured-reasoning template for reliable multimodal intention inference in human-robot interaction.

major comments (2)
  1. [Abstract] Abstract: The headline performance claims (80% accuracy, +30% over baseline, human-level) are stated without any description of model architecture, training procedure, baseline definitions, statistical tests, or error analysis, rendering it impossible to assess whether the data support the claims.
  2. [Results] Results section: The central methodological claim—that the forward-inverse two-stage decomposition reduces hallucinations and produces the observed gains—is load-bearing yet unsupported by ablations (single-stage vs. two-stage) or any quantitative hallucination metric. Absent these controls, the 30% margin cannot be confidently attributed to the proposed structure rather than prompting, context length, or unstated fine-tuning.
minor comments (1)
  1. [Abstract] Abstract: The clause 'including text, visual cues to form' is grammatically incomplete and should read 'including text and visual cues to form'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on improving the clarity of our claims and the empirical support for our methodological contributions. We address each major comment below and have made targeted revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline performance claims (80% accuracy, +30% over baseline, human-level) are stated without any description of model architecture, training procedure, baseline definitions, statistical tests, or error analysis, rendering it impossible to assess whether the data support the claims.

    Authors: We agree that the abstract, due to its brevity, does not provide these supporting details, which limits immediate assessment of the claims. The full manuscript describes the two-stage video-language architecture and forward-inverse decomposition in Section 3, the training and inference procedure in Section 4, baseline definitions and comparisons in Section 5.1, and statistical tests plus error analysis in Section 5.3. To address the concern directly, we have revised the abstract to concisely reference the forward-inverse modeling approach, the video-language model backbone, the evaluation on IntentQA and Inst-IT Bench, and the use of accuracy with human-comparison baselines. We have also expanded the error analysis subsection in the results to better link the reported metrics to the data. revision: yes

  2. Referee: [Results] Results section: The central methodological claim—that the forward-inverse two-stage decomposition reduces hallucinations and produces the observed gains—is load-bearing yet unsupported by ablations (single-stage vs. two-stage) or any quantitative hallucination metric. Absent these controls, the 30% margin cannot be confidently attributed to the proposed structure rather than prompting, context length, or unstated fine-tuning.

    Authors: The referee is correct that the original results section relies on overall baseline comparisons without isolating the two-stage structure via explicit ablations or a dedicated hallucination metric. This leaves room for alternative explanations. In the revised manuscript, we have added a dedicated ablation subsection that directly compares the full two-stage forward-inverse model against a single-stage variant (collapsing goal generation and selection into one VLM inference pass) while holding prompting, context length, and model parameters fixed. We also introduce a quantitative hallucination metric defined as the rate of inconsistent or fabricated intentions (validated via automated consistency checks and a small human annotation study). These controls confirm the gains are attributable to the decomposition. We further clarify that no additional fine-tuning was performed beyond the zero-shot/few-shot setup described in Section 4. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims are empirical and methodological

full rationale

The paper describes a two-stage video-language framework for intention recognition, explicitly framed as inspired by forward-inverse modeling from cognitive science and implemented via goal candidate generation followed by structured selection. No equations, parameter fits, or quantitative derivations appear in the provided text. The central performance claims (80% accuracy, +30% over baseline) are presented as results of evaluation on IntentQA and Inst-IT Bench rather than as predictions derived from the method itself. The decomposition is introduced as a design choice to reduce hallucinations, not as a self-definitional or fitted result that reduces to its own inputs. No self-citations, uniqueness theorems, or ansatzes smuggled via prior work are invoked in a load-bearing way within the abstract or summary. The derivation chain is therefore self-contained as an empirical methodology without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Assessment is limited to the abstract; no explicit free parameters, invented entities, or detailed axioms are stated. The only identifiable assumption is the applicability of cognitive-science forward-inverse modeling to video-language models.

axioms (1)
  • domain assumption Forward-inverse modeling from cognitive science can be applied to decompose intention recognition in video-language models.
    The abstract states the approach is inspired by this cognitive-science concept.

pith-pipeline@v0.9.0 · 5471 in / 1252 out tokens · 69429 ms · 2026-05-08T02:28:25.645347+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 13 canonical work pages · 2 internal anchors

  1. [1]

    What is proactive human-robot interaction?-a review of a progressive field and its definitions,

    van Den Broek, M. Koch, and T. B. Moeslund, “What is proactive human-robot interaction?-a review of a progressive field and its definitions, ”ACM Transactions on Human-Robot Interaction, vol. 13, no. 4, pp. 1–30, 2024

  2. [2]

    Harmoni: Multimodal personalization of multi-user human- robot interactions with llms,

    J. Malécot, H. Rahimi, J. Cattoni, M. Samson, M. Abrini, M. Khoramshahi, M. Pino, and M. Chetouani, “Harmoni: Multimodal personalization of multi-user human- robot interactions with llms, ”arXiv preprint arXiv:2601.19839, 2026

  3. [3]

    Mental state attribution, neurocognitive functioning, and psychopathology: what predicts poor social competence in schizophrenia best?,

    M. Brüne, M. Abdel-Hamid, C. Lehmkämper, and C. Sonntag, “Mental state attribution, neurocognitive functioning, and psychopathology: what predicts poor social competence in schizophrenia best?, ”Schizophrenia research, vol. 92, no. 1-3, pp. 151–159, 2007

  4. [4]

    Chetouani,Introduction to Computational Human-AI Collaboration

    M. Chetouani,Introduction to Computational Human-AI Collaboration. Cham: Springer Nature Switzerland, 2026

  5. [5]

    Understanding robots,

    T. Ziemke, “Understanding robots, ”Science Robotics, vol. 5, no. 46, p. eabe2987, 2020

  6. [6]

    Mental state attribution to robots: A systematic review of conceptions, methods, and findings,

    S. Thellman, M. De Graaf, and T. Ziemke, “Mental state attribution to robots: A systematic review of conceptions, methods, and findings, ”ACM Transactions on Human-Robot Interaction (THRI), vol. 11, no. 4, pp. 1–51, 2022

  7. [7]

    Inferring human intent and predicting human action in human–robot collaboration,

    G. Hoffman, T. Bhattacharjee, and S. Nikolaidis, “Inferring human intent and predicting human action in human–robot collaboration, ”Annual Review of Control, Robotics, and Autonomous Systems, vol. 7, no. 1, pp. 73–95, 2024

  8. [8]

    Probabilistic human intent recognition for shared autonomy in assistive robotics,

    S. Jain and B. Argall, “Probabilistic human intent recognition for shared autonomy in assistive robotics, ”ACM Transactions on Human-Robot Interaction (THRI), vol. 9, no. 1, pp. 1–23, 2019

  9. [9]

    Encoding Predictability and Legibility for Style-Conditioned Diffusion Policy

    A. J. Crétides, M. Abrini, H. Rahimi, and M. Chetouani, “Encoding pre- dictability and legibility for style-conditioned diffusion policy, ”arXiv preprint arXiv:2603.16368, 2026

  10. [10]

    Multimodal uncertainty re- duction for intention recognition in human-robot interaction,

    S. Trick, D. Koert, J. Peters, and C. A. Rothkopf, “Multimodal uncertainty re- duction for intention recognition in human-robot interaction, ” in2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7009–7016, IEEE, 2019

  11. [11]

    Capir: Collaborative action planning with intention recognition,

    T.-H. Nguyen, D. Hsu, W.-S. Lee, T.-Y. Leong, L. Kaelbling, T. Lozano-Perez, and A. Grant, “Capir: Collaborative action planning with intention recognition, ” in Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol. 7, pp. 61–66, 2011

  12. [12]

    I-failsense: Towards general robotic failure detection with vision-language models,

    C. Grislain, H. Rahimi, O. Sigaud, and M. Chetouani, “I-failsense: Towards general robotic failure detection with vision-language models, ”arXiv preprint arXiv:2509.16072, 2025

  13. [13]

    Two ways to make your robot proactive: Reasoning about human intentions or reasoning about possible futures,

    S. Buyukgoz, G. J., M. Chetouani, and A. Saffiotti, “Two ways to make your robot proactive: Reasoning about human intentions or reasoning about possible futures, ”Frontiers in Robotics and AI, 2022

  14. [14]

    Activity, plan, and goal recognition: A review,

    F. A. Van-Horenbeke and A. Peer, “Activity, plan, and goal recognition: A review, ” Frontiers in Robotics and AI, vol. 8, 2021

  15. [15]

    Generalized plan recognition,

    H. A. Kautz and J. F. Allen, “Generalized plan recognition, ” inProceedings of the Fifth AAAI National Conference on Artificial Intelligence, AAAI’86, p. 32–37, AAAI Press, 1986

  16. [16]

    Sukthankar, C

    G. Sukthankar, C. Geib, H. H. Bui, D. Pynadath, and R. P. Goldman,Plan, Activity, and Intent Recognition: Theory and Practice. Morgan Kaufmann, 2014

  17. [17]

    Deep learning approaches for multimodal intent recognition: A survey,

    J. Zhao, Y. Wen, Q. Li, M. Hu, Y. Zhou, J. Xue, J. Wu, Y. Gao, Z. Wen, J. Tao,et al., “Deep learning approaches for multimodal intent recognition: A survey, ”arXiv preprint arXiv:2507.22934, 2025

  18. [18]

    Foundation mod- els for video understanding: A survey.arXiv preprint arXiv:2405.03770, 2024

    N. Madan, A. Møgelmose, R. Modi, Y. S. Rawat, and T. B. Moeslund, “Foundation models for video understanding: A survey, ”arXiv preprint arXiv:2405.03770, 2024

  19. [19]

    Cognitive science as a source of forward and inverse models of human decisions for robotics and control,

    M. K. Ho and T. L. Griffiths, “Cognitive science as a source of forward and inverse models of human decisions for robotics and control, ”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 33–53, 2022

  20. [20]

    Vision-language models for vision tasks: A survey,

    J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey, ”IEEE transactions on pattern analysis and machine intelligence, vol. 46, no. 8, pp. 5625–5644, 2024

  21. [21]

    A sound and fast goal recognizer,

    N. Lesh and O. Etzioni, “A sound and fast goal recognizer, ” inProceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’95, p. 1704–1710, Morgan Kaufmann Publishers Inc., 1995

  22. [22]

    Intention and personal policies,

    M. E. Bratman, “Intention and personal policies, ”Philosophical Perspectives, 1989

  23. [23]

    A probabilistic model of plan recognition,

    E. Charniak and R. Goldman, “A probabilistic model of plan recognition, ” in Proceedings of the Ninth National Conference on Artificial Intelligence - Volume 1, AAAI’91, p. 160–165, AAAI Press, 1991

  24. [25]

    Plan recognition as planning,

    M. Ramírez and H. Geffner, “Plan recognition as planning, ” inProceedings of the 21st International Joint Conference on Artificial Intelligence, IJCAI’09, p. 1778–1783, Morgan Kaufmann Publishers Inc., 2009

  25. [26]

    Probabilistic plan recognition using off-the-shelf clas- sical planners,

    M. Ramírez and H. Geffner, “Probabilistic plan recognition using off-the-shelf clas- sical planners, ” inProceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, p. 1121–1126, AAAI Press, 2010

  26. [27]

    Tractable probabilistic models for intention recognition based on expert knowledge,

    O. C. Schrempf, D. Albrecht, and U. D. Hanebeck, “Tractable probabilistic models for intention recognition based on expert knowledge, ” in2007 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems, pp. 3122–3127, 2007

  27. [28]

    Cost-based goal recognition in navigational domains,

    P. Masters and S. Sardina, “Cost-based goal recognition in navigational domains, ” J. Artif. Int. Res., vol. 64, no. 1, p. 197–242, 2019

  28. [29]

    Detachment, involvement, and rationality: Are we essentially rational animals?,

    H. L. Dreyfus, “Detachment, involvement, and rationality: Are we essentially rational animals?, ”Human Affairs, vol. 17, no. 2, pp. 101–109, 2007

  29. [30]

    Forward modelling requires intention recognition and non-impoverished predictions,

    J. P. de Ruiter and C. Cummins, “Forward modelling requires intention recognition and non-impoverished predictions, ”Behavioral and Brain Sciences, vol. 36, no. 4, pp. 393–394, 2013

  30. [31]

    Action understanding as inverse planning,

    C. L. Baker, R. Saxe, and J. B. Tenenbaum, “Action understanding as inverse planning, ”Cognition, vol. 113, no. 3, pp. 329–349, 2009

  31. [32]

    Modeling human inten- tion inference from visual motion using inverse planning,

    Z. Qian, M. Kryven, T. Gao, and J. B. Tenenbaum, “Modeling human inten- tion inference from visual motion using inverse planning, ”arXiv preprint arXiv:2112.00903, 2021

  32. [33]

    Algorithms for inverse reinforcement learning,

    A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement learning, ” in Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 663–670, 2000

  33. [34]

    Cooperative inverse reinforcement learning,

    D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. D. Dragan, “Cooperative inverse reinforcement learning, ” inAdvances in Neural Information Processing Systems (NeurIPS), pp. 3909–3917, 2016

  34. [35]

    Cognitive science as a source of forward and inverse models of human decision making,

    M. K. Ho and T. L. Griffiths, “Cognitive science as a source of forward and inverse models of human decision making, ”Psychological Review, 2022

  35. [36]

    Proceedings of 1st workshop on advanc- ing artificial intelligence through theory of mind,

    M. Abrini, O. Abend, D. Acklin, H. Admoni, G. Aichinger, N. Alon, Z. Ashktorab, A. Atreja, M. Auron, A. Aufreiter,et al., “Proceedings of 1st workshop on advanc- ing artificial intelligence through theory of mind, ”arXiv preprint arXiv:2505.03770, 2025

  36. [37]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,

    W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, ”International Conference on Machine Learning, 2022

  37. [38]

    Inner monologue: Embodied reasoning through planning with language models,

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, T. Jackson, N. Brown, L. Luu, S. Levine, K. Hausman, and brian ichter, “Inner monologue: Embodied reasoning through planning with language models, ” in6th Annual Conference on Robot Learning, 2022

  38. [39]

    Foundation model driven robotics: A comprehensive review,

    M. T. Khan and A. Waheed, “Foundation model driven robotics: A comprehensive review, ”arXiv preprint arXiv:2507.10087v1, 2025

  39. [40]

    Language models are few-shot learn- ers,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakan- tan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...

  40. [41]

    Emergent abilities of large language models,

    J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models, ”Transactions on Machine Learning Research, 2022. Survey Certification

  41. [42]

    A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges,

    Z. Li, X. Wu, H. Du, F. Liu, H. Nghiem, and G. Shi, “A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges, ” in 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1578–1597, 2025

  42. [43]

    User-vlm: Llm contex- tualization with multimodal pre-trained user models,

    H. Rahimi, M. Abrini, M. Khoramshahi, and M. Chetouani, “User-vlm: Llm contex- tualization with multimodal pre-trained user models, ” inToM4AI@ 39th Annual AAAI Conference on Artificial Intelligence, 2025

  43. [44]

    User-vlm 360: Personalized vision language models with user-aware tuning for social human-robot interactions,

    H. Rahimi, A. Bahaj, M. Abrini, M. Khoramshahi, M. Ghogho, and M. Chetouani, “User-vlm 360: Personalized vision language models with user-aware tuning for social human-robot interactions, ” inProceedings of the 27th International Conference on Multimodal Interaction, pp. 326–336, 2025

  44. [45]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision, ” 2021

  45. [46]

    Physically grounded vision-language models for robotic manipulation,

    J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh, “Physically grounded vision-language models for robotic manipulation, ” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 12462– 12469, 2024

  46. [47]

    Qwen3-VL Technical Report

    S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge,et al., “Qwen3-vl technical report, ”arXiv preprint arXiv:2511.21631, 2025. Submitted to ICMI ’26, October 5–9, 2026, Napoli, Italy Rahimi et al

  47. [48]

    Demographic user modeling for social robotics with multi- modal pre-trained models,

    H. Rahimi, M. Abrini, J. Malecot, Y. Lai, A. Jacquet Crétides, M. Khoramshahi, and M. Chetouani, “Demographic user modeling for social robotics with multi- modal pre-trained models, ” inProceedings of the 27th International Conference on Multimodal Interaction, pp. 337–343, 2025

  48. [49]

    Reasoning llms for user-aware multimodal conversational agents,

    H. Rahimi, J. Cattoni, M. Beghili, M. Abrini, M. Khoramshahi, M. Pino, and M. Chetouani, “Reasoning llms for user-aware multimodal conversational agents, ” in2025 34th IEEE International Conference on Robot and Human Interactive Com- munication (RO-MAN), pp. 443–448, IEEE, 2025

  49. [50]

    Neural theory-of-mind? on the limits of social intelligence in large LMs,

    M. Sap, R. Le Bras, D. Fried, and Y. Choi, “Neural theory-of-mind? on the limits of social intelligence in large LMs, ” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing(Y. Goldberg, Z. Kozareva, and Y. Zhang, eds.), pp. 3762–3780, Association for Computational Linguistics, 2022

  50. [51]

    Multi-modal user intent classification under the scenario of smart factory,

    Y.-C. Chiu, B.-H. Chang, T.-Y. Chen, and C.-F. Yang, “Multi-modal user intent classification under the scenario of smart factory, ” inProceedings of the AAAI Conference on Artificial Intelligence, 2021

  51. [52]

    Goal recognition using actor-critic optimization,

    B. Nageris, F. Meneguzzi, and R. Mirsky, “Goal recognition using actor-critic optimization, ”arXiv preprint arXiv:2501.01463, 2024

  52. [53]

    Reinforcement learning for human- AI collaboration via probabilistic intent inference,

    Y. Lin, S. F. Ghoreishi, T. Lan, and M. Imani, “Reinforcement learning for human- AI collaboration via probabilistic intent inference, ” inReinforcement Learning Conference, 2025

  53. [54]

    Goma: Proac- tive embodied cooperative communication via goal-oriented mental alignment,

    L. Ying, K. Jha, S. Aarya, J. B. Tenenbaum, A. Torralba, and T. Shu, “Goma: Proac- tive embodied cooperative communication via goal-oriented mental alignment, ” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7099–7106, 2024

  54. [55]

    Combo: Compositional world models for embodied multi-agent cooperation,

    H. Zhang, Z. Wang, Q. Lyu, Z. Zhang, S. Chen, T. Shu, B. Dariush, K. Lee, Y. Du, and C. Gan, “Combo: Compositional world models for embodied multi-agent cooperation, ” inICLR, 2025

  55. [56]

    Infer human’s intentions before following natural language instructions,

    Y. Wan, Y. Wu, Y. Wang, J. Mao, and N. Jaques, “Infer human’s intentions before following natural language instructions, ” AAAI’25/IAAI’25/EAAI’25, AAAI Press, 2025

  56. [57]

    Intention understanding in human–robot interaction based on visual-nlp semantics,

    L. Z, M. Y, S. Z, S. S, S. J, and Z. J, “Intention understanding in human–robot interaction based on visual-nlp semantics, ”Frontiers in Neurobotics, 2021

  57. [58]

    Lit: Large lan- guage model driven intention tracking for proactive human-robot collaboration – a robot sous-chef application,

    Z. Huang, J. Pohovey, A. Yammanuru, and K. Driggs-Campbell, “Lit: Large lan- guage model driven intention tracking for proactive human-robot collaboration – a robot sous-chef application, ”arXiv preprint arXiv:2406.13787, 2024

  58. [59]

    Ov-mer: towards open-vocabulary multimodal emotion recognition,

    Z. Lian, H. Sun, L. Sun, H. Chen, L. Chen, H. Gu, Z. Wen, S. Chen, S. Zhang, H. Yao, B. Liu, R. Liu, S. Liang, Y. Li, J. Yi, and J. Tao, “Ov-mer: towards open-vocabulary multimodal emotion recognition, ” inProceedings of the 42nd International Confer- ence on Machine Learning, ICML’25, JMLR.org, 2025

  59. [60]

    Intentqa: Context-aware video intent reason- ing,

    J. Li, P. Wei, W. Han, and L. Fan, “Intentqa: Context-aware video intent reason- ing, ” inProceedings of the IEEE/CVF international conference on computer vision, pp. 11963–11974, 2023

  60. [61]

    Inst-it: Boosting instance understanding via explicit visual prompt instruction tuning,

    W. Peng, L. Meng, Y. Chen, Y. Xie, Y. Liu, T. Gui, H. Xu, X. Qiu, Z. Wu, and Y.-G. Jiang, “Inst-it: Boosting instance understanding via explicit visual prompt instruction tuning, ”arXiv preprint arXiv:2412.03565, 2024

  61. [62]

    Parameter-efficient fine-tuning for foundation models, 2025

    D. Zhang, T. Feng, L. Xue, Y. Wang, Y. Dong, and J. Tang, “Parameter-efficient fine-tuning for foundation models, ”arXiv preprint arXiv:2501.13787, 2025

  62. [63]

    Lora: Low-rank adaptation of large language models.,

    E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen,et al., “Lora: Low-rank adaptation of large language models., ”Iclr, vol. 1, no. 2, p. 3, 2022

  63. [64]

    Rouge metric evaluation for text summarization techniques,

    M. Barbella and G. Tortora, “Rouge metric evaluation for text summarization techniques, ”A vailable at SSRN 4120317, 2022

  64. [65]

    Evolution of semantic similarity—a survey,

    D. Chandrasekaran and V. Mago, “Evolution of semantic similarity—a survey, ” Acm Computing Surveys (Csur), vol. 54, no. 2, pp. 1–37, 2021

  65. [66]

    A fine-grained analysis of bertscore,

    M. Hanna and O. Bojar, “A fine-grained analysis of bertscore, ” inProceedings of the Sixth Conference on Machine Translation, pp. 507–517, 2021. Received April 2026