IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models
Pith reviewed 2026-05-08 02:28 UTC · model grok-4.3
The pith
A two-stage forward-inverse framework enables robust open-vocabulary intention recognition in video-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that intention understanding can be decomposed into goal candidate generation followed by structured inference through selection, following forward-inverse modeling. This structured process reduces hallucinations in latent reasoning and supports open-vocabulary human intention recognition. On the IntentQA and Inst-IT Bench datasets the resulting model reaches up to 80 percent accuracy, exceeds baseline performance by 30 percent, and matches human performance while avoiding catastrophic forgetting.
What carries the argument
The forward-inverse modeling decomposition, in which forward modeling generates goal candidates from multimodal inputs and inverse modeling performs structured selection to identify the intended goal.
If this is right
- Open-vocabulary intention recognition improves in settings that combine video and language inputs.
- Hallucinations during the model's internal reasoning about goals are reduced.
- The model adapts to new intention tasks without losing performance on previous ones.
- Multimodal signals are integrated into more coherent goal interpretations.
- The approach supplies a foundation for intention-aware human-robot interaction.
Where Pith is reading between the lines
- The explicit selection step could make model decisions more inspectable than direct end-to-end prediction.
- Similar decomposition might help other multimodal prediction tasks such as forecasting pedestrian actions in video.
- The method could be tested on longer video sequences to check whether the candidate-generation stage scales without added error.
- Pairing the two-stage structure with real-time streaming video might enable live intention monitoring in collaborative settings.
Load-bearing premise
That separating goal candidate generation from the later selection step is what reduces hallucinations and prevents loss of prior knowledge.
What would settle it
Direct comparison on the IntentQA and Inst-IT Bench datasets showing that a single-stage video-language model without the selection step achieves equal or higher accuracy and equal or lower hallucination rates.
Figures
read the original abstract
Improving the effectiveness of human-robot interaction requires social robots to accurately infer human goals through robust intention understanding. This challenge is particularly critical in multimodal settings, where agents must integrate heterogeneous signals including text, visual cues to form a coherent interpretation of user intent. This paper presents IntentVLM, a novel two-stage video-language framework designed for open-vocabulary human intention recognition. The approach is inspired by forward-inverse modeling in cognitive science by decomposing intention understanding into goal candidate generation followed by structured inference through selection, effectively reducing hallucinations in latent reasoning. Evaluated on the IntentQA and Inst-IT Bench datasets, IntentVLM achieves state-of-the-art results with up to 80% accuracy, notably surpassing the baseline performance by 30% and matches human performance. Our findings demonstrate that this structured reasoning approach enhances open-vocabulary intention understanding without catastrophic forgetting, offering a robust foundation for human-centered robotics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes IntentVLM, a two-stage video-language framework for open-vocabulary human intention recognition. Drawing on forward-inverse modeling from cognitive science, it decomposes the task into goal-candidate generation followed by structured selection-based inference, with the goal of reducing hallucinations in VLM latent reasoning. Evaluated on IntentQA and Inst-IT Bench, the method is reported to reach up to 80% accuracy, exceed baselines by 30%, match human performance, and avoid catastrophic forgetting.
Significance. If the reported gains can be isolated to the proposed decomposition via controlled experiments, the work could supply a practical structured-reasoning template for reliable multimodal intention inference in human-robot interaction.
major comments (2)
- [Abstract] Abstract: The headline performance claims (80% accuracy, +30% over baseline, human-level) are stated without any description of model architecture, training procedure, baseline definitions, statistical tests, or error analysis, rendering it impossible to assess whether the data support the claims.
- [Results] Results section: The central methodological claim—that the forward-inverse two-stage decomposition reduces hallucinations and produces the observed gains—is load-bearing yet unsupported by ablations (single-stage vs. two-stage) or any quantitative hallucination metric. Absent these controls, the 30% margin cannot be confidently attributed to the proposed structure rather than prompting, context length, or unstated fine-tuning.
minor comments (1)
- [Abstract] Abstract: The clause 'including text, visual cues to form' is grammatically incomplete and should read 'including text and visual cues to form'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on improving the clarity of our claims and the empirical support for our methodological contributions. We address each major comment below and have made targeted revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline performance claims (80% accuracy, +30% over baseline, human-level) are stated without any description of model architecture, training procedure, baseline definitions, statistical tests, or error analysis, rendering it impossible to assess whether the data support the claims.
Authors: We agree that the abstract, due to its brevity, does not provide these supporting details, which limits immediate assessment of the claims. The full manuscript describes the two-stage video-language architecture and forward-inverse decomposition in Section 3, the training and inference procedure in Section 4, baseline definitions and comparisons in Section 5.1, and statistical tests plus error analysis in Section 5.3. To address the concern directly, we have revised the abstract to concisely reference the forward-inverse modeling approach, the video-language model backbone, the evaluation on IntentQA and Inst-IT Bench, and the use of accuracy with human-comparison baselines. We have also expanded the error analysis subsection in the results to better link the reported metrics to the data. revision: yes
-
Referee: [Results] Results section: The central methodological claim—that the forward-inverse two-stage decomposition reduces hallucinations and produces the observed gains—is load-bearing yet unsupported by ablations (single-stage vs. two-stage) or any quantitative hallucination metric. Absent these controls, the 30% margin cannot be confidently attributed to the proposed structure rather than prompting, context length, or unstated fine-tuning.
Authors: The referee is correct that the original results section relies on overall baseline comparisons without isolating the two-stage structure via explicit ablations or a dedicated hallucination metric. This leaves room for alternative explanations. In the revised manuscript, we have added a dedicated ablation subsection that directly compares the full two-stage forward-inverse model against a single-stage variant (collapsing goal generation and selection into one VLM inference pass) while holding prompting, context length, and model parameters fixed. We also introduce a quantitative hallucination metric defined as the rate of inconsistent or fabricated intentions (validated via automated consistency checks and a small human annotation study). These controls confirm the gains are attributable to the decomposition. We further clarify that no additional fine-tuning was performed beyond the zero-shot/few-shot setup described in Section 4. revision: yes
Circularity Check
No circularity in derivation chain; claims are empirical and methodological
full rationale
The paper describes a two-stage video-language framework for intention recognition, explicitly framed as inspired by forward-inverse modeling from cognitive science and implemented via goal candidate generation followed by structured selection. No equations, parameter fits, or quantitative derivations appear in the provided text. The central performance claims (80% accuracy, +30% over baseline) are presented as results of evaluation on IntentQA and Inst-IT Bench rather than as predictions derived from the method itself. The decomposition is introduced as a design choice to reduce hallucinations, not as a self-definitional or fitted result that reduces to its own inputs. No self-citations, uniqueness theorems, or ansatzes smuggled via prior work are invoked in a load-bearing way within the abstract or summary. The derivation chain is therefore self-contained as an empirical methodology without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Forward-inverse modeling from cognitive science can be applied to decompose intention recognition in video-language models.
Reference graph
Works this paper leans on
-
[1]
What is proactive human-robot interaction?-a review of a progressive field and its definitions,
van Den Broek, M. Koch, and T. B. Moeslund, “What is proactive human-robot interaction?-a review of a progressive field and its definitions, ”ACM Transactions on Human-Robot Interaction, vol. 13, no. 4, pp. 1–30, 2024
2024
-
[2]
Harmoni: Multimodal personalization of multi-user human- robot interactions with llms,
J. Malécot, H. Rahimi, J. Cattoni, M. Samson, M. Abrini, M. Khoramshahi, M. Pino, and M. Chetouani, “Harmoni: Multimodal personalization of multi-user human- robot interactions with llms, ”arXiv preprint arXiv:2601.19839, 2026
-
[3]
Mental state attribution, neurocognitive functioning, and psychopathology: what predicts poor social competence in schizophrenia best?,
M. Brüne, M. Abdel-Hamid, C. Lehmkämper, and C. Sonntag, “Mental state attribution, neurocognitive functioning, and psychopathology: what predicts poor social competence in schizophrenia best?, ”Schizophrenia research, vol. 92, no. 1-3, pp. 151–159, 2007
2007
-
[4]
Chetouani,Introduction to Computational Human-AI Collaboration
M. Chetouani,Introduction to Computational Human-AI Collaboration. Cham: Springer Nature Switzerland, 2026
2026
-
[5]
Understanding robots,
T. Ziemke, “Understanding robots, ”Science Robotics, vol. 5, no. 46, p. eabe2987, 2020
2020
-
[6]
Mental state attribution to robots: A systematic review of conceptions, methods, and findings,
S. Thellman, M. De Graaf, and T. Ziemke, “Mental state attribution to robots: A systematic review of conceptions, methods, and findings, ”ACM Transactions on Human-Robot Interaction (THRI), vol. 11, no. 4, pp. 1–51, 2022
2022
-
[7]
Inferring human intent and predicting human action in human–robot collaboration,
G. Hoffman, T. Bhattacharjee, and S. Nikolaidis, “Inferring human intent and predicting human action in human–robot collaboration, ”Annual Review of Control, Robotics, and Autonomous Systems, vol. 7, no. 1, pp. 73–95, 2024
2024
-
[8]
Probabilistic human intent recognition for shared autonomy in assistive robotics,
S. Jain and B. Argall, “Probabilistic human intent recognition for shared autonomy in assistive robotics, ”ACM Transactions on Human-Robot Interaction (THRI), vol. 9, no. 1, pp. 1–23, 2019
2019
-
[9]
Encoding Predictability and Legibility for Style-Conditioned Diffusion Policy
A. J. Crétides, M. Abrini, H. Rahimi, and M. Chetouani, “Encoding pre- dictability and legibility for style-conditioned diffusion policy, ”arXiv preprint arXiv:2603.16368, 2026
work page internal anchor Pith review arXiv 2026
-
[10]
Multimodal uncertainty re- duction for intention recognition in human-robot interaction,
S. Trick, D. Koert, J. Peters, and C. A. Rothkopf, “Multimodal uncertainty re- duction for intention recognition in human-robot interaction, ” in2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7009–7016, IEEE, 2019
2019
-
[11]
Capir: Collaborative action planning with intention recognition,
T.-H. Nguyen, D. Hsu, W.-S. Lee, T.-Y. Leong, L. Kaelbling, T. Lozano-Perez, and A. Grant, “Capir: Collaborative action planning with intention recognition, ” in Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, vol. 7, pp. 61–66, 2011
2011
-
[12]
I-failsense: Towards general robotic failure detection with vision-language models,
C. Grislain, H. Rahimi, O. Sigaud, and M. Chetouani, “I-failsense: Towards general robotic failure detection with vision-language models, ”arXiv preprint arXiv:2509.16072, 2025
-
[13]
Two ways to make your robot proactive: Reasoning about human intentions or reasoning about possible futures,
S. Buyukgoz, G. J., M. Chetouani, and A. Saffiotti, “Two ways to make your robot proactive: Reasoning about human intentions or reasoning about possible futures, ”Frontiers in Robotics and AI, 2022
2022
-
[14]
Activity, plan, and goal recognition: A review,
F. A. Van-Horenbeke and A. Peer, “Activity, plan, and goal recognition: A review, ” Frontiers in Robotics and AI, vol. 8, 2021
2021
-
[15]
Generalized plan recognition,
H. A. Kautz and J. F. Allen, “Generalized plan recognition, ” inProceedings of the Fifth AAAI National Conference on Artificial Intelligence, AAAI’86, p. 32–37, AAAI Press, 1986
1986
-
[16]
Sukthankar, C
G. Sukthankar, C. Geib, H. H. Bui, D. Pynadath, and R. P. Goldman,Plan, Activity, and Intent Recognition: Theory and Practice. Morgan Kaufmann, 2014
2014
-
[17]
Deep learning approaches for multimodal intent recognition: A survey,
J. Zhao, Y. Wen, Q. Li, M. Hu, Y. Zhou, J. Xue, J. Wu, Y. Gao, Z. Wen, J. Tao,et al., “Deep learning approaches for multimodal intent recognition: A survey, ”arXiv preprint arXiv:2507.22934, 2025
-
[18]
Foundation mod- els for video understanding: A survey.arXiv preprint arXiv:2405.03770, 2024
N. Madan, A. Møgelmose, R. Modi, Y. S. Rawat, and T. B. Moeslund, “Foundation models for video understanding: A survey, ”arXiv preprint arXiv:2405.03770, 2024
-
[19]
Cognitive science as a source of forward and inverse models of human decisions for robotics and control,
M. K. Ho and T. L. Griffiths, “Cognitive science as a source of forward and inverse models of human decisions for robotics and control, ”Annual Review of Control, Robotics, and Autonomous Systems, vol. 5, no. 1, pp. 33–53, 2022
2022
-
[20]
Vision-language models for vision tasks: A survey,
J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey, ”IEEE transactions on pattern analysis and machine intelligence, vol. 46, no. 8, pp. 5625–5644, 2024
2024
-
[21]
A sound and fast goal recognizer,
N. Lesh and O. Etzioni, “A sound and fast goal recognizer, ” inProceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’95, p. 1704–1710, Morgan Kaufmann Publishers Inc., 1995
1995
-
[22]
Intention and personal policies,
M. E. Bratman, “Intention and personal policies, ”Philosophical Perspectives, 1989
1989
-
[23]
A probabilistic model of plan recognition,
E. Charniak and R. Goldman, “A probabilistic model of plan recognition, ” in Proceedings of the Ninth National Conference on Artificial Intelligence - Volume 1, AAAI’91, p. 160–165, AAAI Press, 1991
1991
-
[25]
Plan recognition as planning,
M. Ramírez and H. Geffner, “Plan recognition as planning, ” inProceedings of the 21st International Joint Conference on Artificial Intelligence, IJCAI’09, p. 1778–1783, Morgan Kaufmann Publishers Inc., 2009
2009
-
[26]
Probabilistic plan recognition using off-the-shelf clas- sical planners,
M. Ramírez and H. Geffner, “Probabilistic plan recognition using off-the-shelf clas- sical planners, ” inProceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence, AAAI’10, p. 1121–1126, AAAI Press, 2010
2010
-
[27]
Tractable probabilistic models for intention recognition based on expert knowledge,
O. C. Schrempf, D. Albrecht, and U. D. Hanebeck, “Tractable probabilistic models for intention recognition based on expert knowledge, ” in2007 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems, pp. 3122–3127, 2007
2007
-
[28]
Cost-based goal recognition in navigational domains,
P. Masters and S. Sardina, “Cost-based goal recognition in navigational domains, ” J. Artif. Int. Res., vol. 64, no. 1, p. 197–242, 2019
2019
-
[29]
Detachment, involvement, and rationality: Are we essentially rational animals?,
H. L. Dreyfus, “Detachment, involvement, and rationality: Are we essentially rational animals?, ”Human Affairs, vol. 17, no. 2, pp. 101–109, 2007
2007
-
[30]
Forward modelling requires intention recognition and non-impoverished predictions,
J. P. de Ruiter and C. Cummins, “Forward modelling requires intention recognition and non-impoverished predictions, ”Behavioral and Brain Sciences, vol. 36, no. 4, pp. 393–394, 2013
2013
-
[31]
Action understanding as inverse planning,
C. L. Baker, R. Saxe, and J. B. Tenenbaum, “Action understanding as inverse planning, ”Cognition, vol. 113, no. 3, pp. 329–349, 2009
2009
-
[32]
Modeling human inten- tion inference from visual motion using inverse planning,
Z. Qian, M. Kryven, T. Gao, and J. B. Tenenbaum, “Modeling human inten- tion inference from visual motion using inverse planning, ”arXiv preprint arXiv:2112.00903, 2021
-
[33]
Algorithms for inverse reinforcement learning,
A. Y. Ng and S. J. Russell, “Algorithms for inverse reinforcement learning, ” in Proceedings of the 17th International Conference on Machine Learning (ICML), pp. 663–670, 2000
2000
-
[34]
Cooperative inverse reinforcement learning,
D. Hadfield-Menell, S. J. Russell, P. Abbeel, and A. D. Dragan, “Cooperative inverse reinforcement learning, ” inAdvances in Neural Information Processing Systems (NeurIPS), pp. 3909–3917, 2016
2016
-
[35]
Cognitive science as a source of forward and inverse models of human decision making,
M. K. Ho and T. L. Griffiths, “Cognitive science as a source of forward and inverse models of human decision making, ”Psychological Review, 2022
2022
-
[36]
Proceedings of 1st workshop on advanc- ing artificial intelligence through theory of mind,
M. Abrini, O. Abend, D. Acklin, H. Admoni, G. Aichinger, N. Alon, Z. Ashktorab, A. Atreja, M. Auron, A. Aufreiter,et al., “Proceedings of 1st workshop on advanc- ing artificial intelligence through theory of mind, ”arXiv preprint arXiv:2505.03770, 2025
-
[37]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,
W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents, ”International Conference on Machine Learning, 2022
2022
-
[38]
Inner monologue: Embodied reasoning through planning with language models,
W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y. Chebotar, P. Sermanet, T. Jackson, N. Brown, L. Luu, S. Levine, K. Hausman, and brian ichter, “Inner monologue: Embodied reasoning through planning with language models, ” in6th Annual Conference on Robot Learning, 2022
2022
-
[39]
Foundation model driven robotics: A comprehensive review,
M. T. Khan and A. Waheed, “Foundation model driven robotics: A comprehensive review, ”arXiv preprint arXiv:2507.10087v1, 2025
-
[40]
Language models are few-shot learn- ers,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakan- tan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Am...
2020
-
[41]
Emergent abilities of large language models,
J. Wei, Y. Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Yogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P. Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models, ”Transactions on Machine Learning Research, 2022. Survey Certification
2022
-
[42]
A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges,
Z. Li, X. Wu, H. Du, F. Liu, H. Nghiem, and G. Shi, “A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges, ” in 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1578–1597, 2025
2025
-
[43]
User-vlm: Llm contex- tualization with multimodal pre-trained user models,
H. Rahimi, M. Abrini, M. Khoramshahi, and M. Chetouani, “User-vlm: Llm contex- tualization with multimodal pre-trained user models, ” inToM4AI@ 39th Annual AAAI Conference on Artificial Intelligence, 2025
2025
-
[44]
User-vlm 360: Personalized vision language models with user-aware tuning for social human-robot interactions,
H. Rahimi, A. Bahaj, M. Abrini, M. Khoramshahi, M. Ghogho, and M. Chetouani, “User-vlm 360: Personalized vision language models with user-aware tuning for social human-robot interactions, ” inProceedings of the 27th International Conference on Multimodal Interaction, pp. 326–336, 2025
2025
-
[45]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision, ” 2021
2021
-
[46]
Physically grounded vision-language models for robotic manipulation,
J. Gao, B. Sarkar, F. Xia, T. Xiao, J. Wu, B. Ichter, A. Majumdar, and D. Sadigh, “Physically grounded vision-language models for robotic manipulation, ” in2024 IEEE International Conference on Robotics and Automation (ICRA), pp. 12462– 12469, 2024
2024
-
[47]
S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge,et al., “Qwen3-vl technical report, ”arXiv preprint arXiv:2511.21631, 2025. Submitted to ICMI ’26, October 5–9, 2026, Napoli, Italy Rahimi et al
work page internal anchor Pith review arXiv 2025
-
[48]
Demographic user modeling for social robotics with multi- modal pre-trained models,
H. Rahimi, M. Abrini, J. Malecot, Y. Lai, A. Jacquet Crétides, M. Khoramshahi, and M. Chetouani, “Demographic user modeling for social robotics with multi- modal pre-trained models, ” inProceedings of the 27th International Conference on Multimodal Interaction, pp. 337–343, 2025
2025
-
[49]
Reasoning llms for user-aware multimodal conversational agents,
H. Rahimi, J. Cattoni, M. Beghili, M. Abrini, M. Khoramshahi, M. Pino, and M. Chetouani, “Reasoning llms for user-aware multimodal conversational agents, ” in2025 34th IEEE International Conference on Robot and Human Interactive Com- munication (RO-MAN), pp. 443–448, IEEE, 2025
2025
-
[50]
Neural theory-of-mind? on the limits of social intelligence in large LMs,
M. Sap, R. Le Bras, D. Fried, and Y. Choi, “Neural theory-of-mind? on the limits of social intelligence in large LMs, ” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing(Y. Goldberg, Z. Kozareva, and Y. Zhang, eds.), pp. 3762–3780, Association for Computational Linguistics, 2022
2022
-
[51]
Multi-modal user intent classification under the scenario of smart factory,
Y.-C. Chiu, B.-H. Chang, T.-Y. Chen, and C.-F. Yang, “Multi-modal user intent classification under the scenario of smart factory, ” inProceedings of the AAAI Conference on Artificial Intelligence, 2021
2021
-
[52]
Goal recognition using actor-critic optimization,
B. Nageris, F. Meneguzzi, and R. Mirsky, “Goal recognition using actor-critic optimization, ”arXiv preprint arXiv:2501.01463, 2024
-
[53]
Reinforcement learning for human- AI collaboration via probabilistic intent inference,
Y. Lin, S. F. Ghoreishi, T. Lan, and M. Imani, “Reinforcement learning for human- AI collaboration via probabilistic intent inference, ” inReinforcement Learning Conference, 2025
2025
-
[54]
Goma: Proac- tive embodied cooperative communication via goal-oriented mental alignment,
L. Ying, K. Jha, S. Aarya, J. B. Tenenbaum, A. Torralba, and T. Shu, “Goma: Proac- tive embodied cooperative communication via goal-oriented mental alignment, ” in2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 7099–7106, 2024
2024
-
[55]
Combo: Compositional world models for embodied multi-agent cooperation,
H. Zhang, Z. Wang, Q. Lyu, Z. Zhang, S. Chen, T. Shu, B. Dariush, K. Lee, Y. Du, and C. Gan, “Combo: Compositional world models for embodied multi-agent cooperation, ” inICLR, 2025
2025
-
[56]
Infer human’s intentions before following natural language instructions,
Y. Wan, Y. Wu, Y. Wang, J. Mao, and N. Jaques, “Infer human’s intentions before following natural language instructions, ” AAAI’25/IAAI’25/EAAI’25, AAAI Press, 2025
2025
-
[57]
Intention understanding in human–robot interaction based on visual-nlp semantics,
L. Z, M. Y, S. Z, S. S, S. J, and Z. J, “Intention understanding in human–robot interaction based on visual-nlp semantics, ”Frontiers in Neurobotics, 2021
2021
-
[58]
Z. Huang, J. Pohovey, A. Yammanuru, and K. Driggs-Campbell, “Lit: Large lan- guage model driven intention tracking for proactive human-robot collaboration – a robot sous-chef application, ”arXiv preprint arXiv:2406.13787, 2024
-
[59]
Ov-mer: towards open-vocabulary multimodal emotion recognition,
Z. Lian, H. Sun, L. Sun, H. Chen, L. Chen, H. Gu, Z. Wen, S. Chen, S. Zhang, H. Yao, B. Liu, R. Liu, S. Liang, Y. Li, J. Yi, and J. Tao, “Ov-mer: towards open-vocabulary multimodal emotion recognition, ” inProceedings of the 42nd International Confer- ence on Machine Learning, ICML’25, JMLR.org, 2025
2025
-
[60]
Intentqa: Context-aware video intent reason- ing,
J. Li, P. Wei, W. Han, and L. Fan, “Intentqa: Context-aware video intent reason- ing, ” inProceedings of the IEEE/CVF international conference on computer vision, pp. 11963–11974, 2023
2023
-
[61]
Inst-it: Boosting instance understanding via explicit visual prompt instruction tuning,
W. Peng, L. Meng, Y. Chen, Y. Xie, Y. Liu, T. Gui, H. Xu, X. Qiu, Z. Wu, and Y.-G. Jiang, “Inst-it: Boosting instance understanding via explicit visual prompt instruction tuning, ”arXiv preprint arXiv:2412.03565, 2024
-
[62]
Parameter-efficient fine-tuning for foundation models, 2025
D. Zhang, T. Feng, L. Xue, Y. Wang, Y. Dong, and J. Tang, “Parameter-efficient fine-tuning for foundation models, ”arXiv preprint arXiv:2501.13787, 2025
-
[63]
Lora: Low-rank adaptation of large language models.,
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen,et al., “Lora: Low-rank adaptation of large language models., ”Iclr, vol. 1, no. 2, p. 3, 2022
2022
-
[64]
Rouge metric evaluation for text summarization techniques,
M. Barbella and G. Tortora, “Rouge metric evaluation for text summarization techniques, ”A vailable at SSRN 4120317, 2022
2022
-
[65]
Evolution of semantic similarity—a survey,
D. Chandrasekaran and V. Mago, “Evolution of semantic similarity—a survey, ” Acm Computing Surveys (Csur), vol. 54, no. 2, pp. 1–37, 2021
2021
-
[66]
A fine-grained analysis of bertscore,
M. Hanna and O. Bojar, “A fine-grained analysis of bertscore, ” inProceedings of the Sixth Conference on Machine Translation, pp. 507–517, 2021. Received April 2026
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.