IntentVLM uses forward-inverse modeling in a two-stage video-language setup to reach up to 80% accuracy on open-vocabulary intention recognition benchmarks, beating baselines by 30% and matching human performance.
Harmoni: Multimodal personalization of multi-user human- robot interactions with llms
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.HC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
IntentVLM: Open-Vocabulary Intention Recognition through Forward-Inverse Modeling with Video-Language Models
IntentVLM uses forward-inverse modeling in a two-stage video-language setup to reach up to 80% accuracy on open-vocabulary intention recognition benchmarks, beating baselines by 30% and matching human performance.