DUALVISION is a new lightweight fusion module using localized cross-attention to integrate infrared with RGB data in MLLMs, improving robustness to degradations and supported by the new DV-204K training dataset and DV-500 benchmark.
Title resolution pending
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5verdicts
UNVERDICTED 5representative citing papers
ACG mitigates hallucinations in LVLMs via single-pass contrastive guidance in attention space that suppresses text-only biases through masking and orthogonal projection.
AutoGUI-v2 is a new benchmark exposing that VLMs handle basic GUI grounding but struggle with complex interaction logic and state prediction.
PIE-V is a framework that injects plausible mistakes and corrections into egocentric procedural videos via psychology-informed planning and LLM-assisted video synthesis, paired with a nine-metric human rubric for benchmarking.
A conformal interpretability method labels LLM agent states step-by-step and extracts linearly separable temporal concept directions aligned with task success on ScienceWorld and AlfWorld.
citing papers explorer
-
DUALVISION: RGB-Infrared Multimodal Large Language Models for Robust Visual Reasoning
DUALVISION is a new lightweight fusion module using localized cross-attention to integrate infrared with RGB data in MLLMs, improving robustness to degradations and supported by the new DV-204K training dataset and DV-500 benchmark.
-
Attention-space Contrastive Guidance for Efficient Hallucination Mitigation in LVLMs
ACG mitigates hallucinations in LVLMs via single-pass contrastive guidance in attention space that suppresses text-only biases through masking and orthogonal projection.
-
AutoGUI-v2: A Comprehensive Multi-Modal GUI Functionality Understanding Benchmark
AutoGUI-v2 is a new benchmark exposing that VLMs handle basic GUI grounding but struggle with complex interaction logic and state prediction.
-
How to Correctly Make Mistakes: A Framework for Constructing and Benchmarking Mistake Aware Egocentric Procedural Videos
PIE-V is a framework that injects plausible mistakes and corrections into egocentric procedural videos via psychology-informed planning and LLM-assisted video synthesis, paired with a nine-metric human rubric for benchmarking.
-
From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents
A conformal interpretability method labels LLM agent states step-by-step and extracts linearly separable temporal concept directions aligned with task success on ScienceWorld and AlfWorld.