De-compositional Evidence Grounding decomposes visual reasoning into atomic sub-questions each tied to a specific image region to improve VLM performance and interpretability.
Dissecting multimodality in videoqa transformer models by impairing modality fusion.arXiv preprint arXiv:2306.08889, 2023
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
H-GRPO: Permutation-Invariant Reinforcement Learning for Grounded Visual Reasoning
De-compositional Evidence Grounding decomposes visual reasoning into atomic sub-questions each tied to a specific image region to improve VLM performance and interpretability.