WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Systematic evaluation finds cross-modal skill injection via model merging succeeds in instruction-following and cross-lingual scenarios but fails in mathematical reasoning, with TA and DARE methods outperforming others after hyperparameter analysis.
citing papers explorer
-
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters
Systematic evaluation finds cross-modal skill injection via model merging succeeds in instruction-following and cross-lingual scenarios but fails in mathematical reasoning, with TA and DARE methods outperforming others after hyperparameter analysis.