WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
Pith reviewed 2026-05-21 04:37 UTC · model grok-4.3
The pith
WikiVQABench supplies Wikipedia images paired with questions that need both visual content and external facts from Wikidata.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WikiVQABench is a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Large language models generate candidate multiple-choice image-question-answer sets that are then reviewed by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence. Evaluation of fifteen VLMs ranging from 256M to 90B parameters reveals a performance range of 24.7 to 75.6 percent accuracy.
What carries the argument
The WikiVQABench construction pipeline, which merges Wikipedia images and captions with Wikidata entries, uses LLMs to propose multiple-choice questions, and applies human curation to confirm factual accuracy and the requirement for external knowledge.
If this is right
- VLMs can be compared directly on their capacity to combine visual perception with structured external knowledge.
- The public dataset and code enable consistent evaluation of progress toward knowledge-aware vision-language models.
- Large gaps in model performance point to specific weaknesses in retrieving and applying facts not present in the image.
Where Pith is reading between the lines
- The same Wikipedia-plus-Wikidata pipeline could be repeated for other languages or narrower domains such as science or history.
- Strong results on WikiVQABench may predict better performance on practical tasks like answering factual questions about photographed places or objects.
- Researchers could analyze error patterns to identify which categories of external knowledge remain hardest for current models.
Load-bearing premise
Human annotators can reliably verify that each generated question requires external knowledge in addition to visual evidence and that the facts drawn from Wikidata are accurate and consistent with the image.
What would settle it
If human annotators frequently disagree on whether questions truly need external knowledge, or if the fifteen VLMs produce nearly identical accuracy scores irrespective of size, the benchmark would fail to discriminate knowledge-intensive reasoning.
Figures
read the original abstract
Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia images with curated multiple-choice questions designed to benchmark knowledge-aware vision-language models (VLMs). Evaluation of fifteen VLMs (256M-90B parameters) reveals a wide performance range (24.7%-75.6% accuracy), demonstrating that the benchmark effectively discriminates model capabilities on knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WikiVQABench, a knowledge-grounded VQA benchmark built from Wikipedia images, article captions, and Wikidata facts. Candidate multiple-choice questions are generated via LLMs and then filtered through human review to enforce factual accuracy, image consistency, and the requirement for external knowledge beyond visual content. Fifteen VLMs (256M–90B parameters) are evaluated, yielding accuracies from 24.7% to 75.6%; the authors conclude that this range shows the benchmark successfully discriminates knowledge-intensive reasoning capabilities. The dataset and evaluation code are released publicly.
Significance. If the human curation reliably isolates questions that require external Wikidata knowledge, the benchmark would fill a clear gap between perception-only VQA datasets and real-world knowledge-intensive tasks. The public release of data and code, together with the broad model-size sweep, would support reproducible progress on knowledge-aware VLMs.
major comments (1)
- [Abstract] Abstract: the human curation process is described only qualitatively ('reviewed and curated by human annotators to ensure … that each question requires external knowledge'). No inter-annotator agreement, annotator count, rejection rate, or post-curation error analysis is reported. Because the benchmark’s claim to discriminate knowledge-aware reasoning rests entirely on the validity of this filter, the missing quantitative validation is load-bearing.
minor comments (2)
- [Evaluation] The evaluation section would benefit from reporting per-model standard errors or confidence intervals on the accuracy figures to support the claim of a 'wide performance range'.
- [Dataset] A short table summarizing the final dataset statistics (number of images, questions, average options, rejection rate) would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the thoughtful review and for highlighting the importance of quantitative validation for the human curation process. We agree that this aspect is central to the benchmark's credibility and will strengthen the manuscript with additional details in the revision.
read point-by-point responses
-
Referee: [Abstract] Abstract: the human curation process is described only qualitatively ('reviewed and curated by human annotators to ensure … that each question requires external knowledge'). No inter-annotator agreement, annotator count, rejection rate, or post-curation error analysis is reported. Because the benchmark’s claim to discriminate knowledge-aware reasoning rests entirely on the validity of this filter, the missing quantitative validation is load-bearing.
Authors: We agree that the current manuscript provides only a qualitative description of the curation process. In the revised version, we will expand the relevant sections (including the abstract and methods) to report: (1) the number of annotators and their qualifications, (2) inter-annotator agreement statistics (e.g., Fleiss' kappa or percentage agreement on key criteria such as factual correctness and knowledge requirement), (3) the overall rejection rate of LLM-generated candidates, and (4) a post-curation error analysis performed on a held-out sample of accepted items. These additions will directly address the load-bearing concern about the filter's reliability. revision: yes
Circularity Check
No circularity in dataset construction or evaluation
full rationale
The paper presents a pipeline for constructing a VQA benchmark by combining Wikipedia images, captions, and Wikidata facts, using LLMs to generate candidates followed by human review for factual accuracy and external-knowledge requirement. It then evaluates fifteen existing VLMs on the resulting dataset and reports accuracy ranges. No mathematical derivations, equations, fitted parameters, or predictions appear in the described process; the central claims rest on the curation steps and empirical results rather than any self-referential reduction or self-citation load-bearing argument. The work is self-contained as a data-construction and benchmarking effort.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Patricia S. Abril and Robert Plant. The patent holder's dilemma: Buy, sell, or troll?. Communications of the ACM. doi:10.1145/1188913.1188915
-
[2]
Sarah Cohen and Werner Nutt and Yehoshua Sagic. Deciding equivalances among conjunctive aggregate queries. doi:10.1145/1219092.1219093
-
[3]
Special issue: Digital Libraries. 1996
work page 1996
- [4]
-
[7]
The title of book two. doi:10.1007/3-540-09237-4
-
[8]
Asad Z. Spector. Achieving application requirements. Distributed Systems. doi:10.1145/90417.90738
-
[9]
Douglass and David Harel and Mark B
Bruce P. Douglass and David Harel and Mark B. Trakhtenbrot. Statecarts in use: structured analysis and object-orientation. Lectures on Embedded Systems. doi:10.1007/3-540-65193-4_29
-
[10]
Donald E. Knuth. The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd. ed.)
-
[11]
Donald E. Knuth. The Art of Computer Programming
-
[12]
Structured Variational Inference Procedures and their Realizations (as incol)
Dan Geiger and Christopher Meek. Structured Variational Inference Procedures and their Realizations (as incol). Proceedings of Tenth International Workshop on Artificial Intelligence and Statistics, The Barbados
-
[13]
Stan W. Smith. An experiment in bibliographic mark-up: Parsing metadata for XML export. Proceedings of the 3rd. annual workshop on Librarians and Computers
-
[14]
Catch me, if you can: Evading network signatures with web-based polymorphic worms
Matthew Van Gundy and Davide Balzarotti and Giovanni Vigna. Catch me, if you can: Evading network signatures with web-based polymorphic worms. Proceedings of the first USENIX workshop on Offensive Technologies
-
[15]
Sten Andler. Predicate Path expressions. Proceedings of the 6th. ACM SIGACT-SIGPLAN symposium on Principles of Programming Languages. doi:10.1145/567752.567774
-
[16]
LOGICS of Programs: AXIOMATICS and DESCRIPTIVE POWER
David Harel. LOGICS of Programs: AXIOMATICS and DESCRIPTIVE POWER
- [17]
- [18]
-
[19]
Introduction to Bayesian Statistics
Harry Thornburg. Introduction to Bayesian Statistics. 2001
work page 2001
-
[20]
CLIFFORD: a Maple 11 Package for Clifford Algebra Computations, version 11
Rafal Ablamowicz and Bertfried Fauser. CLIFFORD: a Maple 11 Package for Clifford Algebra Computations, version 11. 2007
work page 2007
- [21]
- [22]
- [23]
-
[24]
Dave Novak. Solder man. ACM SIGGRAPH 2003 Video Review on Animation theater Program: Part I - Vol. 145 (July 27--27, 2003). doi:10.945/woot07-S422
work page 2003
-
[25]
Interview with Bill Kinder: January 13, 2005
Newton Lee. Interview with Bill Kinder: January 13, 2005. Comput. Entertain. doi:10.1145/1057270.1057278
-
[26]
The Enabling of Digital Libraries
Bernard Rous. The Enabling of Digital Libraries. Digital Libraries
-
[28]
(new) Finding minimum congestion spanning trees , journal =
Werneck, Renato and Setubal, Jo\. (new) Finding minimum congestion spanning trees , journal =. doi:10.1145/351827.384253 , acmid = 384253, publisher =
-
[30]
Conti, Mauro and Di Pietro, Roberto and Mancini, Luigi V. and Mei, Alessandro , title =. Inf. Fusion , volume =. 2009 , issn =. doi:10.1016/j.inffus.2009.01.002 , acmid =
-
[31]
Li, Cheng-Lun and Buyuktur, Ayse G. and Hutchful, David K. and Sant, Natasha B. and Nainwal, Satyendra K. , title =. CHI '08 extended abstracts on Human factors in computing systems , year =. doi:10.1145/1358628.1358946 , acmid =
- [32]
-
[33]
Goossens, Michel and Rahtz, S. P. and Moore, Ross and Sutor, Robert S. , title =. 1999 , isbn =
work page 1999
-
[34]
Buss, Jonathan F. and Rosenberg, Arnold L. and Knott, Judson D. , title =. 1987 , source =
work page 1987
-
[35]
CHI '08: CHI '08 extended abstracts on Human factors in computing systems , year =
, note =. CHI '08: CHI '08 extended abstracts on Human factors in computing systems , year =
-
[36]
Algorithms for Closest-Point Problems (Computational Geometry) , year =
Clarkson, Kenneth Lee , advisor =. Algorithms for Closest-Point Problems (Computational Geometry) , year =
-
[37]
SIGCOMM Comput. Commun. Rev. , year =
-
[38]
IEEE TCSC Executive Committee , booktitle =. 2004 , isbn =. doi:http://dx.doi.org/10.1109/ICWS.2004.64 , acmid =
-
[39]
Distributed systems (2nd Ed.) , year =
- [40]
-
[41]
Donald E. Knuth. Seminumerical Algorithms. 1981
work page 1981
-
[42]
E-commerce and cultural values , year =
Kong, Wei-Chang , Title =. E-commerce and cultural values , year =
-
[43]
E-commerce and cultural values , year =
Kong, Wei-Chang , type =. E-commerce and cultural values , year =
- [44]
-
[45]
E-commerce and cultural values , editor =
Kong, Wei-Chang , title =. E-commerce and cultural values , editor =. 2003 , isbn =
work page 2003
-
[46]
E-commerce and cultural values - (InBook-num-in-chap) , chapter =
Kong, Wei-Chang , editor =. E-commerce and cultural values - (InBook-num-in-chap) , chapter =. 2004 , address =
work page 2004
-
[47]
E-commerce and cultural values (Inbook-text-in-chap) , chapter =
Kong, Wei-Chang , editor =. E-commerce and cultural values (Inbook-text-in-chap) , chapter =. 2005 , address =
work page 2005
-
[48]
E-commerce and cultural values (Inbook-num chap) , chapter =
Kong, Wei-Chang , editor =. E-commerce and cultural values (Inbook-num chap) , chapter =. 2006 , address =
work page 2006
-
[49]
Mehdi Saeedi and Morteza Saheb Zamani and Mehdi Sedighi , title =. Microelectron. J. , volume =. 2010 , pages =
work page 2010
-
[50]
Mehdi Saeedi and Morteza Saheb Zamani and Mehdi Sedighi and Zahra Sasanian , title =. J. Emerg. Technol. Comput. Syst. , volume =
-
[51]
Kirschmer, Markus and Voight, John , title =. SIAM J. Comput. , issue_date =. 2010 , issn =. doi:https://doi.org/10.1137/080734467 , acmid =
-
[52]
Hoare, C. A. R. , title =. Structured programming (incoll) , editor =. 1972 , isbn =
work page 1972
-
[53]
History of programming languages I (incoll) , editor =
Lee, Jan , title =. History of programming languages I (incoll) , editor =. 1981 , isbn =. doi:http://doi.acm.org/10.1145/800025.1198348 , acmid =
- [54]
-
[55]
Wenzel, Elizabeth M. , title =. Multimedia interface design (incoll) , year =. doi:10.1145/146022.146089 , acmid =
- [56]
-
[57]
McCracken, Daniel D. and Golden, Donald G. , title =. 1990 , isbn =
work page 1990
-
[58]
The analysis of linear partial differential operators
H. The analysis of linear partial differential operators. 1985 , PAGES =
work page 1985
-
[59]
A. Adya and P. Bahl and J. Padhye and A.Wolman and L. Zhou , title =. Proceedings of the IEEE 1st International Conference on Broadnets Networks (BroadNets'04) , publisher = "IEEE", address = "Los Alamitos, CA", year =
-
[60]
I. F. Akyildiz and W. Su and Y. Sankarasubramaniam and E. Cayirci , title =. Comm. ACM , volume = 38, number = "4", year =
-
[61]
I. F. Akyildiz and T. Melodia and K. R. Chowdhury , title =. Computer Netw. , volume = 51, number = "4", year =
-
[62]
P. Bahl and R. Chancre and J. Dungeon , title =. Proceeding of the 10th International Conference on Mobile Computing and Networking (MobiCom'04) , publisher = "ACM", address = "New York, NY", year =
-
[63]
8 (Special Issue on Sensor Networks)
D. Culler and D. Estrin and M. Srivastava , title =. IEEE Comput. , volume = 37, number = "8 (Special Issue on Sensor Networks)", publisher = "IEEE", address = "Los Alamitos, CA", year =
-
[64]
A. Natarajan and M. Motani and B. de Silva and K. Yap and K. C. Chua , title =. Network Architectures , editor =. 960935712
- [65]
- [66]
-
[67]
Mapping Powerlists onto Hypercubes
Jacob Kornerup. Mapping Powerlists onto Hypercubes. 1994
work page 1994
-
[68]
Automatic Parallelization for Distributed-Memory Multiprocessing Systems
Michael Gerndt. Automatic Parallelization for Distributed-Memory Multiprocessing Systems
-
[69]
J. E. Archer, Jr. and R. Conway and F. B. Schneider. User recovery and reversal in interactive systems. ACM Trans. Program. Lang. Syst
-
[70]
D. D. Dunlop and V. R. Basili. Generalizing specifications for uniformly implemented loops. ACM Trans. Program. Lang. Syst
-
[71]
J. Heering and P. Klint. Towards monolingual programming environments. ACM Trans. Program. Lang. Syst
-
[72]
Donald E. Knuth. The book
-
[73]
E. Korach and D. Rotem and N. Santoro. Distributed algorithms for finding centers and medians in networks. ACM Trans. Program. Lang. Syst
- [74]
-
[75]
F. Nielson. Program transformations in a denotational setting. ACM Trans. Program. Lang. Syst
-
[76]
Brian K. Reid. A high-level approach to computer document formatting. Proceedings of the 7th Annual Symposium on Principles of Programming Languages
-
[77]
Zhou, Gang and Wu, Yafeng and Yan, Ting and He, Tian and Huang, Chengdu and Stankovic, John A. and Abdelzaher, Tarek F. , title =. ACM Trans. Embed. Comput. Syst. , issue_date =. doi:10.1145/1721695.1721705 , acmid = 1721705, publisher =
-
[78]
Institutional members of the Users Group
-
[79]
Boris Veytsman , title =
-
[80]
Robin Schneider , title =
-
[81]
Bowman, Mic and Debray, Saumya K. and Peterson, Larry L. , title =. ACM Trans. Program. Lang. Syst. , volume =. 1993 , doi =
work page 1993
- [82]
-
[83]
Malcolm Clark. Post Congress Tristesse. TeX90 Conference Proceedings
- [84]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.