pith. sign in

arxiv: 2605.21479 · v1 · pith:O5UKZ2ZEnew · submitted 2026-05-20 · 💻 cs.CV · cs.AI

WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata

Pith reviewed 2026-05-21 04:37 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords visual question answeringknowledge-grounded VQAvision-language modelsWikipediaWikidatabenchmark datasetexternal knowledgemultiple-choice questions
0
0 comments X

The pith

WikiVQABench supplies Wikipedia images paired with questions that need both visual content and external facts from Wikidata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most existing visual question answering benchmarks can be solved from the image alone. This paper builds WikiVQABench to test models on questions that also require outside knowledge not visible in the picture. The dataset draws images and captions from Wikipedia and facts from Wikidata. Large language models first generate multiple-choice questions and answers, then human reviewers check them for accuracy, consistency with the image, and the genuine need for external knowledge. Testing fifteen vision-language models of widely different sizes produces accuracy scores that spread from 24.7 percent to 75.6 percent, showing the benchmark can separate models that handle knowledge-intensive reasoning from those that cannot.

Core claim

WikiVQABench is a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Large language models generate candidate multiple-choice image-question-answer sets that are then reviewed by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence. Evaluation of fifteen VLMs ranging from 256M to 90B parameters reveals a performance range of 24.7 to 75.6 percent accuracy.

What carries the argument

The WikiVQABench construction pipeline, which merges Wikipedia images and captions with Wikidata entries, uses LLMs to propose multiple-choice questions, and applies human curation to confirm factual accuracy and the requirement for external knowledge.

If this is right

  • VLMs can be compared directly on their capacity to combine visual perception with structured external knowledge.
  • The public dataset and code enable consistent evaluation of progress toward knowledge-aware vision-language models.
  • Large gaps in model performance point to specific weaknesses in retrieving and applying facts not present in the image.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Wikipedia-plus-Wikidata pipeline could be repeated for other languages or narrower domains such as science or history.
  • Strong results on WikiVQABench may predict better performance on practical tasks like answering factual questions about photographed places or objects.
  • Researchers could analyze error patterns to identify which categories of external knowledge remain hardest for current models.

Load-bearing premise

Human annotators can reliably verify that each generated question requires external knowledge in addition to visual evidence and that the facts drawn from Wikidata are accurate and consistent with the image.

What would settle it

If human annotators frequently disagree on whether questions truly need external knowledge, or if the fifteen VLMs produce nearly identical accuracy scores irrespective of size, the benchmark would fail to discriminate knowledge-intensive reasoning.

Figures

Figures reproduced from arXiv: 2605.21479 by Anna Lisa Gentile, Basel Shbita, Pengyuan Li.

Figure 1
Figure 1. Figure 1: Example from WikiVQABench illustrat￾ing a knowledge-grounded multiple-choice VQA instance. The image depicts a spider whose tax￾onomic classification cannot be determined from visual appearance alone. Correctly answering the question requires external biological knowledge linking visual cues to entity-level taxonomy (e.g., family or genus), demonstrating the benchmark’s emphasis on required knowledge beyon… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the WikiVQABench dataset construction pipeline. Left: image and caption [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Question difficulty distribution. Each bar shows [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Screenshot of the UI used for human curation and quality control. Annotators are shown [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Visual Question Answering (VQA) benchmarks have largely emphasized perception-based tasks that can be solved from visual content alone. In contrast, many real-world scenarios require external knowledge that is not directly observable in the image to answer correctly. We introduce WikiVQABench, a human-curated knowledge-grounded VQA benchmark constructed by systematically combining Wikipedia images, their associated article captions, and structured knowledge from Wikidata. Our pipeline uses large language models (LLMs) to generate candidate multiple-choice image-question-answer sets. All generated instances are subsequently reviewed and curated by human annotators to ensure factual correctness, visual-text consistency, and that each question requires external knowledge in addition to visual evidence for correct resolution. WikiVQABench comprises a substantial collection of Wikipedia images with curated multiple-choice questions designed to benchmark knowledge-aware vision-language models (VLMs). Evaluation of fifteen VLMs (256M-90B parameters) reveals a wide performance range (24.7%-75.6% accuracy), demonstrating that the benchmark effectively discriminates model capabilities on knowledge-intensive reasoning. The dataset and benchmarking code are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces WikiVQABench, a knowledge-grounded VQA benchmark built from Wikipedia images, article captions, and Wikidata facts. Candidate multiple-choice questions are generated via LLMs and then filtered through human review to enforce factual accuracy, image consistency, and the requirement for external knowledge beyond visual content. Fifteen VLMs (256M–90B parameters) are evaluated, yielding accuracies from 24.7% to 75.6%; the authors conclude that this range shows the benchmark successfully discriminates knowledge-intensive reasoning capabilities. The dataset and evaluation code are released publicly.

Significance. If the human curation reliably isolates questions that require external Wikidata knowledge, the benchmark would fill a clear gap between perception-only VQA datasets and real-world knowledge-intensive tasks. The public release of data and code, together with the broad model-size sweep, would support reproducible progress on knowledge-aware VLMs.

major comments (1)
  1. [Abstract] Abstract: the human curation process is described only qualitatively ('reviewed and curated by human annotators to ensure … that each question requires external knowledge'). No inter-annotator agreement, annotator count, rejection rate, or post-curation error analysis is reported. Because the benchmark’s claim to discriminate knowledge-aware reasoning rests entirely on the validity of this filter, the missing quantitative validation is load-bearing.
minor comments (2)
  1. [Evaluation] The evaluation section would benefit from reporting per-model standard errors or confidence intervals on the accuracy figures to support the claim of a 'wide performance range'.
  2. [Dataset] A short table summarizing the final dataset statistics (number of images, questions, average options, rejection rate) would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and for highlighting the importance of quantitative validation for the human curation process. We agree that this aspect is central to the benchmark's credibility and will strengthen the manuscript with additional details in the revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the human curation process is described only qualitatively ('reviewed and curated by human annotators to ensure … that each question requires external knowledge'). No inter-annotator agreement, annotator count, rejection rate, or post-curation error analysis is reported. Because the benchmark’s claim to discriminate knowledge-aware reasoning rests entirely on the validity of this filter, the missing quantitative validation is load-bearing.

    Authors: We agree that the current manuscript provides only a qualitative description of the curation process. In the revised version, we will expand the relevant sections (including the abstract and methods) to report: (1) the number of annotators and their qualifications, (2) inter-annotator agreement statistics (e.g., Fleiss' kappa or percentage agreement on key criteria such as factual correctness and knowledge requirement), (3) the overall rejection rate of LLM-generated candidates, and (4) a post-curation error analysis performed on a held-out sample of accepted items. These additions will directly address the load-bearing concern about the filter's reliability. revision: yes

Circularity Check

0 steps flagged

No circularity in dataset construction or evaluation

full rationale

The paper presents a pipeline for constructing a VQA benchmark by combining Wikipedia images, captions, and Wikidata facts, using LLMs to generate candidates followed by human review for factual accuracy and external-knowledge requirement. It then evaluates fifteen existing VLMs on the resulting dataset and reports accuracy ranges. No mathematical derivations, equations, fitted parameters, or predictions appear in the described process; the central claims rest on the curation steps and empirical results rather than any self-referential reduction or self-citation load-bearing argument. The work is self-contained as a data-construction and benchmarking effort.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a new dataset rather than deriving results from prior equations or postulates; no free parameters, axioms, or invented entities are required beyond standard assumptions about human annotation quality and Wikidata accuracy.

pith-pipeline@v0.9.0 · 5729 in / 1096 out tokens · 32185 ms · 2026-05-21T04:37:48.909676+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

132 extracted references · 132 canonical work pages · 9 internal anchors

  1. [1]

    Abril and Robert Plant

    Patricia S. Abril and Robert Plant. The patent holder's dilemma: Buy, sell, or troll?. Communications of the ACM. doi:10.1145/1188913.1188915

  2. [2]

    Deciding equivalances among conjunctive aggregate queries

    Sarah Cohen and Werner Nutt and Yehoshua Sagic. Deciding equivalances among conjunctive aggregate queries. doi:10.1145/1219092.1219093

  3. [3]

    Special issue: Digital Libraries. 1996

  4. [4]

    Understanding Policy-Based Networking

    David Kosiur. Understanding Policy-Based Networking

  5. [7]

    doi:10.1007/3-540-09237-4

    The title of book two. doi:10.1007/3-540-09237-4

  6. [8]

    Asad Z. Spector. Achieving application requirements. Distributed Systems. doi:10.1145/90417.90738

  7. [9]

    Douglass and David Harel and Mark B

    Bruce P. Douglass and David Harel and Mark B. Trakhtenbrot. Statecarts in use: structured analysis and object-orientation. Lectures on Embedded Systems. doi:10.1007/3-540-65193-4_29

  8. [10]

    Donald E. Knuth. The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd. ed.)

  9. [11]

    Donald E. Knuth. The Art of Computer Programming

  10. [12]

    Structured Variational Inference Procedures and their Realizations (as incol)

    Dan Geiger and Christopher Meek. Structured Variational Inference Procedures and their Realizations (as incol). Proceedings of Tenth International Workshop on Artificial Intelligence and Statistics, The Barbados

  11. [13]

    Stan W. Smith. An experiment in bibliographic mark-up: Parsing metadata for XML export. Proceedings of the 3rd. annual workshop on Librarians and Computers

  12. [14]

    Catch me, if you can: Evading network signatures with web-based polymorphic worms

    Matthew Van Gundy and Davide Balzarotti and Giovanni Vigna. Catch me, if you can: Evading network signatures with web-based polymorphic worms. Proceedings of the first USENIX workshop on Offensive Technologies

  13. [15]

    Predicate Path expressions

    Sten Andler. Predicate Path expressions. Proceedings of the 6th. ACM SIGACT-SIGPLAN symposium on Principles of Programming Languages. doi:10.1145/567752.567774

  14. [16]

    LOGICS of Programs: AXIOMATICS and DESCRIPTIVE POWER

    David Harel. LOGICS of Programs: AXIOMATICS and DESCRIPTIVE POWER

  15. [17]

    Anisi , title =

    David A. Anisi , title =

  16. [18]

    Clarkson

    Kenneth L. Clarkson. Algorithms for Closest-Point Problems (Computational Geometry)

  17. [19]

    Introduction to Bayesian Statistics

    Harry Thornburg. Introduction to Bayesian Statistics. 2001

  18. [20]

    CLIFFORD: a Maple 11 Package for Clifford Algebra Computations, version 11

    Rafal Ablamowicz and Bertfried Fauser. CLIFFORD: a Maple 11 Package for Clifford Algebra Computations, version 11. 2007

  19. [21]

    Stats and Analysis

    Poker-Edge.Com. Stats and Analysis. 2006

  20. [22]

    A more perfect union

    Barack Obama. A more perfect union

  21. [23]

    The fountain of youth

    Joseph Scientist. The fountain of youth

  22. [24]

    Solder man

    Dave Novak. Solder man. ACM SIGGRAPH 2003 Video Review on Animation theater Program: Part I - Vol. 145 (July 27--27, 2003). doi:10.945/woot07-S422

  23. [25]

    Interview with Bill Kinder: January 13, 2005

    Newton Lee. Interview with Bill Kinder: January 13, 2005. Comput. Entertain. doi:10.1145/1057270.1057278

  24. [26]

    The Enabling of Digital Libraries

    Bernard Rous. The Enabling of Digital Libraries. Digital Libraries

  25. [28]

    (new) Finding minimum congestion spanning trees , journal =

    Werneck, Renato and Setubal, Jo\. (new) Finding minimum congestion spanning trees , journal =. doi:10.1145/351827.384253 , acmid = 384253, publisher =

  26. [30]

    and Mei, Alessandro , title =

    Conti, Mauro and Di Pietro, Roberto and Mancini, Luigi V. and Mei, Alessandro , title =. Inf. Fusion , volume =. 2009 , issn =. doi:10.1016/j.inffus.2009.01.002 , acmid =

  27. [31]

    and Hutchful, David K

    Li, Cheng-Lun and Buyuktur, Ayse G. and Hutchful, David K. and Sant, Natasha B. and Nainwal, Satyendra K. , title =. CHI '08 extended abstracts on Human factors in computing systems , year =. doi:10.1145/1358628.1358946 , acmid =

  28. [32]

    , title =

    Hollis, Billy S. , title =. 1999 , isbn =

  29. [33]

    Goossens, Michel and Rahtz, S. P. and Moore, Ross and Sutor, Robert S. , title =. 1999 , isbn =

  30. [34]

    and Rosenberg, Arnold L

    Buss, Jonathan F. and Rosenberg, Arnold L. and Knott, Judson D. , title =. 1987 , source =

  31. [35]

    CHI '08: CHI '08 extended abstracts on Human factors in computing systems , year =

    , note =. CHI '08: CHI '08 extended abstracts on Human factors in computing systems , year =

  32. [36]

    Algorithms for Closest-Point Problems (Computational Geometry) , year =

    Clarkson, Kenneth Lee , advisor =. Algorithms for Closest-Point Problems (Computational Geometry) , year =

  33. [37]

    SIGCOMM Comput. Commun. Rev. , year =

  34. [38]

    2004 , isbn =

    IEEE TCSC Executive Committee , booktitle =. 2004 , isbn =. doi:http://dx.doi.org/10.1109/ICWS.2004.64 , acmid =

  35. [39]

    Distributed systems (2nd Ed.) , year =

  36. [40]

    , title =

    Petrie, Charles J. , title =. 1986 , source =

  37. [41]

    Donald E. Knuth. Seminumerical Algorithms. 1981

  38. [42]

    E-commerce and cultural values , year =

    Kong, Wei-Chang , Title =. E-commerce and cultural values , year =

  39. [43]

    E-commerce and cultural values , year =

    Kong, Wei-Chang , type =. E-commerce and cultural values , year =

  40. [44]

    Chapter 9 , booktitle =

    Kong, Wei-Chang , editor =. Chapter 9 , booktitle =

  41. [45]

    E-commerce and cultural values , editor =

    Kong, Wei-Chang , title =. E-commerce and cultural values , editor =. 2003 , isbn =

  42. [46]

    E-commerce and cultural values - (InBook-num-in-chap) , chapter =

    Kong, Wei-Chang , editor =. E-commerce and cultural values - (InBook-num-in-chap) , chapter =. 2004 , address =

  43. [47]

    E-commerce and cultural values (Inbook-text-in-chap) , chapter =

    Kong, Wei-Chang , editor =. E-commerce and cultural values (Inbook-text-in-chap) , chapter =. 2005 , address =

  44. [48]

    E-commerce and cultural values (Inbook-num chap) , chapter =

    Kong, Wei-Chang , editor =. E-commerce and cultural values (Inbook-num chap) , chapter =. 2006 , address =

  45. [49]

    Microelectron

    Mehdi Saeedi and Morteza Saheb Zamani and Mehdi Sedighi , title =. Microelectron. J. , volume =. 2010 , pages =

  46. [50]

    Mehdi Saeedi and Morteza Saheb Zamani and Mehdi Sedighi and Zahra Sasanian , title =. J. Emerg. Technol. Comput. Syst. , volume =

  47. [51]

    Kirschmer, Markus and Voight, John , title =. SIAM J. Comput. , issue_date =. 2010 , issn =. doi:https://doi.org/10.1137/080734467 , acmid =

  48. [52]

    Hoare, C. A. R. , title =. Structured programming (incoll) , editor =. 1972 , isbn =

  49. [53]

    History of programming languages I (incoll) , editor =

    Lee, Jan , title =. History of programming languages I (incoll) , editor =. 1981 , isbn =. doi:http://doi.acm.org/10.1145/800025.1198348 , acmid =

  50. [54]

    , title =

    Dijkstra, E. , title =. Classics in software engineering (incoll) , year =

  51. [55]

    , title =

    Wenzel, Elizabeth M. , title =. Multimedia interface design (incoll) , year =. doi:10.1145/146022.146089 , acmid =

  52. [56]

    , title =

    Mumford, E. , title =. Critical issues in information systems research (incoll) , year =

  53. [57]

    and Golden, Donald G

    McCracken, Daniel D. and Golden, Donald G. , title =. 1990 , isbn =

  54. [58]

    The analysis of linear partial differential operators

    H. The analysis of linear partial differential operators. 1985 , PAGES =

  55. [59]

    IEEE", address =

    A. Adya and P. Bahl and J. Padhye and A.Wolman and L. Zhou , title =. Proceedings of the IEEE 1st International Conference on Broadnets Networks (BroadNets'04) , publisher = "IEEE", address = "Los Alamitos, CA", year =

  56. [60]

    I. F. Akyildiz and W. Su and Y. Sankarasubramaniam and E. Cayirci , title =. Comm. ACM , volume = 38, number = "4", year =

  57. [61]

    I. F. Akyildiz and T. Melodia and K. R. Chowdhury , title =. Computer Netw. , volume = 51, number = "4", year =

  58. [62]

    ACM", address =

    P. Bahl and R. Chancre and J. Dungeon , title =. Proceeding of the 10th International Conference on Mobile Computing and Networking (MobiCom'04) , publisher = "ACM", address = "New York, NY", year =

  59. [63]

    8 (Special Issue on Sensor Networks)

    D. Culler and D. Estrin and M. Srivastava , title =. IEEE Comput. , volume = 37, number = "8 (Special Issue on Sensor Networks)", publisher = "IEEE", address = "Los Alamitos, CA", year =

  60. [64]

    Natarajan and M

    A. Natarajan and M. Motani and B. de Silva and K. Yap and K. C. Chua , title =. Network Architectures , editor =. 960935712

  61. [65]

    Tzamaloukas and J

    A. Tzamaloukas and J. J. Garcia-Luna-Aceves , title =

  62. [66]

    Zhou and J

    G. Zhou and J. Lu and C.-Y. Wan and M. D. Yarvis and J. A. Stankovic , title =

  63. [67]

    Mapping Powerlists onto Hypercubes

    Jacob Kornerup. Mapping Powerlists onto Hypercubes. 1994

  64. [68]

    Automatic Parallelization for Distributed-Memory Multiprocessing Systems

    Michael Gerndt. Automatic Parallelization for Distributed-Memory Multiprocessing Systems

  65. [69]

    J. E. Archer, Jr. and R. Conway and F. B. Schneider. User recovery and reversal in interactive systems. ACM Trans. Program. Lang. Syst

  66. [70]

    D. D. Dunlop and V. R. Basili. Generalizing specifications for uniformly implemented loops. ACM Trans. Program. Lang. Syst

  67. [71]

    Heering and P

    J. Heering and P. Klint. Towards monolingual programming environments. ACM Trans. Program. Lang. Syst

  68. [72]

    Donald E. Knuth. The book

  69. [73]

    Korach and D

    E. Korach and D. Rotem and N. Santoro. Distributed algorithms for finding centers and medians in networks. ACM Trans. Program. Lang. Syst

  70. [74]

    : A Document Preparation System

    Leslie Lamport. : A Document Preparation System

  71. [75]

    F. Nielson. Program transformations in a denotational setting. ACM Trans. Program. Lang. Syst

  72. [76]

    Brian K. Reid. A high-level approach to computer document formatting. Proceedings of the 7th Annual Symposium on Principles of Programming Languages

  73. [77]

    and Abdelzaher, Tarek F

    Zhou, Gang and Wu, Yafeng and Yan, Ting and He, Tian and Huang, Chengdu and Stankovic, John A. and Abdelzaher, Tarek F. , title =. ACM Trans. Embed. Comput. Syst. , issue_date =. doi:10.1145/1721695.1721705 , acmid = 1721705, publisher =

  74. [78]

    Institutional members of the Users Group

  75. [79]

    Boris Veytsman , title =

  76. [80]

    Robin Schneider , title =

  77. [81]

    and Peterson, Larry L

    Bowman, Mic and Debray, Saumya K. and Peterson, Larry L. , title =. ACM Trans. Program. Lang. Syst. , volume =. 1993 , doi =

  78. [82]

    TUGboat , volume =

    Braams, Johannes , title =. TUGboat , volume =

  79. [83]

    Post Congress Tristesse

    Malcolm Clark. Post Congress Tristesse. TeX90 Conference Proceedings

  80. [84]

    ACM Trans

    Herlihy, Maurice , title =. ACM Trans. Program. Lang. Syst. , volume =. 1993 , doi =

Showing first 80 references.