HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

Issa Sugiura; Naoaki Okazaki; Shuhei Kurita; Yusuke Oda

arxiv: 2606.01132 · v1 · pith:KZSH6DI5new · submitted 2026-05-31 · 💻 cs.CV

HakushoBench: A Japanese Chart and Table VQA Benchmark from Governmental White Papers

Issa Sugiura , Shuhei Kurita , Yusuke Oda , Naoaki Okazaki This is my paper

Pith reviewed 2026-06-28 17:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords Japanese VQAchart understandingtable understandingvision-language modelsgovernmental documentsbenchmark constructionnon-English evaluation

0 comments

The pith

HakushoBench, built from 33 Japanese governmental white papers, shows the best open-weight vision-language model reaches only 58.6 percent accuracy on chart and table questions with a 34.9-point gap to proprietary models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds HakushoBench to test whether advances in English chart and table understanding carry over to Japanese. It extracts 2,053 images spanning more than ten types from freely available government reports and pairs them with manually written questions that require integrated reasoning over the entire visual element. Experiments on a range of vision-language models establish that open-weight systems top out at 58.6 percent while proprietary models sit 34.9 points higher. This gap indicates that current English-centric progress leaves substantial shortfalls in non-English document tasks. The release of the dataset and code supplies a concrete test bed for closing that shortfall.

Core claim

HakushoBench supplies 2,053 chart and table images drawn from 33 Japanese governmental white papers together with manually annotated questions that target deep holistic understanding; evaluation across vision-language models shows open-weight models limited to 58.6 percent accuracy and a 34.9-point deficit relative to proprietary systems.

What carries the argument

HakushoBench, the dataset of charts and tables extracted from governmental white papers and paired with questions written to demand integrated visual-textual reasoning rather than isolated cues.

If this is right

Open-weight vision-language models require further development to handle non-English chart and table reasoning at proprietary levels.
Governmental white papers offer a scalable route to realistic multilingual benchmarks in other countries.
English benchmark gains do not automatically translate to Japanese document understanding tasks.
Annotation that prioritizes holistic questions produces harder evaluations than methods focused on local features.
Public release of the images and questions allows direct measurement of progress toward closing the observed gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Comparable benchmarks could be assembled from white papers in additional languages to check cross-lingual generalization.
Models adapted to HakushoBench may improve accuracy in practical Japanese administrative document processing.
The size of the gap suggests language-specific data or architectural adjustments may matter more than raw scale alone.

Load-bearing premise

The manually created questions genuinely require deep and holistic understanding of the charts and tables instead of local visual shortcuts.

What would settle it

A new open-weight model that scores above 80 percent on the full HakushoBench test set without task-specific fine-tuning would show the claimed performance gap does not hold.

Figures

Figures reproduced from arXiv: 2606.01132 by Issa Sugiura, Naoaki Okazaki, Shuhei Kurita, Yusuke Oda.

**Figure 2.** Figure 2: Diversity of image types in HakushoBench. One randomly sampled example is shown for each image type [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Construction pipeline of HakushoBench. Chart and table images are collected from 33 Japanese white [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of QA pairs in HakushoBench. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Representative VQA pairs in HakushoBench, requiring multi-hop reasoning and global image understand [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Performance of each model on HakushoBench under the Direct and CoT settings. Since Gemini 3 Pro [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy spread across models on HakushoBench, grouped by image type. Categories with fewer than 50 examples (Area, Scatter, Bubble, and Other) are omitted. HakushoBench is more challenging than JGraphQA. Compared to JGraphQA, HakushoBench is much harder for open-weight models: Qwen3-VL 8B reaches only 58.6% on HakushoBench versus 88.8% on JGraphQA, and Sarashina2.2-Vision 3B reaches 37.7% versus 81.0% [P… view at source ↗

**Figure 8.** Figure 8: Representative failure cases of Gemini 3 Pro on HakushoBench. Left: a perception error in reading scatter [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Accuracy spread across models on HakushoBench, grouped by question type. F Evaluation Results on Existing Chart and Table Benchmarks Figures 10,11,12, and13 visualize the per-model Direct and CoT accuracy on ChartQA, ChartQAPro, CharXiv, and JGraphQA as bar charts. G Image Showcases for Comparison Benchmarks Figures 14, 16, 17, and 15 show one randomly sampled image per image-type category for each benchm… view at source ↗

**Figure 10.** Figure 10: Performance of each model on ChartQA. Sarashina2.2-V 3B Qwen3-VL 4B InternVL 3.5 4B InternVL 3.5 8B Qwen3-VL 8B LLM-jp-4-VL 9B beta GPT 4o GPT 5.1 Gemini 3 Pro 0 20 40 60 80 100 Accuracy (%) 23.0 30.5 34.7 33.5 35.0 27.8 44.1 42.0 23.0 30.5 34.8 33.5 35.1 27.8 48.4 55.5 65.8 Δ−0.0 Δ−0.0 Δ+0.1 Δ+0.0 Δ+0.0 Δ+0.1 Δ+4.2 Δ+13.4 Direct CoT [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Performance of each model on ChartQAPro. [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 12.** Figure 12: Performance of each model on CharXiv. Sarashina2.2-V 3B Qwen3-VL 4B InternVL 3.5 4B InternVL 3.5 8B Qwen3-VL 8B LLM-jp-4-VL 9B beta GPT 4o GPT 5.1 Gemini 3 Pro 0 20 40 60 80 100 Accuracy (%) 80.4 84.2 84.2 85.9 86.6 87.8 92.0 95.1 81.0 91.5 84.2 88.4 88.8 85.2 84.7 94.9 96.9 Δ+0.5 Δ+7.3 Δ+0.0 Δ+2.6 Δ+2.2 Δ−2.6 Δ−7.3 Δ−0.2 Direct CoT [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Performance of each model on JGraphQA. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: ChartQA: one randomly sampled image per image type. [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: JGraphQA: one randomly sampled image per image type. [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: ChartQAPro: one randomly sampled image per image type. [PITH_FULL_IMAGE:figures/full_fig_p015_16.png] view at source ↗

**Figure 17.** Figure 17: CharXiv: one randomly sampled image per image type. [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

read the original abstract

Understanding chart and table images is essential for applying vision-language models (VLMs) to real-world document understanding. While English benchmarks have advanced rapidly, non-English counterparts remain scarce, leaving it unclear whether this progress generalizes across languages. A key obstacle is the difficulty of collecting realistic and diverse non-English chart and table images at scale. To address this, we leverage governmental white papers as a scalable source for benchmark construction beyond English, as they contain naturally occurring charts and tables across diverse formats and domains and are freely accessible in many countries. As a first instantiation, we introduce HakushoBench, a challenging Japanese chart and table VQA benchmark built from 33 governmental white papers. HakushoBench contains 2,053 images spanning over 10 image types, with manually annotated QA pairs, designed to assess deep and holistic understanding of charts and tables, rather than local visual cues alone. Experiments across a broad range of VLMs demonstrate that HakushoBench remains challenging for open-weight models: the best open-weight model achieves only 58.6% accuracy, and a 34.9-point gap between open-weight and proprietary models highlights substantial room for improvement in complex chart and table understanding. We release our dataset and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HakushoBench gives a practical new Japanese chart/table VQA set from real government papers, but missing annotation details make the difficulty claims hard to trust.

read the letter

The main takeaway is that this paper delivers a new benchmark, HakushoBench, built from 33 Japanese governmental white papers with 2053 images and manually annotated QA pairs. It targets a real gap in non-English chart and table VQA, where most existing work stays English-only.

The construction approach works well. Sourcing from public white papers supplies naturally occurring charts and tables across more than 10 types without relying on synthetic generation. That scale and domain variety is a step up from smaller or narrower prior sets. Running a range of VLMs and reporting the 58.6% ceiling for open-weight models plus the 34.9-point gap to proprietary ones gives a clear empirical signal that current open models still struggle here. Releasing the data and code is also straightforward and useful.

The soft spot sits in the annotation process. The abstract states the questions are meant to test deep and holistic understanding rather than local cues, yet it supplies no protocol, number of annotators, inter-annotator agreement, or human baseline. Without those, the reported accuracies cannot be confidently read as evidence of benchmark hardness. The same holds for any error analysis that might show whether models fail on reasoning or on question artifacts. The soundness score stays modest because the central claim depends on unshown verification steps.

This work is aimed at groups building or evaluating multilingual document VLMs. Anyone needing a Japanese test set for charts and tables will find the resource itself worth examining. It deserves peer review because the dataset fills a documented need and the performance numbers are worth checking once the annotation details are filled in.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces HakushoBench, a Japanese visual question answering benchmark for charts and tables extracted from 33 governmental white papers. The benchmark comprises 2,053 images across more than 10 image types, accompanied by manually annotated QA pairs intended to evaluate deep and holistic understanding of charts and tables. Through experiments on a range of vision-language models, it reports that the best-performing open-weight model achieves 58.6% accuracy, with a 34.9 percentage point gap to proprietary models, underscoring the difficulty of the task and the need for improved complex document understanding capabilities in non-English settings.

Significance. If the QA annotations prove reliable, this benchmark would fill an important gap in non-English resources for chart and table VQA, leveraging a scalable source of real-world governmental documents. The reported performance gap between open-weight and proprietary models provides a concrete measure of current limitations in multilingual VLM capabilities for document understanding, which could guide future model development and evaluation practices.

major comments (1)

[Abstract (and Dataset Construction section)] The central claim that HakushoBench assesses 'deep and holistic understanding' rather than local visual cues, and that the reported accuracies (best open-weight 58.6%, 34.9-point gap) demonstrate its challenge, depends on the quality and reliability of the manually annotated QA pairs. However, the abstract provides no details on the annotation methodology, number of annotators, inter-annotator agreement, or validation procedures such as a human performance baseline. This information is load-bearing for interpreting the results as evidence of model limitations in complex understanding rather than annotation artifacts.

minor comments (1)

A table summarizing VLM results (model names, open-weight vs. proprietary status, accuracies) would improve readability of the experimental findings.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential value of HakushoBench in addressing the gap in non-English chart and table VQA resources. We agree that the reliability of the QA annotations is central to interpreting the benchmark's difficulty and the reported performance gaps, and we will revise the manuscript to provide the requested details.

read point-by-point responses

Referee: [Abstract (and Dataset Construction section)] The central claim that HakushoBench assesses 'deep and holistic understanding' rather than local visual cues, and that the reported accuracies (best open-weight 58.6%, 34.9-point gap) demonstrate its challenge, depends on the quality and reliability of the manually annotated QA pairs. However, the abstract provides no details on the annotation methodology, number of annotators, inter-annotator agreement, or validation procedures such as a human performance baseline. This information is load-bearing for interpreting the results as evidence of model limitations in complex understanding rather than annotation artifacts.

Authors: We agree that the abstract and Dataset Construction section currently lack sufficient detail on annotation procedures, which is necessary to substantiate claims about deep understanding. In the revised manuscript we will expand both sections to describe the annotation methodology (including how questions were designed to target holistic chart/table comprehension rather than local cues), the number of annotators, the process for resolving disagreements, inter-annotator agreement statistics, and a human performance baseline computed on a held-out subset. These additions will allow readers to assess annotation quality directly and strengthen the interpretation of the 58.6% open-weight and 34.9-point proprietary gap as evidence of model limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark reporting with no derivations or fitted predictions

full rationale

The paper constructs HakushoBench from governmental white papers and reports VLM accuracies on it. No equations, parameter fitting, or derivation chain exists that could reduce a claimed prediction to its inputs by construction. The central claims rest on dataset collection and external model evaluation rather than any self-referential mathematical step. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz. This is a standard empirical benchmark paper with no circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces a new dataset without introducing free parameters, new mathematical axioms, or invented entities; it relies on standard assumptions about public document accessibility and manual annotation quality.

axioms (1)

domain assumption Governmental white papers contain naturally occurring charts and tables across diverse formats and domains and are freely accessible.
Invoked in the abstract as the scalable source for benchmark construction.

pith-pipeline@v0.9.1-grok · 5762 in / 1150 out tokens · 17298 ms · 2026-06-28T17:35:09.289668+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 13 canonical work pages · 8 internal anchors

[1]

Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

PaddleOCR-VL: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model.Preprint, arXiv:2510.14528. Google DeepMind

work page arXiv
[2]

https://storage.googleapis.com/deepmind-m edia/Model-Cards/Gemini-3-Pro-Model-Card

Gemini 3 Pro model card. https://storage.googleapis.com/deepmind-m edia/Model-Cards/Gemini-3-Pro-Model-Card. pdf. Accessed: 2026-05-19. Anson Ho, Jean-Stanislas Denain, David Atanasov, Samuel Albanie, and Rohin Shah

2026
[3]

A Rosetta Stone for AI Benchmarks, 2025

A rosetta stone for AI benchmarks.Preprint, arXiv:2512.00193. InternVL Team

work page arXiv
[4]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

InternVL3.5: Advancing open- source multimodal models in versatility, reasoning, and efficiency.Preprint, arXiv:2508.18265. Japanese Digital Agency

work page internal anchor Pith review Pith/arXiv arXiv
[5]

https://www

e-Gov. https://www. e-gov.go.jp/about-government/white-papers. html. Accessed: 2026-05-19. Siddharth Joshi, Haoli Yin, Rishabh Adiga, Ricardo Monti, Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, and 12 others

2026
[6]

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan

DatBench: Discriminative, faithful, and efficient VLM evalua- tions.Preprint, arXiv:2601.02316. Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan

work page arXiv
[7]

FigureQA: An Annotated Figure Dataset for Visual Reasoning

FigureQA: An annotated 9 figure dataset for visual reasoning.Preprint, arXiv:1710.07300. Shankar Kantharaj, Rixie Tiffany Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Kimi-VL Technical Report

Kimi-VL technical report.Preprint, arXiv:2504.07491. Akira Kinoshita

work page internal anchor Pith review Pith/arXiv arXiv
[9]

https://hugg ingface.co/datasets/r- g2- 2024/JGraphQA

JGraphQA. https://hugg ingface.co/datasets/r- g2- 2024/JGraphQA . Accessed: 2026-05-19. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- taka Matsuo, and Yusuke Iwasawa

2024
[10]

GPT-4o system card.Preprint, arXiv:2410.21276. OpenAI

work page internal anchor Pith review Pith/arXiv arXiv
[11]

https://openai.com/index/gpt -5-1

GPT-5.1: A smarter, more conversa- tional ChatGPT. https://openai.com/index/gpt -5-1. Accessed: 2026-05-19. Yonatan Oren, Nicole Meister, Niladri S. Chatterji, Faisal Ladhak, and Tatsunori Hashimoto

2026
[12]

https://ourworld indata.org/

Our world in data. https://ourworld indata.org/. Accessed: 2026-05-19. Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks, Ziwen Han, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, and 281 others

2026
[13]

Qwen3-VL Technical Report

Qwen3-VL technical report. Preprint, arXiv:2511.21631. Keito Sasagawa, Shuhei Kurita, and Daisuke Kawa- hara

work page internal anchor Pith review Pith/arXiv arXiv
[14]

SB Intuitions

Evaluating multimodal large language models on vertically written japanese text.Preprint, arXiv:2511.15059. SB Intuitions

work page arXiv
[15]

https: //huggingface.co/sbintuitions/sarashina2 .2-vision-3b

Sarashina2.2-Vision-3B. https: //huggingface.co/sbintuitions/sarashina2 .2-vision-3b. Accessed: 2026-05-19. Statista

2026
[16]

https://www.statista.com /

Statista. https://www.statista.com /. Accessed: 2026-05-19. Stockmark

2026
[17]

JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation

BusinessSlideVQA. https://gith ub.com/stockmarkteam/business-slide-quest ions. Accessed: 2026-05-19. Issa Sugiura, Koki Maeda, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, and Naoaki Okazaki. 2026a. JAMMEval: A refined collection of Japanese bench- marks for reliable VLM evaluation.Preprint, arXiv:2604.00909. Issa Sugiura, Keito Sasagawa, Keisuke Nakao, K...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.Preprint, arXiv:2502.14786. U.S. Government Publishing Office

work page internal anchor Pith review Pith/arXiv arXiv
[19]

https://www.govinfo.go v/app/collection/erp

Economic report of the president. https://www.govinfo.go v/app/collection/erp. Accessed: 2026-05-19. 10 Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, San- jeev Arora, and Danqi Chen

2026
[20]

DeepSeek-OCR: Contexts Optical Compression

DeepSeek-OCR: Contexts optical compression. Preprint, arXiv:2510.18234. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Dashboard

POLY- CHARTQA: Benchmarking large vision-language models with multilingual chart question answering. Preprint, arXiv:2507.11939. A Licenses for Our Resources HakushoBench and its evaluation code are released under the Apache 2.0 License. Note that we dis- tribute only image URLs rather than the raw image data. B Use of AI Assistants We used AI assistants ...

work page arXiv 2025

[1] [1]

Paddleocr-vl: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528, 2025

PaddleOCR-VL: Boosting multilingual document parsing via a 0.9b ultra-compact vision-language model.Preprint, arXiv:2510.14528. Google DeepMind

work page arXiv

[2] [2]

https://storage.googleapis.com/deepmind-m edia/Model-Cards/Gemini-3-Pro-Model-Card

Gemini 3 Pro model card. https://storage.googleapis.com/deepmind-m edia/Model-Cards/Gemini-3-Pro-Model-Card. pdf. Accessed: 2026-05-19. Anson Ho, Jean-Stanislas Denain, David Atanasov, Samuel Albanie, and Rohin Shah

2026

[3] [3]

A Rosetta Stone for AI Benchmarks, 2025

A rosetta stone for AI benchmarks.Preprint, arXiv:2512.00193. InternVL Team

work page arXiv

[4] [4]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

InternVL3.5: Advancing open- source multimodal models in versatility, reasoning, and efficiency.Preprint, arXiv:2508.18265. Japanese Digital Agency

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

https://www

e-Gov. https://www. e-gov.go.jp/about-government/white-papers. html. Accessed: 2026-05-19. Siddharth Joshi, Haoli Yin, Rishabh Adiga, Ricardo Monti, Aldo Carranza, Alex Fang, Alvin Deng, Amro Abbas, Brett Larsen, Cody Blakeney, Darren Teh, David Schwab, Fan Pan, Haakon Mongstad, Jack Urbanek, Jason Lee, Jason Telanoff, Josh Wills, Kaleigh Mentzer, and 12 others

2026

[6] [6]

Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan

DatBench: Discriminative, faithful, and efficient VLM evalua- tions.Preprint, arXiv:2601.02316. Kushal Kafle, Brian Price, Scott Cohen, and Christopher Kanan

work page arXiv

[7] [7]

FigureQA: An Annotated Figure Dataset for Visual Reasoning

FigureQA: An annotated 9 figure dataset for visual reasoning.Preprint, arXiv:1710.07300. Shankar Kantharaj, Rixie Tiffany Leong, Xiang Lin, Ahmed Masry, Megh Thakkar, Enamul Hoque, and Shafiq Joty

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Kimi-VL Technical Report

Kimi-VL technical report.Preprint, arXiv:2504.07491. Akira Kinoshita

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

https://hugg ingface.co/datasets/r- g2- 2024/JGraphQA

JGraphQA. https://hugg ingface.co/datasets/r- g2- 2024/JGraphQA . Accessed: 2026-05-19. Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- taka Matsuo, and Yusuke Iwasawa

2024

[10] [10]

GPT-4o system card.Preprint, arXiv:2410.21276. OpenAI

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

https://openai.com/index/gpt -5-1

GPT-5.1: A smarter, more conversa- tional ChatGPT. https://openai.com/index/gpt -5-1. Accessed: 2026-05-19. Yonatan Oren, Nicole Meister, Niladri S. Chatterji, Faisal Ladhak, and Tatsunori Hashimoto

2026

[12] [12]

https://ourworld indata.org/

Our world in data. https://ourworld indata.org/. Accessed: 2026-05-19. Long Phan, Alice Gatti, Nathaniel Li, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Dan Hendrycks, Ziwen Han, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, and 281 others

2026

[13] [13]

Qwen3-VL Technical Report

Qwen3-VL technical report. Preprint, arXiv:2511.21631. Keito Sasagawa, Shuhei Kurita, and Daisuke Kawa- hara

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

SB Intuitions

Evaluating multimodal large language models on vertically written japanese text.Preprint, arXiv:2511.15059. SB Intuitions

work page arXiv

[15] [15]

https: //huggingface.co/sbintuitions/sarashina2 .2-vision-3b

Sarashina2.2-Vision-3B. https: //huggingface.co/sbintuitions/sarashina2 .2-vision-3b. Accessed: 2026-05-19. Statista

2026

[16] [16]

https://www.statista.com /

Statista. https://www.statista.com /. Accessed: 2026-05-19. Stockmark

2026

[17] [17]

JAMMEval: A Refined Collection of Japanese Benchmarks for Reliable VLM Evaluation

BusinessSlideVQA. https://gith ub.com/stockmarkteam/business-slide-quest ions. Accessed: 2026-05-19. Issa Sugiura, Koki Maeda, Shuhei Kurita, Yusuke Oda, Daisuke Kawahara, and Naoaki Okazaki. 2026a. JAMMEval: A refined collection of Japanese bench- marks for reliable VLM evaluation.Preprint, arXiv:2604.00909. Issa Sugiura, Keito Sasagawa, Keisuke Nakao, K...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.Preprint, arXiv:2502.14786. U.S. Government Publishing Office

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

https://www.govinfo.go v/app/collection/erp

Economic report of the president. https://www.govinfo.go v/app/collection/erp. Accessed: 2026-05-19. 10 Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, San- jeev Arora, and Danqi Chen

2026

[20] [20]

DeepSeek-OCR: Contexts Optical Compression

DeepSeek-OCR: Contexts optical compression. Preprint, arXiv:2510.18234. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed H. Chi, Quoc V Le, and Denny Zhou

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Dashboard

POLY- CHARTQA: Benchmarking large vision-language models with multilingual chart question answering. Preprint, arXiv:2507.11939. A Licenses for Our Resources HakushoBench and its evaluation code are released under the Apache 2.0 License. Note that we dis- tribute only image URLs rather than the raw image data. B Use of AI Assistants We used AI assistants ...

work page arXiv 2025