INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

Anathapindika Dravichi; Gaurav Harit; Somraj Gautam

arxiv: 2604.11970 · v1 · submitted 2026-04-13 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents

Somraj Gautam , Anathapindika Dravichi , Gaurav Harit This is my paper

Pith reviewed 2026-05-10 16:13 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords INDOTABVQAcross-lingual table VQABahasa Indonesia documentsvision-language modelstable understandingfine-tuningspatial priorsmultilingual document AI

0 comments

The pith

A new benchmark for Indonesian table visual questions allows fine-tuning to improve VLM accuracy by up to 17.8%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces INDOTABVQA as a benchmark consisting of 1,593 real-world Bahasa Indonesia document images with tables and question-answer pairs in four languages. It evaluates vision-language models on both monolingual and cross-lingual table VQA tasks, revealing performance gaps on complex tables and low-resource languages. Fine-tuning smaller models on the dataset produces measurable accuracy gains, and supplying table region coordinates adds further improvement. A reader would care because current document AI systems largely overlook non-English languages, and this work offers a direct way to adapt models for practical multilingual document tasks.

Core claim

The paper establishes that the INDOTABVQA dataset, comprising 1,593 document images across bordered, borderless, and colorful table styles with 1,593 multilingual question-answer sets in Bahasa Indonesia, English, Hindi, and Arabic, exposes substantial gaps in leading VLMs for table reasoning. Fine-tuning a compact 3B model and a LoRA-finetuned 7B model on the dataset yields 11.6% and 17.8% accuracy improvements, while adding explicit table region coordinates as input further boosts performance by 4-7% via spatial priors.

What carries the argument

The INDOTABVQA benchmark dataset that pairs real-world Indonesian document images containing one or more tables with question-answer sets in four languages and optional table region coordinates to enable spatial priors for VLM reasoning.

Load-bearing premise

The accuracy gains come primarily from the new dataset and fine-tuning process rather than from unstated factors in evaluation protocol, model selection, or dataset construction, and the 1,593 images adequately represent real-world variability in Indonesian documents.

What would settle it

Retraining the same models on a different collection of Indonesian table documents of similar size and style produces no accuracy gain or less gain than reported, or performance on a new set of held-out Indonesian documents shows no improvement after fine-tuning.

Figures

Figures reproduced from arXiv: 2604.11970 by Anathapindika Dravichi, Gaurav Harit, Somraj Gautam.

**Figure 1.** Figure 1: INDOTABVQA presents document images in Bahasa Indonesia, and semantically aligned QA pairs in four languages, enabling cross-lingual evaluation of VLMs. and OCRBench (Liu et al., 2024). Recent tablefocused datasets such as TableVQA-Bench (Kim et al., 2024), TabComp (Gautam et al., 2025a), and ComTQA (Zhao et al., 2024) further assess numerical reasoning and structure-aware comprehension. However, these … view at source ↗

**Figure 2.** Figure 2: Architecture comparison with left-to-right pipeline flow across three evaluation settings. Each row [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Global language coverage map for the INDOTABVQA benchmark. The shading intensity indicates the number of supported languages (1–3) spoken in each country. For example, Canada supports both English and Hindi. This visualization highlights the geographical and cultural reach of our cross-lingual benchmark. where relevant values are localized or require limited structural interpretation. In contrast, looku… view at source ↗

**Figure 4.** Figure 4: Comparative distribution of prediction error types across the four languages in the [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Example of the INDOTABVQA correct predictions on mono-lingual and cross-lingual question answering across three table formats. Bordered (left), Borderless (middle), and Colorful (right). The examples include questions in Bahasa Indonesia, English, Hindi, and Arabic [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: https://huggingface.co/datasets/NusaBharat/INDOTABVQA}

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New Indonesian table VQA dataset with cross-lingual questions is a practical addition, but the reported fine-tuning gains need explicit baselines and ablations to be convincing.

read the letter

The paper introduces INDOTABVQA, a dataset of 1,593 real Indonesian document images with tables in bordered, borderless, and colorful styles, plus matching questions in Bahasa Indonesia, English, Hindi, and Arabic. It benchmarks several open VLMs and GPT-4o, then shows that fine-tuning a 3B model and a LoRA 7B model on the data lifts accuracy by 11.6% and 17.8%, with another 4-7% from feeding explicit table coordinates. The data is released publicly on Hugging Face, which is the clearest positive step here. This setup targets an actual gap in structure-aware, cross-lingual document understanding for a low-resource language, and the three visual styles plus multi-table cases give it some coverage that prior table VQA sets lack. The authors also flag where current models fall short on complex tables and non-English questions, which is useful context for anyone working on multilingual VLMs. The main weakness is that the accuracy deltas are presented without the pre-fine-tuning baseline numbers on the identical test split, prompt format, or decoding settings. Without those controls, or any mention of multiple runs and statistical checks, it is hard to rule out that the gains come from changes in evaluation protocol rather than the new data itself. The 1,593-sample size also leaves open the chance that models are picking up surface patterns instead of learning general table reasoning. The spatial-prior result would carry more weight if the coordinates were produced by an automatic detector instead of oracle input. This work is aimed at groups building or testing document VQA systems for Southeast Asian or other low-resource languages. A reader who needs a starting benchmark for cross-lingual table tasks would get immediate value from the released data and the reported gaps. It is coherent enough on its own terms to deserve a serious referee, mainly to push for clearer experimental details and perhaps some external validation on held-out Indonesian documents.

Referee Report

3 major / 1 minor

Summary. The paper introduces INDOTABVQA, a benchmark of 1,593 real-world Bahasa Indonesia document images containing bordered, borderless, or colorful tables, paired with 1,593 QA sets in four languages (Bahasa Indonesia, English, Hindi, Arabic). It benchmarks open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o, reports performance gaps especially on complex tables and low-resource languages, and claims that fine-tuning a 3B model and a LoRA-tuned 7B model on the dataset yields 11.6% and 17.8% accuracy gains; providing explicit table-region coordinates adds a further 4-7% improvement. The work positions the dataset as a resource for cross-lingual, structure-aware document understanding.

Significance. If the accuracy gains can be shown to arise specifically from the new Indonesian table data rather than evaluation-protocol changes, this benchmark would supply a needed resource for low-resource-language document VQA and would usefully demonstrate the value of spatial priors and targeted fine-tuning for table reasoning in VLMs.

major comments (3)

[Abstract] Abstract: the reported 11.6% and 17.8% accuracy improvements after fine-tuning the 3B and LoRA-7B models are presented without the corresponding pre-fine-tuning baseline accuracies on the identical test split, prompt format, and decoding settings. Without these numbers it is impossible to attribute the deltas to the INDOTABVQA data rather than to differences in evaluation protocol or answer extraction.
[Abstract] Abstract: no details are supplied on the train/test split of the 1,593 samples, the number of evaluation runs, the precise metric (exact match, token F1, etc.), or any statistical significance test. Given the modest dataset size, these omissions leave open the possibility that the gains reflect memorization or post-hoc choices rather than improved cross-lingual table reasoning.
[Abstract] Abstract: the additional 4-7% gain from supplying explicit table-region coordinates is only informative if the coordinates are shown to be automatically obtainable; the manuscript must clarify whether these are oracle ground-truth boxes or outputs of an automatic detector and must report the performance drop when realistic detection noise is introduced.

minor comments (1)

[Abstract] The abstract states that 'substantial performance gaps' exist but does not quantify them with concrete accuracy or F1 numbers for each model and language pair; adding a compact results table would improve readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their thorough review and valuable suggestions. We address each of the major comments below and have made revisions to the manuscript to improve clarity and completeness.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 11.6% and 17.8% accuracy improvements after fine-tuning the 3B and LoRA-7B models are presented without the corresponding pre-fine-tuning baseline accuracies on the identical test split, prompt format, and decoding settings. Without these numbers it is impossible to attribute the deltas to the INDOTABVQA data rather than to differences in evaluation protocol or answer extraction.

Authors: We agree that the absolute baseline accuracies are essential for proper interpretation. The full manuscript reports these baselines in the experimental results section using the same test split, prompts, and decoding settings. We will revise the abstract to explicitly include the pre-fine-tuning accuracy figures for both models, allowing readers to directly verify the reported gains. revision: yes
Referee: [Abstract] Abstract: no details are supplied on the train/test split of the 1,593 samples, the number of evaluation runs, the precise metric (exact match, token F1, etc.), or any statistical significance test. Given the modest dataset size, these omissions leave open the possibility that the gains reflect memorization or post-hoc choices rather than improved cross-lingual table reasoning.

Authors: We concur that these details are important for assessing the reliability of the results. We will add the necessary information to the abstract and expand the evaluation protocol description in the paper, including the train/test split details, the number of runs, the exact metric employed, and any statistical tests performed. This will help demonstrate that the improvements are robust and not due to memorization. revision: yes
Referee: [Abstract] Abstract: the additional 4-7% gain from supplying explicit table-region coordinates is only informative if the coordinates are shown to be automatically obtainable; the manuscript must clarify whether these are oracle ground-truth boxes or outputs of an automatic detector and must report the performance drop when realistic detection noise is introduced.

Authors: The coordinates used are ground-truth table region annotations from our dataset. We will clarify this explicitly in the revised abstract and manuscript. Regarding automatic detection, while we did not include experiments with noisy detections in the original submission, we recognize its importance for practical applicability. We will add a note on this limitation and, if space permits, preliminary results using an automatic detector to show the performance under realistic conditions. revision: partial

standing simulated objections not resolved

The requirement to report the performance drop when using realistic automatic table detection noise, as this would require conducting new experiments not included in the current manuscript.

Circularity Check

0 steps flagged

No circularity: empirical dataset benchmark with no derivations or self-referential claims

full rationale

The paper introduces the INDOTABVQA dataset (1,593 images and QA pairs) and reports empirical VLM benchmarks plus fine-tuning gains (11.6% for 3B model, 17.8% for LoRA 7B) and coordinate-input improvements (4-7%). No equations, fitted parameters, predictions, or derivations exist. Claims rest on direct experimental results rather than any chain that reduces to inputs by construction. No self-citations are load-bearing for a mathematical result, and no ansatz, uniqueness theorem, or renaming of known results occurs. This is a standard data-release and evaluation paper whose central content is independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the creation of a new empirical benchmark and standard fine-tuning of existing VLMs; no free parameters, axioms, or invented entities are introduced beyond the dataset itself.

pith-pipeline@v0.9.0 · 5625 in / 1259 out tokens · 45704 ms · 2026-05-10T16:13:07.060969+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Gemma 3 Technical Report

Pubtables-1m: Towards comprehensive table extraction from unstructured documents. InProceed- ings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 4634–4642. Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, An-Lan Wang, Chunhui Lin, Hao Feng, Zhen Zhao, Yanjie Wang, and 1 others. 2025. Mtvqa: Bench- marking multilingual...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624, 2021

Publaynet: largest dataset ever for document layout analysis. In2019 International conference on document analysis and recognition (ICDAR), pages 1015–1022. IEEE. Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. Tat-qa: A question answering benchmark on a hybrid of tabular and textual cont...

work page arXiv 2021

[1] [1]

Gemma 3 Technical Report

Pubtables-1m: Towards comprehensive table extraction from unstructured documents. InProceed- ings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 4634–4642. Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, An-Lan Wang, Chunhui Lin, Hao Feng, Zhen Zhao, Yanjie Wang, and 1 others. 2025. Mtvqa: Bench- marking multilingual...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

TAT-QA: A question answering benchmark on a hybrid of tabular and textual content in finance.arXiv preprint arXiv:2105.07624, 2021

Publaynet: largest dataset ever for document layout analysis. In2019 International conference on document analysis and recognition (ICDAR), pages 1015–1022. IEEE. Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. Tat-qa: A question answering benchmark on a hybrid of tabular and textual cont...

work page arXiv 2021