INDOTABVQA: A Benchmark for Cross-Lingual Table Understanding in Bahasa Indonesia Documents
Pith reviewed 2026-05-10 16:13 UTC · model grok-4.3
The pith
A new benchmark for Indonesian table visual questions allows fine-tuning to improve VLM accuracy by up to 17.8%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that the INDOTABVQA dataset, comprising 1,593 document images across bordered, borderless, and colorful table styles with 1,593 multilingual question-answer sets in Bahasa Indonesia, English, Hindi, and Arabic, exposes substantial gaps in leading VLMs for table reasoning. Fine-tuning a compact 3B model and a LoRA-finetuned 7B model on the dataset yields 11.6% and 17.8% accuracy improvements, while adding explicit table region coordinates as input further boosts performance by 4-7% via spatial priors.
What carries the argument
The INDOTABVQA benchmark dataset that pairs real-world Indonesian document images containing one or more tables with question-answer sets in four languages and optional table region coordinates to enable spatial priors for VLM reasoning.
Load-bearing premise
The accuracy gains come primarily from the new dataset and fine-tuning process rather than from unstated factors in evaluation protocol, model selection, or dataset construction, and the 1,593 images adequately represent real-world variability in Indonesian documents.
What would settle it
Retraining the same models on a different collection of Indonesian table documents of similar size and style produces no accuracy gain or less gain than reported, or performance on a new set of held-out Indonesian documents shows no improvement after fine-tuning.
Figures
read the original abstract
We introduce INDOTABVQA, a benchmark for evaluating cross-lingual Table Visual Question Answering (VQA) on real-world document images in Bahasa Indonesia. The dataset comprises 1,593 document images across three visual styles (bordered, borderless, and colorful) with one or more than one tables, and 1,593 question-answer sets in four languages: Bahasa Indonesia, English, Hindi, and Arabic. This enables evaluation of Vision-Language Models (VLMs) in both monolingual (Bahasa documents with Bahasa questions) and cross-lingual settings (Bahasa documents with questions in other languages). We benchmark leading open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o and reveal substantial performance gaps, particularly on structurally complex tables and in low-resource languages. Fine-tuning a compact 3B and LoRA-finetuned 7B model on our dataset yields 11.6% and 17.8% improvements in accuracy. Providing explicit table region coordinates as additional input further improves performance by 4-7%, demonstrating the value of Spatial priors for table-based reasoning. Our findings underscore the importance of language-diverse, domain-specific datasets and demonstrate that targeted fine-tuning can significantly enhance VLM performance on specialized document understanding tasks. INDOTABVQA provides a valuable resource for advancing research in cross-lingual, structure-aware document understanding, especially in underrepresented regions of the world. Full dataset can be accessed in huggingface at: https://huggingface.co/datasets/NusaBharat/INDOTABVQA}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces INDOTABVQA, a benchmark of 1,593 real-world Bahasa Indonesia document images containing bordered, borderless, or colorful tables, paired with 1,593 QA sets in four languages (Bahasa Indonesia, English, Hindi, Arabic). It benchmarks open-source VLMs (Qwen2.5-VL, Gemma-3, LLaMA-3.2) and GPT-4o, reports performance gaps especially on complex tables and low-resource languages, and claims that fine-tuning a 3B model and a LoRA-tuned 7B model on the dataset yields 11.6% and 17.8% accuracy gains; providing explicit table-region coordinates adds a further 4-7% improvement. The work positions the dataset as a resource for cross-lingual, structure-aware document understanding.
Significance. If the accuracy gains can be shown to arise specifically from the new Indonesian table data rather than evaluation-protocol changes, this benchmark would supply a needed resource for low-resource-language document VQA and would usefully demonstrate the value of spatial priors and targeted fine-tuning for table reasoning in VLMs.
major comments (3)
- [Abstract] Abstract: the reported 11.6% and 17.8% accuracy improvements after fine-tuning the 3B and LoRA-7B models are presented without the corresponding pre-fine-tuning baseline accuracies on the identical test split, prompt format, and decoding settings. Without these numbers it is impossible to attribute the deltas to the INDOTABVQA data rather than to differences in evaluation protocol or answer extraction.
- [Abstract] Abstract: no details are supplied on the train/test split of the 1,593 samples, the number of evaluation runs, the precise metric (exact match, token F1, etc.), or any statistical significance test. Given the modest dataset size, these omissions leave open the possibility that the gains reflect memorization or post-hoc choices rather than improved cross-lingual table reasoning.
- [Abstract] Abstract: the additional 4-7% gain from supplying explicit table-region coordinates is only informative if the coordinates are shown to be automatically obtainable; the manuscript must clarify whether these are oracle ground-truth boxes or outputs of an automatic detector and must report the performance drop when realistic detection noise is introduced.
minor comments (1)
- [Abstract] The abstract states that 'substantial performance gaps' exist but does not quantify them with concrete accuracy or F1 numbers for each model and language pair; adding a compact results table would improve readability.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable suggestions. We address each of the major comments below and have made revisions to the manuscript to improve clarity and completeness.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported 11.6% and 17.8% accuracy improvements after fine-tuning the 3B and LoRA-7B models are presented without the corresponding pre-fine-tuning baseline accuracies on the identical test split, prompt format, and decoding settings. Without these numbers it is impossible to attribute the deltas to the INDOTABVQA data rather than to differences in evaluation protocol or answer extraction.
Authors: We agree that the absolute baseline accuracies are essential for proper interpretation. The full manuscript reports these baselines in the experimental results section using the same test split, prompts, and decoding settings. We will revise the abstract to explicitly include the pre-fine-tuning accuracy figures for both models, allowing readers to directly verify the reported gains. revision: yes
-
Referee: [Abstract] Abstract: no details are supplied on the train/test split of the 1,593 samples, the number of evaluation runs, the precise metric (exact match, token F1, etc.), or any statistical significance test. Given the modest dataset size, these omissions leave open the possibility that the gains reflect memorization or post-hoc choices rather than improved cross-lingual table reasoning.
Authors: We concur that these details are important for assessing the reliability of the results. We will add the necessary information to the abstract and expand the evaluation protocol description in the paper, including the train/test split details, the number of runs, the exact metric employed, and any statistical tests performed. This will help demonstrate that the improvements are robust and not due to memorization. revision: yes
-
Referee: [Abstract] Abstract: the additional 4-7% gain from supplying explicit table-region coordinates is only informative if the coordinates are shown to be automatically obtainable; the manuscript must clarify whether these are oracle ground-truth boxes or outputs of an automatic detector and must report the performance drop when realistic detection noise is introduced.
Authors: The coordinates used are ground-truth table region annotations from our dataset. We will clarify this explicitly in the revised abstract and manuscript. Regarding automatic detection, while we did not include experiments with noisy detections in the original submission, we recognize its importance for practical applicability. We will add a note on this limitation and, if space permits, preliminary results using an automatic detector to show the performance under realistic conditions. revision: partial
- The requirement to report the performance drop when using realistic automatic table detection noise, as this would require conducting new experiments not included in the current manuscript.
Circularity Check
No circularity: empirical dataset benchmark with no derivations or self-referential claims
full rationale
The paper introduces the INDOTABVQA dataset (1,593 images and QA pairs) and reports empirical VLM benchmarks plus fine-tuning gains (11.6% for 3B model, 17.8% for LoRA 7B) and coordinate-input improvements (4-7%). No equations, fitted parameters, predictions, or derivations exist. Claims rest on direct experimental results rather than any chain that reduces to inputs by construction. No self-citations are load-bearing for a mathematical result, and no ansatz, uniqueness theorem, or renaming of known results occurs. This is a standard data-release and evaluation paper whose central content is independent of any circular reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Pubtables-1m: Towards comprehensive table extraction from unstructured documents. InProceed- ings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 4634–4642. Jingqun Tang, Qi Liu, Yongjie Ye, Jinghui Lu, Shu Wei, An-Lan Wang, Chunhui Lin, Hao Feng, Zhen Zhao, Yanjie Wang, and 1 others. 2025. Mtvqa: Bench- marking multilingual...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Publaynet: largest dataset ever for document layout analysis. In2019 International conference on document analysis and recognition (ICDAR), pages 1015–1022. IEEE. Fengbin Zhu, Wenqiang Lei, Youcheng Huang, Chao Wang, Shuo Zhang, Jiancheng Lv, Fuli Feng, and Tat-Seng Chua. 2021. Tat-qa: A question answering benchmark on a hybrid of tabular and textual cont...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.