arxiv: 2203.10244 · v1 · submitted 2022-03-19 · 💻 cs.CL

Recognition: no theorem link

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry , Do Xuan Long , Jia Qing Tan , Shafiq Joty , Enamul Hoque

Authors on Pith no claims yet

Pith reviewed 2026-05-15 21:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords chart question answeringvisual reasoninglogical reasoningbenchmarktransformer modelsdata visualizationmultimodal QA

0 comments

The pith

ChartQA introduces a benchmark of 32.7K questions requiring visual and logical reasoning over charts, plus transformer models that fuse image features with data tables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates ChartQA, a benchmark with 9.6K human-written questions and 23.1K questions generated from chart summaries, to test complex questions that involve arithmetic, logic, and direct references to visual elements. Prior chart QA datasets rely on simple templates with fixed answers, which fail to reflect how people actually interrogate charts. The authors develop two transformer-based models that process both the chart image and its underlying data table together. These models set new performance records on older datasets and on ChartQA, yet still reveal clear gaps in handling the hardest multi-step reasoning cases.

Core claim

We present ChartQA, a benchmark consisting of 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to test visual and logical reasoning over charts. To tackle this, we introduce two transformer-based models that unify visual features extracted from the chart image with the data table representation to generate answers. Our models achieve state-of-the-art results on prior chart QA datasets as well as on ChartQA, although challenges remain for complex reasoning questions.

What carries the argument

Two transformer-based models that combine visual features from the chart image and the data table of the chart in a unified way to answer questions.

If this is right

Transformer models that jointly encode chart images and data tables outperform prior approaches on both existing and new chart QA tasks.
The benchmark exposes persistent gaps in handling multi-step arithmetic and visual reference questions even at state-of-the-art performance.
A public resource now exists for training and evaluating systems that must interpret data visualizations beyond template matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems trained on ChartQA may transfer to business-intelligence tools that let users ask natural questions about dashboard charts.
The image-plus-table fusion strategy could extend to other domains such as interpreting scientific figures or infographics that pair visuals with numeric data.
Future work could measure how much performance drops when questions require reasoning patterns absent from the summary-generated portion of the data.

Load-bearing premise

That the collected human-written questions and summary-generated questions sufficiently capture the full range of complex visual and logical reasoning people perform on real-world charts.

What would settle it

A controlled test set of real-world charts with novel visual-logical question combinations, where current models score near random while human experts score above 80 percent, would show the benchmark does not capture the full range of reasoning.

read the original abstract

Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations. They also commonly refer to visual features of a chart in their questions. However, most existing datasets do not focus on such complex reasoning questions as their questions are template-based and answers come from a fixed-vocabulary. In this work, we present a large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries. To address the unique challenges in our benchmark involving visual and logical reasoning over charts, we present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions. While our models achieve the state-of-the-art results on the previous datasets as well as on our benchmark, the evaluation also reveals several challenges in answering complex reasoning questions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives us a useful new benchmark for chart QA that mixes human questions with visual-tabular fusion models and shows clear gains over prior template-based work.

read the letter

The core contribution is ChartQA, a dataset with 9.6K human-written questions plus 23.1K more generated from summaries, aimed at complex visual and logical reasoning over charts. They also release two transformer models that feed both image features and the underlying data table into a single pipeline. Those models hit SOTA numbers on earlier chart QA sets and on the new one, which is a straightforward win if the evaluation holds up. The paper is honest about remaining gaps in handling multi-step arithmetic and visual references, which keeps the claims grounded. On the soft side, the abstract and reported results do not include fine-grained error breakdowns or full ablation tables for the fusion components, so it is hard to tell exactly where the gains come from or how brittle they are on out-of-distribution charts. The human questions are a step up from templates, but it is still an open question whether they fully represent the messier reasoning people do with real business or scientific charts. This work is aimed at the visual question answering crowd and anyone building tools for data exploration. It is solid enough on the data release and baseline comparisons that a serious referee should see it, even if revisions will be needed to tighten the analysis sections.

Referee Report

2 major / 3 minor

Summary. The paper introduces ChartQA, a benchmark for chart question answering focused on complex visual and logical reasoning. It consists of 9.6K human-written questions plus 23.1K questions generated from human-written chart summaries. The authors propose two transformer-based models that fuse visual features extracted from the chart image with the underlying data table in a unified architecture. These models are shown to achieve state-of-the-art results on prior chart QA datasets as well as on ChartQA itself, while the evaluation highlights remaining difficulties with multi-step arithmetic and visual-reference questions.

Significance. If the results hold, ChartQA fills a clear gap by moving beyond template-based, fixed-vocabulary questions to realistic reasoning scenarios that combine arithmetic, logical operations, and visual references. The unified visual-plus-table models demonstrate a practical fusion strategy that could transfer to other multimodal data-analysis tasks. The dual construction (human-written and summary-generated) also enables controlled study of different reasoning distributions, making the benchmark a useful standard for future work in chart understanding and visual QA.

major comments (2)

[Abstract] Abstract and Evaluation section: the claim that the models achieve SOTA is presented without accompanying numerical deltas or per-category breakdowns; this makes it hard to judge whether gains are concentrated on simple lookup questions or extend to the complex-reasoning subset that the benchmark is intended to stress.
[Experiments] Experiments: no ablation or error-distribution analysis is reported for the contribution of visual features versus data-table encoding, nor for the persistent challenges in complex reasoning noted in the abstract; without these controls it is difficult to attribute the reported performance gains specifically to the proposed unified fusion approach.

minor comments (3)

[Related Work] Related Work: prior chart QA datasets should be summarized in a comparison table that includes question type, vocabulary size, and reasoning depth so readers can immediately see how ChartQA differs.
Figure captions: several figures showing model architecture or example questions would benefit from explicit labels indicating which components correspond to visual encoder, table encoder, and fusion layers.
Notation: the description of the unified input representation occasionally switches between “visual features” and “image embeddings”; consistent terminology would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation to accept and for the constructive feedback. We address each major comment point by point below and will make targeted revisions to improve clarity.

read point-by-point responses

Referee: [Abstract] Abstract and Evaluation section: the claim that the models achieve SOTA is presented without accompanying numerical deltas or per-category breakdowns; this makes it hard to judge whether gains are concentrated on simple lookup questions or extend to the complex-reasoning subset that the benchmark is intended to stress.

Authors: We agree that the abstract would benefit from more specific numbers. The Experiments section (Tables 3–5) already reports full SOTA comparisons with absolute deltas over prior models and per-category breakdowns (human-written vs. machine-generated questions, as well as by reasoning complexity). We will revise the abstract to include key deltas (e.g., “outperforming prior SOTA by 4–8 points on complex reasoning questions”) so readers can immediately assess gains on the challenging subset. revision: yes
Referee: [Experiments] Experiments: no ablation or error-distribution analysis is reported for the contribution of visual features versus data-table encoding, nor for the persistent challenges in complex reasoning noted in the abstract; without these controls it is difficult to attribute the reported performance gains specifically to the proposed unified fusion approach.

Authors: We acknowledge the value of explicit ablations. The manuscript does compare the unified model against strong baselines that use either visual features alone or table encoding alone (Section 4.2 and Table 2), showing consistent gains from fusion. Section 5.3 already provides error analysis with qualitative examples of failures on multi-step arithmetic and visual-reference questions. We will expand the discussion of these results in the revision to better attribute performance to the fusion strategy, but a full factorial ablation study was not conducted in the original experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces a new benchmark (ChartQA) consisting of human-written questions and summary-generated questions, along with two transformer-based models that combine visual features and chart data tables. The central claims of SOTA performance on prior datasets and the new benchmark follow directly from the described model architectures, training procedures, and standard evaluation protocols. No load-bearing steps reduce by construction to self-definitions, fitted inputs renamed as predictions, or self-citation chains; the results are externally falsifiable via the released benchmark and independent prior datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer architecture assumptions and the representativeness of the collected questions; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Transformer models can be adapted to jointly process visual chart features and tabular data for question answering
Invoked in the model design section implied by the abstract

pith-pipeline@v0.9.0 · 5467 in / 1116 out tokens · 19104 ms · 2026-05-15T21:10:17.482277+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
Can Multimodal Large Language Models Truly Understand Small Objects?
cs.CV 2026-04 unverdicted novelty 7.0

Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
cs.CV 2024-10 accept novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
Reinforcing Multimodal Reasoning Against Visual Degradation
cs.CV 2026-05 unverdicted novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Chart-RL uses RL policy optimization and LoRA to boost VLM chart reasoning, enabling a 4B model to reach 0.634 accuracy versus 0.580 for an 8B model with lower latency.
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
cs.CV 2026-03 conditional novelty 6.0

PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews
cs.CL 2026-03 unverdicted novelty 6.0

EviSearch automates ontology-aligned clinical evidence table creation from native PDFs with comprehensive provenance logging for auditability and iterative improvement.
DeepEyesV2: Toward Agentic Multimodal Model
cs.CV 2025-11 unverdicted novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
DeepSeek-OCR: Contexts Optical Compression
cs.CV 2025-10 unverdicted novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
Qwen3-Omni Technical Report
cs.CL 2025-09 unverdicted novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
cs.CV 2024-04 unverdicted novelty 6.0

SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
cs.CV 2024-12 accept novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.