Recognition: no theorem link
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Pith reviewed 2026-05-15 21:10 UTC · model grok-4.3
The pith
ChartQA introduces a benchmark of 32.7K questions requiring visual and logical reasoning over charts, plus transformer models that fuse image features with data tables.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present ChartQA, a benchmark consisting of 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to test visual and logical reasoning over charts. To tackle this, we introduce two transformer-based models that unify visual features extracted from the chart image with the data table representation to generate answers. Our models achieve state-of-the-art results on prior chart QA datasets as well as on ChartQA, although challenges remain for complex reasoning questions.
What carries the argument
Two transformer-based models that combine visual features from the chart image and the data table of the chart in a unified way to answer questions.
If this is right
- Transformer models that jointly encode chart images and data tables outperform prior approaches on both existing and new chart QA tasks.
- The benchmark exposes persistent gaps in handling multi-step arithmetic and visual reference questions even at state-of-the-art performance.
- A public resource now exists for training and evaluating systems that must interpret data visualizations beyond template matching.
Where Pith is reading between the lines
- Systems trained on ChartQA may transfer to business-intelligence tools that let users ask natural questions about dashboard charts.
- The image-plus-table fusion strategy could extend to other domains such as interpreting scientific figures or infographics that pair visuals with numeric data.
- Future work could measure how much performance drops when questions require reasoning patterns absent from the summary-generated portion of the data.
Load-bearing premise
That the collected human-written questions and summary-generated questions sufficiently capture the full range of complex visual and logical reasoning people perform on real-world charts.
What would settle it
A controlled test set of real-world charts with novel visual-logical question combinations, where current models score near random while human experts score above 80 percent, would show the benchmark does not capture the full range of reasoning.
read the original abstract
Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations. They also commonly refer to visual features of a chart in their questions. However, most existing datasets do not focus on such complex reasoning questions as their questions are template-based and answers come from a fixed-vocabulary. In this work, we present a large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries. To address the unique challenges in our benchmark involving visual and logical reasoning over charts, we present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions. While our models achieve the state-of-the-art results on the previous datasets as well as on our benchmark, the evaluation also reveals several challenges in answering complex reasoning questions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ChartQA, a benchmark for chart question answering focused on complex visual and logical reasoning. It consists of 9.6K human-written questions plus 23.1K questions generated from human-written chart summaries. The authors propose two transformer-based models that fuse visual features extracted from the chart image with the underlying data table in a unified architecture. These models are shown to achieve state-of-the-art results on prior chart QA datasets as well as on ChartQA itself, while the evaluation highlights remaining difficulties with multi-step arithmetic and visual-reference questions.
Significance. If the results hold, ChartQA fills a clear gap by moving beyond template-based, fixed-vocabulary questions to realistic reasoning scenarios that combine arithmetic, logical operations, and visual references. The unified visual-plus-table models demonstrate a practical fusion strategy that could transfer to other multimodal data-analysis tasks. The dual construction (human-written and summary-generated) also enables controlled study of different reasoning distributions, making the benchmark a useful standard for future work in chart understanding and visual QA.
major comments (2)
- [Abstract] Abstract and Evaluation section: the claim that the models achieve SOTA is presented without accompanying numerical deltas or per-category breakdowns; this makes it hard to judge whether gains are concentrated on simple lookup questions or extend to the complex-reasoning subset that the benchmark is intended to stress.
- [Experiments] Experiments: no ablation or error-distribution analysis is reported for the contribution of visual features versus data-table encoding, nor for the persistent challenges in complex reasoning noted in the abstract; without these controls it is difficult to attribute the reported performance gains specifically to the proposed unified fusion approach.
minor comments (3)
- [Related Work] Related Work: prior chart QA datasets should be summarized in a comparison table that includes question type, vocabulary size, and reasoning depth so readers can immediately see how ChartQA differs.
- Figure captions: several figures showing model architecture or example questions would benefit from explicit labels indicating which components correspond to visual encoder, table encoder, and fusion layers.
- Notation: the description of the unified input representation occasionally switches between “visual features” and “image embeddings”; consistent terminology would improve readability.
Simulated Author's Rebuttal
We thank the referee for the positive recommendation to accept and for the constructive feedback. We address each major comment point by point below and will make targeted revisions to improve clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract and Evaluation section: the claim that the models achieve SOTA is presented without accompanying numerical deltas or per-category breakdowns; this makes it hard to judge whether gains are concentrated on simple lookup questions or extend to the complex-reasoning subset that the benchmark is intended to stress.
Authors: We agree that the abstract would benefit from more specific numbers. The Experiments section (Tables 3–5) already reports full SOTA comparisons with absolute deltas over prior models and per-category breakdowns (human-written vs. machine-generated questions, as well as by reasoning complexity). We will revise the abstract to include key deltas (e.g., “outperforming prior SOTA by 4–8 points on complex reasoning questions”) so readers can immediately assess gains on the challenging subset. revision: yes
-
Referee: [Experiments] Experiments: no ablation or error-distribution analysis is reported for the contribution of visual features versus data-table encoding, nor for the persistent challenges in complex reasoning noted in the abstract; without these controls it is difficult to attribute the reported performance gains specifically to the proposed unified fusion approach.
Authors: We acknowledge the value of explicit ablations. The manuscript does compare the unified model against strong baselines that use either visual features alone or table encoding alone (Section 4.2 and Table 2), showing consistent gains from fusion. Section 5.3 already provides error analysis with qualitative examples of failures on multi-step arithmetic and visual-reference questions. We will expand the discussion of these results in the revision to better attribute performance to the fusion strategy, but a full factorial ablation study was not conducted in the original experiments. revision: partial
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper introduces a new benchmark (ChartQA) consisting of human-written questions and summary-generated questions, along with two transformer-based models that combine visual features and chart data tables. The central claims of SOTA performance on prior datasets and the new benchmark follow directly from the described model architectures, training procedures, and standard evaluation protocols. No load-bearing steps reduce by construction to self-definitions, fitted inputs renamed as predictions, or self-citation chains; the results are externally falsifiable via the released benchmark and independent prior datasets.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer models can be adapted to jointly process visual chart features and tabular data for question answering
Forward citations
Cited by 22 Pith papers
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
Can Multimodal Large Language Models Truly Understand Small Objects?
Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.
-
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
Reinforcing Multimodal Reasoning Against Visual Degradation
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
-
Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models
Chart-RL uses RL policy optimization and LoRA to boost VLM chart reasoning, enabling a 4B model to reach 0.634 accuracy versus 0.580 for an 8B model with lower latency.
-
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
-
EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews
EviSearch automates ontology-aligned clinical evidence table creation from native PDFs with comprehensive provenance logging for auditability and iterative improvement.
-
DeepEyesV2: Toward Agentic Multimodal Model
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
-
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.