ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Pith reviewed 2026-05-15 21:10 UTC · model grok-4.3
The pith
ChartQA introduces a benchmark of 32.7K questions requiring visual and logical reasoning over charts, plus transformer models that fuse image features with data tables.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present ChartQA, a benchmark consisting of 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to test visual and logical reasoning over charts. To tackle this, we introduce two transformer-based models that unify visual features extracted from the chart image with the data table representation to generate answers. Our models achieve state-of-the-art results on prior chart QA datasets as well as on ChartQA, although challenges remain for complex reasoning questions.
What carries the argument
Two transformer-based models that combine visual features from the chart image and the data table of the chart in a unified way to answer questions.
If this is right
- Transformer models that jointly encode chart images and data tables outperform prior approaches on both existing and new chart QA tasks.
- The benchmark exposes persistent gaps in handling multi-step arithmetic and visual reference questions even at state-of-the-art performance.
- A public resource now exists for training and evaluating systems that must interpret data visualizations beyond template matching.
Where Pith is reading between the lines
- Systems trained on ChartQA may transfer to business-intelligence tools that let users ask natural questions about dashboard charts.
- The image-plus-table fusion strategy could extend to other domains such as interpreting scientific figures or infographics that pair visuals with numeric data.
- Future work could measure how much performance drops when questions require reasoning patterns absent from the summary-generated portion of the data.
Load-bearing premise
That the collected human-written questions and summary-generated questions sufficiently capture the full range of complex visual and logical reasoning people perform on real-world charts.
What would settle it
A controlled test set of real-world charts with novel visual-logical question combinations, where current models score near random while human experts score above 80 percent, would show the benchmark does not capture the full range of reasoning.
read the original abstract
Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations. They also commonly refer to visual features of a chart in their questions. However, most existing datasets do not focus on such complex reasoning questions as their questions are template-based and answers come from a fixed-vocabulary. In this work, we present a large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries. To address the unique challenges in our benchmark involving visual and logical reasoning over charts, we present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions. While our models achieve the state-of-the-art results on the previous datasets as well as on our benchmark, the evaluation also reveals several challenges in answering complex reasoning questions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ChartQA, a benchmark for chart question answering focused on complex visual and logical reasoning. It consists of 9.6K human-written questions plus 23.1K questions generated from human-written chart summaries. The authors propose two transformer-based models that fuse visual features extracted from the chart image with the underlying data table in a unified architecture. These models are shown to achieve state-of-the-art results on prior chart QA datasets as well as on ChartQA itself, while the evaluation highlights remaining difficulties with multi-step arithmetic and visual-reference questions.
Significance. If the results hold, ChartQA fills a clear gap by moving beyond template-based, fixed-vocabulary questions to realistic reasoning scenarios that combine arithmetic, logical operations, and visual references. The unified visual-plus-table models demonstrate a practical fusion strategy that could transfer to other multimodal data-analysis tasks. The dual construction (human-written and summary-generated) also enables controlled study of different reasoning distributions, making the benchmark a useful standard for future work in chart understanding and visual QA.
major comments (2)
- [Abstract] Abstract and Evaluation section: the claim that the models achieve SOTA is presented without accompanying numerical deltas or per-category breakdowns; this makes it hard to judge whether gains are concentrated on simple lookup questions or extend to the complex-reasoning subset that the benchmark is intended to stress.
- [Experiments] Experiments: no ablation or error-distribution analysis is reported for the contribution of visual features versus data-table encoding, nor for the persistent challenges in complex reasoning noted in the abstract; without these controls it is difficult to attribute the reported performance gains specifically to the proposed unified fusion approach.
minor comments (3)
- [Related Work] Related Work: prior chart QA datasets should be summarized in a comparison table that includes question type, vocabulary size, and reasoning depth so readers can immediately see how ChartQA differs.
- Figure captions: several figures showing model architecture or example questions would benefit from explicit labels indicating which components correspond to visual encoder, table encoder, and fusion layers.
- Notation: the description of the unified input representation occasionally switches between “visual features” and “image embeddings”; consistent terminology would improve readability.
Simulated Author's Rebuttal
We thank the referee for the positive recommendation to accept and for the constructive feedback. We address each major comment point by point below and will make targeted revisions to improve clarity.
read point-by-point responses
-
Referee: [Abstract] Abstract and Evaluation section: the claim that the models achieve SOTA is presented without accompanying numerical deltas or per-category breakdowns; this makes it hard to judge whether gains are concentrated on simple lookup questions or extend to the complex-reasoning subset that the benchmark is intended to stress.
Authors: We agree that the abstract would benefit from more specific numbers. The Experiments section (Tables 3–5) already reports full SOTA comparisons with absolute deltas over prior models and per-category breakdowns (human-written vs. machine-generated questions, as well as by reasoning complexity). We will revise the abstract to include key deltas (e.g., “outperforming prior SOTA by 4–8 points on complex reasoning questions”) so readers can immediately assess gains on the challenging subset. revision: yes
-
Referee: [Experiments] Experiments: no ablation or error-distribution analysis is reported for the contribution of visual features versus data-table encoding, nor for the persistent challenges in complex reasoning noted in the abstract; without these controls it is difficult to attribute the reported performance gains specifically to the proposed unified fusion approach.
Authors: We acknowledge the value of explicit ablations. The manuscript does compare the unified model against strong baselines that use either visual features alone or table encoding alone (Section 4.2 and Table 2), showing consistent gains from fusion. Section 5.3 already provides error analysis with qualitative examples of failures on multi-step arithmetic and visual-reference questions. We will expand the discussion of these results in the revision to better attribute performance to the fusion strategy, but a full factorial ablation study was not conducted in the original experiments. revision: partial
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper introduces a new benchmark (ChartQA) consisting of human-written questions and summary-generated questions, along with two transformer-based models that combine visual features and chart data tables. The central claims of SOTA performance on prior datasets and the new benchmark follow directly from the described model architectures, training procedures, and standard evaluation protocols. No load-bearing steps reduce by construction to self-definitions, fitted inputs renamed as predictions, or self-citation chains; the results are externally falsifiable via the released benchmark and independent prior datasets.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Transformer models can be adapted to jointly process visual chart features and tabular data for question answering
Forward citations
Cited by 54 Pith papers
-
MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
-
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
Can Multimodal Large Language Models Truly Understand Small Objects?
Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.
-
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
-
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
DeFacto trains multimodal models using counterfactual image variants and reinforcement learning rewards to improve both answer accuracy and evidence-answer consistency.
-
InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information
InterChart is a new benchmark that reveals steep drops in VLM accuracy when moving from single-chart facts to integrative reasoning over 2-3 related charts, with better performance after decomposing complex charts.
-
MMSearch-R1: Incentivizing LMMs to Search
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
-
VGR: Visual Grounded Reasoning
VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.
-
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
-
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperfor...
-
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
-
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Seg-Zero uses cognitive reinforcement learning on a decoupled reasoning-plus-segmentation architecture to produce explicit reasoning chains and reach 57.5 zero-shot accuracy on ReasonSeg, beating prior supervised LISA...
-
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
-
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
-
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distributi...
-
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
-
From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
Reinforcing Multimodal Reasoning Against Visual Degradation
ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
-
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
-
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
-
Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models
Chart-RL uses RL policy optimization and LoRA to boost VLM chart reasoning, enabling a 4B model to reach 0.634 accuracy versus 0.580 for an 8B model with lower latency.
-
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
-
EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews
EviSearch automates ontology-aligned clinical evidence table creation from native PDFs with comprehensive provenance logging for auditability and iterative improvement.
-
FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis
FPBench evaluates 20 MLLMs across 8 fingerprint tasks on 7 datasets and shows fine-tuning vision and language encoders improves performance by 7-39%.
-
Boosting Reasoning in Large Multimodal Models via Activation Replay
Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
-
DeepEyesV2: Toward Agentic Multimodal Model
DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
-
Routing-Based Continual Learning for Multimodal Large Language Models
Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
-
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
-
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.
-
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
DeFacto trains multimodal models with counterfactual image variants and GRPO reinforcement learning to enforce that correct answers are supported by correct visual evidence.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
-
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.
-
Emu3: Next-Token Prediction is All You Need
Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
-
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.
-
Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention
BA-Att introduces pre-downsampled block selection with norm-sorting and diagonal covariance correction to approximate sparse attention, yielding up to 6.95x speedup at 50% sparsity across language, multimodal, and vid...
-
Semantic-Enriched Latent Visual Reasoning
SLVR enriches latent visual representations with fine-grained attribute semantics via supervised first-stage learning and multi-query alignment via M-GRPO, yielding improved robustness on region-level reasoning tasks.
-
Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth
Constraining visual token budget per observation during VLM training forces genuine active perception and delivers 5% average relative improvement without auxiliary losses or architecture changes.
-
Qwen2.5-VL Technical Report
Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level lo...
-
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
-
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.
-
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
GOT is a unified end-to-end model that treats all man-made optical signals as characters and handles multiple OCR tasks including formatted output and interactive region recognition via prompts.
-
CogVLM2: Visual Language Models for Image and Video Understanding
CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
-
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
-
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
-
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
-
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
-
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
-
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.