pith. sign in

arxiv: 1710.07300 · v2 · pith:BFYLXW5Fnew · submitted 2017-10-19 · 💻 cs.CV

FigureQA: An Annotated Figure Dataset for Visual Reasoning

classification 💻 cs.CV
keywords reasoningvisualdataelementsfigurefigureqaplotquestions
0
0 comments X
read the original abstract

We introduce FigureQA, a visual reasoning corpus of over one million question-answer pairs grounded in over 100,000 images. The images are synthetic, scientific-style figures from five classes: line plots, dot-line plots, vertical and horizontal bar graphs, and pie charts. We formulate our reasoning task by generating questions from 15 templates; questions concern various relationships between plot elements and examine characteristics like the maximum, the minimum, area-under-the-curve, smoothness, and intersection. To resolve, such questions often require reference to multiple plot elements and synthesis of information distributed spatially throughout a figure. To facilitate the training of machine learning systems, the corpus also includes side data that can be used to formulate auxiliary objectives. In particular, we provide the numerical data used to generate each figure as well as bounding-box annotations for all plot elements. We study the proposed visual reasoning task by training several models, including the recently proposed Relation Network as a strong baseline. Preliminary results indicate that the task poses a significant machine learning challenge. We envision FigureQA as a first step towards developing models that can intuitively recognize patterns from visual representations of data.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Vision2Code: A Multi-Domain Benchmark for Evaluating Image-to-Code Generation

    cs.CV 2026-05 accept novelty 8.0

    Vision2Code is a multi-domain benchmark that evaluates image-to-code generation via rendered outputs scored by a VLM rater with dataset-specific rubrics, revealing domain-dependent model performance and enabling impro...

  2. Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

    cs.CV 2026-01 unverdicted novelty 8.0

    Molmo2 delivers state-of-the-art open-weight video VLMs with new grounding datasets and training methods that outperform prior open models and match or exceed some proprietary ones on pointing and tracking tasks.

  3. Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    cs.CV 2024-09 accept novelty 8.0

    Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.

  4. QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

    quant-ph 2026-04 unverdicted novelty 7.0

    Introduces QCalEval benchmark showing best zero-shot VLM score of 72.3 on quantum calibration plots, with fine-tuning and in-context learning effects varying by model type.

  5. Beyond Single Plots: A Benchmark for Question Answering on Multi-Charts

    cs.CL 2026-04 unverdicted novelty 7.0

    PolyChartQA is a new mid-scale dataset for multi-chart question answering that reveals a 27.4% accuracy drop for multimodal models on human-authored questions compared to AI-generated ones, plus a modest gain from a p...

  6. Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

    cs.AI 2026-04 unverdicted novelty 7.0

    SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.

  7. GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.

  8. Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models

    cs.CV 2026-03 unverdicted novelty 7.0

    Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.

  9. PlotChain: Deterministic Checkpointed Evaluation of Multimodal LLMs on Engineering Plot Reading

    cs.AI 2026-01 conditional novelty 7.0

    PlotChain benchmark reports top MLLMs reaching ~80% field-level accuracy on engineering plot reading under human-like tolerances, but with persistent failures on frequency-domain tasks like bandpass and FFT spectra.

  10. From Static to Interactive: Authoring Interactive Visualizations via Natural Language

    cs.HC 2026-01 unverdicted novelty 7.0

    Athanor converts static visualizations to interactive versions via MLLMs, a multi-agent analyzer, and an abstraction transformer, allowing natural language authoring of interactions.

  11. InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information

    cs.CL 2025-08 unverdicted novelty 7.0

    InterChart is a new benchmark that reveals steep drops in VLM accuracy when moving from single-chart facts to integrative reasoning over 2-3 related charts, with better performance after decomposing complex charts.

  12. Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

    cs.AI 2024-10 unverdicted novelty 7.0

    PolyMATH is a new 5,000-image benchmark where top MLLMs reach at most 41 percent accuracy on multi-modal mathematical reasoning, with ablation showing minimal gain from text over images.

  13. Process Rewards with Learned Reliability

    cs.CL 2026-05 unverdicted novelty 6.0

    BetaPRM learns distributional step rewards with explicit reliability via Beta-Binomial modeling, enabling ACA that cuts token use by up to 33.57% while raising final-answer accuracy on reasoning benchmarks.

  14. Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts

    cs.CV 2026-05 unverdicted novelty 6.0

    Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.

  15. CAGE: Bridging the Accuracy-Aesthetics Gap in Educational Diagrams via Code-Anchored Generative Enhancement

    cs.CV 2026-04 unverdicted novelty 6.0

    CAGE uses LLM-generated code for label-correct diagrams followed by ControlNet-conditioned diffusion refinement to produce both accurate and visually engaging educational graphics, backed by the new EduDiagram-2K dataset.

  16. Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    Chart-RL uses RL policy optimization and LoRA to boost VLM chart reasoning, enabling a 4B model to reach 0.634 accuracy versus 0.580 for an 8B model with lower latency.

  17. ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch

    cs.CV 2026-01 conditional novelty 6.0

    ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.

  18. OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

    cs.CV 2025-03 conditional novelty 6.0

    Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.

  19. LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    cs.CL 2025-03 unverdicted novelty 6.0

    A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.

  20. Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    cs.CV 2024-12 unverdicted novelty 6.0

    InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.

  21. Assessing Y-Axis Influence: Bias in Multimodal Language Models on Chart-to-Table Translation

    cs.AI 2026-04 unverdicted novelty 5.0

    Y-axis features such as major tick digit length, number of ticks, value range, and format introduce significant biases in multimodal models during chart-to-table tasks, with y-axis prompting improving performance for ...

  22. Responses Fall Short of Understanding: Revealing the Gap between Internal Representations and Responses in Visual Document Understanding

    cs.CL 2026-04 unverdicted novelty 5.0

    Linear probing reveals a gap between internal representations and responses in LVLMs for visual document understanding, with task information encoded more linearly in intermediate layers than the final layer, and fine...

  23. MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    cs.CV 2024-08 conditional novelty 5.0

    MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.

  24. ZAYA1-VL-8B Technical Report

    cs.CV 2026-05 unverdicted novelty 4.0

    ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...

  25. Seed1.5-VL Technical Report

    cs.CV 2025-05 unverdicted novelty 4.0

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.