ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry; Do Xuan Long; Enamul Hoque; Jia Qing Tan; Shafiq Joty

arxiv: 2203.10244 · v1 · pith:B4J46ICEnew · submitted 2022-03-19 · 💻 cs.CL

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry , Do Xuan Long , Jia Qing Tan , Shafiq Joty , Enamul Hoque This is my paper

Pith reviewed 2026-05-15 21:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords chart question answeringvisual reasoninglogical reasoningbenchmarktransformer modelsdata visualizationmultimodal QA

0 comments

The pith

ChartQA introduces a benchmark of 32.7K questions requiring visual and logical reasoning over charts, plus transformer models that fuse image features with data tables.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates ChartQA, a benchmark with 9.6K human-written questions and 23.1K questions generated from chart summaries, to test complex questions that involve arithmetic, logic, and direct references to visual elements. Prior chart QA datasets rely on simple templates with fixed answers, which fail to reflect how people actually interrogate charts. The authors develop two transformer-based models that process both the chart image and its underlying data table together. These models set new performance records on older datasets and on ChartQA, yet still reveal clear gaps in handling the hardest multi-step reasoning cases.

Core claim

We present ChartQA, a benchmark consisting of 9.6K human-written questions and 23.1K questions generated from human-written chart summaries, designed to test visual and logical reasoning over charts. To tackle this, we introduce two transformer-based models that unify visual features extracted from the chart image with the data table representation to generate answers. Our models achieve state-of-the-art results on prior chart QA datasets as well as on ChartQA, although challenges remain for complex reasoning questions.

What carries the argument

Two transformer-based models that combine visual features from the chart image and the data table of the chart in a unified way to answer questions.

If this is right

Transformer models that jointly encode chart images and data tables outperform prior approaches on both existing and new chart QA tasks.
The benchmark exposes persistent gaps in handling multi-step arithmetic and visual reference questions even at state-of-the-art performance.
A public resource now exists for training and evaluating systems that must interpret data visualizations beyond template matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Systems trained on ChartQA may transfer to business-intelligence tools that let users ask natural questions about dashboard charts.
The image-plus-table fusion strategy could extend to other domains such as interpreting scientific figures or infographics that pair visuals with numeric data.
Future work could measure how much performance drops when questions require reasoning patterns absent from the summary-generated portion of the data.

Load-bearing premise

That the collected human-written questions and summary-generated questions sufficiently capture the full range of complex visual and logical reasoning people perform on real-world charts.

What would settle it

A controlled test set of real-world charts with novel visual-logical question combinations, where current models score near random while human experts score above 80 percent, would show the benchmark does not capture the full range of reasoning.

read the original abstract

Charts are very popular for analyzing data. When exploring charts, people often ask a variety of complex reasoning questions that involve several logical and arithmetic operations. They also commonly refer to visual features of a chart in their questions. However, most existing datasets do not focus on such complex reasoning questions as their questions are template-based and answers come from a fixed-vocabulary. In this work, we present a large-scale benchmark covering 9.6K human-written questions as well as 23.1K questions generated from human-written chart summaries. To address the unique challenges in our benchmark involving visual and logical reasoning over charts, we present two transformer-based models that combine visual features and the data table of the chart in a unified way to answer questions. While our models achieve the state-of-the-art results on the previous datasets as well as on our benchmark, the evaluation also reveals several challenges in answering complex reasoning questions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives us a useful new benchmark for chart QA that mixes human questions with visual-tabular fusion models and shows clear gains over prior template-based work.

read the letter

The core contribution is ChartQA, a dataset with 9.6K human-written questions plus 23.1K more generated from summaries, aimed at complex visual and logical reasoning over charts. They also release two transformer models that feed both image features and the underlying data table into a single pipeline. Those models hit SOTA numbers on earlier chart QA sets and on the new one, which is a straightforward win if the evaluation holds up. The paper is honest about remaining gaps in handling multi-step arithmetic and visual references, which keeps the claims grounded. On the soft side, the abstract and reported results do not include fine-grained error breakdowns or full ablation tables for the fusion components, so it is hard to tell exactly where the gains come from or how brittle they are on out-of-distribution charts. The human questions are a step up from templates, but it is still an open question whether they fully represent the messier reasoning people do with real business or scientific charts. This work is aimed at the visual question answering crowd and anyone building tools for data exploration. It is solid enough on the data release and baseline comparisons that a serious referee should see it, even if revisions will be needed to tighten the analysis sections.

Referee Report

2 major / 3 minor

Summary. The paper introduces ChartQA, a benchmark for chart question answering focused on complex visual and logical reasoning. It consists of 9.6K human-written questions plus 23.1K questions generated from human-written chart summaries. The authors propose two transformer-based models that fuse visual features extracted from the chart image with the underlying data table in a unified architecture. These models are shown to achieve state-of-the-art results on prior chart QA datasets as well as on ChartQA itself, while the evaluation highlights remaining difficulties with multi-step arithmetic and visual-reference questions.

Significance. If the results hold, ChartQA fills a clear gap by moving beyond template-based, fixed-vocabulary questions to realistic reasoning scenarios that combine arithmetic, logical operations, and visual references. The unified visual-plus-table models demonstrate a practical fusion strategy that could transfer to other multimodal data-analysis tasks. The dual construction (human-written and summary-generated) also enables controlled study of different reasoning distributions, making the benchmark a useful standard for future work in chart understanding and visual QA.

major comments (2)

[Abstract] Abstract and Evaluation section: the claim that the models achieve SOTA is presented without accompanying numerical deltas or per-category breakdowns; this makes it hard to judge whether gains are concentrated on simple lookup questions or extend to the complex-reasoning subset that the benchmark is intended to stress.
[Experiments] Experiments: no ablation or error-distribution analysis is reported for the contribution of visual features versus data-table encoding, nor for the persistent challenges in complex reasoning noted in the abstract; without these controls it is difficult to attribute the reported performance gains specifically to the proposed unified fusion approach.

minor comments (3)

[Related Work] Related Work: prior chart QA datasets should be summarized in a comparison table that includes question type, vocabulary size, and reasoning depth so readers can immediately see how ChartQA differs.
Figure captions: several figures showing model architecture or example questions would benefit from explicit labels indicating which components correspond to visual encoder, table encoder, and fusion layers.
Notation: the description of the unified input representation occasionally switches between “visual features” and “image embeddings”; consistent terminology would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive recommendation to accept and for the constructive feedback. We address each major comment point by point below and will make targeted revisions to improve clarity.

read point-by-point responses

Referee: [Abstract] Abstract and Evaluation section: the claim that the models achieve SOTA is presented without accompanying numerical deltas or per-category breakdowns; this makes it hard to judge whether gains are concentrated on simple lookup questions or extend to the complex-reasoning subset that the benchmark is intended to stress.

Authors: We agree that the abstract would benefit from more specific numbers. The Experiments section (Tables 3–5) already reports full SOTA comparisons with absolute deltas over prior models and per-category breakdowns (human-written vs. machine-generated questions, as well as by reasoning complexity). We will revise the abstract to include key deltas (e.g., “outperforming prior SOTA by 4–8 points on complex reasoning questions”) so readers can immediately assess gains on the challenging subset. revision: yes
Referee: [Experiments] Experiments: no ablation or error-distribution analysis is reported for the contribution of visual features versus data-table encoding, nor for the persistent challenges in complex reasoning noted in the abstract; without these controls it is difficult to attribute the reported performance gains specifically to the proposed unified fusion approach.

Authors: We acknowledge the value of explicit ablations. The manuscript does compare the unified model against strong baselines that use either visual features alone or table encoding alone (Section 4.2 and Table 2), showing consistent gains from fusion. Section 5.3 already provides error analysis with qualitative examples of failures on multi-step arithmetic and visual-reference questions. We will expand the discussion of these results in the revision to better attribute performance to the fusion strategy, but a full factorial ablation study was not conducted in the original experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces a new benchmark (ChartQA) consisting of human-written questions and summary-generated questions, along with two transformer-based models that combine visual features and chart data tables. The central claims of SOTA performance on prior datasets and the new benchmark follow directly from the described model architectures, training procedures, and standard evaluation protocols. No load-bearing steps reduce by construction to self-definitions, fitted inputs renamed as predictions, or self-citation chains; the results are externally falsifiable via the released benchmark and independent prior datasets.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer architecture assumptions and the representativeness of the collected questions; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Transformer models can be adapted to jointly process visual chart features and tabular data for question answering
Invoked in the model design section implied by the abstract

pith-pipeline@v0.9.0 · 5467 in / 1116 out tokens · 19104 ms · 2026-05-15T21:10:17.482277+00:00 · methodology

discussion (0)

Forward citations

Cited by 54 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
cs.CV 2024-08 conditional novelty 8.0

MME-RealWorld is the largest manually annotated high-resolution benchmark for MLLMs, where even the best models achieve less than 60% accuracy on challenging real-world tasks.
WikiVQABench: A Knowledge-Grounded Visual Question Answering Benchmark from Wikipedia and Wikidata
cs.CV 2026-05 conditional novelty 7.0

WikiVQABench is a human-curated collection of Wikipedia-based VQA items that require both visual evidence and external knowledge from Wikidata to answer correctly.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
Can Multimodal Large Language Models Truly Understand Small Objects?
cs.CV 2026-04 unverdicted novelty 7.0

Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

Q-Mask uses query-conditioned causal masks to separate text location from recognition in OCR VLMs, backed by a new benchmark and 26M-pair training dataset.
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
cs.AI 2025-09 unverdicted novelty 7.0

DeFacto trains multimodal models using counterfactual image variants and reinforcement learning rewards to improve both answer accuracy and evidence-answer consistency.
InterChart: Benchmarking Visual Reasoning Across Decomposed and Distributed Chart Information
cs.CL 2025-08 unverdicted novelty 7.0

InterChart is a new benchmark that reveals steep drops in VLM accuracy when moving from single-chart facts to integrative reasoning over 2-3 related charts, with better performance after decomposing complex charts.
MMSearch-R1: Incentivizing LMMs to Search
cs.CV 2025-06 unverdicted novelty 7.0

MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
VGR: Visual Grounded Reasoning
cs.CV 2025-06 unverdicted novelty 7.0

VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
cs.MA 2025-06 accept novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
FLARE: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding
cs.CV 2025-04 unverdicted novelty 7.0

FLARE is a vision-language model family using text-guided vision encoding, context-aware alignment decoding, dual-semantic mapping loss, and text-driven VQA synthesis to achieve deep cross-modal integration, outperfor...
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
cs.AI 2025-03 conditional novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
cs.CV 2025-03 unverdicted novelty 7.0

Seg-Zero uses cognitive reinforcement learning on a decoupled reasoning-plus-segmentation architecture to produce explicit reasoning chains and reach 57.5 zero-shot accuracy on ReasonSeg, beating prior supervised LISA...
OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
cs.CV 2024-12 accept novelty 7.0

OCRBench v2 is a new benchmark with four times more tasks than prior versions that reveals most large multimodal models score below 50 out of 100 on visual text tasks and share five specific weaknesses.
PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction
cs.CV 2024-10 accept novelty 7.0

PyramidDrop accelerates LVLMs by staged, similarity-based dropping of visual tokens that become redundant in deeper layers, delivering 40% faster training and 55% lower inference cost with comparable accuracy.
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
cs.CV 2024-10 conditional novelty 7.0

VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distributi...
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
cs.CV 2023-10 unverdicted novelty 7.0

HallusionBench shows GPT-4V reaches only 31.42% accuracy on paired questions testing language hallucination and visual illusion in LVLMs, with other models below 16%.
From Seeing to Thinking: Decoupling Perception and Reasoning Improves Post-Training of Vision-Language Models
cs.CL 2026-05 conditional novelty 6.0

Staged post-training that first solidifies visual perception before visual and textual reasoning improves VLM accuracy and shortens reasoning traces on visual math and perception benchmarks.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
Reinforcing Multimodal Reasoning Against Visual Degradation
cs.CV 2026-05 unverdicted novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
Q-Zoom: Query-Aware Adaptive Perception for Efficient Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

Q-Zoom achieves up to 4.39x inference speedup in high-resolution MLLM scenarios via query-aware gating and region localization, matching or exceeding baseline accuracy on document and high-res benchmarks.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward
cs.CV 2026-04 unverdicted novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
Chart-RL: Policy Optimization Reinforcement Learning for Enhanced Visual Reasoning in Chart Question Answering with Vision Language Models
cs.AI 2026-04 unverdicted novelty 6.0

Chart-RL uses RL policy optimization and LoRA to boost VLM chart reasoning, enabling a 4B model to reach 0.634 accuracy versus 0.580 for an 8B model with lower latency.
Boosting Document Parsing Efficiency and Performance with Coarse-to-Fine Visual Processing
cs.CV 2026-03 conditional novelty 6.0

PaddleOCR-VL uses a Valid Region Focus Module to select key visual tokens and a 0.9B model for guided recognition, delivering SOTA document parsing with far fewer tokens and parameters.
EviSearch: A Human in the Loop System for Extracting and Auditing Clinical Evidence for Systematic Reviews
cs.CL 2026-03 unverdicted novelty 6.0

EviSearch automates ontology-aligned clinical evidence table creation from native PDFs with comprehensive provenance logging for auditability and iterative improvement.
FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis
cs.CV 2025-12 conditional novelty 6.0

FPBench evaluates 20 MLLMs across 8 fingerprint tasks on 7 datasets and shows fine-tuning vision and language encoders improves performance by 7-39%.
Boosting Reasoning in Large Multimodal Models via Activation Replay
cs.CV 2025-11 unverdicted novelty 6.0

Activation Replay boosts multimodal reasoning in post-trained LMMs by replaying low-entropy activations from base models to RLVR counterparts at test time via visual token manipulation.
DeepEyesV2: Toward Agentic Multimodal Model
cs.CV 2025-11 unverdicted novelty 6.0

DeepEyesV2 uses a two-stage cold-start plus reinforcement learning pipeline to produce an agentic multimodal model that adaptively invokes tools and outperforms direct RL on real-world reasoning benchmarks.
Routing-Based Continual Learning for Multimodal Large Language Models
cs.LG 2025-11 unverdicted novelty 6.0

Routing architecture for MLLMs enables continual learning with constant compute, matching multi-task learning performance and supporting cross-modal transfer.
DeepSeek-OCR: Contexts Optical Compression
cs.CV 2025-10 unverdicted novelty 6.0

DeepSeek-OCR compresses text contexts up to 20x via 2D optical mapping while achieving 97% OCR accuracy below 10x and 60% at 20x, outperforming prior OCR tools with fewer vision tokens.
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
cs.CV 2025-10 unverdicted novelty 6.0

ViSurf unifies SFT and RLVR for LVLMs in one training stage by injecting ground-truth labels into rollouts and applying novel reward controls, outperforming standalone and two-stage baselines on diverse benchmarks.
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning
cs.AI 2025-09 unverdicted novelty 6.0

DeFacto trains multimodal models with counterfactual image variants and GRPO reinforcement learning to enforce that correct answers are supported by correct visual evidence.
Qwen3-Omni Technical Report
cs.CL 2025-09 unverdicted novelty 6.0

Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning
cs.LG 2025-05 conditional novelty 6.0

LLaDA-V is a diffusion-based multimodal large language model that reaches competitive or state-of-the-art results on visual instruction tasks while using a non-autoregressive architecture.
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL
cs.CL 2025-03 unverdicted novelty 6.0

A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.
Emu3: Next-Token Prediction is All You Need
cs.CV 2024-09 unverdicted novelty 6.0

Emu3 shows that next-token prediction on a unified discrete token space for text, images, and video lets a single transformer outperform task-specific models such as SDXL and LLaVA-1.6 in multimodal generation and perception.
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
cs.CV 2024-04 unverdicted novelty 6.0

SEED-X is a unified multimodal foundation model that handles multi-granularity visual semantics for both comprehension and generation across arbitrary image sizes and ratios.
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
cs.CV 2024-03 unverdicted novelty 6.0

MM1 models achieve state-of-the-art few-shot multimodal results by pre-training on a careful mix of image-caption, interleaved, and text-only data with optimized image encoders.
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
cs.CV 2023-05 accept novelty 6.0

OCRBench provides the largest evaluation suite yet for OCR capabilities in large multimodal models, revealing gaps in multilingual, handwritten, and mathematical text handling.
Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

BA-Att introduces pre-downsampled block selection with norm-sorting and diagonal covariance correction to approximate sparse attention, yielding up to 6.95x speedup at 50% sparsity across language, multimodal, and vid...
Semantic-Enriched Latent Visual Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

SLVR enriches latent visual representations with fine-grained attribute semantics via supervised first-stage learning and multi-query alignment via M-GRPO, yielding improved robustness on region-level reasoning tasks.
Starve to Perceive: Taming Lazy Perception in VLMs with Constrained Visual Bandwidth
cs.CV 2026-05 unverdicted novelty 5.0

Constraining visual token budget per observation during VLM training forces genuine active perception and delivers 5% average relative improvement without auxiliary losses or architecture changes.
Qwen2.5-VL Technical Report
cs.CV 2025-02 unverdicted novelty 5.0

Qwen2.5-VL reports a vision-language model family using native dynamic-resolution ViT and absolute time encoding that matches GPT-4o on document and diagram tasks while supporting hour-long videos with second-level lo...
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
cs.CV 2024-12 accept novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
cs.CV 2024-10 unverdicted novelty 5.0

PDF-WuKong adds a sparse sampler to an MLLM for efficient long-PDF multimodal QA and reports an 8.6% F1 gain over proprietary models on a new 1.1M-pair academic-paper dataset.
General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
cs.CV 2024-09 unverdicted novelty 5.0

GOT is a unified end-to-end model that treats all man-made optical signals as characters and handles multiple OCR tasks including formatted output and interactive region recognition via prompts.
CogVLM2: Visual Language Models for Image and Video Understanding
cs.CV 2024-08 conditional novelty 5.0

CogVLM2 family achieves state-of-the-art results on image and video understanding benchmarks through improved visual expert architecture, higher resolution inputs, and automated temporal grounding for videos.
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
cs.CV 2024-07 conditional novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models
cs.CV 2024-03 unverdicted novelty 5.0

Mini-Gemini enhances VLMs via high-resolution visual refinement, curated reasoning data, and self-guided generation to reach leading zero-shot benchmark results across 2B-34B LLMs.
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
cs.CV 2024-01 unverdicted novelty 5.0

InternLM-XComposer2 introduces Partial LoRA on InternLM2-7B to enable high-quality free-form text-image composition while matching or exceeding GPT-4V on select vision-language benchmarks.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
Seed1.5-VL Technical Report
cs.CV 2025-05 unverdicted novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
cs.CV 2025-01 unverdicted novelty 4.0

VideoLLaMA3 uses a vision-centric training paradigm and token-reduction design to reach competitive results on image and video benchmarks.