Canonical reference

Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025

Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al · 2025 · arXiv 2504.15279

Canonical reference. 88% of citing Pith papers cite this work as background.

17 Pith papers citing it

Background 88% of classified citations

read on arXiv browse 17 citing papers

citation-role summary

background 6 dataset 1 method 1

citation-polarity summary

background 7 use method 1

representative citing papers

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

cs.CV · 2026-04-20 · unverdicted · novelty 8.0

EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning

cs.AI · 2026-04-04 · unverdicted · novelty 8.0

FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.

The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

MLLMs scoring 70-83% on Cartesian visual tasks drop to 31-39% on logically equivalent polar versions, exposing reliance on grid discretization shortcuts instead of topology-invariant reasoning.

EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training

cs.CV · 2026-04-21 · unverdicted · novelty 7.0

EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.

Leveraging Latent Visual Reasoning in Silence

cs.CV · 2026-05-18 · conditional · novelty 6.0

Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.

Anisotropic Modality Align

cs.MM · 2026-05-08 · unverdicted · novelty 6.0

Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.

Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation

cs.CV · 2026-04-27 · unverdicted · novelty 6.0 · 2 refs

Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.

MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?

cs.LG · 2026-02-20 · conditional · novelty 6.0 · 2 refs

MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

cs.CV · 2026-02-02 · unverdicted · novelty 6.0

ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.

ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch

cs.CV · 2026-01-20 · conditional · novelty 6.0

ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.

SPHINX: A Synthetic Environment for Visual Perception and Reasoning

cs.CV · 2025-11-25 · unverdicted · novelty 6.0

SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.

What's Holding Back Latent Visual Reasoning?

cs.CV · 2026-05-18 · unverdicted · novelty 5.0

Latent visual reasoning fails in current models because standard datasets make oracle latents uninformative and inference-time latents collapse away from useful representations.

Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images

cs.CV · 2026-04-13 · unverdicted · novelty 5.0

TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.

When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise

cs.CV · 2026-05-06 · unverdicted · novelty 4.0 · 2 refs

Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.

Seed1.5-VL Technical Report

cs.CV · 2025-05-11 · unverdicted · novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

cs.AI · 2026-04-11

citing papers explorer

Showing 17 of 17 citing papers.

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations cs.CV · 2026-04-20 · unverdicted · none · ref 46
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning cs.AI · 2026-04-04 · unverdicted · none · ref 52
FeynmanBench is the first benchmark for evaluating multimodal LLMs on diagrammatic reasoning with Feynman diagrams, revealing systematic failures in enforcing physical constraints and global topology.
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space cs.CV · 2026-05-11 · unverdicted · none · ref 38
MLLMs scoring 70-83% on Cartesian visual tasks drop to 31-39% on logically equivalent polar versions, exposing reliance on grid discretization shortcuts instead of topology-invariant reasoning.
EmbodiedMidtrain: Bridging the Gap between Vision-Language Models and Vision-Language-Action Models via Mid-training cs.CV · 2026-04-21 · unverdicted · none · ref 22
EmbodiedMidtrain mid-trains VLMs on curated VLA-aligned data subsets to improve downstream performance on robot manipulation benchmarks.
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 37
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.
Leveraging Latent Visual Reasoning in Silence cs.CV · 2026-05-18 · conditional · none · ref 36
Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.
Anisotropic Modality Align cs.MM · 2026-05-08 · unverdicted · none · ref 15
Modality representations share dominant semantic geometry but have an anisotropic residual gap; AnisoAlign corrects source representations boundedly using target geometry for unpaired alignment.
Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation cs.CV · 2026-04-27 · unverdicted · none · ref 47 · 2 links
Tuna-2 shows that direct pixel embeddings can replace vision encoders in unified multimodal models, achieving competitive generation and stronger understanding at scale.
MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs? cs.LG · 2026-02-20 · conditional · none · ref 87 · 2 links
MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.
Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models cs.CV · 2026-02-02 · unverdicted · none · ref 31
ReAlign corrects the modality gap in unpaired data to let MLLMs learn visual distributions from text alone before instruction tuning, reducing dependence on expensive paired corpora.
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch cs.CV · 2026-01-20 · conditional · none · ref 42
ChartVerse uses Rollout Posterior Entropy and truth-anchored inverse QA synthesis to produce 640K high-quality chart reasoning samples, training an 8B model that surpasses its 30B teacher.
SPHINX: A Synthetic Environment for Visual Perception and Reasoning cs.CV · 2025-11-25 · unverdicted · none · ref 62
SPHINX generates synthetic visual puzzles for benchmarking LVLMs, where GPT-5 scores 51.1% and RLVR training improves both in-domain and external visual reasoning performance.
What's Holding Back Latent Visual Reasoning? cs.CV · 2026-05-18 · unverdicted · none · ref 23
Latent visual reasoning fails in current models because standard datasets make oracle latents uninformative and inference-time latents collapse away from useful representations.
Test-time Scaling over Perception: Resolving the Grounding Paradox in Thinking with Images cs.CV · 2026-04-13 · unverdicted · none · ref 39
TTSP resolves the Grounding Paradox by treating perception as a scalable test-time process that generates, filters, and iteratively refines multiple visual exploration traces, outperforming baselines on high-resolution and multimodal reasoning tasks.
When Relations Break: Analyzing Relation Hallucination in Vision-Language Model Under Rotation and Noise cs.CV · 2026-05-06 · unverdicted · none · ref 14 · 2 links
Mild rotations and noise significantly increase relation hallucinations in VLMs across models and datasets, with prompt and preprocessing fixes providing only partial relief.
Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 155
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models cs.AI · 2026-04-11 · unreviewed · ref 100

Visulogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer