Explain before you answer: A survey on compositional visual reasoning

Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, et al · 2025 · arXiv 2508.17298

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Dynamic Execution Commitment of Vision-Language-Action Models

cs.CV · 2026-05-12 · unverdicted · novelty 7.0 · 3 refs

A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.

CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

Introduces CORTEX benchmark supplying 76,177 validated four-stage diagnostic reasoning traces for open/closed VQA and report generation on chest CT to enable traceable MLLM supervision and evaluation.

The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm

cs.CV · 2026-04-22 · unverdicted · novelty 6.0 · 2 refs

Proposes the Modality Translation Protocol with metrics ToS, CoS, FoS and SSC to quantify visual knowledge bottlenecks in VLMs, plus a Divergence Law hypothesis that scaling language models may increase the penalty.

Mull-Tokens: Modality-Agnostic Latent Thinking

cs.CV · 2025-12-11 · unverdicted · novelty 6.0

Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.

Visual Compositional Tuning

cs.CV · 2025-04-30 · unverdicted · novelty 6.0

COMPACT synthesizes compositional visual instruction data to reduce VIT training data by 90% while achieving 100.2% of full performance across eight multimodal benchmarks.

GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes

cs.CV · 2026-05-22 · unverdicted · novelty 5.0

GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

cs.LG · 2026-04-14 · unverdicted · novelty 5.0

Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.

ARIS: Agentic and Relationship Intelligence System for Social Robots

cs.RO · 2026-05-01 · unverdicted · novelty 4.0

ARIS integrates a graph-based Social World Model, RAG, and agentic architecture for social robots and reports higher user ratings for intelligence, animacy, anthropomorphism, and likeability than an LLM baseline in a 23-person study with the Pepper robot.

Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification

cs.AI · 2026-04-18

citing papers explorer

Showing 7 of 7 citing papers after filters.

Dynamic Execution Commitment of Vision-Language-Action Models cs.CV · 2026-05-12 · unverdicted · none · ref 33 · 3 links
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs cs.CV · 2026-06-25 · unverdicted · none · ref 11
Introduces CORTEX benchmark supplying 76,177 validated four-stage diagnostic reasoning traces for open/closed VQA and report generation on chest CT to enable traceable MLLM supervision and evaluation.
The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm cs.CV · 2026-04-22 · unverdicted · none · ref 10 · 2 links
Proposes the Modality Translation Protocol with metrics ToS, CoS, FoS and SSC to quantify visual knowledge bottlenecks in VLMs, plus a Divergence Law hypothesis that scaling language models may increase the penalty.
GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes cs.CV · 2026-05-22 · unverdicted · none · ref 24
GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding cs.LG · 2026-04-14 · unverdicted · none · ref 68
Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
ARIS: Agentic and Relationship Intelligence System for Social Robots cs.RO · 2026-05-01 · unverdicted · none · ref 16
ARIS integrates a graph-based Social World Model, RAG, and agentic architecture for social robots and reports higher user ratings for intelligence, animacy, anthropomorphism, and likeability than an LLM baseline in a 23-person study with the Pepper robot.
Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification cs.AI · 2026-04-18 · unreviewed · ref 26

Explain before you answer: A survey on compositional visual reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer