A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
Explain before you answer: A survey on compositional visual reasoning
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Introduces CORTEX benchmark supplying 76,177 validated four-stage diagnostic reasoning traces for open/closed VQA and report generation on chest CT to enable traceable MLLM supervision and evaluation.
Proposes the Modality Translation Protocol with metrics ToS, CoS, FoS and SSC to quantify visual knowledge bottlenecks in VLMs, plus a Divergence Law hypothesis that scaling language models may increase the penalty.
Mull-Tokens are modality-agnostic latent tokens that enable free-form multimodal thinking and deliver up to 16% gains on spatial reasoning benchmarks.
COMPACT synthesizes compositional visual instruction data to reduce VIT training data by 90% while achieving 100.2% of full performance across eight multimodal benchmarks.
GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.
Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
ARIS integrates a graph-based Social World Model, RAG, and agentic architecture for social robots and reports higher user ratings for intelligence, animacy, anthropomorphism, and likeability than an LLM baseline in a 23-person study with the Pepper robot.
citing papers explorer
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 reframes dynamic action chunk commitment in VLA models as self-speculative prefix verification, accepting the longest continuous sequence of actions that satisfies consensus-ordered conditional invariance and prefix-closed sequential consistency.
-
CORTEX: A Structured Reasoning Benchmark for Trustworthy 3D Chest CT MLLMs
Introduces CORTEX benchmark supplying 76,177 validated four-stage diagnostic reasoning traces for open/closed VQA and report generation on chest CT to enable traceable MLLM supervision and evaluation.
-
The Expense of Seeing: Attaining Trustworthy Multimodal Reasoning Within the Monolithic Paradigm
Proposes the Modality Translation Protocol with metrics ToS, CoS, FoS and SSC to quantify visual knowledge bottlenecks in VLMs, plus a Divergence Law hypothesis that scaling language models may increase the penalty.
-
GlowGS: Generative Semantic Feature Learning for 3D Gaussian Splatting in Nighttime Glow Scenes
GlowGS improves 3D Gaussian Splatting in nighttime glow scenes via semantic feature generation from diffusion models and novel-view semantic learning with vision foundation models.
-
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding
Using lexical concreteness to guide contrastive negative mining and a new margin-based Cement loss, the Slipform framework reaches state-of-the-art on compositional benchmarks for vision-language models.
-
ARIS: Agentic and Relationship Intelligence System for Social Robots
ARIS integrates a graph-based Social World Model, RAG, and agentic architecture for social robots and reports higher user ratings for intelligence, animacy, anthropomorphism, and likeability than an LLM baseline in a 23-person study with the Pepper robot.
- Rule-VLN: Bridging Perception and Compliance via Semantic Reasoning and Geometric Rectification