StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.
FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.
Frozen multimodal embeddings with trait-specific late fusion cut personality prediction MSE by 19% relative to baseline in the 2026 AVI challenge, while cognitive results are attributed to validation shortcuts rather than content-based inference.
citing papers explorer
-
StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning
StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.
-
Evaluating Cognitive Age Alignment in Interactive AI Agents
The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.
-
The Last Visible Pixel: Probing Fine-Scale Perception in Vision-Language Models
FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.