StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
years
2026 4representative citing papers
The paper presents ChildAgentEval as the first psychometrically grounded benchmark comparing MLLM-based agents' reasoning performance to age-specific human cognitive stages.
FineSightBench reveals VLMs perceive patterns down to 12px but show persistent failures in fine-scale reasoning such as numeracy and sequencing.
Frozen multimodal embeddings with trait-specific late fusion cut personality prediction MSE by 19% relative to baseline in the 2026 AVI challenge, while cognitive results are attributed to validation shortcuts rather than content-based inference.
citing papers explorer
No citing papers match the current filters.