A paired-image benchmark reveals that many MLLMs fail to update predictions when task-critical visual evidence changes, even when they answer individual images correctly.
Core knowledge deficits in multi-modal language models
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
VLMs show partial alignment with children's performance on six cognitive tasks, with stronger models matching better at task and item levels but struggling on matrix reasoning and mental rotation.
SpatialAct benchmark shows VLMs handle isolated spatial reasoning but fail to maintain coherent spatial beliefs and produce reliable actions in multi-turn 3D interactions, underperforming humans.
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
citing papers explorer
-
VisualFLIP: Do Predictions Depend on Task-Critical Visual Evidence in Multimodal Reasoning?
A paired-image benchmark reveals that many MLLMs fail to update predictions when task-critical visual evidence changes, even when they answer individual images correctly.
-
LEVANTE-bench: Multi-Scale Comparison of VLMs to Children Using Cognitive Tasks (or, "Is Your VLM Smarter Than a 5th Grader?")
VLMs show partial alignment with children's performance on six cognitive tasks, with stronger models matching better at task and item levels but struggling on matrix reasoning and mental rotation.
-
SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes
SpatialAct benchmark shows VLMs handle isolated spatial reasoning but fail to maintain coherent spatial beliefs and produce reliable actions in multi-turn 3D interactions, underperforming humans.