Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs
Pith reviewed 2026-05-23 02:33 UTC · model grok-4.3
The pith
The best MLLM reaches only 54% on digitized human tests of basic visual cognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VisFactor digitizes 20 vision-centric subtests from FRCT across four domains of human visual cognition and supplies algorithms for automatic, controllable test-case generation. When 39 frontier MLLMs are evaluated, the best model scores 54.0% and all models fail consistently on mental rotation, spatial relation inference, and figure-ground discrimination, independent of size or prompting strategy. The benchmark exhibits Cronbach's alpha of 0.94 and construct validity relative to existing vision benchmarks, implying that improvements on general benchmarks may not reflect genuine human-like visual cognition.
What carries the argument
VisFactor, the benchmark that digitizes 20 FRCT subtests and automatically generates validated test cases with controllable difficulty.
If this is right
- MLLMs trained directly on complex tasks bypass the bottom-up visual hierarchy humans rely on.
- High scores on existing general vision-language benchmarks can coexist with absence of basic visual primitives.
- Failures on mental rotation, spatial inference, and figure-ground tasks occur regardless of model scale or prompting.
- The benchmark demonstrates high internal consistency (Cronbach's alpha = 0.94) and construct validity.
- Current performance gains may represent superficial mastery rather than foundational visual competence.
Where Pith is reading between the lines
- Models may continue to exhibit these gaps even as parameter counts increase unless training explicitly includes bottom-up visual primitives.
- Downstream errors in robotics, navigation, or diagram interpretation could trace directly to the identified failures.
- The automatic generation method could be reused to create similar benchmarks for auditory or temporal cognition in AI systems.
Load-bearing premise
The 20 digitized FRCT subtests and the automatic generation algorithm faithfully capture foundational visual primitives without introducing artifacts that change task difficulty.
What would settle it
Human participants scoring substantially lower on the digitized VisFactor versions than on the original paper FRCT subtests, or any MLLM reaching human-level accuracy on VisFactor while still failing the same tasks in real-world spatial applications.
Figures
read the original abstract
Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstream tasks, often bypassing these foundational visual capabilities. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from FRCT, a well-established cognitive psychology assessment spanning four domains of human visual cognition. Furthermore, we design algorithms to automatically construct and validate unlimited test cases with controllable difficulty. Using VisFactor, we evaluate 39 frontier MLLMs, including both proprietary (e.g., GPT, Gemini) and open-source (e.g., LLaMA, Qwen) models. The best model achieves a score of only 54.0%. Analysis reveals good internal consistency (Cronbach's alpha = 0.94) and construct validity (compared to existing vision benchmarks). Models consistently fail on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that performance improvements on existing general benchmarks might represent castles in the air instead of a genuine mastery of human-like visual cognition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VisFactor, a benchmark that digitizes 20 vision-centric subtests from the FRCT cognitive assessment across four domains of human visual cognition. It evaluates 39 MLLMs (proprietary and open-source) using automatically generated test cases with controllable difficulty, reporting that the best model scores only 54.0% with consistent failures on mental rotation, spatial relation inference, and figure-ground discrimination regardless of scale or prompting. The benchmark exhibits high internal consistency (Cronbach's alpha = 0.94) and construct validity against existing vision benchmarks, suggesting MLLMs bypass bottom-up visual hierarchies.
Significance. If the digitized subtests and generation algorithm preserve the original task demands and human performance baselines, the results would provide concrete evidence that gains on general vision-language benchmarks do not reflect mastery of foundational visual primitives. The automatic generation of unlimited controllable test cases is a methodological strength that supports scalability and reproducibility.
major comments (2)
- [Methods] Methods (automatic construction and validation): The central claim that low MLLM scores (54%) reflect missing foundational primitives requires that the generated items match original FRCT demands for humans; however, no human accuracy results on the VisFactor items themselves are reported, only internal consistency and correlation with other benchmarks. This leaves open the possibility that digitization or generation artifacts alter difficulty.
- [Results] Results (54.0% aggregate and per-task failures): The headline performance figure and the analysis of systematic failures on mental rotation etc. are presented without per-subtest human baselines, error bars, or details on test-case counts per domain, making it difficult to attribute gaps specifically to MLLM visual processing rather than benchmark construction.
minor comments (2)
- [Abstract] Abstract: The phrase 'construct validity (compared to existing vision benchmarks)' does not name the specific benchmarks or report the correlation coefficients.
- The manuscript would benefit from a table listing the 20 subtests with their domain assignments and example generated items.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [Methods] Methods (automatic construction and validation): The central claim that low MLLM scores (54%) reflect missing foundational primitives requires that the generated items match original FRCT demands for humans; however, no human accuracy results on the VisFactor items themselves are reported, only internal consistency and correlation with other benchmarks. This leaves open the possibility that digitization or generation artifacts alter difficulty.
Authors: We agree that direct human performance data on the newly generated VisFactor items would provide the strongest confirmation that task demands are preserved. The current manuscript relies on the established human baselines from the original FRCT, high internal consistency (Cronbach's alpha = 0.94), and construct validity correlations with existing vision benchmarks. To address this concern, we will add a dedicated validation subsection describing the generation algorithm's fidelity checks (e.g., parameter matching to FRCT specifications) and, where feasible, report pilot human accuracy on a sample of items. revision: partial
-
Referee: [Results] Results (54.0% aggregate and per-task failures): The headline performance figure and the analysis of systematic failures on mental rotation etc. are presented without per-subtest human baselines, error bars, or details on test-case counts per domain, making it difficult to attribute gaps specifically to MLLM visual processing rather than benchmark construction.
Authors: We will revise the results section to include: (1) per-subtest human baselines drawn from the FRCT literature where available, (2) error bars or confidence intervals on all aggregate and per-domain scores, and (3) explicit counts of test cases generated per domain and difficulty level. These additions will make the attribution of failures to MLLM visual processing more transparent while preserving the core finding that even the best model reaches only 54%. revision: yes
Circularity Check
No significant circularity; empirical benchmark evaluation is self-contained
full rationale
The paper introduces VisFactor as a new benchmark by digitizing existing FRCT subtests and reports direct empirical scores (e.g., 54.0% ceiling) on 39 MLLMs. No equations, fitted parameters, or derivations are present that reduce reported performance or validity claims to inputs by construction. FRCT is an external established assessment with no indicated author overlap or self-citation load-bearing the central result. Internal consistency metrics and benchmark correlations are standard empirical checks, not circular reductions. The derivation chain consists of benchmark construction followed by model evaluation and is independent of the target claims.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption FRCT subtests validly isolate bottom-up visual primitives that are prerequisite for human-like visual cognition.
Reference graph
Works this paper leans on
-
[1]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
URL https://qwenlm.github.io/blog/ qwen-vl/. Team, Q. Qwen3-vl: Sharper vision, deeper thought, broader action.Qwen Blogs Sep 22 2025, 2025. URL https://qwen.ai/blog?id=qwen3-vl. Thurstone, L. L. Primary mental abilities:.Psychology Monographs, 1, 1938. Thurstone, L. L.A factorial study of perception.The University of Chicago Press, 1944. Thurstone, L. L....
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
A circle inside another: all things in the inner group belong to the outer group
-
[3]
Circles that overlap partly: the two groups share some, but not all, things
-
[4]
TRUE” if it shows the rela- tionships for the three groups, “FALSE
Circles that do not touch: the two groups share nothing. Task: Decide whether the image follows these rules for the three groups: Desks, furniture, pencils. Output: Respond with only one word: “TRUE” if it shows the rela- tionships for the three groups, “FALSE” if it does not, in JSON format as follows:{"answer": YOUR ANSWER HERE}. 25 Human Cognitive Benc...
work page 1954
-
[5]
Every face shows a different letter, number, or symbol
Each cube has six faces. Every face shows a different letter, number, or symbol
-
[6]
Hidden faces may show any symbols, but no symbol appears on more than one face of the same cube. Task: Decide whether the following statement is true or false: the first cube is a certain view of the second cube after it is turned. (!!!) Three other prompts are: (1) the first cube is not any view of the second cube no matter how it is turned (2) the secon...
work page 1974
-
[7]
You may switch lines only where a black dot is drawn
-
[8]
Lines that cross or touch without a dot are not connected
-
[9]
Task: For box E, decide if there is one continuous line that:
The path must stay inside the chosen box and must not stop at a dead-end. Task: For box E, decide if there is one continuous line that:
-
[10]
Starts at S inside that box
-
[11]
Reaches the single circle at the top
-
[12]
TRUE” if box E meets all the rules, “FALSE
Comes back to F inside the same box without entering any other box. Output: Respond with only one word: “TRUE” if box E meets all the rules, “FALSE” if it does not, in JSON format as follows:{"answer": YOUR ANSWER HERE}. Prompt for SS3: Map Planning Test Look at the city map shown in the image below: In the map:
-
[13]
Streets = black lines
-
[14]
Circles = road-blocks (you cannot cross there)
-
[15]
Task: Find the shortest street route from F to T
Numbered squares = buildings. Task: Find the shortest street route from F to T. Rules:
-
[16]
The route will always touch the side of one and only one numbered building
-
[17]
Touching only a corner does not count
-
[18]
The ability to manipulate or transform the im- age of spatial patterns into other arrangements
Move only along streets (horizontal or vertical), never through circles. Output: Respond with only one number: the number on the build- ing your shortest route touches, in JSON format as follows: {"answer": YOUR ANSWER HERE}. 27 Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs F.10. Visualization (VZ) The Factor: “The ability to manipul...
work page 1971
-
[19]
Use 2–5 of the pieces to fill the figure exactly
-
[20]
You may rotate pieces but do not flip them. Task: Decide whether the Fifth piece is in the set of pieces that makes the figure. Output: Respond with only one word: “TRUE” if it is or “FALSE” if it is not, in JSON format as follows:{"answer": YOUR ANSWER HERE}. Prompt for VZ2: Paper Folding Test Look at the two images: Below is the first image, a step-by-s...
-
[21]
Do not flip or rotate the paper except for the folds shown
Mentally follow every fold in the first image exactly as drawn. Do not flip or rotate the paper except for the folds shown
-
[22]
Imagine a hole being punched through all layers where each circle is drawn
-
[23]
Unfold the paper, step by step, in reverse order of the folds, keeping the sheet’s original orientation
-
[24]
After it is flat, note where every hole should appear on the sheet
-
[25]
Compare this mental result with the pattern of holes in the second image. Output: Respond with only one word: “TRUE” if every hole (num- ber and position) in the second image matches your mental result exactly, otherwise “FALSE”, in JSON format as follows: {"answer": YOUR ANSWER HERE}. Prompt for VZ3: Surface Development Test Look at the two images: Below...
-
[26]
**Identify the Pattern**: Examine the small shape in the first image and record its exact pixel or cell configuration (e.g., a 2D grid of colors or pixels)
-
[27]
**Scan the Larger Image**: Systematically slide a window of the same size as the first image over the second image, checking each possible sub-region
-
[28]
**Compare**: For each sub-region, check if it matches the pattern from the first image exactly—no rotation, flip, or size change allowed
-
[29]
If no match is found after scanning the entire larger image, output ‘”answer”: ”FALSE”‘
**Decision**: If an exact match is found, output ‘”answer”: ”TRUE”‘. If no match is found after scanning the entire larger image, output ‘”answer”: ”FALSE”‘. Solution to CF2: Hidden Patterns Test
-
[30]
**Identify Model Dimensions**: Note the size (rows x columns) of the model in the first image
-
[31]
**Scan Pattern Image**: Slide a window of the same dimensions across the second image (top-left to bottom-right)
-
[32]
**Check for Exact Match**: At each position, compare the sub-section of the pattern with the model
-
[33]
**No Rotation or Flip**: Ensure the comparison uses the model as-is, without any transformations
-
[34]
Otherwise, return ‘”answer”: ”FALSE”‘
**Return Result**: If an exact match is found, return ‘”answer”: ”TRUE”‘. Otherwise, return ‘”answer”: ”FALSE”‘. Solution to CF3: Copying Test
-
[35]
**Observe the shape** in the first image and break it into straight line segments along the grid
-
[36]
**Start at the circled dot** in the second image
-
[37]
**Trace the same movements** (up/down/left/right/diagonal) from the start point, replicating the shape exactly by placing corners on the grid dots
-
[38]
**Count steps carefully** to ensure each corner aligns with a grid dot as in the original shape
-
[39]
Solution to CS1: Gestalt Completion Test
**Record the final dot** reached after completing the entire shape. Solution to CS1: Gestalt Completion Test
-
[40]
**Observe the drawing**: Look closely at the curved and linear segments to infer what object is being sketched
-
[41]
**Look for familiar outlines**: Identify key features—shapes, proportions, and positioning—that suggest a common object (e.g., wheels, body, handles)
-
[42]
**Mentally complete the figure**: Use the partial lines to visualize what the full object would look like
-
[43]
Solution to CS2: Concealed Words Test
**Identify the object**: Based on the partial sketch, determine the most likely object. Solution to CS2: Concealed Words Test
-
[44]
**Analyze the visible fragments**: Identify parts of letters that are still visible and match them to possible lowercase letters
-
[45]
**Visualize missing parts**: Mentally fill in the gaps based on typical letter structures
-
[46]
**Look for patterns**: Combine identified letters into a coherent word, considering common English words
-
[47]
Solution to CS3: Snowy Pictures
**Verify length**: Ensure the word is at least four letters long and uses only lowercase letters. Solution to CS3: Snowy Pictures
-
[48]
**Identify visible features**: Focus on the parts that are not hidden—shape, color, structure, or details that hint at the object
-
[49]
**Infer the whole object**: Use context and symmetry to mentally complete the object, even if part is obscured
-
[50]
Solution to I3: Figure Classification
**Choose the most likely object**: Based on the visible portion and common objects with that appearance. Solution to I3: Figure Classification
-
[51]
**Examine Group 1 and Group 2 figures**: Look for common traits shared within each group (e.g., shape count, orientation, fill patterns, symmetry)
-
[52]
**Identify the rule per group**: Determine what consistent rule applies to all three figures in each group (e.g., all shapes have a diagonal line, or all contain a specific number of elements)
-
[53]
**Compare rules between groups**: Make sure the rule is not shared across groups—each group must have a distinct rule
-
[54]
**Analyze the figure to classify**: Determine which group’s rule the new figure follows
-
[55]
Solution to MA1: Picture-Number Test
**Assign it to the correct group**: Match the figure to the group with the corresponding visual rule. Solution to MA1: Picture-Number Test
-
[56]
**Study the 21 picture-number pairs** in the first image: Memorize or note the associations between each unique picture and its corresponding number
-
[57]
**Examine the picture in the second image**: Identify the object or scene shown
-
[58]
**Match the second image** to one of the 21 pictures from the first image by comparing visual features
-
[59]
**Retrieve the associated number** from the first image that corresponds to the matched picture
-
[60]
Solution to MV1: Shape Memory Test
**Return the number** in the required JSON format. Solution to MV1: Shape Memory Test
-
[61]
**Memorize the shapes and orientations** in the first image: Focus on each shape’s design and the direction it’s facing (rotation or reflection)
-
[62]
**Examine the second image**: Identify the specific shape and its orienta- tion shown here
-
[63]
**Compare it to the memorized shapes**: Look for an exact match in both shape and orientation from the first image
-
[64]
This shape matches one from the first image
**Evaluate the statement**: Determine if the given claim (e.g., “This shape matches one from the first image”) accurately reflects what is shown
-
[65]
**Decide if the statement is TRUE or FALSE** based on your comparison. 29 Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs Solution to MV2: Building Memory
-
[66]
**Memorize the street map** in the first image: Note the position of each unique building relative to the street layout
-
[67]
**Study the block labels** in the second image: Understand how the blocks (A–E) correspond to the same street layout from the first image
-
[68]
**Identify the building** in the third image: Match its shape, size, and features to one on the original street map
-
[69]
**Locate that building** on the labeled block map from the second image
-
[70]
**Determine if it is in the specified block**: Compare its actual position to the named block in the question. Solution to MV3: Map Memory
-
[71]
**Memorize the maps** in the first image: Focus on the layout of walls, paths, and any unique structures in each map
-
[72]
**Examine the single map** in the second image: Pay attention to the same features—structure, layout, and orientation
-
[73]
**Compare the second map** to the ones memorized: Check for exact matches or close similarities, including possible rotations or reflections
-
[74]
**Evaluate the statement**: Determine whether it correctly asserts a match (or lack thereof) between the second map and any from the first image
-
[75]
Solution to P3: Identical Pictures Test
**Answer TRUE or FALSE** depending on whether the claim aligns with your comparison. Solution to P3: Identical Pictures Test
-
[76]
**Study the target object** in the first image: Note its overall shape, orientation, components, and details
-
[77]
**Examine the test object** in the second image: Observe the same features—shape, structure, and orientation
-
[78]
**Compare both objects** precisely: Check for any differences in angles, positioning, parts, or missing elements
-
[79]
Solution to RL2: Diagramming Relationships
**Determine exact match**: Decide if the test object is an identical copy of the target object in all aspects. Solution to RL2: Diagramming Relationships
-
[80]
**Understand the group relationships described** in the statement (e.g., one group is a subset of another, or groups partially overlap or are completely separate)
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.