Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Dasen Dai; Haodong Duan; Jen-tse Huang; Jen-Yuan Huang; Pinjia He; Wenxiang Jiao; Wenxuan Wang; Xiaoyuan Liu; Youliang Yuan; Zhaopeng Tu

arxiv: 2502.16435 · v4 · submitted 2025-02-23 · 💻 cs.CV · cs.CL

Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs

Jen-tse Huang , Dasen Dai , Jen-Yuan Huang , Youliang Yuan , Xiaoyuan Liu , Wenxuan Wang , Wenxiang Jiao , Pinjia He

show 2 more authors

Zhaopeng Tu Haodong Duan

This is my paper

Pith reviewed 2026-05-23 02:33 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords multimodal large language modelsvisual cognitioncognitive benchmarkmental rotationspatial reasoningfigure-ground discriminationVisFactorFRCT

0 comments

The pith

The best MLLM reaches only 54% on digitized human tests of basic visual cognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper digitizes 20 subtests from the FRCT cognitive assessment into VisFactor to measure whether MLLMs possess the bottom-up visual primitives humans develop first. Evaluation of 39 models shows the highest score is 54%, with repeated failures on mental rotation, spatial relation inference, and figure-ground discrimination. These shortfalls persist across model sizes and prompting methods, and the benchmark shows strong internal consistency and validity against other vision tests. The results indicate that gains on complex downstream tasks do not demonstrate mastery of the foundational visual skills that precede semantics in human perception.

Core claim

VisFactor digitizes 20 vision-centric subtests from FRCT across four domains of human visual cognition and supplies algorithms for automatic, controllable test-case generation. When 39 frontier MLLMs are evaluated, the best model scores 54.0% and all models fail consistently on mental rotation, spatial relation inference, and figure-ground discrimination, independent of size or prompting strategy. The benchmark exhibits Cronbach's alpha of 0.94 and construct validity relative to existing vision benchmarks, implying that improvements on general benchmarks may not reflect genuine human-like visual cognition.

What carries the argument

VisFactor, the benchmark that digitizes 20 FRCT subtests and automatically generates validated test cases with controllable difficulty.

If this is right

MLLMs trained directly on complex tasks bypass the bottom-up visual hierarchy humans rely on.
High scores on existing general vision-language benchmarks can coexist with absence of basic visual primitives.
Failures on mental rotation, spatial inference, and figure-ground tasks occur regardless of model scale or prompting.
The benchmark demonstrates high internal consistency (Cronbach's alpha = 0.94) and construct validity.
Current performance gains may represent superficial mastery rather than foundational visual competence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may continue to exhibit these gaps even as parameter counts increase unless training explicitly includes bottom-up visual primitives.
Downstream errors in robotics, navigation, or diagram interpretation could trace directly to the identified failures.
The automatic generation method could be reused to create similar benchmarks for auditory or temporal cognition in AI systems.

Load-bearing premise

The 20 digitized FRCT subtests and the automatic generation algorithm faithfully capture foundational visual primitives without introducing artifacts that change task difficulty.

What would settle it

Human participants scoring substantially lower on the digitized VisFactor versions than on the original paper FRCT subtests, or any MLLM reaching human-level accuracy on VisFactor while still failing the same tasks in real-world spatial applications.

Figures

Figures reproduced from arXiv: 2502.16435 by Dasen Dai, Haodong Duan, Jen-tse Huang, Jen-Yuan Huang, Pinjia He, Wenxiang Jiao, Wenxuan Wang, Xiaoyuan Liu, Youliang Yuan, Zhaopeng Tu.

**Figure 2.** Figure 2: VISFACTOR comprises 20 vision-centric cognitive subtests. Each task is designed to isolate core factors of human visual cognition, covering 10 distinct factors in total. The subtests are converted into either yes/no questions or fill-in-the-blank questions according to §2.3. Example stimuli, questions, and ground-truth answers are shown for each task. 3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 2.** Figure 2: Samples of our generated images. We can dynamically adjust test difficulties in VISFACTOR. For example, the grid size of CF3 is changed to 6 × 6 instead of 5 × 5. folding. (iv) Claude performs best on CS2, MV1, MV3, S2, and SS3. (v) Seed achieves the top score on CS3 and MV2. Model size and recency do not guarantee superior performance. For example, Qwen-2.5-72B is surpassed by both the smaller Qwen-2.5-3… view at source ↗

**Figure 3.** Figure 3: An example of our generated MA1 image-number pairs using CF2 and MV1 figures [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: VISFACTOR integrates 20 subtests adapted from standardized human cognitive assessments. Subtests are organized into four major domains and weighted by test case count (shown numerically), which determines each segment’ visual area. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Perason correlation between all subtests in VISFACTOR. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

Humans develop perception through a bottom-up hierarchy: from basic primitives and Gestalt principles to high-level semantics. In contrast, current Multimodal Large Language Models (MLLMs) are trained directly on complex downstream tasks, often bypassing these foundational visual capabilities. To systematically investigate this gap, we introduce VisFactor, a benchmark that digitizes 20 vision-centric subtests from FRCT, a well-established cognitive psychology assessment spanning four domains of human visual cognition. Furthermore, we design algorithms to automatically construct and validate unlimited test cases with controllable difficulty. Using VisFactor, we evaluate 39 frontier MLLMs, including both proprietary (e.g., GPT, Gemini) and open-source (e.g., LLaMA, Qwen) models. The best model achieves a score of only 54.0%. Analysis reveals good internal consistency (Cronbach's alpha = 0.94) and construct validity (compared to existing vision benchmarks). Models consistently fail on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination, regardless of model size or prompting strategy. These findings suggest that performance improvements on existing general benchmarks might represent castles in the air instead of a genuine mastery of human-like visual cognition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VisFactor, a benchmark that digitizes 20 vision-centric subtests from the FRCT cognitive assessment across four domains of human visual cognition. It evaluates 39 MLLMs (proprietary and open-source) using automatically generated test cases with controllable difficulty, reporting that the best model scores only 54.0% with consistent failures on mental rotation, spatial relation inference, and figure-ground discrimination regardless of scale or prompting. The benchmark exhibits high internal consistency (Cronbach's alpha = 0.94) and construct validity against existing vision benchmarks, suggesting MLLMs bypass bottom-up visual hierarchies.

Significance. If the digitized subtests and generation algorithm preserve the original task demands and human performance baselines, the results would provide concrete evidence that gains on general vision-language benchmarks do not reflect mastery of foundational visual primitives. The automatic generation of unlimited controllable test cases is a methodological strength that supports scalability and reproducibility.

major comments (2)

[Methods] Methods (automatic construction and validation): The central claim that low MLLM scores (54%) reflect missing foundational primitives requires that the generated items match original FRCT demands for humans; however, no human accuracy results on the VisFactor items themselves are reported, only internal consistency and correlation with other benchmarks. This leaves open the possibility that digitization or generation artifacts alter difficulty.
[Results] Results (54.0% aggregate and per-task failures): The headline performance figure and the analysis of systematic failures on mental rotation etc. are presented without per-subtest human baselines, error bars, or details on test-case counts per domain, making it difficult to attribute gaps specifically to MLLM visual processing rather than benchmark construction.

minor comments (2)

[Abstract] Abstract: The phrase 'construct validity (compared to existing vision benchmarks)' does not name the specific benchmarks or report the correlation coefficients.
The manuscript would benefit from a table listing the 20 subtests with their domain assignments and example generated items.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Methods] Methods (automatic construction and validation): The central claim that low MLLM scores (54%) reflect missing foundational primitives requires that the generated items match original FRCT demands for humans; however, no human accuracy results on the VisFactor items themselves are reported, only internal consistency and correlation with other benchmarks. This leaves open the possibility that digitization or generation artifacts alter difficulty.

Authors: We agree that direct human performance data on the newly generated VisFactor items would provide the strongest confirmation that task demands are preserved. The current manuscript relies on the established human baselines from the original FRCT, high internal consistency (Cronbach's alpha = 0.94), and construct validity correlations with existing vision benchmarks. To address this concern, we will add a dedicated validation subsection describing the generation algorithm's fidelity checks (e.g., parameter matching to FRCT specifications) and, where feasible, report pilot human accuracy on a sample of items. revision: partial
Referee: [Results] Results (54.0% aggregate and per-task failures): The headline performance figure and the analysis of systematic failures on mental rotation etc. are presented without per-subtest human baselines, error bars, or details on test-case counts per domain, making it difficult to attribute gaps specifically to MLLM visual processing rather than benchmark construction.

Authors: We will revise the results section to include: (1) per-subtest human baselines drawn from the FRCT literature where available, (2) error bars or confidence intervals on all aggregate and per-domain scores, and (3) explicit counts of test cases generated per domain and difficulty level. These additions will make the attribution of failures to MLLM visual processing more transparent while preserving the core finding that even the best model reaches only 54%. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical benchmark evaluation is self-contained

full rationale

The paper introduces VisFactor as a new benchmark by digitizing existing FRCT subtests and reports direct empirical scores (e.g., 54.0% ceiling) on 39 MLLMs. No equations, fitted parameters, or derivations are present that reduce reported performance or validity claims to inputs by construction. FRCT is an external established assessment with no indicated author overlap or self-citation load-bearing the central result. Internal consistency metrics and benchmark correlations are standard empirical checks, not circular reductions. The derivation chain consists of benchmark construction followed by model evaluation and is independent of the target claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that FRCT subtests measure foundational visual capabilities independent of the complex semantics MLLMs are trained on, and that the automatic construction algorithm preserves the original test properties.

axioms (1)

domain assumption FRCT subtests validly isolate bottom-up visual primitives that are prerequisite for human-like visual cognition.
Invoked in the abstract when contrasting human bottom-up hierarchy with MLLM training and when claiming the benchmark reveals foundational gaps.

pith-pipeline@v0.9.0 · 5777 in / 1224 out tokens · 20915 ms · 2026-05-23T02:33:43.781228+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

138 extracted references · 138 canonical work pages · 1 internal anchor

[1]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

URL https://qwenlm.github.io/blog/ qwen-vl/. Team, Q. Qwen3-vl: Sharper vision, deeper thought, broader action.Qwen Blogs Sep 22 2025, 2025. URL https://qwen.ai/blog?id=qwen3-vl. Thurstone, L. L. Primary mental abilities:.Psychology Monographs, 1, 1938. Thurstone, L. L.A factorial study of perception.The University of Chicago Press, 1944. Thurstone, L. L....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

A circle inside another: all things in the inner group belong to the outer group

work page
[3]

Circles that overlap partly: the two groups share some, but not all, things

work page
[4]

TRUE” if it shows the rela- tionships for the three groups, “FALSE

Circles that do not touch: the two groups share nothing. Task: Decide whether the image follows these rules for the three groups: Desks, furniture, pencils. Output: Respond with only one word: “TRUE” if it shows the rela- tionships for the three groups, “FALSE” if it does not, in JSON format as follows:{"answer": YOUR ANSWER HERE}. 25 Human Cognitive Benc...

work page 1954
[5]

Every face shows a different letter, number, or symbol

Each cube has six faces. Every face shows a different letter, number, or symbol

work page
[6]

TRUE” or “FALSE

Hidden faces may show any symbols, but no symbol appears on more than one face of the same cube. Task: Decide whether the following statement is true or false: the first cube is a certain view of the second cube after it is turned. (!!!) Three other prompts are: (1) the first cube is not any view of the second cube no matter how it is turned (2) the secon...

work page 1974
[7]

You may switch lines only where a black dot is drawn

work page
[8]

Lines that cross or touch without a dot are not connected

work page
[9]

Task: For box E, decide if there is one continuous line that:

The path must stay inside the chosen box and must not stop at a dead-end. Task: For box E, decide if there is one continuous line that:

work page
[10]

Starts at S inside that box

work page
[11]

Reaches the single circle at the top

work page
[12]

TRUE” if box E meets all the rules, “FALSE

Comes back to F inside the same box without entering any other box. Output: Respond with only one word: “TRUE” if box E meets all the rules, “FALSE” if it does not, in JSON format as follows:{"answer": YOUR ANSWER HERE}. Prompt for SS3: Map Planning Test Look at the city map shown in the image below: In the map:

work page
[13]

Streets = black lines

work page
[14]

Circles = road-blocks (you cannot cross there)

work page
[15]

Task: Find the shortest street route from F to T

Numbered squares = buildings. Task: Find the shortest street route from F to T. Rules:

work page
[16]

The route will always touch the side of one and only one numbered building

work page
[17]

Touching only a corner does not count

work page
[18]

The ability to manipulate or transform the im- age of spatial patterns into other arrangements

Move only along streets (horizontal or vertical), never through circles. Output: Respond with only one number: the number on the build- ing your shortest route touches, in JSON format as follows: {"answer": YOUR ANSWER HERE}. 27 Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs F.10. Visualization (VZ) The Factor: “The ability to manipul...

work page 1971
[19]

Use 2–5 of the pieces to fill the figure exactly

work page
[20]

TRUE” if it is or “FALSE

You may rotate pieces but do not flip them. Task: Decide whether the Fifth piece is in the set of pieces that makes the figure. Output: Respond with only one word: “TRUE” if it is or “FALSE” if it is not, in JSON format as follows:{"answer": YOUR ANSWER HERE}. Prompt for VZ2: Paper Folding Test Look at the two images: Below is the first image, a step-by-s...

work page
[21]

Do not flip or rotate the paper except for the folds shown

Mentally follow every fold in the first image exactly as drawn. Do not flip or rotate the paper except for the folds shown

work page
[22]

Imagine a hole being punched through all layers where each circle is drawn

work page
[23]

Unfold the paper, step by step, in reverse order of the folds, keeping the sheet’s original orientation

work page
[24]

After it is flat, note where every hole should appear on the sheet

work page
[25]

TRUE” if every hole (num- ber and position) in the second image matches your mental result exactly, otherwise “FALSE

Compare this mental result with the pattern of holes in the second image. Output: Respond with only one word: “TRUE” if every hole (num- ber and position) in the second image matches your mental result exactly, otherwise “FALSE”, in JSON format as follows: {"answer": YOUR ANSWER HERE}. Prompt for VZ3: Surface Development Test Look at the two images: Below...

work page
[26]

**Identify the Pattern**: Examine the small shape in the first image and record its exact pixel or cell configuration (e.g., a 2D grid of colors or pixels)

work page
[27]

**Scan the Larger Image**: Systematically slide a window of the same size as the first image over the second image, checking each possible sub-region

work page
[28]

**Compare**: For each sub-region, check if it matches the pattern from the first image exactly—no rotation, flip, or size change allowed

work page
[29]

If no match is found after scanning the entire larger image, output ‘”answer”: ”FALSE”‘

**Decision**: If an exact match is found, output ‘”answer”: ”TRUE”‘. If no match is found after scanning the entire larger image, output ‘”answer”: ”FALSE”‘. Solution to CF2: Hidden Patterns Test

work page
[30]

**Identify Model Dimensions**: Note the size (rows x columns) of the model in the first image

work page
[31]

**Scan Pattern Image**: Slide a window of the same dimensions across the second image (top-left to bottom-right)

work page
[32]

**Check for Exact Match**: At each position, compare the sub-section of the pattern with the model

work page
[33]

**No Rotation or Flip**: Ensure the comparison uses the model as-is, without any transformations

work page
[34]

Otherwise, return ‘”answer”: ”FALSE”‘

**Return Result**: If an exact match is found, return ‘”answer”: ”TRUE”‘. Otherwise, return ‘”answer”: ”FALSE”‘. Solution to CF3: Copying Test

work page
[35]

**Observe the shape** in the first image and break it into straight line segments along the grid

work page
[36]

**Start at the circled dot** in the second image

work page
[37]

**Trace the same movements** (up/down/left/right/diagonal) from the start point, replicating the shape exactly by placing corners on the grid dots

work page
[38]

**Count steps carefully** to ensure each corner aligns with a grid dot as in the original shape

work page
[39]

Solution to CS1: Gestalt Completion Test

**Record the final dot** reached after completing the entire shape. Solution to CS1: Gestalt Completion Test

work page
[40]

**Observe the drawing**: Look closely at the curved and linear segments to infer what object is being sketched

work page
[41]

**Look for familiar outlines**: Identify key features—shapes, proportions, and positioning—that suggest a common object (e.g., wheels, body, handles)

work page
[42]

**Mentally complete the figure**: Use the partial lines to visualize what the full object would look like

work page
[43]

Solution to CS2: Concealed Words Test

**Identify the object**: Based on the partial sketch, determine the most likely object. Solution to CS2: Concealed Words Test

work page
[44]

**Analyze the visible fragments**: Identify parts of letters that are still visible and match them to possible lowercase letters

work page
[45]

**Visualize missing parts**: Mentally fill in the gaps based on typical letter structures

work page
[46]

**Look for patterns**: Combine identified letters into a coherent word, considering common English words

work page
[47]

Solution to CS3: Snowy Pictures

**Verify length**: Ensure the word is at least four letters long and uses only lowercase letters. Solution to CS3: Snowy Pictures

work page
[48]

**Identify visible features**: Focus on the parts that are not hidden—shape, color, structure, or details that hint at the object

work page
[49]

**Infer the whole object**: Use context and symmetry to mentally complete the object, even if part is obscured

work page
[50]

Solution to I3: Figure Classification

**Choose the most likely object**: Based on the visible portion and common objects with that appearance. Solution to I3: Figure Classification

work page
[51]

**Examine Group 1 and Group 2 figures**: Look for common traits shared within each group (e.g., shape count, orientation, fill patterns, symmetry)

work page
[52]

**Identify the rule per group**: Determine what consistent rule applies to all three figures in each group (e.g., all shapes have a diagonal line, or all contain a specific number of elements)

work page
[53]

**Compare rules between groups**: Make sure the rule is not shared across groups—each group must have a distinct rule

work page
[54]

**Analyze the figure to classify**: Determine which group’s rule the new figure follows

work page
[55]

Solution to MA1: Picture-Number Test

**Assign it to the correct group**: Match the figure to the group with the corresponding visual rule. Solution to MA1: Picture-Number Test

work page
[56]

**Study the 21 picture-number pairs** in the first image: Memorize or note the associations between each unique picture and its corresponding number

work page
[57]

**Examine the picture in the second image**: Identify the object or scene shown

work page
[58]

**Match the second image** to one of the 21 pictures from the first image by comparing visual features

work page
[59]

**Retrieve the associated number** from the first image that corresponds to the matched picture

work page
[60]

Solution to MV1: Shape Memory Test

**Return the number** in the required JSON format. Solution to MV1: Shape Memory Test

work page
[61]

**Memorize the shapes and orientations** in the first image: Focus on each shape’s design and the direction it’s facing (rotation or reflection)

work page
[62]

**Examine the second image**: Identify the specific shape and its orienta- tion shown here

work page
[63]

**Compare it to the memorized shapes**: Look for an exact match in both shape and orientation from the first image

work page
[64]

This shape matches one from the first image

**Evaluate the statement**: Determine if the given claim (e.g., “This shape matches one from the first image”) accurately reflects what is shown

work page
[65]

29 Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs Solution to MV2: Building Memory

**Decide if the statement is TRUE or FALSE** based on your comparison. 29 Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs Solution to MV2: Building Memory

work page
[66]

**Memorize the street map** in the first image: Note the position of each unique building relative to the street layout

work page
[67]

**Study the block labels** in the second image: Understand how the blocks (A–E) correspond to the same street layout from the first image

work page
[68]

**Identify the building** in the third image: Match its shape, size, and features to one on the original street map

work page
[69]

**Locate that building** on the labeled block map from the second image

work page
[70]

Solution to MV3: Map Memory

**Determine if it is in the specified block**: Compare its actual position to the named block in the question. Solution to MV3: Map Memory

work page
[71]

**Memorize the maps** in the first image: Focus on the layout of walls, paths, and any unique structures in each map

work page
[72]

**Examine the single map** in the second image: Pay attention to the same features—structure, layout, and orientation

work page
[73]

**Compare the second map** to the ones memorized: Check for exact matches or close similarities, including possible rotations or reflections

work page
[74]

**Evaluate the statement**: Determine whether it correctly asserts a match (or lack thereof) between the second map and any from the first image

work page
[75]

Solution to P3: Identical Pictures Test

**Answer TRUE or FALSE** depending on whether the claim aligns with your comparison. Solution to P3: Identical Pictures Test

work page
[76]

**Study the target object** in the first image: Note its overall shape, orientation, components, and details

work page
[77]

**Examine the test object** in the second image: Observe the same features—shape, structure, and orientation

work page
[78]

**Compare both objects** precisely: Check for any differences in angles, positioning, parts, or missing elements

work page
[79]

Solution to RL2: Diagramming Relationships

**Determine exact match**: Decide if the test object is an identical copy of the target object in all aspects. Solution to RL2: Diagramming Relationships

work page
[80]

**Understand the group relationships described** in the statement (e.g., one group is a subset of another, or groups partially overlap or are completely separate)

work page

Showing first 80 references.

[1] [1]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

URL https://qwenlm.github.io/blog/ qwen-vl/. Team, Q. Qwen3-vl: Sharper vision, deeper thought, broader action.Qwen Blogs Sep 22 2025, 2025. URL https://qwen.ai/blog?id=qwen3-vl. Thurstone, L. L. Primary mental abilities:.Psychology Monographs, 1, 1938. Thurstone, L. L.A factorial study of perception.The University of Chicago Press, 1944. Thurstone, L. L....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

A circle inside another: all things in the inner group belong to the outer group

work page

[3] [3]

Circles that overlap partly: the two groups share some, but not all, things

work page

[4] [4]

TRUE” if it shows the rela- tionships for the three groups, “FALSE

Circles that do not touch: the two groups share nothing. Task: Decide whether the image follows these rules for the three groups: Desks, furniture, pencils. Output: Respond with only one word: “TRUE” if it shows the rela- tionships for the three groups, “FALSE” if it does not, in JSON format as follows:{"answer": YOUR ANSWER HERE}. 25 Human Cognitive Benc...

work page 1954

[5] [5]

Every face shows a different letter, number, or symbol

Each cube has six faces. Every face shows a different letter, number, or symbol

work page

[6] [6]

TRUE” or “FALSE

Hidden faces may show any symbols, but no symbol appears on more than one face of the same cube. Task: Decide whether the following statement is true or false: the first cube is a certain view of the second cube after it is turned. (!!!) Three other prompts are: (1) the first cube is not any view of the second cube no matter how it is turned (2) the secon...

work page 1974

[7] [7]

You may switch lines only where a black dot is drawn

work page

[8] [8]

Lines that cross or touch without a dot are not connected

work page

[9] [9]

Task: For box E, decide if there is one continuous line that:

The path must stay inside the chosen box and must not stop at a dead-end. Task: For box E, decide if there is one continuous line that:

work page

[10] [10]

Starts at S inside that box

work page

[11] [11]

Reaches the single circle at the top

work page

[12] [12]

TRUE” if box E meets all the rules, “FALSE

Comes back to F inside the same box without entering any other box. Output: Respond with only one word: “TRUE” if box E meets all the rules, “FALSE” if it does not, in JSON format as follows:{"answer": YOUR ANSWER HERE}. Prompt for SS3: Map Planning Test Look at the city map shown in the image below: In the map:

work page

[13] [13]

Streets = black lines

work page

[14] [14]

Circles = road-blocks (you cannot cross there)

work page

[15] [15]

Task: Find the shortest street route from F to T

Numbered squares = buildings. Task: Find the shortest street route from F to T. Rules:

work page

[16] [16]

The route will always touch the side of one and only one numbered building

work page

[17] [17]

Touching only a corner does not count

work page

[18] [18]

The ability to manipulate or transform the im- age of spatial patterns into other arrangements

Move only along streets (horizontal or vertical), never through circles. Output: Respond with only one number: the number on the build- ing your shortest route touches, in JSON format as follows: {"answer": YOUR ANSWER HERE}. 27 Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs F.10. Visualization (VZ) The Factor: “The ability to manipul...

work page 1971

[19] [19]

Use 2–5 of the pieces to fill the figure exactly

work page

[20] [20]

TRUE” if it is or “FALSE

You may rotate pieces but do not flip them. Task: Decide whether the Fifth piece is in the set of pieces that makes the figure. Output: Respond with only one word: “TRUE” if it is or “FALSE” if it is not, in JSON format as follows:{"answer": YOUR ANSWER HERE}. Prompt for VZ2: Paper Folding Test Look at the two images: Below is the first image, a step-by-s...

work page

[21] [21]

Do not flip or rotate the paper except for the folds shown

Mentally follow every fold in the first image exactly as drawn. Do not flip or rotate the paper except for the folds shown

work page

[22] [22]

Imagine a hole being punched through all layers where each circle is drawn

work page

[23] [23]

Unfold the paper, step by step, in reverse order of the folds, keeping the sheet’s original orientation

work page

[24] [24]

After it is flat, note where every hole should appear on the sheet

work page

[25] [25]

TRUE” if every hole (num- ber and position) in the second image matches your mental result exactly, otherwise “FALSE

Compare this mental result with the pattern of holes in the second image. Output: Respond with only one word: “TRUE” if every hole (num- ber and position) in the second image matches your mental result exactly, otherwise “FALSE”, in JSON format as follows: {"answer": YOUR ANSWER HERE}. Prompt for VZ3: Surface Development Test Look at the two images: Below...

work page

[26] [26]

**Identify the Pattern**: Examine the small shape in the first image and record its exact pixel or cell configuration (e.g., a 2D grid of colors or pixels)

work page

[27] [27]

**Scan the Larger Image**: Systematically slide a window of the same size as the first image over the second image, checking each possible sub-region

work page

[28] [28]

**Compare**: For each sub-region, check if it matches the pattern from the first image exactly—no rotation, flip, or size change allowed

work page

[29] [29]

If no match is found after scanning the entire larger image, output ‘”answer”: ”FALSE”‘

**Decision**: If an exact match is found, output ‘”answer”: ”TRUE”‘. If no match is found after scanning the entire larger image, output ‘”answer”: ”FALSE”‘. Solution to CF2: Hidden Patterns Test

work page

[30] [30]

**Identify Model Dimensions**: Note the size (rows x columns) of the model in the first image

work page

[31] [31]

**Scan Pattern Image**: Slide a window of the same dimensions across the second image (top-left to bottom-right)

work page

[32] [32]

**Check for Exact Match**: At each position, compare the sub-section of the pattern with the model

work page

[33] [33]

**No Rotation or Flip**: Ensure the comparison uses the model as-is, without any transformations

work page

[34] [34]

Otherwise, return ‘”answer”: ”FALSE”‘

**Return Result**: If an exact match is found, return ‘”answer”: ”TRUE”‘. Otherwise, return ‘”answer”: ”FALSE”‘. Solution to CF3: Copying Test

work page

[35] [35]

**Observe the shape** in the first image and break it into straight line segments along the grid

work page

[36] [36]

**Start at the circled dot** in the second image

work page

[37] [37]

**Trace the same movements** (up/down/left/right/diagonal) from the start point, replicating the shape exactly by placing corners on the grid dots

work page

[38] [38]

**Count steps carefully** to ensure each corner aligns with a grid dot as in the original shape

work page

[39] [39]

Solution to CS1: Gestalt Completion Test

**Record the final dot** reached after completing the entire shape. Solution to CS1: Gestalt Completion Test

work page

[40] [40]

**Observe the drawing**: Look closely at the curved and linear segments to infer what object is being sketched

work page

[41] [41]

**Look for familiar outlines**: Identify key features—shapes, proportions, and positioning—that suggest a common object (e.g., wheels, body, handles)

work page

[42] [42]

**Mentally complete the figure**: Use the partial lines to visualize what the full object would look like

work page

[43] [43]

Solution to CS2: Concealed Words Test

**Identify the object**: Based on the partial sketch, determine the most likely object. Solution to CS2: Concealed Words Test

work page

[44] [44]

**Analyze the visible fragments**: Identify parts of letters that are still visible and match them to possible lowercase letters

work page

[45] [45]

**Visualize missing parts**: Mentally fill in the gaps based on typical letter structures

work page

[46] [46]

**Look for patterns**: Combine identified letters into a coherent word, considering common English words

work page

[47] [47]

Solution to CS3: Snowy Pictures

**Verify length**: Ensure the word is at least four letters long and uses only lowercase letters. Solution to CS3: Snowy Pictures

work page

[48] [48]

**Identify visible features**: Focus on the parts that are not hidden—shape, color, structure, or details that hint at the object

work page

[49] [49]

**Infer the whole object**: Use context and symmetry to mentally complete the object, even if part is obscured

work page

[50] [50]

Solution to I3: Figure Classification

**Choose the most likely object**: Based on the visible portion and common objects with that appearance. Solution to I3: Figure Classification

work page

[51] [51]

**Examine Group 1 and Group 2 figures**: Look for common traits shared within each group (e.g., shape count, orientation, fill patterns, symmetry)

work page

[52] [52]

**Identify the rule per group**: Determine what consistent rule applies to all three figures in each group (e.g., all shapes have a diagonal line, or all contain a specific number of elements)

work page

[53] [53]

**Compare rules between groups**: Make sure the rule is not shared across groups—each group must have a distinct rule

work page

[54] [54]

**Analyze the figure to classify**: Determine which group’s rule the new figure follows

work page

[55] [55]

Solution to MA1: Picture-Number Test

**Assign it to the correct group**: Match the figure to the group with the corresponding visual rule. Solution to MA1: Picture-Number Test

work page

[56] [56]

**Study the 21 picture-number pairs** in the first image: Memorize or note the associations between each unique picture and its corresponding number

work page

[57] [57]

**Examine the picture in the second image**: Identify the object or scene shown

work page

[58] [58]

**Match the second image** to one of the 21 pictures from the first image by comparing visual features

work page

[59] [59]

**Retrieve the associated number** from the first image that corresponds to the matched picture

work page

[60] [60]

Solution to MV1: Shape Memory Test

**Return the number** in the required JSON format. Solution to MV1: Shape Memory Test

work page

[61] [61]

**Memorize the shapes and orientations** in the first image: Focus on each shape’s design and the direction it’s facing (rotation or reflection)

work page

[62] [62]

**Examine the second image**: Identify the specific shape and its orienta- tion shown here

work page

[63] [63]

**Compare it to the memorized shapes**: Look for an exact match in both shape and orientation from the first image

work page

[64] [64]

This shape matches one from the first image

**Evaluate the statement**: Determine if the given claim (e.g., “This shape matches one from the first image”) accurately reflects what is shown

work page

[65] [65]

29 Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs Solution to MV2: Building Memory

**Decide if the statement is TRUE or FALSE** based on your comparison. 29 Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs Solution to MV2: Building Memory

work page

[66] [66]

**Memorize the street map** in the first image: Note the position of each unique building relative to the street layout

work page

[67] [67]

**Study the block labels** in the second image: Understand how the blocks (A–E) correspond to the same street layout from the first image

work page

[68] [68]

**Identify the building** in the third image: Match its shape, size, and features to one on the original street map

work page

[69] [69]

**Locate that building** on the labeled block map from the second image

work page

[70] [70]

Solution to MV3: Map Memory

**Determine if it is in the specified block**: Compare its actual position to the named block in the question. Solution to MV3: Map Memory

work page

[71] [71]

**Memorize the maps** in the first image: Focus on the layout of walls, paths, and any unique structures in each map

work page

[72] [72]

**Examine the single map** in the second image: Pay attention to the same features—structure, layout, and orientation

work page

[73] [73]

**Compare the second map** to the ones memorized: Check for exact matches or close similarities, including possible rotations or reflections

work page

[74] [74]

**Evaluate the statement**: Determine whether it correctly asserts a match (or lack thereof) between the second map and any from the first image

work page

[75] [75]

Solution to P3: Identical Pictures Test

**Answer TRUE or FALSE** depending on whether the claim aligns with your comparison. Solution to P3: Identical Pictures Test

work page

[76] [76]

**Study the target object** in the first image: Note its overall shape, orientation, components, and details

work page

[77] [77]

**Examine the test object** in the second image: Observe the same features—shape, structure, and orientation

work page

[78] [78]

**Compare both objects** precisely: Check for any differences in angles, positioning, parts, or missing elements

work page

[79] [79]

Solution to RL2: Diagramming Relationships

**Determine exact match**: Decide if the test object is an identical copy of the target object in all aspects. Solution to RL2: Diagramming Relationships

work page

[80] [80]

**Understand the group relationships described** in the statement (e.g., one group is a subset of another, or groups partially overlap or are completely separate)

work page