pith. sign in

arxiv: 2606.25066 · v1 · pith:PO5E5JHDnew · submitted 2026-06-23 · 💻 cs.AI · cs.CV

Do vision-language models search like humans? Reasoning tokens as a reaction-time analog in classic visual-search paradigms

Pith reviewed 2026-06-25 22:57 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords visual searchvision-language modelsreasoning tokensattentionpsychophysicsfeature searchconjunction searchenumeration
0
0 comments X

The pith

Vision-language models reproduce several human visual-search signatures when reasoning-token count is treated as a reaction-time analog, yet reverse the target-present versus target-absent effort ordering and maintain enumeration accuracy w

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts classic visual-search tasks—feature versus conjunction search, T-versus-L configuration search, enumeration, and orientation asymmetry—to current frontier and mid-tier vision-language models. It treats the number of reasoning tokens generated on each trial as a within-model proxy for the search effort that human reaction times measure in the Wolfe et al. 2010 benchmark. The models produce the expected flat effort function for feature search and rising effort for conjunction search; frontier models keep high accuracy while mid-tier models drop to chance; and a resolution control confirms the rising cost is not merely a perceptual-resolution limit. The same models diverge from humans by showing steeper target-present than target-absent slopes and by preserving accurate enumeration at larger set sizes. These parallels and divergences are offered as an inexpensive behavioral probe of machine visual cognition.

Core claim

When reasoning-token usage per trial is measured on feature, conjunction, T-L, enumeration, and asymmetry displays and compared with human reaction-time data, token counts remain flat with set size in feature search but increase in conjunction search; this increase survives image enlargement; frontier models sustain accuracy where mid-tier models collapse; yet the target-present effort slope exceeds the target-absent slope (reversing humans) and enumeration accuracy remains high at set sizes where humans would lose count; a reasoning model with adaptive deliberation skips deliberation on detection tasks, turning the same search into an effort gradient in one model and an accuracy cliff in an

What carries the argument

Reasoning-token count per trial, used as a within-model analog of search effort to compare against human reaction times.

If this is right

  • Feature search produces flat effort across set size while conjunction search produces rising effort, reproducing the human parallel-versus-serial distinction.
  • Frontier models maintain high accuracy on conjunction tasks at set sizes where mid-tier models fall to chance.
  • The conjunction-search cost survives enlargement of the stimuli, showing it is not explained by difficulty resolving small shapes.
  • Target-present effort slopes exceed target-absent slopes, the opposite of the human ordering.
  • Enumeration accuracy remains high at larger set sizes where human accuracy declines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If token count genuinely indexes effort, models that can be instructed to vary deliberation depth might be used to test whether the same architecture can switch between effort-gradient and accuracy-cliff regimes on identical images.
  • The reversed target-present versus absent ordering could be tested by measuring whether models continue exhaustive inspection after locating a target, unlike humans who terminate early.
  • Applying the identical token-count method to other psychophysical tasks such as visual short-term memory or multiple-object tracking could reveal whether the observed search signatures generalize to other attention-like behaviors.
  • The accuracy cliff in mid-tier models versus sustained performance in frontier models suggests a scale threshold at which internal representations become sufficient for serial-like search.

Load-bearing premise

The number of reasoning tokens a model spends per trial functions as a valid within-model analog of search effort that can be directly compared to human reaction times.

What would settle it

If the same models are forced to emit a fixed token budget on every trial, the set-size-dependent rise in token count for conjunction search should disappear while accuracy patterns remain unchanged.

Figures

Figures reproduced from arXiv: 2606.25066 by Farahnaz Wick.

Figure 1
Figure 1. Figure 1: Example displays for the five conditions. The first two (feature, conjunction) retain color; the last three remove it, isolating spatial-configuration search, enumeration, and a search asymmetry. The T-vs-L study placed a single black T among black Ls (set sizes 4, 8, 16, 32; present/absent; 25 per cell; 200 displays). The enumeration study placed one to four black Ts among black Ls (set sizes 8, 16, 32; 2… view at source ↗
Figure 2
Figure 2. Figure 2: Proportion correct by set size for humans and four model configurations. Feature search (green) is at ceiling for every model; conjunction accuracy (gray) falls with set size, gently for the mid-tier model with thinking and steeply for GPT-4o and o4-mini. Humans (leftmost) stay near-perfect in both conditions because they trade time for accuracy. Shaded bands are 95% Wilson CIs. by ordinary least squares o… view at source ↗
Figure 3
Figure 3. Figure 3: Human reaction time and frontier-model reasoning effort by set size. Feature search is flat and conjunction search climbs in all three panels. The currency differs (milliseconds for people, reasoning tokens for models) but the shape is shared. Shaded bands are 95% CIs; Claude Opus 4.8’s wide conjunction band reflects its erratic, near-floor token spending [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Human reaction time versus model reasoning effort across the sixteen matched cells, per thinking model; each point is one cell (green feature, gray conjunction; filled present, open absent). The correspondence is strong for GPT-5.5 and the mid-tier model, modest for o4-mini, and essentially absent for Claude Opus 4.8, whose effort sits near the floor. ρ = 0.91) and the mid-tier thinking model r = 0.63 (p =… view at source ↗
Figure 5
Figure 5. Figure 5: Top: one set-size-16 conjunction display at four detail levels, from crisp (left) to a few-pixel smear (right); color is preserved throughout. Bottom: reasoning effort at set size 32 across the detail ladder (x-axis: pixels carried before reblurring; higher is crisper). Feature effort (green) stays flat; conjunction effort (gray) is higher and rises only at the blurriest end, with the crisp baseline alread… view at source ↗
Figure 6
Figure 6. Figure 6: T-vs-L reasoning effort by set size, target present (solid) and absent (open), with 95% CIs. GPT-5.5’s effort climbs steeply, most so on absent trials; Claude remains flat near four tokens. (Overlaying the feature and conjunction curves from Experiment 1 places GPT-5.5’s T-vs-L line on its conjunction curve and Claude’s on its feature floor.) [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Enumeration. Left block: mean reported count versus true count by set size, with the identity line. The leftmost panel is the pattern predicted for humans by subsequent-search-miss errors (illustrative, not measured); the two model panels are measured (bars are 95% CIs) and sit on the identity line. Right: mean reasoning tokens per trial by set size and target count; effort grows with both, and GPT-5.5 spe… view at source ↗
Figure 8
Figure 8. Figure 8: Search asymmetry on target-absent trials, both models per panel. Left: accuracy (bars are 95% Wilson CIs); only Claude breaks, falling to chance when ruling out a tilted bar among vertical ones. Right: reasoning tokens by set size with 95% CIs; only GPT-5.5 works harder, and only in that same hard direction [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
read the original abstract

Visual search has been one of the most productive paradigms in the study of visual attention: the way reaction time scales with the number of items distinguishes parallel, "pop-out" search from serial, attention-demanding search. I ask whether vision-language models (VLMs) exhibit the same behavioral signatures. I adapt four classic paradigms: feature versus conjunction search, spatial-configuration (T-vs-L) search, enumeration, and the tilted/vertical search asymmetry; and present them to current frontier and mid-tier models. Because a single model call has no reaction time, I use the number of reasoning ("thinking") tokens a model spends per trial as a within-model analog of search effort, and I compare against a large public human benchmark (Wolfe et al., 2010). The models reproduce several human signatures: feature search costs flat effort while conjunction effort climbs with set size; frontier models hold accuracy where mid-tier models collapse to chance; and a resolution control shows the conjunction cost is genuine search rather than difficulty resolving small shapes. They also diverge from humans in informative ways. The target-present effort slope exceeds the target-absent slope, reversing the human ordering; enumeration remains accurate where humans would lose count; and a reasoning model with adaptive deliberation declines to deliberate on detection tasks altogether, so that a single search expresses itself as an effort gradient in one model and as an accuracy cliff in another. I argue that psychophysical paradigms, applied behaviorally, are a sharp and inexpensive probe of machine visual cognition, and that the points of divergence are as informative as the points of agreement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper adapts classic visual search paradigms (feature vs. conjunction, T-vs-L, enumeration, search asymmetry) to frontier and mid-tier VLMs, using the count of reasoning tokens per trial as a within-model proxy for human reaction time. It compares results to the Wolfe et al. (2010) human benchmark and reports that models reproduce several signatures (flat feature-search effort, set-size-dependent conjunction effort, frontier models maintaining accuracy) while diverging in others (reversed target-present/absent slopes, preserved enumeration accuracy). A resolution control is cited to argue the conjunction cost reflects search rather than low-level resolution difficulty.

Significance. If the token-count measure can be shown to index visual search effort rather than output-generation length, the work supplies a low-cost, structured behavioral assay for probing machine visual cognition against established human benchmarks. The explicit comparison to a public human dataset and the documentation of both alignments and divergences constitute a useful contribution to the growing literature on model psychophysics.

major comments (3)
  1. [Abstract] Abstract: The central claim that reasoning-token count functions as a valid analog of search effort is load-bearing for all reported signatures, yet the text provides no ablation that holds output format and instruction constant while varying only visual-search demand; the resolution control rules out pixel-level difficulty but leaves linguistic or enumeration confounds unaddressed.
  2. [Abstract] Abstract: The reported qualitative matches to human patterns (flat feature-search cost, rising conjunction cost) are presented without quantitative slopes, error bars, statistical tests, or full methods, making it impossible to assess effect sizes or reproducibility against the Wolfe et al. (2010) benchmark.
  3. [Abstract] Abstract: The divergence that 'target-present effort slope exceeds the target-absent slope' reverses the canonical human ordering; without quantitative data or a stated statistical criterion, it is unclear whether this constitutes a reliable model-human difference or an artifact of token-count measurement.
minor comments (1)
  1. [Abstract] The abstract refers to 'a large public human benchmark (Wolfe et al., 2010)' but does not specify which exact conditions or dependent measures were extracted for comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the paper accordingly where the concerns identify gaps in the current presentation or controls.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that reasoning-token count functions as a valid analog of search effort is load-bearing for all reported signatures, yet the text provides no ablation that holds output format and instruction constant while varying only visual-search demand; the resolution control rules out pixel-level difficulty but leaves linguistic or enumeration confounds unaddressed.

    Authors: We agree that an ablation isolating visual-search demand while holding output format and instruction fixed would provide stronger evidence. The existing resolution control varies visual complexity under the same search instruction, but does not fully rule out linguistic or enumeration-related token usage. In the revision we will add a control condition in which the model is instructed to enumerate all items without searching for a target, using identical output formatting and prompt structure, to directly compare token counts attributable to search versus enumeration. revision: yes

  2. Referee: [Abstract] Abstract: The reported qualitative matches to human patterns (flat feature-search cost, rising conjunction cost) are presented without quantitative slopes, error bars, statistical tests, or full methods, making it impossible to assess effect sizes or reproducibility against the Wolfe et al. (2010) benchmark.

    Authors: The abstract is intentionally concise, but the full manuscript reports quantitative slopes, error bars, and statistical comparisons against the Wolfe et al. benchmark in the Results section, with complete methods provided. To address the concern, we will revise the abstract to include key quantitative values (e.g., slopes and confidence intervals) and explicitly reference the statistical tests and methods section. revision: yes

  3. Referee: [Abstract] Abstract: The divergence that 'target-present effort slope exceeds the target-absent slope' reverses the canonical human ordering; without quantitative data or a stated statistical criterion, it is unclear whether this constitutes a reliable model-human difference or an artifact of token-count measurement.

    Authors: The full manuscript already contains the quantitative slopes and a direct statistical comparison of target-present versus target-absent slopes. We will revise the abstract to report these values explicitly and state the statistical criterion used to identify the reversal as a reliable difference. This will clarify that the divergence is not presented solely qualitatively. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison to external benchmark

full rationale

The paper performs a direct empirical test by feeding visual-search stimuli to VLMs and recording reasoning-token counts, then comparing the resulting set-size slopes and accuracy patterns against the independent Wolfe et al. 2010 human data set. No equations, fitted parameters, or self-citations appear in the derivation; the token-count proxy is introduced as a measurement choice rather than derived from the target patterns themselves. The reported signatures therefore stand or fall on the observed model outputs versus the external benchmark, with no reduction by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; ledger is therefore minimal and provisional.

axioms (1)
  • domain assumption Number of reasoning tokens per trial measures search effort in a manner comparable to human reaction time
    Central to the entire experimental design and comparison to Wolfe et al. 2010 benchmark

pith-pipeline@v0.9.1-grok · 5816 in / 1253 out tokens · 22332 ms · 2026-06-25T22:57:02.207285+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 5 canonical work pages

  1. [1]

    D., Webb, T

    Budny, N., Ghods, K., Campbell, D., Marjieh, R., Joshi, A., Kumar, S., Cohen, J. D., Webb, T. W., & Griffiths, T. L. (2025).Visual serial processing deficits explain divergences in human and VLM reasoning. arXiv. https://arxiv.org/abs/2509.25142

  2. [2]

    S., & Mitroff, S

    Cain, M. S., & Mitroff, S. R. (2013). Memory for found targets interferes with subsequent performance in multiple-target visual search.Journal of Experimental Psychology: Human Perception and Performance,39(5), 1398–1408

  3. [3]

    M., Griffiths, T

    Campbell, D., Rane, S., Giallanza, T., De Sabbata, N., Ghods, K., Joshi, A., Ku, A., Frankland, S. M., Griffiths, T. L., Cohen, J. D., & Webb, T. (2024). Understanding the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems,37. https://arxiv.org/abs/2411.00238 12

  4. [4]

    satisfaction of search

    Fleck, M. S., Samei, E., & Mitroff, S. R. (2010). Generalized “satisfaction of search”: Adverse influences on dual-target search accuracy.Journal of Experimental Psychology: Applied,16(1), 60–71

  5. [5]

    A., Ma, W.-C., & Krishna, R

    Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N. A., Ma, W.-C., & Krishna, R. (2024). BLINK: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision (ECCV). https://arxiv.org/abs/2404.12390

  6. [6]

    Hulleman, J., Lund, K., & Skarratt, P. A. (2020). Medium versus difficult visual search: How a quantitative change in the functional visual field leads to a qualitative difference in performance

  7. [7]

    https://doi.org/10.3758/s13414-019- 01787-4

    Attention, Perception, & Psychophysics,82(1), 118–139. https://doi.org/10.3758/s13414-019- 01787-4

  8. [8]

    M., Horowitz, T

    Palmer, E. M., Horowitz, T. S., Torralba, A., & Wolfe, J. M. (2011). What are the shapes of response time distributions in visual search?Journal of Experimental Psychology: Human Perception and Performance,37(1), 58–71

  9. [9]

    R., & Nguyen, A

    Rahmanzadehgervi, P., Bolton, L., Taesiri, M. R., & Nguyen, A. T. (2024). Vision language models are blind. InProceedings of the Asian Conference on Computer Vision (ACCV)(pp. 18–34). https://arxiv.org/abs/2407.06581

  10. [10]

    1980 , issn =

    Treisman, A. M., & Gelade, G. (1980). A feature-integration theory of attention.Cognitive Psychology, 12(1), 97–136. https://doi.org/10.1016/0010-0285(80)90005-5

  11. [11]

    Treisman, A., & Gormican, S. (1988). Feature analysis in early vision: Evidence from search asymmetries.Psychological Review,95(1), 15–48. https://doi.org/10.1037/0033-295X.95.1.15

  12. [12]

    M., & Pylyshyn, Z

    Trick, L. M., & Pylyshyn, Z. W. (1994). Why are small and large numbers enumerated differently? A limited-capacity preattentive stage in vision.Psychological Review,101(1), 80–102

  13. [13]

    Ullman, S. (1984). Visual routines.Cognition,18(1–3), 97–159. https://doi.org/10.1016/0010- 0277(84)90023-4

  14. [14]

    H., Le, Q

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Ichter, B., Xia, F., Chi, E. H., Le, Q. V., & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems,35

  15. [15]

    Wolfe, J. M. (2021). Guided Search 6.0: An updated model of visual search.Psychonomic Bulletin & Review,28(4), 1060–1092

  16. [16]

    M., Cave, K

    Wolfe, J. M., Cave, K. R., & Franzel, S. L. (1989). Guided search: An alternative to the feature integration model for visual search.Journal of Experimental Psychology: Human Perception and Performance,15(3), 419–433

  17. [17]

    M., Palmer, E

    Wolfe, J. M., Palmer, E. M., & Horowitz, T. S. (2010). Reaction time distributions constrain models of visual search.Vision Research,50(14), 1304–1311. https://doi.org/10.1016/j.visres.2009.11.002 13