pith. sign in

arxiv: 2510.08482 · v3 · pith:2T2XWNVWnew · submitted 2025-10-09 · 💻 cs.CV · cs.CL

The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping

Pith reviewed 2026-05-21 20:17 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords iconicitysign languagevision-language modelsphonological predictionvisual groundingform-meaning mappingtransparencybenchmark
0
0 comments X

The pith

Vision-language models that better predict sign phonological forms also align more closely with human iconicity judgments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Visual Iconicity Challenge as a video-based benchmark to test how vision-language models recover form-meaning mappings in signed languages. It adapts three psycholinguistic tasks to sign videos from the Sign Language of the Netherlands: predicting phonological features such as handshape and location, inferring meaning from visual form, and giving graded iconicity ratings. Across 13 state-of-the-art models in zero- and few-shot settings, performance stays below human levels on all tasks, yet models stronger at phonological prediction show higher correlation with human iconicity ratings. This pattern indicates that both models and people draw on shared sensitivity to visually grounded structure in dynamic signs. The results support using these tasks to guide improvements in visual grounding for multimodal systems.

Core claim

The central claim is that VLMs recover some handshape and location detail from sign videos but remain below human performance on transparency and show only moderate correlation with human iconicity ratings in top models; crucially, models with stronger phonological form prediction correlate better with human iconicity judgments, pointing to shared sensitivity to visually grounded structure.

What carries the argument

The Visual Iconicity Challenge benchmark with its three adapted tasks on dynamic sign videos: phonological sign-form prediction, transparency inference, and graded iconicity ratings.

If this is right

  • VLMs recover some handshape and location details from sign videos but remain below human performance levels.
  • VLMs perform far below human baselines when inferring meaning from visual sign forms in transparency tasks.
  • Only the top VLMs achieve moderate correlation with human graded iconicity ratings.
  • Stronger phonological form prediction performance predicts better correlation with human iconicity judgments.
  • The findings support developing human-centric signals and embodied learning methods to model iconicity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The correlation pattern could be tested in other visual domains such as gesture or object recognition to see if phonological-style prediction aids general visual grounding.
  • Extending the benchmark to additional signed languages would check whether the link between form prediction and iconicity holds beyond one language community.
  • Training regimes that emphasize motion prediction might close the gap to human performance on transparency without requiring explicit iconicity labels.

Load-bearing premise

The three adapted psycholinguistic tasks on video data provide a valid and unbiased measure of visual grounding and iconicity understanding in VLMs.

What would settle it

Observing no correlation between phonological form prediction accuracy and alignment with human iconicity ratings when testing a broad set of VLMs on the same sign videos would undermine the shared-sensitivity claim.

read the original abstract

Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Visual Iconicity Challenge, a video-based benchmark adapting three psycholinguistic tasks (phonological sign-form prediction for handshape/location, transparency inference from visual form, and graded iconicity ratings) to evaluate 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands data. It compares model performance to human baselines and reports that VLMs recover some phonological details but lag on transparency, with only top models showing moderate correlation to human iconicity ratings; crucially, stronger phonological prediction correlates with better human alignment, interpreted as evidence of shared sensitivity to visually grounded structure.

Significance. If the reported correlation holds after controlling for confounds, the work would be significant for providing a novel human-centric benchmark that leverages the structured visual-linguistic mappings in sign languages to probe visual grounding in VLMs. It offers concrete diagnostic tasks and motivates embodied or phonological-aware training approaches, with the human baselines and multi-task evaluation adding value for the field of multimodal learning.

major comments (2)
  1. [Abstract and Results] Abstract and Results section: The abstract and reported findings provide directional results and a correlation observation but include no dataset size, statistical tests, error bars, or exclusion criteria. This leaves the central comparative claims (including the key correlation between phonological prediction and iconicity alignment) under-supported by visible evidence and is load-bearing for the reliability of the conclusions.
  2. [Results (correlation analysis)] Results (correlation analysis): The claim that models with stronger phonological form prediction correlate better with human iconicity judgment indicates shared sensitivity to visually grounded structure. However, no controls such as partial correlations or regressions for confounds like parameter count, pretraining scale, or performance on unrelated visual tasks are described. Without isolating the effect, the correlation may reflect general model capability rather than specific visual grounding sensitivity, which is central to the interpretation.
minor comments (2)
  1. [Methods] The description of the three adapted tasks would benefit from explicit discussion of how video input is tokenized or prompted for VLMs to ensure reproducibility.
  2. [Experiments] Table or figure presenting the 13 VLMs should include model sizes or parameter counts to aid interpretation of the correlation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify how to strengthen the presentation and interpretation of the Visual Iconicity Challenge results. We address each major comment below and describe the revisions we will implement.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and Results section: The abstract and reported findings provide directional results and a correlation observation but include no dataset size, statistical tests, error bars, or exclusion criteria. This leaves the central comparative claims (including the key correlation between phonological prediction and iconicity alignment) under-supported by visible evidence and is load-bearing for the reliability of the conclusions.

    Authors: We agree that the abstract and high-level results summary would be strengthened by explicitly stating dataset size, statistical tests, error bars, and exclusion criteria. The full manuscript reports these details in the Methods and Results sections (including the number of NGT signs evaluated and human baseline procedures), but we will revise the abstract to include key quantitative information and ensure all figures display error bars with appropriate statistical reporting. Exclusion criteria will also be clarified in the revised text. revision: yes

  2. Referee: [Results (correlation analysis)] Results (correlation analysis): The claim that models with stronger phonological form prediction correlate better with human iconicity judgment indicates shared sensitivity to visually grounded structure. However, no controls such as partial correlations or regressions for confounds like parameter count, pretraining scale, or performance on unrelated visual tasks are described. Without isolating the effect, the correlation may reflect general model capability rather than specific visual grounding sensitivity, which is central to the interpretation.

    Authors: We acknowledge that controlling for confounds is necessary to support the interpretation that the correlation reflects shared sensitivity to visually grounded structure. In the revised manuscript we will add partial correlation and regression analyses that control for model parameter count and pretraining scale. We will also report correlations against performance on unrelated visual tasks where such data are available for the evaluated models, to help isolate the effect from general capability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation against external human baselines

full rationale

The paper introduces a video-based benchmark adapting three psycholinguistic tasks and reports empirical results from evaluating 13 VLMs in zero- and few-shot settings on Sign Language of the Netherlands data, with direct comparisons to independent human baselines. The key observation—that models stronger on phonological form prediction correlate better with human iconicity ratings—is a statistical finding from these external comparisons rather than any derivation, equation, or fitted parameter that reduces to the inputs by construction. No self-definitional steps, load-bearing self-citations, ansatzes, or renamed known results appear in the reported chain; the methodology is self-contained against the provided human benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation framework depends on the assumption that human judgments constitute a reliable external standard and that the chosen video tasks isolate iconicity without confounding factors from model pretraining.

axioms (1)
  • domain assumption Human performance on the same tasks constitutes the appropriate reference standard for assessing VLM capabilities.
    Paper repeatedly positions results relative to human baselines without independent validation that this comparison isolates model limitations rather than task artifacts.

pith-pipeline@v0.9.0 · 5770 in / 1167 out tokens · 52863 ms · 2026-05-21T20:17:26.382497+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.