The Visual Iconicity Challenge: Evaluating Vision-Language Models on Sign Language Form-Meaning Mapping
Pith reviewed 2026-05-21 20:17 UTC · model grok-4.3
The pith
Vision-language models that better predict sign phonological forms also align more closely with human iconicity judgments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that VLMs recover some handshape and location detail from sign videos but remain below human performance on transparency and show only moderate correlation with human iconicity ratings in top models; crucially, models with stronger phonological form prediction correlate better with human iconicity judgments, pointing to shared sensitivity to visually grounded structure.
What carries the argument
The Visual Iconicity Challenge benchmark with its three adapted tasks on dynamic sign videos: phonological sign-form prediction, transparency inference, and graded iconicity ratings.
If this is right
- VLMs recover some handshape and location details from sign videos but remain below human performance levels.
- VLMs perform far below human baselines when inferring meaning from visual sign forms in transparency tasks.
- Only the top VLMs achieve moderate correlation with human graded iconicity ratings.
- Stronger phonological form prediction performance predicts better correlation with human iconicity judgments.
- The findings support developing human-centric signals and embodied learning methods to model iconicity.
Where Pith is reading between the lines
- The correlation pattern could be tested in other visual domains such as gesture or object recognition to see if phonological-style prediction aids general visual grounding.
- Extending the benchmark to additional signed languages would check whether the link between form prediction and iconicity holds beyond one language community.
- Training regimes that emphasize motion prediction might close the gap to human performance on transparency without requiring explicit iconicity labels.
Load-bearing premise
The three adapted psycholinguistic tasks on video data provide a valid and unbiased measure of visual grounding and iconicity understanding in VLMs.
What would settle it
Observing no correlation between phonological form prediction accuracy and alignment with human iconicity ratings when testing a broad set of VLMs on the same sign videos would undermine the shared-sensitivity claim.
read the original abstract
Iconicity, the resemblance between linguistic form and meaning, is pervasive in signed languages, offering a natural testbed for visual grounding. For vision-language models (VLMs), the challenge is to recover such essential mappings from dynamic human motion rather than static context. We introduce the Visual Iconicity Challenge, a novel video-based benchmark that adapts psycholinguistic measures to evaluate VLMs on three tasks: (i) phonological sign-form prediction (e.g., handshape, location), (ii) transparency (inferring meaning from visual form), and (iii) graded iconicity ratings. We assess 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands and compare them to human baselines. On phonological form prediction, VLMs recover some handshape and location detail but remain below human performance; on transparency, they are far from human baselines; and only top models correlate moderately with human iconicity ratings. Interestingly, models with stronger phonological form prediction correlate better with human iconicity judgment, indicating shared sensitivity to visually grounded structure. Our findings validate these diagnostic tasks and motivate human-centric signals and embodied learning methods for modelling iconicity and improving visual grounding in multimodal models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Visual Iconicity Challenge, a video-based benchmark adapting three psycholinguistic tasks (phonological sign-form prediction for handshape/location, transparency inference from visual form, and graded iconicity ratings) to evaluate 13 state-of-the-art VLMs in zero- and few-shot settings on Sign Language of the Netherlands data. It compares model performance to human baselines and reports that VLMs recover some phonological details but lag on transparency, with only top models showing moderate correlation to human iconicity ratings; crucially, stronger phonological prediction correlates with better human alignment, interpreted as evidence of shared sensitivity to visually grounded structure.
Significance. If the reported correlation holds after controlling for confounds, the work would be significant for providing a novel human-centric benchmark that leverages the structured visual-linguistic mappings in sign languages to probe visual grounding in VLMs. It offers concrete diagnostic tasks and motivates embodied or phonological-aware training approaches, with the human baselines and multi-task evaluation adding value for the field of multimodal learning.
major comments (2)
- [Abstract and Results] Abstract and Results section: The abstract and reported findings provide directional results and a correlation observation but include no dataset size, statistical tests, error bars, or exclusion criteria. This leaves the central comparative claims (including the key correlation between phonological prediction and iconicity alignment) under-supported by visible evidence and is load-bearing for the reliability of the conclusions.
- [Results (correlation analysis)] Results (correlation analysis): The claim that models with stronger phonological form prediction correlate better with human iconicity judgment indicates shared sensitivity to visually grounded structure. However, no controls such as partial correlations or regressions for confounds like parameter count, pretraining scale, or performance on unrelated visual tasks are described. Without isolating the effect, the correlation may reflect general model capability rather than specific visual grounding sensitivity, which is central to the interpretation.
minor comments (2)
- [Methods] The description of the three adapted tasks would benefit from explicit discussion of how video input is tokenized or prompted for VLMs to ensure reproducibility.
- [Experiments] Table or figure presenting the 13 VLMs should include model sizes or parameter counts to aid interpretation of the correlation results.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify how to strengthen the presentation and interpretation of the Visual Iconicity Challenge results. We address each major comment below and describe the revisions we will implement.
read point-by-point responses
-
Referee: [Abstract and Results] Abstract and Results section: The abstract and reported findings provide directional results and a correlation observation but include no dataset size, statistical tests, error bars, or exclusion criteria. This leaves the central comparative claims (including the key correlation between phonological prediction and iconicity alignment) under-supported by visible evidence and is load-bearing for the reliability of the conclusions.
Authors: We agree that the abstract and high-level results summary would be strengthened by explicitly stating dataset size, statistical tests, error bars, and exclusion criteria. The full manuscript reports these details in the Methods and Results sections (including the number of NGT signs evaluated and human baseline procedures), but we will revise the abstract to include key quantitative information and ensure all figures display error bars with appropriate statistical reporting. Exclusion criteria will also be clarified in the revised text. revision: yes
-
Referee: [Results (correlation analysis)] Results (correlation analysis): The claim that models with stronger phonological form prediction correlate better with human iconicity judgment indicates shared sensitivity to visually grounded structure. However, no controls such as partial correlations or regressions for confounds like parameter count, pretraining scale, or performance on unrelated visual tasks are described. Without isolating the effect, the correlation may reflect general model capability rather than specific visual grounding sensitivity, which is central to the interpretation.
Authors: We acknowledge that controlling for confounds is necessary to support the interpretation that the correlation reflects shared sensitivity to visually grounded structure. In the revised manuscript we will add partial correlation and regression analyses that control for model parameter count and pretraining scale. We will also report correlations against performance on unrelated visual tasks where such data are available for the evaluated models, to help isolate the effect from general capability. revision: yes
Circularity Check
No significant circularity; empirical evaluation against external human baselines
full rationale
The paper introduces a video-based benchmark adapting three psycholinguistic tasks and reports empirical results from evaluating 13 VLMs in zero- and few-shot settings on Sign Language of the Netherlands data, with direct comparisons to independent human baselines. The key observation—that models stronger on phonological form prediction correlate better with human iconicity ratings—is a statistical finding from these external comparisons rather than any derivation, equation, or fitted parameter that reduces to the inputs by construction. No self-definitional steps, load-bearing self-citations, ansatzes, or renamed known results appear in the reported chain; the methodology is self-contained against the provided human benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human performance on the same tasks constitutes the appropriate reference standard for assessing VLM capabilities.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.