What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric
Pith reviewed 2026-05-10 18:27 UTC · model grok-4.3
The pith
Semantic scanpath similarity via VLMs captures content agreement even when gaze paths diverge spatially.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By converting fixations into concise textual descriptions via patch-based and marker-based VLM encodings and aggregating them into scanpath representations, semantic similarity computed with embedding and lexical NLP metrics reveals partially independent variance from geometric alignment, identifying high content agreement despite spatial divergence on free-viewing data.
What carries the argument
VLM-based fixation-to-text encoding (patch and marker strategies) followed by NLP metric comparison, which turns attended image regions into language representations for similarity scoring.
If this is right
- Semantic metrics can flag scanpath pairs that agree on attended objects even when fixation locations differ.
- Patch-based versus marker-based encoding choices measurably affect description stability and final similarity values.
- Multimodal models supply an interpretable, content-aware layer that extends classical spatial scanpath tools.
- The approach supplies a complementary signal for any eye-tracking study that cares about what observers understood rather than only where they looked.
Where Pith is reading between the lines
- The method could be tested on task-driven viewing datasets to check whether semantic agreement predicts performance differences that spatial metrics overlook.
- Combining the semantic score with spatial measures in a single composite metric might improve downstream applications such as saliency model evaluation or clinical gaze analysis.
- If VLM description quality improves with newer models, the independence from spatial measures is likely to become even clearer.
Load-bearing premise
The textual descriptions generated by the VLM faithfully reflect the semantic content of the attended image regions without introducing model-specific biases or hallucinations.
What would settle it
A dataset in which semantic similarity scores show no additional variance beyond MultiMatch or DTW correlations, or in which VLM descriptions of the same attended object vary wildly across equivalent patches.
Figures
read the original abstract
Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a semantic scanpath similarity framework that encodes individual fixations from eye-tracking data into textual descriptions using vision-language models (VLMs) under patch-based and marker-based contextual strategies. These descriptions are aggregated into scanpath-level representations, and semantic similarity is computed with embedding-based and lexical NLP metrics. The resulting scores are compared to geometric measures such as MultiMatch and DTW on free-viewing eye-tracking datasets, with the central claim that semantic similarity captures partially independent variance from spatial alignment. The work additionally examines effects of contextual encoding on description fidelity and metric stability, suggesting multimodal models enable interpretable, content-aware extensions to classical scanpath analysis.
Significance. If the independence result holds with adequate controls, the framework supplies a useful complementary dimension for gaze research by moving beyond purely geometric alignment to semantic content agreement. A notable strength is the parameter-free reliance on off-the-shelf pre-trained VLMs and standard NLP metrics rather than any quantities fitted within the paper, which supports reproducibility. The explicit comparison of encoding strategies and stability analysis offers practical guidance for the ETRA community. The approach could extend existing tools if the VLM descriptions are shown to be faithful.
major comments (2)
- [Experiments] Experiments section: the central claim that semantic similarity captures partially independent variance from geometric measures is stated without reported effect sizes, correlation coefficients, R² values, or statistical tests (e.g., partial correlations controlling for spatial alignment). This quantitative gap prevents assessment of whether the independence is substantive or marginal.
- [Methods / Results] Methods and Results on description fidelity: the assumption that patch- and marker-based VLM encodings produce descriptions that faithfully reflect attended semantics (without hallucinations or model-specific biases) is load-bearing for the independence claim, yet the manuscript provides only internal metric stability rather than external validation against human annotations or controlled stimuli with known semantics.
minor comments (2)
- [Abstract] Abstract: the title uses 'metric' (singular) while the text refers to multiple NLP metrics; consistent terminology would improve clarity.
- [Results] The manuscript would benefit from an explicit table or figure summarizing the correlation matrix between semantic and geometric scores across datasets to make the independence claim immediately verifiable.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript's quantitative rigor and transparency regarding the assumptions in our framework.
read point-by-point responses
-
Referee: [Experiments] Experiments section: the central claim that semantic similarity captures partially independent variance from geometric measures is stated without reported effect sizes, correlation coefficients, R² values, or statistical tests (e.g., partial correlations controlling for spatial alignment). This quantitative gap prevents assessment of whether the independence is substantive or marginal.
Authors: We agree that additional quantitative analyses would better support the independence claim. In the revised manuscript, we will compute and report Pearson and Spearman correlation coefficients between the semantic similarity scores and the geometric measures (MultiMatch and DTW). We will also perform partial correlation analyses to assess the unique contribution of semantic similarity after controlling for spatial alignment. Furthermore, we will include R² values from linear regression models where semantic similarity is regressed on geometric scores to quantify the independent variance. These additions will be presented in the Experiments section with appropriate statistical tests for significance. revision: yes
-
Referee: [Methods / Results] Methods and Results on description fidelity: the assumption that patch- and marker-based VLM encodings produce descriptions that faithfully reflect attended semantics (without hallucinations or model-specific biases) is load-bearing for the independence claim, yet the manuscript provides only internal metric stability rather than external validation against human annotations or controlled stimuli with known semantics.
Authors: We acknowledge the importance of validating the fidelity of VLM-generated descriptions. The current manuscript emphasizes internal stability across different NLP metrics and encoding strategies to demonstrate robustness. However, we recognize that external validation would strengthen the claims. In the revision, we will expand the discussion to explicitly address potential limitations, including risks of hallucinations and biases in VLMs. We will also propose controlled experiments with known semantics as future work. If space permits, we may include a preliminary comparison with a small set of human-annotated descriptions from a subset of the data to provide initial external validation. revision: partial
Circularity Check
No circularity detected: semantic similarity derived from external VLMs and standard NLP metrics
full rationale
The paper constructs its semantic scanpath similarity by feeding fixation patches or marked images into pre-trained vision-language models to produce textual descriptions, then applying off-the-shelf embedding-based and lexical NLP metrics (e.g., cosine similarity on embeddings or BLEU/ROUGE-style scores). These steps rely on external foundation models and established NLP tools whose parameters and training data are independent of the present work. The subsequent comparison to geometric measures (MultiMatch, DTW) and the claim of partially independent variance are obtained by direct computation on public eye-tracking datasets; no quantity is fitted inside the paper and then re-used as a 'prediction,' and no self-citation chain is invoked to justify the core representation. The analysis of contextual-encoding effects on description fidelity is likewise an empirical stability check rather than a definitional loop. The derivation chain is therefore self-contained and externally grounded.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631(2025). Donald J. Berndt and James Clifford. 1994. Using dynamic time warping to find patterns in time series. InProceedings of the 3rd International Conference on Knowledge Discovery and Data Mining(Seattle, WA)(AAAIWS’94). AAAI Press, 359–370. Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Ale...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247(2024). Alessandro Bruno, Marouane Tliba, Mohamed Amine Kerkouri, Aladine Chetouani, Carlo Calogero Giunta, and Arzu Çöltekin. 2023. Detecting colour vision deficiencies via Webcam-based Eye-tracking: A case study. InProceedings of the 2023 Symposium on Eye Tracking Research and Ap...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.