What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

Aladine Chetouani; Alessandro Bruno; Bin Wang; Marouane Tliba; Mohamed Amine Kerkouri; Ulas Bagci

arxiv: 2604.08494 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.CL· cs.HC

What They Saw, Not Just Where They Looked: Semantic Scanpath Similarity via VLMs and NLP metric

Mohamed Amine Kerkouri , Marouane Tliba , Bin Wang , Aladine Chetouani , Ulas Bagci , Alessandro Bruno This is my paper

Pith reviewed 2026-05-10 18:27 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.HC

keywords scanpath similarityeye trackingvision-language modelssemantic analysisNLP metricsgaze researchfree-viewing

0 comments

The pith

Semantic scanpath similarity via VLMs captures content agreement even when gaze paths diverge spatially.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to compare eye-movement scanpaths by their semantic content rather than only by geometric positions. It encodes each fixation as a short text description using a vision-language model under two controlled visual contexts, then measures similarity between those descriptions with standard NLP embedding and lexical metrics. This semantic score is tested against classic spatial tools like MultiMatch and DTW on free-viewing eye-tracking datasets. The central result is that semantic similarity accounts for variance that spatial measures miss, including cases where observers attend to equivalent image content from different locations.

Core claim

By converting fixations into concise textual descriptions via patch-based and marker-based VLM encodings and aggregating them into scanpath representations, semantic similarity computed with embedding and lexical NLP metrics reveals partially independent variance from geometric alignment, identifying high content agreement despite spatial divergence on free-viewing data.

What carries the argument

VLM-based fixation-to-text encoding (patch and marker strategies) followed by NLP metric comparison, which turns attended image regions into language representations for similarity scoring.

If this is right

Semantic metrics can flag scanpath pairs that agree on attended objects even when fixation locations differ.
Patch-based versus marker-based encoding choices measurably affect description stability and final similarity values.
Multimodal models supply an interpretable, content-aware layer that extends classical spatial scanpath tools.
The approach supplies a complementary signal for any eye-tracking study that cares about what observers understood rather than only where they looked.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on task-driven viewing datasets to check whether semantic agreement predicts performance differences that spatial metrics overlook.
Combining the semantic score with spatial measures in a single composite metric might improve downstream applications such as saliency model evaluation or clinical gaze analysis.
If VLM description quality improves with newer models, the independence from spatial measures is likely to become even clearer.

Load-bearing premise

The textual descriptions generated by the VLM faithfully reflect the semantic content of the attended image regions without introducing model-specific biases or hallucinations.

What would settle it

A dataset in which semantic similarity scores show no additional variance beyond MultiMatch or DTW correlations, or in which VLM descriptions of the same attended object vary wildly across equivalent patches.

Figures

Figures reproduced from arXiv: 2604.08494 by Aladine Chetouani, Alessandro Bruno, Bin Wang, Marouane Tliba, Mohamed Amine Kerkouri, Ulas Bagci.

**Figure 1.** Figure 1: Pipeline overview: fixations are encoded via patch extraction (left) or marker annotation (right), described by a VLM, aggregated into [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Correlation comparisions between our NLP based metrics framework and spatial/temporal metrics. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Scanpath similarity metrics are central to eye-movement research, yet existing methods predominantly evaluate spatial and temporal alignment while neglecting semantic equivalence between attended image regions. We present a semantic scanpath similarity framework that integrates vision-language models (VLMs) into eye-tracking analysis. Each fixation is encoded under controlled visual context (patch-based and marker-based strategies) and transformed into concise textual descriptions, which are aggregated into scanpath-level representations. Semantic similarity is then computed using embedding-based and lexical NLP metrics and compared against established spatial measures, including MultiMatch and DTW. Experiments on free-viewing eye-tracking data demonstrate that semantic similarity captures partially independent variance from geometric alignment, revealing cases of high content agreement despite spatial divergence. We further analyze the impact of contextual encoding on description fidelity and metric stability. Our findings suggest that multimodal foundation models enable interpretable, content-aware extensions of classical scanpath analysis, providing a complementary dimension for gaze research within the ETRA community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a semantic scanpath similarity framework that encodes individual fixations from eye-tracking data into textual descriptions using vision-language models (VLMs) under patch-based and marker-based contextual strategies. These descriptions are aggregated into scanpath-level representations, and semantic similarity is computed with embedding-based and lexical NLP metrics. The resulting scores are compared to geometric measures such as MultiMatch and DTW on free-viewing eye-tracking datasets, with the central claim that semantic similarity captures partially independent variance from spatial alignment. The work additionally examines effects of contextual encoding on description fidelity and metric stability, suggesting multimodal models enable interpretable, content-aware extensions to classical scanpath analysis.

Significance. If the independence result holds with adequate controls, the framework supplies a useful complementary dimension for gaze research by moving beyond purely geometric alignment to semantic content agreement. A notable strength is the parameter-free reliance on off-the-shelf pre-trained VLMs and standard NLP metrics rather than any quantities fitted within the paper, which supports reproducibility. The explicit comparison of encoding strategies and stability analysis offers practical guidance for the ETRA community. The approach could extend existing tools if the VLM descriptions are shown to be faithful.

major comments (2)

[Experiments] Experiments section: the central claim that semantic similarity captures partially independent variance from geometric measures is stated without reported effect sizes, correlation coefficients, R² values, or statistical tests (e.g., partial correlations controlling for spatial alignment). This quantitative gap prevents assessment of whether the independence is substantive or marginal.
[Methods / Results] Methods and Results on description fidelity: the assumption that patch- and marker-based VLM encodings produce descriptions that faithfully reflect attended semantics (without hallucinations or model-specific biases) is load-bearing for the independence claim, yet the manuscript provides only internal metric stability rather than external validation against human annotations or controlled stimuli with known semantics.

minor comments (2)

[Abstract] Abstract: the title uses 'metric' (singular) while the text refers to multiple NLP metrics; consistent terminology would improve clarity.
[Results] The manuscript would benefit from an explicit table or figure summarizing the correlation matrix between semantic and geometric scores across datasets to make the independence claim immediately verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each major comment below, providing clarifications and committing to revisions that strengthen the manuscript's quantitative rigor and transparency regarding the assumptions in our framework.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that semantic similarity captures partially independent variance from geometric measures is stated without reported effect sizes, correlation coefficients, R² values, or statistical tests (e.g., partial correlations controlling for spatial alignment). This quantitative gap prevents assessment of whether the independence is substantive or marginal.

Authors: We agree that additional quantitative analyses would better support the independence claim. In the revised manuscript, we will compute and report Pearson and Spearman correlation coefficients between the semantic similarity scores and the geometric measures (MultiMatch and DTW). We will also perform partial correlation analyses to assess the unique contribution of semantic similarity after controlling for spatial alignment. Furthermore, we will include R² values from linear regression models where semantic similarity is regressed on geometric scores to quantify the independent variance. These additions will be presented in the Experiments section with appropriate statistical tests for significance. revision: yes
Referee: [Methods / Results] Methods and Results on description fidelity: the assumption that patch- and marker-based VLM encodings produce descriptions that faithfully reflect attended semantics (without hallucinations or model-specific biases) is load-bearing for the independence claim, yet the manuscript provides only internal metric stability rather than external validation against human annotations or controlled stimuli with known semantics.

Authors: We acknowledge the importance of validating the fidelity of VLM-generated descriptions. The current manuscript emphasizes internal stability across different NLP metrics and encoding strategies to demonstrate robustness. However, we recognize that external validation would strengthen the claims. In the revision, we will expand the discussion to explicitly address potential limitations, including risks of hallucinations and biases in VLMs. We will also propose controlled experiments with known semantics as future work. If space permits, we may include a preliminary comparison with a small set of human-annotated descriptions from a subset of the data to provide initial external validation. revision: partial

Circularity Check

0 steps flagged

No circularity detected: semantic similarity derived from external VLMs and standard NLP metrics

full rationale

The paper constructs its semantic scanpath similarity by feeding fixation patches or marked images into pre-trained vision-language models to produce textual descriptions, then applying off-the-shelf embedding-based and lexical NLP metrics (e.g., cosine similarity on embeddings or BLEU/ROUGE-style scores). These steps rely on external foundation models and established NLP tools whose parameters and training data are independent of the present work. The subsequent comparison to geometric measures (MultiMatch, DTW) and the claim of partially independent variance are obtained by direct computation on public eye-tracking datasets; no quantity is fitted inside the paper and then re-used as a 'prediction,' and no self-citation chain is invoked to justify the core representation. The analysis of contextual-encoding effects on description fidelity is likewise an empirical stability check rather than a definitional loop. The derivation chain is therefore self-contained and externally grounded.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the framework uses off-the-shelf VLMs and NLP metrics without introducing new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5491 in / 1124 out tokens · 38604 ms · 2026-05-10T18:27:31.851333+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Qwen3-VL Technical Report

Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631(2025). Donald J. Berndt and James Clifford. 1994. Using dynamic time warping to find patterns in time series. InProceedings of the 3rd International Conference on Knowledge Discovery and Data Mining(Seattle, WA)(AAAIWS’94). AAAI Press, 359–370. Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Ale...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Alessandro Bruno, Marouane Tliba, Mohamed Amine Kerkouri, Aladine Chetouani, Carlo Calogero Giunta, and Arzu Çöltekin

An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247(2024). Alessandro Bruno, Marouane Tliba, Mohamed Amine Kerkouri, Aladine Chetouani, Carlo Calogero Giunta, and Arzu Çöltekin. 2023. Detecting colour vision deficiencies via Webcam-based Eye-tracking: A case study. InProceedings of the 2023 Symposium on Eye Tracking Research and Ap...

work page doi:10.1145/3588015.3590133 2024

[1] [1]

Qwen3-VL Technical Report

Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631(2025). Donald J. Berndt and James Clifford. 1994. Using dynamic time warping to find patterns in time series. InProceedings of the 3rd International Conference on Knowledge Discovery and Data Mining(Seattle, WA)(AAAIWS’94). AAAI Press, 359–370. Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Ale...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Alessandro Bruno, Marouane Tliba, Mohamed Amine Kerkouri, Aladine Chetouani, Carlo Calogero Giunta, and Arzu Çöltekin

An introduction to vision-language modeling.arXiv preprint arXiv:2405.17247(2024). Alessandro Bruno, Marouane Tliba, Mohamed Amine Kerkouri, Aladine Chetouani, Carlo Calogero Giunta, and Arzu Çöltekin. 2023. Detecting colour vision deficiencies via Webcam-based Eye-tracking: A case study. InProceedings of the 2023 Symposium on Eye Tracking Research and Ap...

work page doi:10.1145/3588015.3590133 2024