arxiv: 2604.13243 · v1 · submitted 2026-04-14 · 💻 cs.HC · cs.AI

Recognition: unknown

Lazy or Efficient? Towards Accessible Eye-Tracking Event Detection Using LLMs

Dongyang Guo , Yasmeen Abdrabou , Enkelejda Kasneci

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:12 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords eye-trackingevent detectionlarge language modelsgaze analysisfixationssaccadesaccessibilityhuman-computer interaction

0 comments

The pith

Large language models can convert natural language instructions into accurate eye-tracking event detection pipelines without requiring specialized programming.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces an LLM-based system for detecting gaze events such as fixations and saccades from eye-tracking data. Users provide raw data files and simple prompts in everyday language instead of writing code or tuning parameters. The system automatically infers data structure, generates and runs the analysis code, labels events, and produces reports. It achieves accuracy similar to classic methods like velocity-threshold or dispersion-threshold detectors on standard benchmarks. This approach could open eye-tracking research to a wider group of scientists and practitioners who lack coding skills.

Core claim

The paper claims that a pipeline using large language models can take raw eye-tracking files and natural language prompts to generate and execute code that cleans the data, implements event detectors, labels fixations and saccades, and outputs results with explanations, performing at levels comparable to traditional I-VT and I-DT methods while greatly lowering the technical barriers.

What carries the argument

An LLM-driven code generation pipeline that inspects raw files, creates detection routines from user prompts, applies them to label events, and supports iterative refinement.

If this is right

Users can analyze eye-tracking data without programming knowledge.
The system handles heterogeneous data formats automatically.
Accuracy matches classical detectors on public benchmarks.
Iterative prompt editing allows optimization of the generated code.
Technical overhead for eye-tracking analysis is substantially reduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach may extend to other signal processing tasks where raw data formats vary widely.
Wider use of eye-tracking could occur in applied settings outside research labs.
Integration with live data streams might enable prompt-based real-time analysis.

Load-bearing premise

That large language models will reliably generate correct, error-free code for eye-tracking event detection from brief natural language descriptions.

What would settle it

Running the system on a public eye-tracking benchmark dataset and finding that the LLM-generated labels deviate significantly from ground truth or from established I-VT and I-DT results.

Figures

Figures reproduced from arXiv: 2604.13243 by Dongyang Guo, Enkelejda Kasneci, Yasmeen Abdrabou.

**Figure 1.** Figure 1: This figure compares our method with the traditional manual approach. In particular, the LLM-assisted workflow on [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

read the original abstract

Gaze event detection is fundamental to vision science, human-computer interaction, and applied analytics. However, current workflows often require specialized programming knowledge and careful handling of heterogeneous raw data formats. Classical detectors such as I-VT and I-DT are effective but highly sensitive to preprocessing and parameterization, limiting their usability outside specialized laboratories. This work introduces a code-free, large language model (LLM)-driven pipeline that converts natural language instructions into an end-to-end analysis. The system (1) inspects raw eye-tracking files to infer structure and metadata, (2) generates executable routines for data cleaning and detector implementation from concise user prompts, (3) applies the generated detector to label fixations and saccades, and (4) returns results and explanatory reports, and allows users to iteratively optimize their code by editing the prompt. Evaluated on public benchmarks, the approach achieves accuracy comparable to traditional methods while substantially reducing technical overhead. The framework lowers barriers to entry for eye-tracking research, providing a flexible and accessible alternative to code-intensive workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical LLM pipeline for turning prompts into eye-tracking detection code, but the benchmark claims rest on thin evidence that may not separate real robustness from prompt tuning.

read the letter

This paper's main point is a pipeline that lets users describe what they want in plain language, has the LLM inspect raw eye-tracking files, spit out cleaning and detection code for things like fixations and saccades, run it, and let you iterate by editing the prompt. It targets the real barrier that classical I-VT or I-DT detectors are fiddly with parameters and data formats. That accessibility angle is the useful part. It does a decent job laying out the four steps and showing how the system handles heterogeneous files without forcing users to write Python themselves. The iterative refinement is a sensible addition that matches how people actually work with LLMs. The specific application to eye-tracking benchmarks is new enough in the cited literature, even if the underlying LLM code generation trick is not revolutionary. The evaluation section is the soft spot. The abstract says comparable accuracy on public benchmarks, but without tables, error bars, details on how many prompt attempts were needed, or checks for variance across generations, it is hard to know whether the results hold up or whether they came from careful post-selection. Eye-tracking event boundaries are sensitive to small changes in velocity thresholds or noise handling, so any LLM hallucination in the generated logic could shift labels without being obvious. The stress-test concern about systematic errors from untested code generation looks like it still applies here. This is for HCI and vision researchers who want to analyze gaze data without becoming programmers. It is worth sending to referees so they can check the actual numbers and failure modes rather than desk-rejecting an idea that could be practically helpful if the robustness is demonstrated.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces an LLM-based pipeline for eye-tracking event detection that accepts natural-language prompts to (1) inspect raw data files and infer structure, (2) generate executable Python routines for cleaning and implementing detectors such as I-VT or I-DT, (3) apply the detector to label fixations and saccades, and (4) return results plus explanatory reports, with support for iterative prompt refinement. It claims that this code-free workflow achieves accuracy comparable to classical methods on public benchmarks while substantially lowering technical barriers.

Significance. If the empirical claims are substantiated with transparent quantitative evidence, the work could meaningfully expand access to eye-tracking analysis for researchers in HCI, vision science, and applied domains who lack programming expertise. The approach directly addresses a recognized pain point in the field—the sensitivity of I-VT/I-DT detectors to preprocessing and parameterization—by shifting the burden to LLM code generation. However, the accessibility benefit is inseparable from questions of robustness, reproducibility, and whether LLM outputs introduce systematic labeling errors on heterogeneous sampling rates and noise profiles.

major comments (2)

[Evaluation] Evaluation section: the central claim that the approach 'achieves accuracy comparable to traditional methods' is unsupported by any reported quantitative metrics, error bars, dataset identifiers, or comparison tables. Without these, it is impossible to determine whether performance reflects genuine zero-shot generation or post-hoc prompt tuning and data selection.
[Methodology] Methodology section (iterative prompt editing): the pipeline explicitly permits users to refine prompts until satisfactory output is obtained. The manuscript provides no controls, variance statistics, or failure-mode analysis across multiple generations or raw-file formats to separate consistent LLM performance from expert-guided iteration, which directly undermines the claim of reduced technical overhead.

minor comments (2)

[Abstract] Abstract: the phrase 'public benchmarks' is used without naming the specific datasets or providing even summary statistics; adding one sentence with benchmark names and high-level accuracy figures would improve clarity.
[Pipeline description] The manuscript should include a brief discussion of how sampling-rate heterogeneity and noise characteristics are handled in the generated cleaning routines, as these are known to affect I-VT/I-DT boundary placement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important areas for strengthening the presentation of our evaluation and methodology. We address each major comment below and outline the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the central claim that the approach 'achieves accuracy comparable to traditional methods' is unsupported by any reported quantitative metrics, error bars, dataset identifiers, or comparison tables. Without these, it is impossible to determine whether performance reflects genuine zero-shot generation or post-hoc prompt tuning and data selection.

Authors: We agree that the current evaluation section does not provide sufficient quantitative detail to fully support the comparability claim. While the manuscript references evaluation on public benchmarks, we acknowledge the absence of explicit metrics, error bars, dataset identifiers, and comparison tables. In the revised version, we will add a dedicated results subsection with quantitative metrics (e.g., F1-scores, precision, and recall for fixation and saccade detection), standard deviations or error bars across datasets or runs, specific public benchmark identifiers, and side-by-side comparison tables against classical I-VT and I-DT implementations. This will clarify the performance characteristics and address concerns about zero-shot versus tuned generation. revision: yes
Referee: [Methodology] Methodology section (iterative prompt editing): the pipeline explicitly permits users to refine prompts until satisfactory output is obtained. The manuscript provides no controls, variance statistics, or failure-mode analysis across multiple generations or raw-file formats to separate consistent LLM performance from expert-guided iteration, which directly undermines the claim of reduced technical overhead.

Authors: We agree that additional controls and analysis are needed to substantiate the claim of reduced technical overhead. The iterative refinement is presented as an optional user feature to improve accessibility, not as a core requirement. In the revision, we will expand the methodology and results sections to include variance statistics from multiple generations using fixed initial prompts, success rates and failure-mode analysis across different raw-file formats and sampling rates, and quantitative indicators of how often initial generations suffice without iteration. This will better isolate baseline LLM performance from the effects of user-guided refinement. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation is independent

full rationale

The paper introduces an LLM-based pipeline that converts natural-language prompts into eye-tracking event detectors and evaluates the resulting accuracy on public benchmarks. No mathematical derivations, equations, fitted parameters, or self-citations appear in the provided abstract or description that would reduce the central claim to its own inputs by construction. The performance statement rests on external benchmark comparisons rather than any self-definitional loop, renamed known result, or load-bearing prior work by the same authors. This is the expected non-finding for an empirically evaluated systems paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unproven assumption that current LLMs possess sufficient reliability for generating accurate scientific analysis code in this domain. No free parameters are introduced. No new physical entities are postulated.

axioms (1)

domain assumption Large language models can translate concise natural-language instructions into correct and robust Python code for eye-tracking data cleaning and event detection.
Invoked throughout the pipeline description in the abstract; the entire system rests on this capability.

pith-pipeline@v0.9.0 · 5488 in / 1257 out tokens · 24510 ms · 2026-05-10T14:12:21.517052+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Birtukan Birawo and Pawel Kasprowski

Free viewing of dynamic stimuli by humans and monkeys.Journal of vision 9, 5 (2009), 19–19. Birtukan Birawo and Pawel Kasprowski. 2022. Review and evaluation of eye movement event detection algorithms.Sensors22, 22 (2022), 8810. Pieter Blignaut. 2009. Fixation identification: The optimum threshold for a dispersion algorithm.Attention, Perception, & Psycho...

work page arXiv 2009
[2]

Teaching Large Language Models to Self-Debug

Differential privacy for eye tracking with temporal correlations.Plos one16, 8 (2021), e0255979. Yuzhe Cai, Shaoguang Mao, Wenshan Wu, Zehua Wang, Yaobo Liang, Tao Ge, Chenfei Wu, WangYou WangYou, Ting Song, Yan Xia, et al. 2024. Low-code LLM: Graphical user interface over large language models. InProceedings of the 2024 Conference of the North American C...

work page internal anchor Pith review arXiv 2021
[3]

format":

Online eye-movement classification with temporal convolutional networks. Behavior Research Methods55, 7 (2023), 3602–3620. Ralf Engbert and Reinhold Kliegl. 2003. Microsaccades uncover the orientation of covert attention.Vision research43, 9 (2003), 1035–1045. Wolfgang Fuhl, Nora Castner, and Enkelejda Kasneci. 2018. Histogram of oriented velocities for e...

work page arXiv 2023
[4]

format":

Do not add any extra text after the JSON. Below is a snippet of the original data (maximum displaying {sample_chars}characters): <DATA_SNIPPET> {snippet} </DATA_SNIPPET> Please output the conclusion and strict JSON as required (the JSON must be parsable byjson.loads). A.2 Step 2: Code Generation Stage In this stage, the user analyzes the structured inform...
[5]

Keep only the time, X-coordinate, and Y-coordinate; delete everything else. 3. Clean noisy data based on the information in the report obtained. 4. Save the cleaned data to a new file. 5. Based on the information extracted from the data fragment, generate implementations of the I-VT and I-DT algorithms. Please output only runnable Python code (including n...

2017
[6]

Logical inversion in the threshold condition (e.g., using>instead of<for fixation detection) Recommendations: Increase the IVT velocity threshold

Noise in gaze data without appropriate filtering 4. Logical inversion in the threshold condition (e.g., using>instead of<for fixation detection) Recommendations: Increase the IVT velocity threshold. Typical thresholds range from 30◦ /s to 100◦ /s depending on sampling rate and noise level. 1. Verify units to ensure velocities are expressed in degrees per ...
[7]

IVT results are abnormal and likely affected by an incorrect velocity threshold or implementation detail. 2. IDT results are consistent, reliable, and aligned with expected behavior in eye movement classification. Adjusting IVT parameters—especially the velocity threshold—and verifying preprocessing should substantially improve performance. A.4 Prompts_Si...