pith. sign in

arxiv: 2602.09259 · v2 · pith:EEQLKH4Lnew · submitted 2026-02-09 · 💻 cs.RO · cs.HC

Data-centric Design of Learning-based Surgical Gaze Perception Models in Multi-Task Simulation

Pith reviewed 2026-05-21 12:56 UTC · model grok-4.3

classification 💻 cs.RO cs.HC
keywords surgical gazepassive gazeactive gazesaliency modelingrobot-assisted surgeryeye trackingsimulator datasetattention perception
0
0 comments X

The pith

Passive novice gaze from simulator videos can substitute for active intermediate gaze in training surgical perception models with limited accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines whether gaze data collected passively from observers watching surgical simulator videos can replace the harder-to-get gaze recorded during actual active task performance by more skilled users. The authors created a dataset pairing active eye-tracked execution with passive viewing of the exact same videos across multiple drills on a surgical simulator. They analyze differences due to expertise and modality using overlap of fixation densities and test how well saliency models trained on one type predict the other. The results indicate that passive data captures much of the active attention patterns, and that even novice passive gaze works well enough for intermediate targets, pointing to easier ways to gather supervision for AI models in surgery.

Core claim

By collecting paired active and passive gaze on the same simulator videos, the work shows that models trained on passive gaze recover a substantial portion of intermediate active attention patterns, albeit with some degradation, and that transfer is asymmetric. Novice passive labels approximate intermediate passive targets with limited loss, particularly on higher-quality demonstrations. This establishes a data-centric path for using more accessible passive observations to build gaze perception models for robot-assisted minimally invasive surgery.

What carries the argument

The paired active-passive multi-task surgical gaze dataset from the da Vinci SimNow simulator, analyzed through fixation density overlap and single-frame saliency modeling to test substitutability of passive for active supervision.

If this is right

  • Models trained on passive gaze recover a substantial portion of intermediate active attention with some degradation.
  • Transfer between active and passive targets is asymmetric.
  • Novice passive labels approximate intermediate-passive targets with limited loss on higher-quality demonstrations.
  • MSI-Net produces stable predictions aligned with human fixations while SalGAN is often unstable and misaligned.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Validating these simulator results in real operating rooms would strengthen the case for using passive data in live surgical settings.
  • Crowd-sourcing passive gaze collection could scale up datasets for surgical AI without heavy reliance on expert time.
  • The asymmetry in transfer suggests designing models that account for modality differences to improve performance.

Load-bearing premise

That fixation density overlap analyses and single-frame saliency modeling on simulator videos are sufficient to establish substitutability of passive novice gaze for active intermediate gaze in real surgical perception tasks.

What would settle it

Testing the trained saliency models on gaze data from actual robot-assisted minimally invasive surgeries and checking if the performance gap between passive novice training and active intermediate training remains comparable to the simulator results.

Figures

Figures reproduced from arXiv: 2602.09259 by Jiaji Su, Shuyuan Yang, Yizhou Li, Zonghe Chua.

Figure 2
Figure 2. Figure 2: Gaze metrics for active and passive demonstrations. (A) Fixation rate, (B) scanpath speed, (C) fixation ration, (D) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Gaze overlap metrics across observer–source skill [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

In robot-assisted minimally invasive surgery (RMIS), reduced haptic feedback and depth cues increase reliance on expert visual perception, motivating gaze-guided training and learning-based surgical perception models. However, operative expert gaze is costly to collect, and it remains unclear how the source of gaze supervision, both expertise level (intermediate vs. novice) and perceptual modality (active execution vs. passive viewing), shapes what attention models learn. We introduce a paired active-passive, multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze was recorded during task execution using a VR headset with eye tracking, and the corresponding videos were reused as stimuli to collect passive gaze from observers, enabling controlled same-video comparisons. We quantify skill- and modality-dependent differences in gaze organization and evaluate the substitutability of passive gaze for operative supervision using fixation density overlap analyses and single-frame saliency modeling. Across settings, MSI-Net produced stable, interpretable predictions, whereas SalGAN was unstable and often poorly aligned with human fixations. Models trained on passive gaze recovered a substantial portion of intermediate active attention, but with predictable degradation, and transfer was asymmetric between active and passive targets. Notably, novice passive labels approximated intermediate-passive targets with limited loss on higher-quality demonstrations, suggesting a practical path for scalable, crowd-sourced gaze supervision in surgical coaching and perception modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces a paired active-passive multi-task surgical gaze dataset collected on the da Vinci SimNow simulator across four drills. Active gaze is recorded during task execution with VR eye tracking, and the same videos are used to collect passive gaze from observers. It quantifies skill- and modality-dependent gaze differences via fixation density overlap and evaluates substitutability for learning-based models using single-frame saliency prediction with MSI-Net and SalGAN. The central claim is that passive gaze, especially from novices, recovers a substantial portion of intermediate active attention with predictable but limited degradation and asymmetric transfer, offering a scalable path for crowd-sourced supervision in surgical perception modeling.

Significance. If the empirical findings hold under more rigorous validation, the work could lower barriers to collecting supervision data for gaze-guided RMIS training and perception models by showing that novice passive viewing can approximate intermediate active attention. The controlled same-video active-passive pairing is a methodological strength that enables direct comparisons not easily obtained in real OR settings.

major comments (3)
  1. [Methods] Methods: No details are provided on participant sample sizes, expertise classification criteria, number of trials per drill, exclusion criteria, or statistical tests (e.g., for overlap metrics or transfer asymmetry). Without these, the reliability of the reported 'substantial portion' recovery and 'limited loss' cannot be assessed.
  2. [Results] Saliency modeling and evaluation: The single-frame MSI-Net and SalGAN results are presented without temporal modeling or comparison to a video-based baseline. Given the dynamic nature of surgical tasks (instrument motion, tissue deformation), this leaves the substitutability claim for real RMIS untested.
  3. [Discussion] Generalization: The abstract and results emphasize simulator-based fixation overlap and single-frame predictions but contain no sim-to-real hold-out evaluation or quantification of effect sizes against confounds such as video order. This makes the practical claim for crowd-sourced supervision dependent on an unverified assumption.
minor comments (2)
  1. [Abstract] Abstract: Consider adding the exact number of participants and trials to convey scale to readers.
  2. [Introduction] Notation: Clarify how 'intermediate' expertise is operationalized relative to 'novice' and 'expert' categories throughout the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve transparency and appropriately scope our claims.

read point-by-point responses
  1. Referee: [Methods] Methods: No details are provided on participant sample sizes, expertise classification criteria, number of trials per drill, exclusion criteria, or statistical tests (e.g., for overlap metrics or transfer asymmetry). Without these, the reliability of the reported 'substantial portion' recovery and 'limited loss' cannot be assessed.

    Authors: We agree that these methodological details were insufficiently reported. In the revised manuscript we have expanded the Methods section with a new 'Participants and Experimental Protocol' subsection that now specifies participant sample sizes for both active and passive cohorts, the expertise classification criteria (based on prior simulation and surgical training hours), the number of trials per drill, exclusion criteria applied to eye-tracking data quality, and the statistical tests (including non-parametric tests for overlap metrics with reported p-values and effect sizes). revision: yes

  2. Referee: [Results] Saliency modeling and evaluation: The single-frame MSI-Net and SalGAN results are presented without temporal modeling or comparison to a video-based baseline. Given the dynamic nature of surgical tasks (instrument motion, tissue deformation), this leaves the substitutability claim for real RMIS untested.

    Authors: We deliberately employed single-frame models to isolate the effects of gaze source (active vs. passive, novice vs. intermediate) without confounding temporal factors. We have revised the Results and added a paragraph in the Discussion that references video-based saliency literature, explains the rationale for the current design, and explicitly states that temporal modeling remains future work. The core substitutability findings are therefore scoped to the single-frame setting used. revision: partial

  3. Referee: [Discussion] Generalization: The abstract and results emphasize simulator-based fixation overlap and single-frame predictions but contain no sim-to-real hold-out evaluation or quantification of effect sizes against confounds such as video order. This makes the practical claim for crowd-sourced supervision dependent on an unverified assumption.

    Authors: We accept that the absence of sim-to-real evaluation limits direct claims about real RMIS. The paired active-passive design was intentionally performed in simulation to enable controlled same-video comparisons that are logistically difficult in the operating room. We have added a Limitations subsection that discusses the simulator setting, reports effect sizes for overlap metrics, notes the randomization of video order, and scopes the crowd-sourced supervision suggestion as a direction for future real-world validation. The abstract has been updated accordingly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on new empirical data and evaluations

full rationale

The paper introduces a new paired active-passive gaze dataset collected on the da Vinci SimNow simulator across four drills, with active gaze recorded via VR eye tracking and passive gaze collected from observers viewing the same videos. It then performs fixation density overlap analyses and trains/evaluates single-frame saliency models (MSI-Net, SalGAN) to quantify skill- and modality-dependent differences and assess substitutability. These steps rely on fresh data collection and direct model performance metrics on the collected simulator videos rather than any self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations. The derivation chain is self-contained through experimentation and does not reduce to prior fitted quantities or internal definitions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the validity of the new paired dataset and the chosen evaluation metrics (fixation density overlap and single-frame saliency); no free parameters, invented entities, or non-standard axioms are described in the abstract.

axioms (1)
  • domain assumption Fixation density overlap and single-frame saliency metrics are appropriate proxies for assessing gaze substitutability in surgical video stimuli.
    Invoked when quantifying skill- and modality-dependent differences and evaluating model predictions.

pith-pipeline@v0.9.0 · 5783 in / 1304 out tokens · 51414 ms · 2026-05-21T12:56:36.191568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 2 internal anchors

  1. [1]

    A seventh sense: sentience and surgical robotics,

    J. F. Lazar, “A seventh sense: sentience and surgical robotics,”Inno- vations: Technology and Techniques in Cardiothoracic and Vascular Surgery, vol. 14, pp. 379–379, Oct. 2019

  2. [2]

    Task dynamics of prior training influence visual force estimation ability during teleoperation,

    Z. Chua, A. M. Jarc, S. M. Wren, I. Nisky, and A. M. Okamura, “Task dynamics of prior training influence visual force estimation ability during teleoperation,”IEEE Transactions on Medical Robotics and Bionics, vol. 2, pp. 586–597, Nov. 2020. Conference Name: IEEE Transactions on Medical Robotics and Bionics

  3. [3]

    Psychomotor control in a virtual laparoscopic surgery training environment: gaze control parameters differentiate novices from experts,

    M. Wilson, J. McGrath, S. Vine, J. Brewer, D. Defriend, and R. Masters, “Psychomotor control in a virtual laparoscopic surgery training environment: gaze control parameters differentiate novices from experts,”Surgical Endoscopy, vol. 24, pp. 2458–2464, Oct. 2010

  4. [4]

    Analysis of eye gaze: do novice surgeons look at the same location as expert surgeons during a laparoscopic operation?,

    R. S. Khan, G. Tien, M. S. Atkins, B. Zheng, O. N. Panton, and A. T. Meneghetti, “Analysis of eye gaze: do novice surgeons look at the same location as expert surgeons during a laparoscopic operation?,” Surgical endoscopy, vol. 26, no. 12, pp. 3536–3540, 2012

  5. [5]

    Cheating experience: Guiding novices to adopt the gaze strategies of experts expedites the learning of technical laparoscopic skills,

    S. J. Vine, R. S. Masters, J. S. McGrath, E. Bright, and M. R. Wilson, “Cheating experience: Guiding novices to adopt the gaze strategies of experts expedites the learning of technical laparoscopic skills,” Surgery, vol. 152, no. 1, pp. 32–40, 2012

  6. [6]

    Gaze training improves the retention and transfer of laparo- scopic technical skills in novices,

    S. J. Vine, R. J. Chaytor, J. S. McGrath, R. S. Masters, and M. R. Wilson, “Gaze training improves the retention and transfer of laparo- scopic technical skills in novices,”Surgical endoscopy, vol. 27, no. 9, pp. 3205–3213, 2013

  7. [7]

    See like an expert: Gaze-augmented training enhances skill acquisition in a virtual reality robotic suturing task,

    R. Melnyk, T. Campbell, T. Holler, K. Cameron, P. Saba, M. W. Wit- thaus, J. Joseph, and A. Ghazi, “See like an expert: Gaze-augmented training enhances skill acquisition in a virtual reality robotic suturing task,”Journal of Endourology, vol. 35, no. 3, pp. 376–382, 2021

  8. [8]

    Utilizing eye gaze to enhance the generalization of imitation networks to unseen environ- ments,

    C. Liu, Y . Chen, L. Tai, M. Liu, and B. Shi, “Utilizing eye gaze to enhance the generalization of imitation networks to unseen environ- ments,”arXiv preprint arXiv:1907.04728, 2019

  9. [9]

    Gaze-guided class activation mapping: leveraging human attention for network attention in chest x-rays classification. arxiv,

    H. Zhu, S. Salcudean, and R. Rohling, “Gaze-guided class activation mapping: leveraging human attention for network attention in chest x-rays classification. arxiv,”arXiv preprint arXiv:2202.07107, 2022

  10. [10]

    Eye-gaze-guided vision transformer for rec- tifying shortcut learning,

    C. Ma, L. Zhao, Y . Chen, S. Wang, L. Guo, T. Zhang, D. Shen, X. Jiang, and T. Liu, “Eye-gaze-guided vision transformer for rec- tifying shortcut learning,”IEEE Transactions on Medical Imaging, vol. 42, no. 11, pp. 3384–3394, 2023

  11. [11]

    Generating attention maps from eye- gaze for the diagnosis of alzheimer’s disease,

    C. Antunes and M. Silveira, “Generating attention maps from eye- gaze for the diagnosis of alzheimer’s disease,” inGaze Meets Machine Learning Workshop, pp. 3–19, PMLR, 2023

  12. [12]

    Crowdsourcing to assess surgical skill,

    T. S. Lendvay, L. White, and T. Kowalewski, “Crowdsourcing to assess surgical skill,”JAMA Surgery, vol. 150, p. 1086, Nov. 2015

  13. [13]

    Identifying fixations and saccades in eye-tracking protocols,

    D. D. Salvucci and J. H. Goldberg, “Identifying fixations and saccades in eye-tracking protocols,” inProceedings of the 2000 symposium on Eye tracking research & applications, pp. 71–78, 2000

  14. [14]

    Eye gaze metrics for skill assessment and feedback in kidney stone surgery,

    Y . Li, A. Reed, N. Kavoussi, and J. Y . Wu, “Eye gaze metrics for skill assessment and feedback in kidney stone surgery,”International Journal of Computer Assisted Radiology and Surgery, pp. 1–8, 2023

  15. [15]

    An efficient algorithm for determining the convex hull of a finite planar set,

    R. L. Graham, “An efficient algorithm for determining the convex hull of a finite planar set,”Info. Proc. Lett., vol. 1, pp. 132–133, 1972

  16. [16]

    Contextual encoder–decoder network for visual saliency prediction,

    A. Kroner, M. Senden, K. Driessens, and R. Goebel, “Contextual encoder–decoder network for visual saliency prediction,”Neural Net- works, vol. 129, pp. 261–270, 2020

  17. [17]

    SalGAN: Visual Saliency Prediction with Generative Adversarial Networks

    J. Pan, C. C. Ferrer, K. McGuinness, N. E. O’Connor, J. Torres, E. Say- rol, and X. Giro-i Nieto, “Salgan: Visual saliency prediction with generative adversarial networks,”arXiv preprint arXiv:1701.01081, 2017

  18. [18]

    Adam: A Method for Stochastic Optimization

    D. P. Kingma, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

  19. [19]

    Differences in gaze behaviour of expert and junior surgeons performing open inguinal hernia repair,

    T. Tien, P. H. Pucher, M. H. Sodergren, K. Sriskandarajah, G.-Z. Yang, and A. Darzi, “Differences in gaze behaviour of expert and junior surgeons performing open inguinal hernia repair,”Surgical Endoscopy, vol. 29, pp. 405–413, Feb. 2015

  20. [20]

    Eye tracking in surgical education: gaze-based dynamic area of interest can discriminate adverse events and expertise,

    E. Fichtel, N. Lau, J. Park, S. Henrickson Parker, S. Ponnala, S. Fitzgibbons, and S. D. Safford, “Eye tracking in surgical education: gaze-based dynamic area of interest can discriminate adverse events and expertise,”Surgical Endoscopy, vol. 33, pp. 2249–2256, July 2019