pith. sign in

arxiv: 2606.03485 · v1 · pith:DKQLFHP6new · submitted 2026-06-02 · 💻 cs.HC

Analyzing Visual Attention Patterns During Band Rehearsal with Mobile Eye Tracking

Pith reviewed 2026-06-28 08:32 UTC · model grok-4.3

classification 💻 cs.HC
keywords visual attentionmobile eye trackingensemble rehearsalgaze patternshub-and-spoke topologyband coordinationfixation analysistransition matrices
0
0 comments X

The pith

Band rehearsals form a hub-and-spoke gaze pattern centered on the leader, with attention stabilizing after repeated attempts on new material.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies mobile eye tracking to a four-person band rehearsing three songs to map where musicians look during real practice sessions. It identifies a consistent pattern in which the session leader receives most gazes from everyone else, while one member directs up to 97 percent of their interpersonal looks at that single person. Gaze shifts between players drop by up to 65 percent on average between attempts at unfamiliar pieces, and visual scanning becomes more settled. Time-based plots show attention breaking apart during teaching interruptions and locking in during continuous play, patterns that match the musicians' own later descriptions. This description of attention flow supplies a concrete basis for designing rehearsal aids that respond to where players are actually looking.

Core claim

The central claim is that visual attention during ensemble rehearsal exhibits a hub-and-spoke topology, with the session leader as the dominant fixation target for all members and the learning guitarist directing up to 97 percent of interpersonal dwell time to this reference. Transition matrices show gaze shifts falling by up to 65 percent on average (82 percent for some individuals) between successive attempts on unfamiliar material, while scarf plots distinguish fragmented attention during teaching breakdowns from consolidated attention during uninterrupted runs. These quantitative patterns align with participants' post-session reflections.

What carries the argument

The hub-and-spoke attention topology, recovered from fixation matrices, transition matrices, and temporal scarf plots built from mobile eye-tracking data mapped to people and objects via YOLOv8 scene annotations.

If this is right

  • Attention concentrates on one reference person rather than distributing evenly across the group.
  • Repeated practice on new material reduces the frequency of gaze shifts between members.
  • Teaching interruptions produce visible fragmentation in the sequence of fixations.
  • Uninterrupted performance runs produce visibly consolidated fixation sequences.
  • Participant self-reports after the session match the recorded gaze patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Rehearsal software could track live attention distribution and flag moments when focus drifts from the leader.
  • The same recording method could be tested in other coordinated group activities such as chamber music or team sports to check for similar topologies.
  • If the stabilization effect proves reliable, rehearsal protocols might deliberately include repeated attempts to accelerate the drop in unnecessary scanning.

Load-bearing premise

The automated scene annotations correctly assign fixations to individual musicians and objects, and the small group of four players and three songs reveals patterns that hold more generally.

What would settle it

Repeating the same rehearsal protocol with a different ensemble or larger sample and finding either no single dominant gaze target or no consistent drop in transitions between attempts would falsify the reported topology and stabilization effect.

Figures

Figures reproduced from arXiv: 2606.03485 by Arvind Srinivasan, Michael Sedlmair, Tobias Rau.

Figure 1
Figure 1. Figure 1: Overview of our rehearsal study and analysis setup. (A) Multi-camera recording interface with real-time person [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: In (a), each cell shows the proportion of classified-target dwell (Other and Phone excluded, renormalized) from [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (A) Relationship between maximum classified-target dwell (%) and total transition count across all 24 recordings [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Changes between Attempt 1 and Attempt 2 for each song. S2 shows the largest redistribution, driven by the shift from [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scarf plots of sequential AOI fixation targets for all six sessions (3 songs [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
read the original abstract

Visual attention is central to ensemble coordination, yet how musicians allocate gaze during naturalistic rehearsal remains poorly understood. We present a pilot study using mobile eye tracking to examine gaze behaviour in a four-member band across three songs, each practiced twice. Musicians wore Pupil Labs Neon eye trackers, and YOLOv8-assisted scene annotations mapped fixations to ensemble members and objects in view. Analyzing fixation matrices, transition matrices, temporal scarf plots, and dwell-transition correlations, we uncover a hub-and-spoke attention topology: the session leader was the dominant gaze target for all members, while the learning guitarist concentrated up to 97% of interpersonal dwell on this single reference. Between attempts, gaze transitions decreased by up to 65% on average for unfamiliar material (up to 82% for individual participants) as scanning stabilized. Scarf plots reveal how teaching breakdowns fragment attention and uninterrupted runs consolidate it. Post-session participant reflections align with the quantitative patterns, and we discuss implications for gaze-aware tools in ensemble pedagogy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a pilot study using mobile eye tracking (Pupil Labs Neon) on four band members rehearsing three songs twice each. YOLOv8-assisted scene annotations map fixations to ensemble members and objects; analysis of fixation/transition matrices, scarf plots, and dwell correlations reveals a hub-and-spoke attention topology with the session leader as dominant target (up to 97% interpersonal dwell concentration for the learning guitarist) and reductions in gaze transitions (up to 65% average, 82% for individuals) between attempts as scanning stabilizes.

Significance. If the fixation-to-object mappings prove reliable, the work supplies the first quantitative description of visual attention dynamics in naturalistic ensemble rehearsal, documenting practice-induced stabilization and alignment with participant reflections. The naturalistic mobile-eye-tracking design in a moving rehearsal setting is a methodological strength that could inform gaze-aware ensemble pedagogy tools.

major comments (2)
  1. [Abstract] Abstract (methods/results): All reported percentages (97% dwell concentration, 65% transition reduction) and the hub-and-spoke topology are derived from fixation matrices produced by YOLOv8-assisted annotations. No accuracy, precision, recall, or inter-annotator agreement metrics are supplied for the annotation step in a dynamic, multi-person, moving-camera rehearsal environment; without such validation the quantitative claims rest on an untested measurement pipeline.
  2. [Abstract] Abstract (results/discussion): The sample comprises only four musicians and three songs. The manuscript must clarify whether the observed topology and transition reductions are presented as general ensemble phenomena or as case-specific observations, and must address how the small N affects the strength of the stabilization claim.
minor comments (1)
  1. [Abstract] The abstract states that 'post-session participant reflections align with the quantitative patterns' but supplies no information on interview protocol, coding, or how alignment was assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our pilot study. We address each major point below and will revise the manuscript accordingly to strengthen the presentation of methods and the framing of results.

read point-by-point responses
  1. Referee: [Abstract] Abstract (methods/results): All reported percentages (97% dwell concentration, 65% transition reduction) and the hub-and-spoke topology are derived from fixation matrices produced by YOLOv8-assisted annotations. No accuracy, precision, recall, or inter-annotator agreement metrics are supplied for the annotation step in a dynamic, multi-person, moving-camera rehearsal environment; without such validation the quantitative claims rest on an untested measurement pipeline.

    Authors: We agree that formal validation metrics for the annotation pipeline are absent from the current manuscript. The study is a pilot, and annotations combined automated YOLOv8 detection with manual review by the research team, but no quantitative metrics (e.g., precision/recall or inter-annotator agreement) were computed. In the revised version we will add a methods subsection describing the annotation workflow in detail, report any available spot-check agreement figures, and explicitly list the lack of full validation metrics as a limitation of the pilot. This will qualify the quantitative claims without altering the reported patterns. revision: yes

  2. Referee: [Abstract] Abstract (results/discussion): The sample comprises only four musicians and three songs. The manuscript must clarify whether the observed topology and transition reductions are presented as general ensemble phenomena or as case-specific observations, and must address how the small N affects the strength of the stabilization claim.

    Authors: The manuscript already labels the work a 'pilot study,' but we accept that the abstract and discussion do not sufficiently emphasize the case-specific nature of the findings. In revision we will (1) rephrase the abstract and results to state that the hub-and-spoke topology and transition reductions are observations from this particular four-person ensemble and these three songs, and (2) add an explicit paragraph in the discussion addressing the implications of N=4 for the stabilization claim, noting that the patterns are consistent with participant reflections yet require larger-scale replication before generalizing to ensemble rehearsal at large. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely observational descriptive analysis

full rationale

The paper is a pilot observational study that collects mobile eye-tracking data, applies YOLOv8-assisted annotations to map fixations, and reports descriptive statistics (dwell percentages, transition counts, scarf plots) on the resulting matrices. No equations, fitted models, predictions, or derivation chains exist that could reduce to author-defined inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The hub-and-spoke topology and percentage reductions are direct empirical summaries of the annotated data, not quantities defined in terms of themselves. This is the normal case of a self-contained descriptive study with no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the accuracy of wearable eye tracking and computer-vision labeling in a small naturalistic setting rather than on mathematical derivations or new theoretical entities.

axioms (1)
  • domain assumption Mobile eye trackers and YOLOv8 scene annotations produce sufficiently accurate fixation-to-target mappings for the purposes of the analysis.
    Invoked implicitly when the abstract states that annotations mapped fixations to ensemble members and objects.

pith-pipeline@v0.9.1-grok · 5703 in / 1339 out tokens · 27369 ms · 2026-06-28T08:32:23.201422+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 11 canonical work pages

  1. [1]

    Laura Bishop, Carlos Cancino-Chacón, and Werner Goebl. 2019a. Eye Gaze as a Means of Giving and Seeking Information During Musical Interaction.Consciousness and Cognition68 (2019), 73–96. doi:10.1016/j.concog.2019.01.002 Laura Bishop, Carlos Cancino-Chacón, and Werner Goebl. 2019b. Moving to Commu- nicate, Moving to Interact: Patterns of Body Motion in Mu...

  2. [2]

    Optics Commu- nications454(2020) https://doi.org/10.1016/j

    Coordinating Cognition: The Costs and Benefits of Shared Gaze During Collaborative Search.Cognition106, 3 (2008), 1465–1477. doi:10.1016/j. cognition.2007.05.012 Jane W. Davidson and James M.M. Good

  3. [3]

    doi:10.1177/0305735602302005 Frédéric Dehais, Mickaël Causse, and Sébastien Tremblay

    Social and Musical Co-Ordination Between Members of a String Quartet: An Exploratory Study.Psychology of Music 30, 2 (2002), 186–201. doi:10.1177/0305735602302005 Frédéric Dehais, Mickaël Causse, and Sébastien Tremblay

  4. [4]

    doi:10.1177/0018720813510735 G

    Failure to Detect Critical Auditory Alerts in the Cockpit: Evidence for Inattentional Deafness.Human Factors56, 4 (2014), 631–644. doi:10.1177/0018720813510735 G. R. Dirkin

  5. [5]

    Perceptual and Motor Skills56, 1 (1983), 191–198

    Cognitive Tunneling: Use of Visual Information Under Stress. Perceptual and Motor Skills56, 1 (1983), 191–198. doi:10.2466/pms.1983.56.1.191 Véronique Drai-Zerbib and Thierry Baccino

  6. [6]

    The Effect of Expertise on Eye Movements in Music Reading.Psychology of Music40, 1 (2012), 101–117. doi:10. 1177/0305735610394710 James A. Easterbrook

  7. [7]

    doi:10.1037/h0047707 Donald Glowinski, Maurizio Mancini, Roddy Cowie, Antonio Camurri, Carlo Chiorri, and Cian Doherty

    The Effect of Emotion on Cue Utilization and the Organi- zation of Behavior.Psychological Review66, 3 (1959), 183–201. doi:10.1037/h0047707 Donald Glowinski, Maurizio Mancini, Roddy Cowie, Antonio Camurri, Carlo Chiorri, and Cian Doherty

  8. [8]

    The Movements Made by Performers in a Skilled Quartet: A Distinctive Pattern, and the Function That It Serves.Frontiers in Psychology4 (2013),

  9. [9]

    Keller, Giacomo Novembre, and Michael J

    doi:10.3389/fpsyg.2013.00841 Peter E. Keller, Giacomo Novembre, and Michael J. Hove

  10. [10]

    doi:10.1098/rstb.2013.0394 Krzysztof Krejtz, Tadeusz Szmidt, Andrew T

    Rhythm in Joint Action: Psychological and Neurophysiological Mechanisms for Real-Time Interpersonal Coordination.Philosophical Transactions of the Royal Society B: Biological Sciences 369, 1658 (2014), 20130394. doi:10.1098/rstb.2013.0394 Krzysztof Krejtz, Tadeusz Szmidt, Andrew T. Duchowski, and Izabela Krejtz

  11. [11]

    InProceedings of the Symposium on Eye Tracking Research & Applications (ETRA ’14)

    Entropy-Based Statistical Analysis of Eye Movement Transitions. InProceedings of the Symposium on Eye Tracking Research & Applications (ETRA ’14). Association for Computing Machinery, New York, NY, USA, 159–166. doi:10.1145/2578153.2578176 Matthias Ragert, Timothy Schroeder, and Peter E. Keller

  12. [12]

    Knowing Too Little or Too Much: The Effects of Familiarity with a Co-performer’s Part on Interpersonal Coordination in Musical Ensembles.Frontiers in Psychology4 (2013),

  13. [13]

    3389/fpsyg.2013.00368 Daniel C

    doi:10. 3389/fpsyg.2013.00368 Daniel C. Richardson and Rick Dale

  14. [14]

    doi:10.1207/ s15516709cog0000_29 Bertrand Schneider and Roy Pea

    Looking To Understand: The Coupling Between Speakers’ and Listeners’ Eye Movements and Its Relationship to Dis- course Comprehension.Cognitive Science29, 6 (2005), 1045–1060. doi:10.1207/ s15516709cog0000_29 Bertrand Schneider and Roy Pea

  15. [15]

    doi:10.1007/s11412- 013-9181-4 Michael Tomasello

    Real-Time Mutual Gaze Perception En- hances Collaborative Learning and Collaboration Quality.International Journal of Computer-Supported Collaborative Learning8, 4 (2013), 375–397. doi:10.1007/s11412- 013-9181-4 Michael Tomasello

  16. [16]

    Attentional tunneling and task management in synthetic vision displays.The international journal of aviation psychology19, 2 (2009), 182–199. Alan M. Wing, Satoshi Endo, Adrian Bradbury, and Dirk Vorberg

  17. [17]

    doi:10.1098/rsif.2013.1125

    Optimal Feedback Correction in String Quartet Synchronization.Journal of The Royal Society Interface11, 93 (2014), 20131125. doi:10.1098/rsif.2013.1125