pith. sign in

arxiv: 2605.23992 · v1 · pith:ABWVTAJ4new · submitted 2026-05-17 · 💻 cs.CV · cs.AI

A World Model of Radiologist Reading for Medical Image Representation Learning

Pith reviewed 2026-06-30 18:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords medical image representation learningradiologist gaze trackingworld modelsfixation sequenceschest X-ray diagnosiszero-shot transferautoregressive predictionspatial completion
0
0 comments X

The pith

A world model trained on radiologist fixation sequences produces image features that achieve state-of-the-art diagnostic accuracy without gaze data at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that radiologist eye-tracking data can be used to train a world model in which the image is the environment and fixation sequences are trajectories through it. The model learns by autoregressively predicting the latent representation of the next fixated patch from prior ones while also completing spatial information for unvisited regions. At test time the model generates patch representations from the image alone. A sympathetic reader would care because this shifts pretraining focus from static image content or labels to the dynamic process experts use to accumulate diagnostic evidence, and the reported results show these features transfer to both supervised and zero-shot tasks on standard chest X-ray benchmarks.

Core claim

GazeWorld treats the image as a world and the radiologist's fixation sequence as a trajectory through it. It autoregressively predicts the latent representation of the next fixated patch from all previously visited ones and adds a spatial-completion branch for unvisited regions. At inference the model generates a sequence of patch representations from the image alone without real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, the highest zero-shot accuracy on the same three benchmarks, and allow a generic decoder to outperform the purpose-built LogitGaze-Med by over

What carries the argument

GazeWorld, a world model that autoregressively predicts the latent representation of the next fixated patch from prior fixations together with a spatial-completion branch for unvisited image regions.

If this is right

  • Frozen features from the model reach state-of-the-art accuracy in every supervised diagnostic setting tested on the three chest X-ray benchmarks.
  • The same frozen features deliver the highest zero-shot accuracy on those benchmarks.
  • A generic decoder trained on the features outperforms a specialized gaze-prediction model on the GazeSearch benchmark by over 16% ScanMatch and 22% SED.
  • Modeling the process of expert image reading supplies a viable pretraining paradigm for medical imaging AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Expert visual search patterns appear to encode diagnostic information that is not fully captured by image labels or standard self-supervision alone.
  • The method could allow existing clinical gaze recordings to serve as a source of supervision that reduces reliance on new manual annotations.
  • Similar trajectory-based world models might be tested on other modalities where expert attention traces exist, such as digital pathology slides.

Load-bearing premise

The assumption that autoregressively predicting next-fixation patch representations from prior ones plus spatial completion produces diagnostic knowledge that transfers when real gaze data is unavailable at inference.

What would settle it

Training an otherwise identical model without the autoregressive next-fixation prediction and spatial-completion objectives and observing whether the resulting features still match or exceed the reported diagnostic accuracies on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax.

Figures

Figures reproduced from arXiv: 2605.23992 by Chao Cao, Dajiang Zhu, Huaqin Zhao, Lin Zhao, Tianming Liu, Yifan Zhou, Yiwei Li, Zihao Wu.

Figure 1
Figure 1. Figure 1: GazeWorld overview. After the chest radiograph and radiologist fixation sequence are processed in Part A, the image is converted into patch-level semantic tokens and the patch grid is divided into visited and unvisited regions. In Part B, visited patch tokens are combined with spatial, temporal, and duration information by the fixation embedder, and an autoregressive predictor performs latent-space predict… view at source ↗
Figure 2
Figure 2. Figure 2: Grad-CAM [Selvaraju et al., 2017] attention visualizations across seven pathologies, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative scanpath comparison across seven pathologies. The first row shows the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE visualization of learned representations on the CheXpert 5 [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
read the original abstract

Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist's fixation sequence as a trajectory through it. GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial-completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, as well as the highest zero-shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose-built LogitGaze-Med by over 16\% in ScanMatch and 22\% in SED, despite not being explicitly trained to predict gaze. GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces GazeWorld, a medical imaging world model that treats the image as the world and radiologist fixation sequences as trajectories through it. The model autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, augmented by a spatial-completion branch for unvisited regions. At inference, it generates patch representations from the image alone without real gaze data. Frozen GazeWorld features are reported to achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, the highest zero-shot accuracy on the same benchmarks, and to outperform the purpose-built LogitGaze-Med by over 16% in ScanMatch and 22% in SED on the GazeSearch benchmark when used with a generic decoder.

Significance. If the empirical claims hold after verification of methods and controls, the work would demonstrate a viable pretraining paradigm that incorporates expert search behavior into representation learning for medical images, with the practical advantage that gaze data is required only during training.

minor comments (3)
  1. Abstract: the claim of SOTA across 'all nine supervised settings' lacks explicit identification of the competing methods, exact performance deltas, or statistical significance tests.
  2. Abstract: no description is given of the backbone architecture, loss formulation, training data splits, or how the autoregressive prediction and spatial-completion objectives are balanced.
  3. Abstract: the GazeSearch results compare a generic decoder against LogitGaze-Med, but the training regime for the generic decoder (e.g., whether it sees any gaze supervision) is not stated.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for noting the potential significance of a pretraining paradigm that incorporates expert search behavior, with the advantage that gaze data is needed only at training time. We address the major comments below.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The abstract and provided context contain no equations, derivation steps, or explicit self-citations. The model is described as trained on real gaze sequences to predict latent patch representations autoregressively, then used at inference without gaze data to produce features for downstream tasks. No load-bearing step reduces by construction to its own inputs, no fitted parameter is renamed as a prediction, and no uniqueness theorem or ansatz is imported via self-citation. The derivation chain cannot be inspected for circularity without the full manuscript equations, but the given material shows no evidence of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, training details, or parameter lists are available, so free parameters, axioms, and invented entities cannot be enumerated from the source.

pith-pipeline@v0.9.1-grok · 5780 in / 1241 out tokens · 35142 ms · 2026-06-30T18:53:20.931873+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

11 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

    Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning.arXiv preprint arXiv:1711.05225,

  2. [2]

    Chexagent: Towards a foundation model for chest x-ray interpretation

    Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation. InAAAI 2024 Spring Symposium on Clinical Foundation Models,

  3. [3]

    A path towards autonomous machine intelligence version 0.9

    10 Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62,

  4. [4]

    Scalable pre-training of large autore- gressive image models.arXiv preprint arXiv:2401.08541,

    Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin. Scalable pre-training of large autore- gressive image models.arXiv preprint arXiv:2401.08541,

  5. [5]

    Us-jepa: A joint embedding predictive architecture for medical ultrasound

    11 Ashwath Radhachandran, Vedrana Ivezi´c, Shreeram Athreya, Ronit Anilkumar, Corey W Arnold, and William Speier. Us-jepa: A joint embedding predictive architecture for medical ultrasound. arXiv preprint arXiv:2602.19322,

  6. [6]

    Medclip: Contrastive learning from unpaired medical images and text

    Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3876–3887, 2022b. Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boe...

  7. [7]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

  8. [8]

    Simulating human saccadic scanpaths on natural images

    Wei Wang, Cheng Chen, Yizhou Wang, Tingting Jiang, Fang Fang, and Yuan Yao. Simulating human saccadic scanpaths on natural images. InCVPR 2011, pages 441–448. IEEE,

  9. [9]

    A vector-based, multidimensional scanpath similarity measure

    Halszka Jarodzka, Kenneth Holmqvist, and Marcus Nystr ¨om. A vector-based, multidimensional scanpath similarity measure. InProceedings of the 2010 symposium on eye-tracking research & applications, pages 211–218,

  10. [10]

    Reported

    A Baseline Protocol Details A.1 Dataset Details MIMIC-EYE.MIMIC-EYE [Hsieh et al., 2023] is an eye-tracking extension of the MIMIC-CXR database [Johnson et al., 2019]. It records radiologist fixation sequences from 3,032 frontal chest radiographs during routine clinical reading sessions using a Tobii Pro Nano eye tracker (sampling rate 60 Hz). Each record...

  11. [11]

    processes the growing fixation sequence with temporal positional encoding. Three output heads operate on each hidden state ht: (i) a spatial head, implemented as a 196-way softmax over the patch grid; (ii) a duration head that regresses fixation dwell time; and (iii) a termination head. Following the GazeSearch evaluation protocol, the decoder emits 7 fix...