A World Model of Radiologist Reading for Medical Image Representation Learning

Chao Cao; Dajiang Zhu; Huaqin Zhao; Lin Zhao; Tianming Liu; Yifan Zhou; Yiwei Li; Zihao Wu

arxiv: 2605.23992 · v1 · pith:ABWVTAJ4new · submitted 2026-05-17 · 💻 cs.CV · cs.AI

A World Model of Radiologist Reading for Medical Image Representation Learning

Yiwei Li , Zihao Wu , Huaqin Zhao , Yifan Zhou , Chao Cao , Dajiang Zhu , Tianming Liu , Lin Zhao This is my paper

Pith reviewed 2026-06-30 18:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical image representation learningradiologist gaze trackingworld modelsfixation sequenceschest X-ray diagnosiszero-shot transferautoregressive predictionspatial completion

0 comments

The pith

A world model trained on radiologist fixation sequences produces image features that achieve state-of-the-art diagnostic accuracy without gaze data at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that radiologist eye-tracking data can be used to train a world model in which the image is the environment and fixation sequences are trajectories through it. The model learns by autoregressively predicting the latent representation of the next fixated patch from prior ones while also completing spatial information for unvisited regions. At test time the model generates patch representations from the image alone. A sympathetic reader would care because this shifts pretraining focus from static image content or labels to the dynamic process experts use to accumulate diagnostic evidence, and the reported results show these features transfer to both supervised and zero-shot tasks on standard chest X-ray benchmarks.

Core claim

GazeWorld treats the image as a world and the radiologist's fixation sequence as a trajectory through it. It autoregressively predicts the latent representation of the next fixated patch from all previously visited ones and adds a spatial-completion branch for unvisited regions. At inference the model generates a sequence of patch representations from the image alone without real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, the highest zero-shot accuracy on the same three benchmarks, and allow a generic decoder to outperform the purpose-built LogitGaze-Med by over

What carries the argument

GazeWorld, a world model that autoregressively predicts the latent representation of the next fixated patch from prior fixations together with a spatial-completion branch for unvisited image regions.

If this is right

Frozen features from the model reach state-of-the-art accuracy in every supervised diagnostic setting tested on the three chest X-ray benchmarks.
The same frozen features deliver the highest zero-shot accuracy on those benchmarks.
A generic decoder trained on the features outperforms a specialized gaze-prediction model on the GazeSearch benchmark by over 16% ScanMatch and 22% SED.
Modeling the process of expert image reading supplies a viable pretraining paradigm for medical imaging AI.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Expert visual search patterns appear to encode diagnostic information that is not fully captured by image labels or standard self-supervision alone.
The method could allow existing clinical gaze recordings to serve as a source of supervision that reduces reliance on new manual annotations.
Similar trajectory-based world models might be tested on other modalities where expert attention traces exist, such as digital pathology slides.

Load-bearing premise

The assumption that autoregressively predicting next-fixation patch representations from prior ones plus spatial completion produces diagnostic knowledge that transfers when real gaze data is unavailable at inference.

What would settle it

Training an otherwise identical model without the autoregressive next-fixation prediction and spatial-completion objectives and observing whether the resulting features still match or exceed the reported diagnostic accuracies on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax.

Figures

Figures reproduced from arXiv: 2605.23992 by Chao Cao, Dajiang Zhu, Huaqin Zhao, Lin Zhao, Tianming Liu, Yifan Zhou, Yiwei Li, Zihao Wu.

**Figure 1.** Figure 1: GazeWorld overview. After the chest radiograph and radiologist fixation sequence are processed in Part A, the image is converted into patch-level semantic tokens and the patch grid is divided into visited and unvisited regions. In Part B, visited patch tokens are combined with spatial, temporal, and duration information by the fixation embedder, and an autoregressive predictor performs latent-space predict… view at source ↗

**Figure 2.** Figure 2: Grad-CAM [Selvaraju et al., 2017] attention visualizations across seven pathologies, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative scanpath comparison across seven pathologies. The first row shows the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: t-SNE visualization of learned representations on the CheXpert 5 [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

read the original abstract

Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist's fixation sequence as a trajectory through it. GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial-completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, as well as the highest zero-shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose-built LogitGaze-Med by over 16\% in ScanMatch and 22\% in SED, despite not being explicitly trained to predict gaze. GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GazeWorld frames radiologist fixations as autoregressive trajectories in a world model with a completion branch, claiming the resulting frozen features set new marks on diagnosis benchmarks and even a gaze task.

read the letter

The main thing to know is that this work treats the sequence of radiologist eye fixations as a trajectory through image patches. The model learns to predict the latent state of the next patch from the ones already visited and adds a branch that completes the unvisited parts of the image. At test time the same model produces patch representations from the image alone, without any real gaze input.

What the paper does well is shift the pretraining signal from static attention maps or decoupled gaze prediction to an explicit model of the reading process itself. The reported transfer results are the strongest part of the abstract: the frozen features reach the highest numbers on all nine supervised splits across CheXpert, RSNA Pneumonia, and SIIM-ACR, plus the best zero-shot scores, and a simple decoder on those features beats a purpose-built gaze model on the GazeSearch benchmark by double-digit margins in ScanMatch and SED.

The soft spots are straightforward. Everything rests on the abstract, so there are no equations, architecture details, training procedure, dataset splits, or ablation tables to inspect. Without those it is impossible to judge whether the SOTA numbers reflect the method or post-hoc tuning, whether the gaze data collection introduces selection bias, or how much the spatial-completion branch actually contributes. The central assumption—that autoregressive prediction of fixation latents produces diagnostically useful representations—cannot be evaluated yet.

This paper is aimed at medical imaging researchers who already work with eye-tracking data or are looking for richer pretraining signals beyond labels. Anyone building self-supervised models for radiology could get value from the formulation if the experiments hold up.

I would send it for peer review. The idea is distinct from the static or auxiliary uses of gaze mentioned in the abstract, and the claimed gains are large enough to deserve a full check.

Referee Report

0 major / 3 minor

Summary. The paper introduces GazeWorld, a medical imaging world model that treats the image as the world and radiologist fixation sequences as trajectories through it. The model autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, augmented by a spatial-completion branch for unvisited regions. At inference, it generates patch representations from the image alone without real gaze data. Frozen GazeWorld features are reported to achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, the highest zero-shot accuracy on the same benchmarks, and to outperform the purpose-built LogitGaze-Med by over 16% in ScanMatch and 22% in SED on the GazeSearch benchmark when used with a generic decoder.

Significance. If the empirical claims hold after verification of methods and controls, the work would demonstrate a viable pretraining paradigm that incorporates expert search behavior into representation learning for medical images, with the practical advantage that gaze data is required only during training.

minor comments (3)

Abstract: the claim of SOTA across 'all nine supervised settings' lacks explicit identification of the competing methods, exact performance deltas, or statistical significance tests.
Abstract: no description is given of the backbone architecture, loss formulation, training data splits, or how the autoregressive prediction and spatial-completion objectives are balanced.
Abstract: the GazeSearch results compare a generic decoder against LogitGaze-Med, but the training regime for the generic decoder (e.g., whether it sees any gaze supervision) is not stated.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for noting the potential significance of a pretraining paradigm that incorporates expert search behavior, with the advantage that gaze data is needed only at training time. We address the major comments below.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The abstract and provided context contain no equations, derivation steps, or explicit self-citations. The model is described as trained on real gaze sequences to predict latent patch representations autoregressively, then used at inference without gaze data to produce features for downstream tasks. No load-bearing step reduces by construction to its own inputs, no fitted parameter is renamed as a prediction, and no uniqueness theorem or ansatz is imported via self-citation. The derivation chain cannot be inspected for circularity without the full manuscript equations, but the given material shows no evidence of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, training details, or parameter lists are available, so free parameters, axioms, and invented entities cannot be enumerated from the source.

pith-pipeline@v0.9.1-grok · 5780 in / 1241 out tokens · 35142 ms · 2026-06-30T18:53:20.931873+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 4 canonical work pages · 2 internal anchors

[1]

CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning.arXiv preprint arXiv:1711.05225,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Chexagent: Towards a foundation model for chest x-ray interpretation

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation. InAAAI 2024 Spring Symposium on Clinical Foundation Models,

2024
[3]

A path towards autonomous machine intelligence version 0.9

10 Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62,

2022
[4]

Scalable pre-training of large autore- gressive image models.arXiv preprint arXiv:2401.08541,

Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin. Scalable pre-training of large autore- gressive image models.arXiv preprint arXiv:2401.08541,

work page arXiv
[5]

Us-jepa: A joint embedding predictive architecture for medical ultrasound

11 Ashwath Radhachandran, Vedrana Ivezi´c, Shreeram Athreya, Ronit Anilkumar, Corey W Arnold, and William Speier. Us-jepa: A joint embedding predictive architecture for medical ultrasound. arXiv preprint arXiv:2602.19322,

work page arXiv
[6]

Medclip: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3876–3887, 2022b. Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boe...

2022
[7]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Simulating human saccadic scanpaths on natural images

Wei Wang, Cheng Chen, Yizhou Wang, Tingting Jiang, Fang Fang, and Yuan Yao. Simulating human saccadic scanpaths on natural images. InCVPR 2011, pages 441–448. IEEE,

2011
[9]

A vector-based, multidimensional scanpath similarity measure

Halszka Jarodzka, Kenneth Holmqvist, and Marcus Nystr ¨om. A vector-based, multidimensional scanpath similarity measure. InProceedings of the 2010 symposium on eye-tracking research & applications, pages 211–218,

2010
[10]

Reported

A Baseline Protocol Details A.1 Dataset Details MIMIC-EYE.MIMIC-EYE [Hsieh et al., 2023] is an eye-tracking extension of the MIMIC-CXR database [Johnson et al., 2019]. It records radiologist fixation sequences from 3,032 frontal chest radiographs during routine clinical reading sessions using a Tobii Pro Nano eye tracker (sampling rate 60 Hz). Each record...

2023
[11]

processes the growing fixation sequence with temporal positional encoding. Three output heads operate on each hidden state ht: (i) a spatial head, implemented as a 196-way softmax over the patch grid; (ii) a duration head that regresses fixation dwell time; and (iii) a termination head. Following the GazeSearch evaluation protocol, the decoder emits 7 fix...

2008

[1] [1]

CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning

Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning.arXiv preprint arXiv:1711.05225,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Chexagent: Towards a foundation model for chest x-ray interpretation

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation. InAAAI 2024 Spring Symposium on Clinical Foundation Models,

2024

[3] [3]

A path towards autonomous machine intelligence version 0.9

10 Yann LeCun et al. A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27. Open Review, 62(1):1–62,

2022

[4] [4]

Scalable pre-training of large autore- gressive image models.arXiv preprint arXiv:2401.08541,

Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Angel Bautista, Alexander Toshev, Vaishaal Shankar, Joshua M Susskind, and Armand Joulin. Scalable pre-training of large autore- gressive image models.arXiv preprint arXiv:2401.08541,

work page arXiv

[5] [5]

Us-jepa: A joint embedding predictive architecture for medical ultrasound

11 Ashwath Radhachandran, Vedrana Ivezi´c, Shreeram Athreya, Ronit Anilkumar, Corey W Arnold, and William Speier. Us-jepa: A joint embedding predictive architecture for medical ultrasound. arXiv preprint arXiv:2602.19322,

work page arXiv

[6] [6]

Medclip: Contrastive learning from unpaired medical images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3876–3887, 2022b. Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boe...

2022

[7] [7]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Simulating human saccadic scanpaths on natural images

Wei Wang, Cheng Chen, Yizhou Wang, Tingting Jiang, Fang Fang, and Yuan Yao. Simulating human saccadic scanpaths on natural images. InCVPR 2011, pages 441–448. IEEE,

2011

[9] [9]

A vector-based, multidimensional scanpath similarity measure

Halszka Jarodzka, Kenneth Holmqvist, and Marcus Nystr ¨om. A vector-based, multidimensional scanpath similarity measure. InProceedings of the 2010 symposium on eye-tracking research & applications, pages 211–218,

2010

[10] [10]

Reported

A Baseline Protocol Details A.1 Dataset Details MIMIC-EYE.MIMIC-EYE [Hsieh et al., 2023] is an eye-tracking extension of the MIMIC-CXR database [Johnson et al., 2019]. It records radiologist fixation sequences from 3,032 frontal chest radiographs during routine clinical reading sessions using a Tobii Pro Nano eye tracker (sampling rate 60 Hz). Each record...

2023

[11] [11]

processes the growing fixation sequence with temporal positional encoding. Three output heads operate on each hidden state ht: (i) a spatial head, implemented as a 196-way softmax over the patch grid; (ii) a duration head that regresses fixation dwell time; and (iii) a termination head. Following the GazeSearch evaluation protocol, the decoder emits 7 fix...

2008