Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?

Jike Zhong; Ming Li; Xiaofeng Yang; Yuheng Li; Yuxiang Lai

arxiv: 2510.10254 · v2 · submitted 2025-10-11 · 💻 cs.CV

Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?

Yuxiang Lai , Jike Zhong , Ming Li , Yuheng Li , Xiaofeng Yang This is my paper

Pith reviewed 2026-05-18 07:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords zero-shot learningmedical imagingvideo modelsmotion prediction4D CTsegmentationdenoisingsuper-resolution

0 comments

The pith

An autoregressive video model untrained on medical data performs competitively on CT segmentation, denoising, super-resolution, and motion prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether autoregressive video modeling principles can transfer directly to medical imaging tasks even when the model has never seen medical data. It applies a large vision model in a zero-shot setting to organ segmentation, denoising, super-resolution, and motion prediction on CT scans. The model delineates anatomical structures with competitive results on the first three tasks and produces anatomically consistent forecasts of respiratory motion that surpass specialized baselines. Evaluation covers 4D CT data from 122 patients and more than 1,820 volumes, showing strong spatial accuracy in motion prediction. These results indicate that general video models may function as zero-shot learners and reasoners for medical volumes presented as frame sequences.

Core claim

Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy on 4D CT data from 122 patients totaling over 1,820 3D volumes.

What carries the argument

Large vision model using autoregressive prediction on sequences of 3D medical volumes presented as video frames.

If this is right

General-purpose video models can act as unified learners across multiple medical imaging tasks without task-specific training.
Zero-shot motion prediction produces patient-specific respiratory forecasts that maintain temporal coherence and anatomical consistency.
State-of-the-art spatial accuracy on 4D CT data supports improved applications in radiotherapy planning.
Video model architectures provide a foundation for building medical foundation models that handle diverse imaging tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same zero-shot transfer may extend to other 3D medical modalities such as MRI when volumes are sequenced as frames.
This approach could lower the barrier to entry for medical imaging applications by reducing the need for domain-specific datasets and fine-tuning.
Examining which video-learned representations enable anatomical reasoning could clarify the sources of cross-domain generalization.
Further scaling of the underlying video model may yield measurable gains in medical task performance.

Load-bearing premise

That autoregressive video modeling principles transfer directly to 3D medical volumes when the volumes are simply presented as frame sequences, without any domain adaptation, fine-tuning, or architectural changes.

What would settle it

Showing that the model generates anatomically inconsistent predictions or fails to exceed baseline accuracy on a new set of 4D CT scans with different respiratory patterns or patient anatomies would falsify the zero-shot transfer claim.

Figures

Figures reproduced from arXiv: 2510.10254 by Jike Zhong, Ming Li, Xiaofeng Yang, Yuheng Li, Yuxiang Lai.

**Figure 1.** Figure 1: Zero-shot learning and reasoning examples of the video model in medical imaging. From low-level perceptual restoration (super [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Schematic illustration of intrafractional tumor motion caused by respiratory cycles during thoracic and upper-abdominal ra [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Multi-phase motion prediction on the public dataset. We evaluate model performance on the public 4D CT dataset using Dice Similarity Coefficient (DSC, %). Each model is provided with the first five phases of the 4D CT scan and autoregressively predicts the next five phases. The plots show phase-by-phase DSC for five representative methods (DAM, DiffuseRT, ConvLSTM, RMSim, and our proposed LVM). LVM consist… view at source ↗

**Figure 4.** Figure 4: Multi-phase motion prediction on the private dataset. The same DSC-based evaluation is conducted on our institutional 4D CT dataset (including lung, heart, and liver cases). Each model receives the first five phases and must generate the subsequent five phases. LVM maintains consistently higher DSC across all organs and phases, with smoother phase-to-phase transitions and less degradation compared to compe… view at source ↗

**Figure 5.** Figure 5: Qualitative visualization of lung motion. The first five phases are used as input, and the model predicts the next five. Each heatmap shows voxel-wise pixel differences between the ground truth (GT) and either the previous phase or the model prediction. Red indicates larger discrepancies. LVM accurately captures respiratory-induced motion, showing reduced errors and smoother temporal transitions compared … view at source ↗

**Figure 6.** Figure 6: Qualitative visualization of liver motion. The first five 4D CT phases are used as input, and the model predicts the next five. Each heatmap shows voxel-wise differences between the prediction (or previous phase) and the ground truth, where red indicates larger errors. LVM accurately captures the livers smooth deformation and diaphragm-induced motion, maintaining temporal and anatomical consistency across … view at source ↗

**Figure 7.** Figure 7: Qualitative visualization of Segmentation. For each organ, the left column shows the original CT slice, and the right column shows the predicted segmentation mask. The results demonstrate that the zero-shot video model can accurately segment organs across diverse anatomical regions based on the given input prompts. 4.2.5. Visualization To qualitatively evaluate the effectiveness of LVM in modeling organ m… view at source ↗

read the original abstract

Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a general autoregressive video model can handle zero-shot medical tasks including 4D CT motion prediction on real patient data, but the 3D volume conversion details remain unclear.

read the letter

The main thing to know is that the authors feed an unmodified large vision model, trained only on natural video, into four medical tasks and report competitive results on organ segmentation, denoising, and super-resolution plus stronger performance than DVF and generative baselines on radiotherapy motion prediction. They use 4D CT from 122 patients and more than 1,820 volumes, claiming anatomically consistent forecasts of future phases that capture patient-specific breathing patterns without any medical fine-tuning or architecture changes. That concrete demonstration on real clinical data is the clearest new element here. The work does a reasonable job of testing the same model across multiple tasks and grounding the motion prediction claim in actual patient scans rather than toy examples. The dataset size gives the results some weight. The soft spot is the direct transfer step. Standard video models expect 2D RGB sequences with natural-scene statistics, yet CT data arrives as anisotropic 3D grids with Hounsfield units. The abstract does not describe slice ordering, resampling, or intensity mapping, so it is difficult to separate genuine zero-shot capability from preprocessing choices that might be preserving spatial consistency. The stress-test concern lands because those steps are load-bearing for the SOTA spatial accuracy claim. Full methods and exact metric tables would be needed to judge whether the baselines were implemented fairly. This paper is for groups working on medical foundation models or zero-shot vision applications. Readers focused on radiotherapy planning or emergent capabilities in large models would find the empirical comparisons useful. It deserves a serious referee because the real-patient scale and the practical task make the results worth checking, even though the methods section will likely need expansion and the 3D handling clarified.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates whether principles from autoregressive video modeling in large vision models (LVMs) transfer directly in a zero-shot setting to medical imaging tasks, despite no medical training data. It evaluates an unmodified LVM on organ segmentation, denoising, super-resolution, and motion prediction using 4D CT scans from 122 patients (over 1,820 3D volumes), reporting competitive results on the first three tasks and state-of-the-art spatial accuracy in forecasting future phases that capture patient-specific respiratory dynamics, outperforming DVF-based and generative baselines.

Significance. If the zero-shot transfer holds under clarified input conditions, the work would demonstrate emergent cross-domain capabilities in general video models, supporting the potential for unified medical foundation models without domain-specific fine-tuning. The large patient cohort strengthens the empirical evaluation and provides a falsifiable test of the transfer hypothesis.

major comments (2)

[Methods] Methods section (input representation): The description of converting 3D CT volumes and 4D sequences into 2D frame sequences for the LVM is insufficient. No details are provided on slice ordering, resampling to match the model's expected resolution, intensity normalization (e.g., Hounsfield units to [0,1]), or handling of anisotropic voxel grids. This directly impacts the central claim that unmodified 2D video modeling preserves 3D anatomical consistency and patient-specific dynamics in motion prediction.
[Results] Results, motion prediction subsection: The SOTA claim on spatial accuracy for 1,820 volumes requires explicit quantitative comparison (e.g., mean squared error, Dice scores, or Hausdorff distances) with error bars and statistical tests against the DVF-based and generative baselines. The abstract's qualitative description of 'anatomically consistent predictions' and 'realistic temporal coherence' is not load-bearing without these metrics to confirm superiority.

minor comments (2)

[Abstract] Abstract: The final sentence contains a minor grammatical issue ('laying the groundwork for future medical foundation models built on video models').
[Introduction] Notation: The term 'LVM' is introduced without an explicit expansion on first use in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We have addressed each major comment point by point below. Where the comments identify areas needing greater clarity or rigor, we have revised the manuscript accordingly to strengthen the presentation of our methods and results.

read point-by-point responses

Referee: [Methods] Methods section (input representation): The description of converting 3D CT volumes and 4D sequences into 2D frame sequences for the LVM is insufficient. No details are provided on slice ordering, resampling to match the model's expected resolution, intensity normalization (e.g., Hounsfield units to [0,1]), or handling of anisotropic voxel grids. This directly impacts the central claim that unmodified 2D video modeling preserves 3D anatomical consistency and patient-specific dynamics in motion prediction.

Authors: We agree that the original Methods section provided insufficient detail on the input representation pipeline, which is essential for reproducibility and for supporting our claims about anatomical consistency in the zero-shot setting. In the revised manuscript, we have expanded the relevant subsection to explicitly describe: (1) slice ordering, where 3D volumes are converted to 2D frame sequences by extracting contiguous axial slices in superior-to-inferior order; (2) resampling all volumes to the LVM's native 224x224 pixel resolution using bilinear interpolation; (3) intensity normalization by clipping Hounsfield units to [-1000, 2000] and linearly scaling to [0, 1]; and (4) handling of anisotropic voxel grids via initial resampling to isotropic 1 mm³ spacing using trilinear interpolation prior to frame extraction. These additions clarify the preprocessing steps while preserving the unmodified nature of the LVM. We have also added a supplementary figure illustrating the conversion workflow for 3D and 4D data. revision: yes
Referee: [Results] Results, motion prediction subsection: The SOTA claim on spatial accuracy for 1,820 volumes requires explicit quantitative comparison (e.g., mean squared error, Dice scores, or Hausdorff distances) with error bars and statistical tests against the DVF-based and generative baselines. The abstract's qualitative description of 'anatomically consistent predictions' and 'realistic temporal coherence' is not load-bearing without these metrics to confirm superiority.

Authors: We acknowledge that the motion prediction results would be more robust with additional quantitative metrics and statistical validation to fully substantiate the state-of-the-art spatial accuracy claim. In the revised manuscript, we have augmented the Results section with a new table and accompanying text that reports mean squared error (MSE), Dice similarity coefficient (DSC) on key anatomical structures, and Hausdorff distance (HD) metrics, each with standard deviations as error bars, computed across all 1,820 volumes from the 122 patients. We also include results from paired statistical tests (Wilcoxon signed-rank tests) against the DVF-based and generative baselines, with p-values demonstrating significant improvements. These quantitative additions directly support the superiority in spatial accuracy and complement the existing qualitative descriptions of anatomical consistency and temporal coherence. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical zero-shot results measured against external ground truth and baselines

full rationale

The paper reports direct experimental outcomes from applying an unmodified autoregressive video model to 4D CT volumes treated as frame sequences. Claims rest on quantitative comparisons to patient-specific ground-truth phases (1,820 volumes from 122 patients) and external DVF/generative baselines, with no internal parameter fitting, self-defined metrics, or equations that reduce predictions to inputs by construction. The zero-shot transfer is an empirical observation, not a derived result; any self-citation to the base LVM is for model provenance only and does not load-bear the medical performance numbers. This is a standard self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested transfer of video-modeling inductive biases to medical volumes; this is an empirical assumption rather than a derived quantity.

axioms (1)

domain assumption Autoregressive video modeling principles can be directly applied to medical imaging tasks by treating 3D CT volumes as video sequences
Invoked in the abstract when the LVM is evaluated zero-shot on CT and 4D CT data without domain-specific fine-tuning.

pith-pipeline@v0.9.0 · 5795 in / 1358 out tokens · 52060 ms · 2026-05-18T07:21:32.537899+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate organ motion prediction as a video modeling and generation problem... P(X1,X2,...,XT)=∏pθ(X′t|X0,...,Xt−1)
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LVM uses an autoregressive modeling framework built on a decoder-only Transformer... context length of 4096 tokens, corresponding to up to 16 CT phases

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 8 internal anchors

[1]

Sequential modeling enables scalable learn- ing for large vision models

Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan L Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learn- ing for large vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22861–22872, 2024

work page 2024
[2]

Making the most of text semantics to improve biomedical vision–language processing

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuro- pean conference on computer vision, pages 1–21. Springer, 2022

work page 2022
[3]

Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

work page 1901
[4]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medi- cal image segmentation.arXiv preprint arXiv:2102.04306, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Analyzing tumors by synthesis.Genera- tive Machine Learning Models in Medical Image Computing, page 85, 2024

Qi Chen, Yuxiang Lai, Xiaoxi Chen, Qixin Hu, Alan Yuille, and Zongwei Zhou. Analyzing tumors by synthesis.Genera- tive Machine Learning Models in Medical Image Computing, page 85, 2024

work page 2024
[6]

An image is worth 16x16 words: Trans- formers for image recognition at scale.International Con- ference on Learning Representations, 2020

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.International Con- ference on Learning Representations, 2020

work page 2020
[7]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021
[8]

Feasibility study of convolutional long shortterm memory network for pulmonary movement prediction in ct images.Journal of Biomedical Physics & Engineering, 14(1):55, 2024

Zahra Ghasemi and Payam Samadi Miandoab. Feasibility study of convolutional long shortterm memory network for pulmonary movement prediction in ct images.Journal of Biomedical Physics & Engineering, 14(1):55, 2024

work page 2024
[9]

Veo 3 announcement.https : / / blog

Google. Veo 3 announcement.https : / / blog . google / technology / ai / generative - media - models- io- 2025/, 2025. Accessed: September 22, 2025

work page 2025
[10]

Veo 3 launch.https://cloud.google

Google. Veo 3 launch.https://cloud.google. com / blog / products / ai - machine - learning / veo - 3 - fast - available - for - everyone - on - vertex-ai, 2025. Accessed: September 22, 2025

work page 2025
[11]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[12]

Geoffrey D Hugo, Elisabeth Weiss, William C Sleeman, Salim Balik, Paul J Keall, Jun Lu, and Jeffrey F Williamson. A longitudinal four-dimensional computed tomography and cone beam computed tomography dataset for image-guided radiation therapy research in lung cancer.Medical physics, 44(2):762–771, 2017

work page 2017
[13]

nnu-net: a self-configuring method for deep learning-based biomedical image segmen- tation.Nature Methods, 18(2):203–211, 2021

Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Pe- tersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmen- tation.Nature Methods, 18(2):203–211, 2021

work page 2021
[14]

F our-dimensional computed tomography (4dct): a review of the current status and applications.Jour- nal of medical imaging and radiation oncology, 59(5):545– 554, 2015

Yune Kwong, Alexandra Olimpia Mel, Greg Wheeler, and John M Troupis. F our-dimensional computed tomography (4dct): a review of the current status and applications.Jour- nal of medical imaging and radiation oncology, 59(5):545– 554, 2015

work page 2015
[15]

Memory-assisted sub-prototype mining for universal domain adaptation.arXiv preprint arXiv:2310.05453, 2023

Yuxiang Lai, Yi Zhou, Xinghong Liu, and Tao Zhou. Memory-assisted sub-prototype mining for universal domain adaptation.arXiv preprint arXiv:2310.05453, 2023

work page arXiv 2023
[16]

From pixel to cancer: Cellular automata in computed tomography

Yuxiang Lai, Xiaoxi Chen, Angtian Wang, Alan Yuille, and Zongwei Zhou. From pixel to cancer: Cellular automata in computed tomography. InInternational Conference on Med- ical Image Computing and Computer-Assisted Intervention, pages 36–46. Springer, 2024

work page 2024
[17]

arXiv preprint arXiv:2503.13939 (2025)

Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, and Xi- aofeng Yang. Med-r1: Reinforcement learning for general- izable medical reasoning in vision-language models.arXiv preprint arXiv:2503.13939, 2025

work page arXiv 2025
[18]

Rmsim: controlled respiratory mo- tion simulation on static patient scans.Physics in Medicine & Biology, 68(4):045009, 2023

Donghoon Lee, Ellen Yorke, Masoud Zarepisheh, Saad Nadeem, and Yu-Chi Hu. Rmsim: controlled respiratory mo- tion simulation on static patient scans.Physics in Medicine & Biology, 68(4):045009, 2023

work page 2023
[19]

Abdomenatlas: A large-scale, detailed- annotated, & multi-center dataset for efficient transfer learn- ing and open algorithmic benchmarking.Medical Image Analysis, 97:103285, 2024

Wenxuan Li, Chongyu Qu, Xiaoxi Chen, Pedro RAS Bassi, Yijia Shi, Yuxiang Lai, Qian Yu, Huimin Xue, Yixiong Chen, Xiaorui Lin, et al. Abdomenatlas: A large-scale, detailed- annotated, & multi-center dataset for efficient transfer learn- ing and open algorithmic benchmarking.Medical Image Analysis, 97:103285, 2024

work page 2024
[20]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Ya- sunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Ed- uardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), pages 353–367. PMLR, 2023

work page 2023
[21]

Site-specific deformable imaging registration algorithm selection using patient-based simulated deforma- tions.Medical physics, 40(4):041911, 2013

Ke Nie, Cynthia Chuang, Neil Kirby, Steve Braunstein, and Jean Pouliot. Site-specific deformable imaging registration algorithm selection using patient-based simulated deforma- tions.Medical physics, 40(4):041911, 2013

work page 2013
[22]

Sora 2 system card.https://openai.com/ index/sora- 2- system- card/, 2025

OpenAI. Sora 2 system card.https://openai.com/ index/sora- 2- system- card/, 2025. Accessed: September 22, 2025

work page 2025
[23]

A probabilistic deep learning model of inter-fraction anatomical variations in radiotherapy

Oscar Pastor-Serrano, Steven Habraken, Mischa Hoogeman, Danny Lathouwers, Dennis Schaart, Yusuke Nomura, Lei Xing, and Zolt ´an Perk ´o. A probabilistic deep learning model of inter-fraction anatomical variations in radiotherapy. Physics in Medicine & Biology, 68(8):085018, 2023

work page 2023
[24]

Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019
[25]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022
[26]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical Image Com- puting and Computer-Assisted Intervention, pages 234–241. Springer, 2015

work page 2015
[27]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen- sri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C ´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Convolutional lstm network: A machine learning approach for precipitation nowcasting.Advances in neural information processing sys- tems, 28, 2015

Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting.Advances in neural information processing sys- tems, 28, 2015

work page 2015
[29]

Diffusert: predicting likely anatomical deformations of patients undergoing radiotherapy.Physics in Medicine & Bi- ology, 69(15):155016, 2024

Andreas Smolders, Luciano Rivetti, Nadine Vatterodt, Stine Korreman, Anthony Lomax, Manju Sharma, Andrej Studen, Damien Charles Weber, Robert Jeraj, and Francesca Albe- tini. Diffusert: predicting likely anatomical deformations of patients undergoing radiotherapy.Physics in Medicine & Bi- ology, 69(15):155016, 2024

work page 2024
[30]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

work page 2017
[32]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Il- lia Polosukhin. Attention is all you need.arXiv preprint arXiv:1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Virginia Commonwealth University, 2015

Douglas J Vile.Statistical modeling of interfractional tissue deformation and its application in radiation therapy plan- ning. Virginia Commonwealth University, 2015

work page 2015
[34]

Medclip: Contrastive learning from unpaired medi- cal images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medi- cal images and text. InProceedings of the Conference on Empirical Methods in Natural Language Processing. Con- ference on Empirical Methods in Natural Language Process- ing, page 3876, 2022

work page 2022
[35]

To- talsegmentator: robust segmentation of 104 anatomic struc- tures in ct images.Radiology: Artificial Intelligence, 5(5), 2023

Jakob Wasserthal, Hanns-Christian Breit, Manfred T Meyer, Maurice Pradella, Daniel Hinck, Alexander W Sauter, Tobias Heye, Daniel T Boll, Joshy Cyriac, Shan Yang, et al. To- talsegmentator: robust segmentation of 104 anatomic struc- tures in ct images.Radiology: Artificial Intelligence, 5(5), 2023

work page 2023
[36]

Video models are zero-shot learners and reasoners

Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and trans- formers.arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[38]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Sequential modeling enables scalable learn- ing for large vision models

Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan L Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learn- ing for large vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22861–22872, 2024

work page 2024

[2] [2]

Making the most of text semantics to improve biomedical vision–language processing

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuro- pean conference on computer vision, pages 1–21. Springer, 2022

work page 2022

[3] [3]

Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

work page 1901

[4] [4]

TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medi- cal image segmentation.arXiv preprint arXiv:2102.04306, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Analyzing tumors by synthesis.Genera- tive Machine Learning Models in Medical Image Computing, page 85, 2024

Qi Chen, Yuxiang Lai, Xiaoxi Chen, Qixin Hu, Alan Yuille, and Zongwei Zhou. Analyzing tumors by synthesis.Genera- tive Machine Learning Models in Medical Image Computing, page 85, 2024

work page 2024

[6] [6]

An image is worth 16x16 words: Trans- formers for image recognition at scale.International Con- ference on Learning Representations, 2020

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.International Con- ference on Learning Representations, 2020

work page 2020

[7] [7]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

work page 2021

[8] [8]

Feasibility study of convolutional long shortterm memory network for pulmonary movement prediction in ct images.Journal of Biomedical Physics & Engineering, 14(1):55, 2024

Zahra Ghasemi and Payam Samadi Miandoab. Feasibility study of convolutional long shortterm memory network for pulmonary movement prediction in ct images.Journal of Biomedical Physics & Engineering, 14(1):55, 2024

work page 2024

[9] [9]

Veo 3 announcement.https : / / blog

Google. Veo 3 announcement.https : / / blog . google / technology / ai / generative - media - models- io- 2025/, 2025. Accessed: September 22, 2025

work page 2025

[10] [10]

Veo 3 launch.https://cloud.google

Google. Veo 3 launch.https://cloud.google. com / blog / products / ai - machine - learning / veo - 3 - fast - available - for - everyone - on - vertex-ai, 2025. Accessed: September 22, 2025

work page 2025

[11] [11]

Imagen Video: High Definition Video Generation with Diffusion Models

Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[12] [12]

Geoffrey D Hugo, Elisabeth Weiss, William C Sleeman, Salim Balik, Paul J Keall, Jun Lu, and Jeffrey F Williamson. A longitudinal four-dimensional computed tomography and cone beam computed tomography dataset for image-guided radiation therapy research in lung cancer.Medical physics, 44(2):762–771, 2017

work page 2017

[13] [13]

nnu-net: a self-configuring method for deep learning-based biomedical image segmen- tation.Nature Methods, 18(2):203–211, 2021

Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Pe- tersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmen- tation.Nature Methods, 18(2):203–211, 2021

work page 2021

[14] [14]

F our-dimensional computed tomography (4dct): a review of the current status and applications.Jour- nal of medical imaging and radiation oncology, 59(5):545– 554, 2015

Yune Kwong, Alexandra Olimpia Mel, Greg Wheeler, and John M Troupis. F our-dimensional computed tomography (4dct): a review of the current status and applications.Jour- nal of medical imaging and radiation oncology, 59(5):545– 554, 2015

work page 2015

[15] [15]

Memory-assisted sub-prototype mining for universal domain adaptation.arXiv preprint arXiv:2310.05453, 2023

Yuxiang Lai, Yi Zhou, Xinghong Liu, and Tao Zhou. Memory-assisted sub-prototype mining for universal domain adaptation.arXiv preprint arXiv:2310.05453, 2023

work page arXiv 2023

[16] [16]

From pixel to cancer: Cellular automata in computed tomography

Yuxiang Lai, Xiaoxi Chen, Angtian Wang, Alan Yuille, and Zongwei Zhou. From pixel to cancer: Cellular automata in computed tomography. InInternational Conference on Med- ical Image Computing and Computer-Assisted Intervention, pages 36–46. Springer, 2024

work page 2024

[17] [17]

arXiv preprint arXiv:2503.13939 (2025)

Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, and Xi- aofeng Yang. Med-r1: Reinforcement learning for general- izable medical reasoning in vision-language models.arXiv preprint arXiv:2503.13939, 2025

work page arXiv 2025

[18] [18]

Rmsim: controlled respiratory mo- tion simulation on static patient scans.Physics in Medicine & Biology, 68(4):045009, 2023

Donghoon Lee, Ellen Yorke, Masoud Zarepisheh, Saad Nadeem, and Yu-Chi Hu. Rmsim: controlled respiratory mo- tion simulation on static patient scans.Physics in Medicine & Biology, 68(4):045009, 2023

work page 2023

[19] [19]

Abdomenatlas: A large-scale, detailed- annotated, & multi-center dataset for efficient transfer learn- ing and open algorithmic benchmarking.Medical Image Analysis, 97:103285, 2024

Wenxuan Li, Chongyu Qu, Xiaoxi Chen, Pedro RAS Bassi, Yijia Shi, Yuxiang Lai, Qian Yu, Huimin Xue, Yixiong Chen, Xiaorui Lin, et al. Abdomenatlas: A large-scale, detailed- annotated, & multi-center dataset for efficient transfer learn- ing and open algorithmic benchmarking.Medical Image Analysis, 97:103285, 2024

work page 2024

[20] [20]

Med-flamingo: a multimodal medical few-shot learner

Michael Moor, Qian Huang, Shirley Wu, Michihiro Ya- sunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Ed- uardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), pages 353–367. PMLR, 2023

work page 2023

[21] [21]

Site-specific deformable imaging registration algorithm selection using patient-based simulated deforma- tions.Medical physics, 40(4):041911, 2013

Ke Nie, Cynthia Chuang, Neil Kirby, Steve Braunstein, and Jean Pouliot. Site-specific deformable imaging registration algorithm selection using patient-based simulated deforma- tions.Medical physics, 40(4):041911, 2013

work page 2013

[22] [22]

Sora 2 system card.https://openai.com/ index/sora- 2- system- card/, 2025

OpenAI. Sora 2 system card.https://openai.com/ index/sora- 2- system- card/, 2025. Accessed: September 22, 2025

work page 2025

[23] [23]

A probabilistic deep learning model of inter-fraction anatomical variations in radiotherapy

Oscar Pastor-Serrano, Steven Habraken, Mischa Hoogeman, Danny Lathouwers, Dennis Schaart, Yusuke Nomura, Lei Xing, and Zolt ´an Perk ´o. A probabilistic deep learning model of inter-fraction anatomical variations in radiotherapy. Physics in Medicine & Biology, 68(8):085018, 2023

work page 2023

[24] [24]

Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019

work page 2019

[25] [25]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

work page 2022

[26] [26]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical Image Com- puting and Computer-Assisted Intervention, pages 234–241. Springer, 2015

work page 2015

[27] [27]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen- sri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C ´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

Convolutional lstm network: A machine learning approach for precipitation nowcasting.Advances in neural information processing sys- tems, 28, 2015

Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting.Advances in neural information processing sys- tems, 28, 2015

work page 2015

[29] [29]

Diffusert: predicting likely anatomical deformations of patients undergoing radiotherapy.Physics in Medicine & Bi- ology, 69(15):155016, 2024

Andreas Smolders, Luciano Rivetti, Nadine Vatterodt, Stine Korreman, Anthony Lomax, Manju Sharma, Andrej Studen, Damien Charles Weber, Robert Jeraj, and Francesca Albe- tini. Diffusert: predicting likely anatomical deformations of patients undergoing radiotherapy.Physics in Medicine & Bi- ology, 69(15):155016, 2024

work page 2024

[30] [30]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

work page 2017

[32] [32]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Il- lia Polosukhin. Attention is all you need.arXiv preprint arXiv:1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [33]

Virginia Commonwealth University, 2015

Douglas J Vile.Statistical modeling of interfractional tissue deformation and its application in radiation therapy plan- ning. Virginia Commonwealth University, 2015

work page 2015

[34] [34]

Medclip: Contrastive learning from unpaired medi- cal images and text

Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medi- cal images and text. InProceedings of the Conference on Empirical Methods in Natural Language Processing. Con- ference on Empirical Methods in Natural Language Process- ing, page 3876, 2022

work page 2022

[35] [35]

To- talsegmentator: robust segmentation of 104 anatomic struc- tures in ct images.Radiology: Artificial Intelligence, 5(5), 2023

Jakob Wasserthal, Hanns-Christian Breit, Manfred T Meyer, Maurice Pradella, Daniel Hinck, Alexander W Sauter, Tobias Heye, Daniel T Boll, Joshy Cyriac, Shan Yang, et al. To- talsegmentator: robust segmentation of 104 anatomic struc- tures in ct images.Radiology: Artificial Intelligence, 5(5), 2023

work page 2023

[36] [36]

Video models are zero-shot learners and reasoners

Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and trans- formers.arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[38] [38]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023