Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?
Pith reviewed 2026-05-18 07:21 UTC · model grok-4.3
The pith
An autoregressive video model untrained on medical data performs competitively on CT segmentation, denoising, super-resolution, and motion prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy on 4D CT data from 122 patients totaling over 1,820 3D volumes.
What carries the argument
Large vision model using autoregressive prediction on sequences of 3D medical volumes presented as video frames.
If this is right
- General-purpose video models can act as unified learners across multiple medical imaging tasks without task-specific training.
- Zero-shot motion prediction produces patient-specific respiratory forecasts that maintain temporal coherence and anatomical consistency.
- State-of-the-art spatial accuracy on 4D CT data supports improved applications in radiotherapy planning.
- Video model architectures provide a foundation for building medical foundation models that handle diverse imaging tasks.
Where Pith is reading between the lines
- The same zero-shot transfer may extend to other 3D medical modalities such as MRI when volumes are sequenced as frames.
- This approach could lower the barrier to entry for medical imaging applications by reducing the need for domain-specific datasets and fine-tuning.
- Examining which video-learned representations enable anatomical reasoning could clarify the sources of cross-domain generalization.
- Further scaling of the underlying video model may yield measurable gains in medical task performance.
Load-bearing premise
That autoregressive video modeling principles transfer directly to 3D medical volumes when the volumes are simply presented as frame sequences, without any domain adaptation, fine-tuning, or architectural changes.
What would settle it
Showing that the model generates anatomically inconsistent predictions or fails to exceed baseline accuracy on a new set of 4D CT scans with different respiratory patterns or patient anatomies would falsify the zero-shot transfer claim.
Figures
read the original abstract
Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates whether principles from autoregressive video modeling in large vision models (LVMs) transfer directly in a zero-shot setting to medical imaging tasks, despite no medical training data. It evaluates an unmodified LVM on organ segmentation, denoising, super-resolution, and motion prediction using 4D CT scans from 122 patients (over 1,820 3D volumes), reporting competitive results on the first three tasks and state-of-the-art spatial accuracy in forecasting future phases that capture patient-specific respiratory dynamics, outperforming DVF-based and generative baselines.
Significance. If the zero-shot transfer holds under clarified input conditions, the work would demonstrate emergent cross-domain capabilities in general video models, supporting the potential for unified medical foundation models without domain-specific fine-tuning. The large patient cohort strengthens the empirical evaluation and provides a falsifiable test of the transfer hypothesis.
major comments (2)
- [Methods] Methods section (input representation): The description of converting 3D CT volumes and 4D sequences into 2D frame sequences for the LVM is insufficient. No details are provided on slice ordering, resampling to match the model's expected resolution, intensity normalization (e.g., Hounsfield units to [0,1]), or handling of anisotropic voxel grids. This directly impacts the central claim that unmodified 2D video modeling preserves 3D anatomical consistency and patient-specific dynamics in motion prediction.
- [Results] Results, motion prediction subsection: The SOTA claim on spatial accuracy for 1,820 volumes requires explicit quantitative comparison (e.g., mean squared error, Dice scores, or Hausdorff distances) with error bars and statistical tests against the DVF-based and generative baselines. The abstract's qualitative description of 'anatomically consistent predictions' and 'realistic temporal coherence' is not load-bearing without these metrics to confirm superiority.
minor comments (2)
- [Abstract] Abstract: The final sentence contains a minor grammatical issue ('laying the groundwork for future medical foundation models built on video models').
- [Introduction] Notation: The term 'LVM' is introduced without an explicit expansion on first use in the main text.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments on our manuscript. We have addressed each major comment point by point below. Where the comments identify areas needing greater clarity or rigor, we have revised the manuscript accordingly to strengthen the presentation of our methods and results.
read point-by-point responses
-
Referee: [Methods] Methods section (input representation): The description of converting 3D CT volumes and 4D sequences into 2D frame sequences for the LVM is insufficient. No details are provided on slice ordering, resampling to match the model's expected resolution, intensity normalization (e.g., Hounsfield units to [0,1]), or handling of anisotropic voxel grids. This directly impacts the central claim that unmodified 2D video modeling preserves 3D anatomical consistency and patient-specific dynamics in motion prediction.
Authors: We agree that the original Methods section provided insufficient detail on the input representation pipeline, which is essential for reproducibility and for supporting our claims about anatomical consistency in the zero-shot setting. In the revised manuscript, we have expanded the relevant subsection to explicitly describe: (1) slice ordering, where 3D volumes are converted to 2D frame sequences by extracting contiguous axial slices in superior-to-inferior order; (2) resampling all volumes to the LVM's native 224x224 pixel resolution using bilinear interpolation; (3) intensity normalization by clipping Hounsfield units to [-1000, 2000] and linearly scaling to [0, 1]; and (4) handling of anisotropic voxel grids via initial resampling to isotropic 1 mm³ spacing using trilinear interpolation prior to frame extraction. These additions clarify the preprocessing steps while preserving the unmodified nature of the LVM. We have also added a supplementary figure illustrating the conversion workflow for 3D and 4D data. revision: yes
-
Referee: [Results] Results, motion prediction subsection: The SOTA claim on spatial accuracy for 1,820 volumes requires explicit quantitative comparison (e.g., mean squared error, Dice scores, or Hausdorff distances) with error bars and statistical tests against the DVF-based and generative baselines. The abstract's qualitative description of 'anatomically consistent predictions' and 'realistic temporal coherence' is not load-bearing without these metrics to confirm superiority.
Authors: We acknowledge that the motion prediction results would be more robust with additional quantitative metrics and statistical validation to fully substantiate the state-of-the-art spatial accuracy claim. In the revised manuscript, we have augmented the Results section with a new table and accompanying text that reports mean squared error (MSE), Dice similarity coefficient (DSC) on key anatomical structures, and Hausdorff distance (HD) metrics, each with standard deviations as error bars, computed across all 1,820 volumes from the 122 patients. We also include results from paired statistical tests (Wilcoxon signed-rank tests) against the DVF-based and generative baselines, with p-values demonstrating significant improvements. These quantitative additions directly support the superiority in spatial accuracy and complement the existing qualitative descriptions of anatomical consistency and temporal coherence. revision: yes
Circularity Check
No circularity: empirical zero-shot results measured against external ground truth and baselines
full rationale
The paper reports direct experimental outcomes from applying an unmodified autoregressive video model to 4D CT volumes treated as frame sequences. Claims rest on quantitative comparisons to patient-specific ground-truth phases (1,820 volumes from 122 patients) and external DVF/generative baselines, with no internal parameter fitting, self-defined metrics, or equations that reduce predictions to inputs by construction. The zero-shot transfer is an empirical observation, not a derived result; any self-citation to the base LVM is for model provenance only and does not load-bear the medical performance numbers. This is a standard self-contained empirical evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Autoregressive video modeling principles can be directly applied to medical imaging tasks by treating 3D CT volumes as video sequences
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We formulate organ motion prediction as a video modeling and generation problem... P(X1,X2,...,XT)=∏pθ(X′t|X0,...,Xt−1)
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat_induction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LVM uses an autoregressive modeling framework built on a decoder-only Transformer... context length of 4096 tokens, corresponding to up to 16 CT phases
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Sequential modeling enables scalable learn- ing for large vision models
Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan L Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learn- ing for large vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22861–22872, 2024
work page 2024
-
[2]
Making the most of text semantics to improve biomedical vision–language processing
Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuro- pean conference on computer vision, pages 1–21. Springer, 2022
work page 2022
-
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020
work page 1901
-
[4]
TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation
Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medi- cal image segmentation.arXiv preprint arXiv:2102.04306, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Qi Chen, Yuxiang Lai, Xiaoxi Chen, Qixin Hu, Alan Yuille, and Zongwei Zhou. Analyzing tumors by synthesis.Genera- tive Machine Learning Models in Medical Image Computing, page 85, 2024
work page 2024
-
[6]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.International Con- ference on Learning Representations, 2020
work page 2020
-
[7]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021
work page 2021
-
[8]
Zahra Ghasemi and Payam Samadi Miandoab. Feasibility study of convolutional long shortterm memory network for pulmonary movement prediction in ct images.Journal of Biomedical Physics & Engineering, 14(1):55, 2024
work page 2024
-
[9]
Veo 3 announcement.https : / / blog
Google. Veo 3 announcement.https : / / blog . google / technology / ai / generative - media - models- io- 2025/, 2025. Accessed: September 22, 2025
work page 2025
-
[10]
Veo 3 launch.https://cloud.google
Google. Veo 3 launch.https://cloud.google. com / blog / products / ai - machine - learning / veo - 3 - fast - available - for - everyone - on - vertex-ai, 2025. Accessed: September 22, 2025
work page 2025
-
[11]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[12]
Geoffrey D Hugo, Elisabeth Weiss, William C Sleeman, Salim Balik, Paul J Keall, Jun Lu, and Jeffrey F Williamson. A longitudinal four-dimensional computed tomography and cone beam computed tomography dataset for image-guided radiation therapy research in lung cancer.Medical physics, 44(2):762–771, 2017
work page 2017
-
[13]
Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Pe- tersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmen- tation.Nature Methods, 18(2):203–211, 2021
work page 2021
-
[14]
Yune Kwong, Alexandra Olimpia Mel, Greg Wheeler, and John M Troupis. F our-dimensional computed tomography (4dct): a review of the current status and applications.Jour- nal of medical imaging and radiation oncology, 59(5):545– 554, 2015
work page 2015
-
[15]
Yuxiang Lai, Yi Zhou, Xinghong Liu, and Tao Zhou. Memory-assisted sub-prototype mining for universal domain adaptation.arXiv preprint arXiv:2310.05453, 2023
-
[16]
From pixel to cancer: Cellular automata in computed tomography
Yuxiang Lai, Xiaoxi Chen, Angtian Wang, Alan Yuille, and Zongwei Zhou. From pixel to cancer: Cellular automata in computed tomography. InInternational Conference on Med- ical Image Computing and Computer-Assisted Intervention, pages 36–46. Springer, 2024
work page 2024
-
[17]
arXiv preprint arXiv:2503.13939 (2025)
Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, and Xi- aofeng Yang. Med-r1: Reinforcement learning for general- izable medical reasoning in vision-language models.arXiv preprint arXiv:2503.13939, 2025
-
[18]
Donghoon Lee, Ellen Yorke, Masoud Zarepisheh, Saad Nadeem, and Yu-Chi Hu. Rmsim: controlled respiratory mo- tion simulation on static patient scans.Physics in Medicine & Biology, 68(4):045009, 2023
work page 2023
-
[19]
Wenxuan Li, Chongyu Qu, Xiaoxi Chen, Pedro RAS Bassi, Yijia Shi, Yuxiang Lai, Qian Yu, Huimin Xue, Yixiong Chen, Xiaorui Lin, et al. Abdomenatlas: A large-scale, detailed- annotated, & multi-center dataset for efficient transfer learn- ing and open algorithmic benchmarking.Medical Image Analysis, 97:103285, 2024
work page 2024
-
[20]
Med-flamingo: a multimodal medical few-shot learner
Michael Moor, Qian Huang, Shirley Wu, Michihiro Ya- sunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Ed- uardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), pages 353–367. PMLR, 2023
work page 2023
-
[21]
Ke Nie, Cynthia Chuang, Neil Kirby, Steve Braunstein, and Jean Pouliot. Site-specific deformable imaging registration algorithm selection using patient-based simulated deforma- tions.Medical physics, 40(4):041911, 2013
work page 2013
-
[22]
Sora 2 system card.https://openai.com/ index/sora- 2- system- card/, 2025
OpenAI. Sora 2 system card.https://openai.com/ index/sora- 2- system- card/, 2025. Accessed: September 22, 2025
work page 2025
-
[23]
A probabilistic deep learning model of inter-fraction anatomical variations in radiotherapy
Oscar Pastor-Serrano, Steven Habraken, Mischa Hoogeman, Danny Lathouwers, Dennis Schaart, Yusuke Nomura, Lei Xing, and Zolt ´an Perk ´o. A probabilistic deep learning model of inter-fraction anatomical variations in radiotherapy. Physics in Medicine & Biology, 68(8):085018, 2023
work page 2023
-
[24]
Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019
work page 2019
-
[25]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022
work page 2022
-
[26]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical Image Com- puting and Computer-Assisted Intervention, pages 234–241. Springer, 2015
work page 2015
-
[27]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen- sri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C ´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting.Advances in neural information processing sys- tems, 28, 2015
work page 2015
-
[29]
Andreas Smolders, Luciano Rivetti, Nadine Vatterodt, Stine Korreman, Anthony Lomax, Manju Sharma, Andrej Studen, Damien Charles Weber, Robert Jeraj, and Francesca Albe- tini. Diffusert: predicting likely anatomical deformations of patients undergoing radiotherapy.Physics in Medicine & Bi- ology, 69(15):155016, 2024
work page 2024
-
[30]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017
work page 2017
-
[32]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Il- lia Polosukhin. Attention is all you need.arXiv preprint arXiv:1706.03762, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
Virginia Commonwealth University, 2015
Douglas J Vile.Statistical modeling of interfractional tissue deformation and its application in radiation therapy plan- ning. Virginia Commonwealth University, 2015
work page 2015
-
[34]
Medclip: Contrastive learning from unpaired medi- cal images and text
Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medi- cal images and text. InProceedings of the Conference on Empirical Methods in Natural Language Processing. Con- ference on Empirical Methods in Natural Language Process- ing, page 3876, 2022
work page 2022
-
[35]
Jakob Wasserthal, Hanns-Christian Breit, Manfred T Meyer, Maurice Pradella, Daniel Hinck, Alexander W Sauter, Tobias Heye, Daniel T Boll, Joshy Cyriac, Shan Yang, et al. To- talsegmentator: robust segmentation of 104 anatomic struc- tures in ct images.Radiology: Artificial Intelligence, 5(5), 2023
work page 2023
-
[36]
Video models are zero-shot learners and reasoners
Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
VideoGPT: Video Generation using VQ-VAE and Transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and trans- formers.arXiv preprint arXiv:2104.10157, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.