pith. sign in

arxiv: 2510.10254 · v2 · submitted 2025-10-11 · 💻 cs.CV

Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging?

Pith reviewed 2026-05-18 07:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords zero-shot learningmedical imagingvideo modelsmotion prediction4D CTsegmentationdenoisingsuper-resolution
0
0 comments X

The pith

An autoregressive video model untrained on medical data performs competitively on CT segmentation, denoising, super-resolution, and motion prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether autoregressive video modeling principles can transfer directly to medical imaging tasks even when the model has never seen medical data. It applies a large vision model in a zero-shot setting to organ segmentation, denoising, super-resolution, and motion prediction on CT scans. The model delineates anatomical structures with competitive results on the first three tasks and produces anatomically consistent forecasts of respiratory motion that surpass specialized baselines. Evaluation covers 4D CT data from 122 patients and more than 1,820 volumes, showing strong spatial accuracy in motion prediction. These results indicate that general video models may function as zero-shot learners and reasoners for medical volumes presented as frame sequences.

Core claim

Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy on 4D CT data from 122 patients totaling over 1,820 3D volumes.

What carries the argument

Large vision model using autoregressive prediction on sequences of 3D medical volumes presented as video frames.

If this is right

  • General-purpose video models can act as unified learners across multiple medical imaging tasks without task-specific training.
  • Zero-shot motion prediction produces patient-specific respiratory forecasts that maintain temporal coherence and anatomical consistency.
  • State-of-the-art spatial accuracy on 4D CT data supports improved applications in radiotherapy planning.
  • Video model architectures provide a foundation for building medical foundation models that handle diverse imaging tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same zero-shot transfer may extend to other 3D medical modalities such as MRI when volumes are sequenced as frames.
  • This approach could lower the barrier to entry for medical imaging applications by reducing the need for domain-specific datasets and fine-tuning.
  • Examining which video-learned representations enable anatomical reasoning could clarify the sources of cross-domain generalization.
  • Further scaling of the underlying video model may yield measurable gains in medical task performance.

Load-bearing premise

That autoregressive video modeling principles transfer directly to 3D medical volumes when the volumes are simply presented as frame sequences, without any domain adaptation, fine-tuning, or architectural changes.

What would settle it

Showing that the model generates anatomically inconsistent predictions or fails to exceed baseline accuracy on a new set of 4D CT scans with different respiratory patterns or patient anatomies would falsify the zero-shot transfer claim.

Figures

Figures reproduced from arXiv: 2510.10254 by Jike Zhong, Ming Li, Xiaofeng Yang, Yuheng Li, Yuxiang Lai.

Figure 1
Figure 1. Figure 1: Zero-shot learning and reasoning examples of the video model in medical imaging. From low-level perceptual restoration (super [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schematic illustration of intrafractional tumor motion caused by respiratory cycles during thoracic and upper-abdominal ra [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Multi-phase motion prediction on the public dataset. We evaluate model performance on the public 4D CT dataset using Dice Similarity Coefficient (DSC, %). Each model is provided with the first five phases of the 4D CT scan and autoregressively predicts the next five phases. The plots show phase-by-phase DSC for five representative methods (DAM, DiffuseRT, ConvLSTM, RMSim, and our proposed LVM). LVM consist… view at source ↗
Figure 4
Figure 4. Figure 4: Multi-phase motion prediction on the private dataset. The same DSC-based evaluation is conducted on our institutional 4D CT dataset (including lung, heart, and liver cases). Each model receives the first five phases and must generate the subsequent five phases. LVM maintains consistently higher DSC across all organs and phases, with smoother phase-to-phase transitions and less degradation compared to compe… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative visualization of lung motion. The first five phases are used as input, and the model predicts the next five. Each heatmap shows voxel-wise pixel differences between the ground truth (GT) and either the previous phase or the model prediction. Red indicates larger discrepancies. LVM accurately captures respiratory-induced motion, showing reduced errors and smoother temporal tran￾sitions compared … view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative visualization of liver motion. The first five 4D CT phases are used as input, and the model predicts the next five. Each heatmap shows voxel-wise differences between the prediction (or previous phase) and the ground truth, where red indicates larger errors. LVM accurately captures the livers smooth deformation and diaphragm-induced motion, maintaining temporal and anatomical consistency across … view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative visualization of Segmentation. For each organ, the left column shows the original CT slice, and the right column shows the predicted segmentation mask. The results demonstrate that the zero-shot video model can accurately segment organs across diverse anatomical regions based on the given input prompts. 4.2.5. Visualization To qualitatively evaluate the effectiveness of LVM in mod￾eling organ m… view at source ↗
read the original abstract

Recent advances in large generative models have shown that simple autoregressive formulations, when scaled appropriately, can exhibit strong zero-shot generalization across domains. Motivated by this trend, we investigate whether autoregressive video modeling principles can be directly applied to medical imaging tasks, despite the model never being trained on medical data. Specifically, we evaluate a large vision model (LVM) in a zero-shot setting across four representative tasks: organ segmentation, denoising, super-resolution, and motion prediction. Remarkably, even without domain-specific fine-tuning, the LVM can delineate anatomical structures in CT scans and achieve competitive performance on segmentation, denoising, and super-resolution. Most notably, in radiotherapy motion prediction, the model forecasts future 3D CT phases directly from prior phases of a 4D CT scan, producing anatomically consistent predictions that capture patient-specific respiratory dynamics with realistic temporal coherence. We evaluate the LVM on 4D CT data from 122 patients, totaling over 1,820 3D CT volumes. Despite no prior exposure to medical data, the model achieves strong performance across all tasks and surpasses specialized DVF-based and generative baselines in motion prediction, achieving state-of-the-art spatial accuracy. These findings reveal the emergence of zero-shot capabilities in medical video modeling and highlight the potential of general-purpose video models to serve as unified learners and reasoners laying the groundwork for future medical foundation models built on video models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates whether principles from autoregressive video modeling in large vision models (LVMs) transfer directly in a zero-shot setting to medical imaging tasks, despite no medical training data. It evaluates an unmodified LVM on organ segmentation, denoising, super-resolution, and motion prediction using 4D CT scans from 122 patients (over 1,820 3D volumes), reporting competitive results on the first three tasks and state-of-the-art spatial accuracy in forecasting future phases that capture patient-specific respiratory dynamics, outperforming DVF-based and generative baselines.

Significance. If the zero-shot transfer holds under clarified input conditions, the work would demonstrate emergent cross-domain capabilities in general video models, supporting the potential for unified medical foundation models without domain-specific fine-tuning. The large patient cohort strengthens the empirical evaluation and provides a falsifiable test of the transfer hypothesis.

major comments (2)
  1. [Methods] Methods section (input representation): The description of converting 3D CT volumes and 4D sequences into 2D frame sequences for the LVM is insufficient. No details are provided on slice ordering, resampling to match the model's expected resolution, intensity normalization (e.g., Hounsfield units to [0,1]), or handling of anisotropic voxel grids. This directly impacts the central claim that unmodified 2D video modeling preserves 3D anatomical consistency and patient-specific dynamics in motion prediction.
  2. [Results] Results, motion prediction subsection: The SOTA claim on spatial accuracy for 1,820 volumes requires explicit quantitative comparison (e.g., mean squared error, Dice scores, or Hausdorff distances) with error bars and statistical tests against the DVF-based and generative baselines. The abstract's qualitative description of 'anatomically consistent predictions' and 'realistic temporal coherence' is not load-bearing without these metrics to confirm superiority.
minor comments (2)
  1. [Abstract] Abstract: The final sentence contains a minor grammatical issue ('laying the groundwork for future medical foundation models built on video models').
  2. [Introduction] Notation: The term 'LVM' is introduced without an explicit expansion on first use in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We have addressed each major comment point by point below. Where the comments identify areas needing greater clarity or rigor, we have revised the manuscript accordingly to strengthen the presentation of our methods and results.

read point-by-point responses
  1. Referee: [Methods] Methods section (input representation): The description of converting 3D CT volumes and 4D sequences into 2D frame sequences for the LVM is insufficient. No details are provided on slice ordering, resampling to match the model's expected resolution, intensity normalization (e.g., Hounsfield units to [0,1]), or handling of anisotropic voxel grids. This directly impacts the central claim that unmodified 2D video modeling preserves 3D anatomical consistency and patient-specific dynamics in motion prediction.

    Authors: We agree that the original Methods section provided insufficient detail on the input representation pipeline, which is essential for reproducibility and for supporting our claims about anatomical consistency in the zero-shot setting. In the revised manuscript, we have expanded the relevant subsection to explicitly describe: (1) slice ordering, where 3D volumes are converted to 2D frame sequences by extracting contiguous axial slices in superior-to-inferior order; (2) resampling all volumes to the LVM's native 224x224 pixel resolution using bilinear interpolation; (3) intensity normalization by clipping Hounsfield units to [-1000, 2000] and linearly scaling to [0, 1]; and (4) handling of anisotropic voxel grids via initial resampling to isotropic 1 mm³ spacing using trilinear interpolation prior to frame extraction. These additions clarify the preprocessing steps while preserving the unmodified nature of the LVM. We have also added a supplementary figure illustrating the conversion workflow for 3D and 4D data. revision: yes

  2. Referee: [Results] Results, motion prediction subsection: The SOTA claim on spatial accuracy for 1,820 volumes requires explicit quantitative comparison (e.g., mean squared error, Dice scores, or Hausdorff distances) with error bars and statistical tests against the DVF-based and generative baselines. The abstract's qualitative description of 'anatomically consistent predictions' and 'realistic temporal coherence' is not load-bearing without these metrics to confirm superiority.

    Authors: We acknowledge that the motion prediction results would be more robust with additional quantitative metrics and statistical validation to fully substantiate the state-of-the-art spatial accuracy claim. In the revised manuscript, we have augmented the Results section with a new table and accompanying text that reports mean squared error (MSE), Dice similarity coefficient (DSC) on key anatomical structures, and Hausdorff distance (HD) metrics, each with standard deviations as error bars, computed across all 1,820 volumes from the 122 patients. We also include results from paired statistical tests (Wilcoxon signed-rank tests) against the DVF-based and generative baselines, with p-values demonstrating significant improvements. These quantitative additions directly support the superiority in spatial accuracy and complement the existing qualitative descriptions of anatomical consistency and temporal coherence. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical zero-shot results measured against external ground truth and baselines

full rationale

The paper reports direct experimental outcomes from applying an unmodified autoregressive video model to 4D CT volumes treated as frame sequences. Claims rest on quantitative comparisons to patient-specific ground-truth phases (1,820 volumes from 122 patients) and external DVF/generative baselines, with no internal parameter fitting, self-defined metrics, or equations that reduce predictions to inputs by construction. The zero-shot transfer is an empirical observation, not a derived result; any self-citation to the base LVM is for model provenance only and does not load-bear the medical performance numbers. This is a standard self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested transfer of video-modeling inductive biases to medical volumes; this is an empirical assumption rather than a derived quantity.

axioms (1)
  • domain assumption Autoregressive video modeling principles can be directly applied to medical imaging tasks by treating 3D CT volumes as video sequences
    Invoked in the abstract when the LVM is evaluated zero-shot on CT and 4D CT data without domain-specific fine-tuning.

pith-pipeline@v0.9.0 · 5795 in / 1358 out tokens · 52060 ms · 2026-05-18T07:21:32.537899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 8 internal anchors

  1. [1]

    Sequential modeling enables scalable learn- ing for large vision models

    Yutong Bai, Xinyang Geng, Karttikeya Mangalam, Amir Bar, Alan L Yuille, Trevor Darrell, Jitendra Malik, and Alexei A Efros. Sequential modeling enables scalable learn- ing for large vision models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22861–22872, 2024

  2. [2]

    Making the most of text semantics to improve biomedical vision–language processing

    Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-Valle, et al. Making the most of text semantics to improve biomedical vision–language processing. InEuro- pean conference on computer vision, pages 1–21. Springer, 2022

  3. [3]

    Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakan- tan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- guage models are few-shot learners.Advances in neural in- formation processing systems, 33:1877–1901, 2020

  4. [4]

    TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

    Jieneng Chen, Yongyi Lu, Qihang Yu, Xiangde Luo, Ehsan Adeli, Yan Wang, Le Lu, Alan L Yuille, and Yuyin Zhou. Transunet: Transformers make strong encoders for medi- cal image segmentation.arXiv preprint arXiv:2102.04306, 2021

  5. [5]

    Analyzing tumors by synthesis.Genera- tive Machine Learning Models in Medical Image Computing, page 85, 2024

    Qi Chen, Yuxiang Lai, Xiaoxi Chen, Qixin Hu, Alan Yuille, and Zongwei Zhou. Analyzing tumors by synthesis.Genera- tive Machine Learning Models in Medical Image Computing, page 85, 2024

  6. [6]

    An image is worth 16x16 words: Trans- formers for image recognition at scale.International Con- ference on Learning Representations, 2020

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale.International Con- ference on Learning Representations, 2020

  7. [7]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  8. [8]

    Feasibility study of convolutional long shortterm memory network for pulmonary movement prediction in ct images.Journal of Biomedical Physics & Engineering, 14(1):55, 2024

    Zahra Ghasemi and Payam Samadi Miandoab. Feasibility study of convolutional long shortterm memory network for pulmonary movement prediction in ct images.Journal of Biomedical Physics & Engineering, 14(1):55, 2024

  9. [9]

    Veo 3 announcement.https : / / blog

    Google. Veo 3 announcement.https : / / blog . google / technology / ai / generative - media - models- io- 2025/, 2025. Accessed: September 22, 2025

  10. [10]

    Veo 3 launch.https://cloud.google

    Google. Veo 3 launch.https://cloud.google. com / blog / products / ai - machine - learning / veo - 3 - fast - available - for - everyone - on - vertex-ai, 2025. Accessed: September 22, 2025

  11. [11]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion mod- els.arXiv preprint arXiv:2210.02303, 2022

  12. [12]

    Geoffrey D Hugo, Elisabeth Weiss, William C Sleeman, Salim Balik, Paul J Keall, Jun Lu, and Jeffrey F Williamson. A longitudinal four-dimensional computed tomography and cone beam computed tomography dataset for image-guided radiation therapy research in lung cancer.Medical physics, 44(2):762–771, 2017

  13. [13]

    nnu-net: a self-configuring method for deep learning-based biomedical image segmen- tation.Nature Methods, 18(2):203–211, 2021

    Fabian Isensee, Paul F Jaeger, Simon AA Kohl, Jens Pe- tersen, and Klaus H Maier-Hein. nnu-net: a self-configuring method for deep learning-based biomedical image segmen- tation.Nature Methods, 18(2):203–211, 2021

  14. [14]

    F our-dimensional computed tomography (4dct): a review of the current status and applications.Jour- nal of medical imaging and radiation oncology, 59(5):545– 554, 2015

    Yune Kwong, Alexandra Olimpia Mel, Greg Wheeler, and John M Troupis. F our-dimensional computed tomography (4dct): a review of the current status and applications.Jour- nal of medical imaging and radiation oncology, 59(5):545– 554, 2015

  15. [15]

    Memory-assisted sub-prototype mining for universal domain adaptation.arXiv preprint arXiv:2310.05453, 2023

    Yuxiang Lai, Yi Zhou, Xinghong Liu, and Tao Zhou. Memory-assisted sub-prototype mining for universal domain adaptation.arXiv preprint arXiv:2310.05453, 2023

  16. [16]

    From pixel to cancer: Cellular automata in computed tomography

    Yuxiang Lai, Xiaoxi Chen, Angtian Wang, Alan Yuille, and Zongwei Zhou. From pixel to cancer: Cellular automata in computed tomography. InInternational Conference on Med- ical Image Computing and Computer-Assisted Intervention, pages 36–46. Springer, 2024

  17. [17]

    arXiv preprint arXiv:2503.13939 (2025)

    Yuxiang Lai, Jike Zhong, Ming Li, Shitian Zhao, and Xi- aofeng Yang. Med-r1: Reinforcement learning for general- izable medical reasoning in vision-language models.arXiv preprint arXiv:2503.13939, 2025

  18. [18]

    Rmsim: controlled respiratory mo- tion simulation on static patient scans.Physics in Medicine & Biology, 68(4):045009, 2023

    Donghoon Lee, Ellen Yorke, Masoud Zarepisheh, Saad Nadeem, and Yu-Chi Hu. Rmsim: controlled respiratory mo- tion simulation on static patient scans.Physics in Medicine & Biology, 68(4):045009, 2023

  19. [19]

    Abdomenatlas: A large-scale, detailed- annotated, & multi-center dataset for efficient transfer learn- ing and open algorithmic benchmarking.Medical Image Analysis, 97:103285, 2024

    Wenxuan Li, Chongyu Qu, Xiaoxi Chen, Pedro RAS Bassi, Yijia Shi, Yuxiang Lai, Qian Yu, Huimin Xue, Yixiong Chen, Xiaorui Lin, et al. Abdomenatlas: A large-scale, detailed- annotated, & multi-center dataset for efficient transfer learn- ing and open algorithmic benchmarking.Medical Image Analysis, 97:103285, 2024

  20. [20]

    Med-flamingo: a multimodal medical few-shot learner

    Michael Moor, Qian Huang, Shirley Wu, Michihiro Ya- sunaga, Yash Dalmia, Jure Leskovec, Cyril Zakka, Ed- uardo Pontes Reis, and Pranav Rajpurkar. Med-flamingo: a multimodal medical few-shot learner. InMachine Learning for Health (ML4H), pages 353–367. PMLR, 2023

  21. [21]

    Site-specific deformable imaging registration algorithm selection using patient-based simulated deforma- tions.Medical physics, 40(4):041911, 2013

    Ke Nie, Cynthia Chuang, Neil Kirby, Steve Braunstein, and Jean Pouliot. Site-specific deformable imaging registration algorithm selection using patient-based simulated deforma- tions.Medical physics, 40(4):041911, 2013

  22. [22]

    Sora 2 system card.https://openai.com/ index/sora- 2- system- card/, 2025

    OpenAI. Sora 2 system card.https://openai.com/ index/sora- 2- system- card/, 2025. Accessed: September 22, 2025

  23. [23]

    A probabilistic deep learning model of inter-fraction anatomical variations in radiotherapy

    Oscar Pastor-Serrano, Steven Habraken, Mischa Hoogeman, Danny Lathouwers, Dennis Schaart, Yusuke Nomura, Lei Xing, and Zolt ´an Perk ´o. A probabilistic deep learning model of inter-fraction anatomical variations in radiotherapy. Physics in Medicine & Biology, 68(8):085018, 2023

  24. [24]

    Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019

  25. [25]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  26. [26]

    U- net: Convolutional networks for biomedical image segmen- tation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InInternational Conference on Medical Image Com- puting and Computer-Assisted Intervention, pages 234–241. Springer, 2015

  27. [27]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen- sri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C ´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025

  28. [28]

    Convolutional lstm network: A machine learning approach for precipitation nowcasting.Advances in neural information processing sys- tems, 28, 2015

    Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang-chun Woo. Convolutional lstm network: A machine learning approach for precipitation nowcasting.Advances in neural information processing sys- tems, 28, 2015

  29. [29]

    Diffusert: predicting likely anatomical deformations of patients undergoing radiotherapy.Physics in Medicine & Bi- ology, 69(15):155016, 2024

    Andreas Smolders, Luciano Rivetti, Nadine Vatterodt, Stine Korreman, Anthony Lomax, Manju Sharma, Andrej Studen, Damien Charles Weber, Robert Jeraj, and Francesca Albe- tini. Diffusert: predicting likely anatomical deformations of patients undergoing radiotherapy.Physics in Medicine & Bi- ology, 69(15):155016, 2024

  30. [30]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  31. [31]

    Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information pro- cessing systems, 30, 2017

  32. [32]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Il- lia Polosukhin. Attention is all you need.arXiv preprint arXiv:1706.03762, 2017

  33. [33]

    Virginia Commonwealth University, 2015

    Douglas J Vile.Statistical modeling of interfractional tissue deformation and its application in radiation therapy plan- ning. Virginia Commonwealth University, 2015

  34. [34]

    Medclip: Contrastive learning from unpaired medi- cal images and text

    Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medi- cal images and text. InProceedings of the Conference on Empirical Methods in Natural Language Processing. Con- ference on Empirical Methods in Natural Language Process- ing, page 3876, 2022

  35. [35]

    To- talsegmentator: robust segmentation of 104 anatomic struc- tures in ct images.Radiology: Artificial Intelligence, 5(5), 2023

    Jakob Wasserthal, Hanns-Christian Breit, Manfred T Meyer, Maurice Pradella, Daniel Hinck, Alexander W Sauter, Tobias Heye, Daniel T Boll, Joshy Cyriac, Shan Yang, et al. To- talsegmentator: robust segmentation of 104 anatomic struc- tures in ct images.Radiology: Artificial Intelligence, 5(5), 2023

  36. [36]

    Video models are zero-shot learners and reasoners

    Thadd ¨aus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learn- ers and reasoners.arXiv preprint arXiv:2509.20328, 2025

  37. [37]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and trans- formers.arXiv preprint arXiv:2104.10157, 2021

  38. [38]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023