pith. sign in

arxiv: 2605.13623 · v1 · pith:XTQ2KCBYnew · submitted 2026-05-13 · 💻 cs.LG

Multimodal Graph-based Classification of Esophageal Motility Disorders

Pith reviewed 2026-05-14 19:30 UTC · model grok-4.3

classification 💻 cs.LG
keywords esophageal motility disordershigh-resolution impedance manometrygraph neural networksmultimodal classificationmachine learning in medicinespatio-temporal graphspatient data fusion
0
0 comments X

The pith

Multimodal graph neural networks that fuse esophageal pressure graphs with patient data improve classification of motility disorders over single-modality baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a machine learning method to classify esophageal motility disorders from high-resolution impedance manometry recordings combined with patient-specific details. The recordings are modeled as spatio-temporal graphs capturing pressure values and their relations over space and time, which a graph neural network processes to learn representations. These are then fused with embeddings extracted from patient demographics, clinical notes, and symptoms for multi-class prediction of swallow events. A sympathetic reader would care because manual reading of these signals shows high variability between clinicians, and reducing that inconsistency could support more reliable diagnosis. Experiments on recordings from 104 patients, including ablations, indicate gains from both the patient data and the graph structure.

Core claim

The central claim is that representing HRIM data as spatio-temporal graphs, processing them with a graph neural network to extract physiologically meaningful features, and fusing the results with patient embeddings produces better multi-class classification of esophageal motility disorders than models that use only HRIM-derived features or vision-based classifiers. Ablation studies and baseline comparisons confirm the contribution of each modality and the advantage of the graph construction.

What carries the argument

Spatio-temporal graphs of HRIM recordings, with nodes corresponding to pressure values along the esophagus and edges encoding spatial adjacency and impedance dynamics, processed by a graph neural network and fused with patient embeddings derived from demographics, symptoms, and clinical notes.

If this is right

  • The multimodal model outperforms HRIM-feature-only models across every classification category.
  • Graph-based modeling adds measurable gains beyond vision-based classifier baselines.
  • Ablation results confirm the complementary contribution of patient embeddings and graph representations.
  • The approach demonstrates feasibility for systematic integration of multiple data modalities in swallow-event classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If validated on larger and more diverse patient cohorts, the method could support standardized interpretation that reduces clinician-to-clinician variability in practice.
  • The same graph construction for physiological time-series signals could be tested on other gastrointestinal or cardiac recordings.
  • Combining language-model extraction of patient features from free-text notes with graph signal processing opens routes to fully automated multimodal diagnostic pipelines.

Load-bearing premise

The graph representation of HRIM recordings encodes physiologically meaningful features that improve multi-class classification performance when fused with patient embeddings.

What would settle it

Repeating the classification experiments on an independent set of HRIM recordings and patient data and finding no improvement or a drop in accuracy metrics for the multimodal graph model compared with the HRIM-only and vision baselines.

read the original abstract

Diagnosing esophageal motility disorders pose significant challenges due to the complexity of high-resolution impedance manometry (HRIM) data and variability in clinical interpretation. This work explores the feasibility of a multimodal Machine Learning (ML)-based classification approach that combines HRIM recordings with patient-specific information and incorporates a graph-based modeling of esophageal physiology. We analyze HRIM recordings with corresponding patient information from 104 patients with esophageal motility disorders. Patient data includes demographic, clinical, and symptom information extracted from structured questionnaires and free-text notes using keyword detection and large language model-based processing. HRIM data is represented as spatio-temporal graphs, where nodes correspond to pressure values along the esophagus and edges encode spatial adjacency and impedance dynamics. A graph neural network (GNN) is applied to learn physiologically meaningful representations, which are fused with patient embeddings for multi-category, multi-class classification of swallow events. The impact of patient features and graph-based modeling is evaluated by ablation studies and comparison to vision-based classifier baselines. The proposed multimodal approach indicates improvements over models that rely solely on HRIM-derived features across all classification categories. Additionally, the graph-based modeling provides gains compared to vision-based baselines. Our experiments systematically assess the complementary contribution of multiple modalities, as well as demonstrate the feasibility of our proposed graph-based approach. Our initial findings demonstrate that integrating patient-level data with graph-based representations of HRIM signals appears to be a promising direction for more accurate classification of esophageal motility disorders.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a multimodal machine learning approach for classifying esophageal motility disorders from high-resolution impedance manometry (HRIM) recordings. HRIM data is modeled as spatio-temporal graphs (nodes as pressure values, edges encoding spatial adjacency and impedance dynamics) processed by a graph neural network (GNN) to learn representations, which are fused with patient embeddings derived from demographics, clinical data, and symptom information (extracted via keyword detection and LLMs). The method is evaluated via ablation studies and comparisons to HRIM-only and vision-based baselines on data from 104 patients, with the central claim being that the multimodal graph-based fusion yields improvements over single-modality models across classification categories.

Significance. If the quantitative results and ablation details support the claims, the work could advance automated diagnosis in gastroenterology by demonstrating that graph representations of physiological signals can capture motility patterns complementary to patient-level data, potentially reducing inter-observer variability in HRIM interpretation.

major comments (3)
  1. [Abstract] Abstract: The claims that the multimodal approach 'indicates improvements over models that rely solely on HRIM-derived features across all classification categories' and that 'graph-based modeling provides gains compared to vision-based baselines' are presented without any quantitative metrics (e.g., accuracy, F1, AUC), statistical significance tests, or performance deltas, making it impossible to assess whether the data supports the central claim.
  2. [Methods] Methods (graph construction and GNN): The spatio-temporal graph representation (nodes as pressure values, edges as spatial adjacency plus impedance dynamics) and the assumption that it encodes physiologically meaningful features for GNN learning lack specific hyperparameters, edge-weight definitions, or validation against clinical motility metrics such as peristaltic velocity or integrated relaxation pressure; this is load-bearing for the claim that GNN representations are superior to vision baselines.
  3. [Experiments] Experiments and ablations: Ablation studies evaluating the contribution of patient features and graph modeling are referenced but supply no numerical results, tables, or comparisons (e.g., performance change when removing the GNN component), preventing verification of the complementary modality contributions.
minor comments (1)
  1. [Methods] The manuscript would benefit from explicit statements of the multi-class label taxonomy and the exact fusion architecture (e.g., concatenation, attention) between GNN embeddings and patient embeddings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating revisions where they will strengthen the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claims that the multimodal approach 'indicates improvements over models that rely solely on HRIM-derived features across all classification categories' and that 'graph-based modeling provides gains compared to vision-based baselines' are presented without any quantitative metrics (e.g., accuracy, F1, AUC), statistical significance tests, or performance deltas, making it impossible to assess whether the data supports the central claim.

    Authors: We agree that the abstract would be strengthened by including quantitative metrics. The full manuscript reports specific results (accuracy, F1, and AUC) in the Experiments section supporting the stated improvements. In the revised version we will update the abstract to include key numerical values and any statistical tests performed, allowing direct assessment of the claims. revision: yes

  2. Referee: [Methods] Methods (graph construction and GNN): The spatio-temporal graph representation (nodes as pressure values, edges as spatial adjacency plus impedance dynamics) and the assumption that it encodes physiologically meaningful features for GNN learning lack specific hyperparameters, edge-weight definitions, or validation against clinical motility metrics such as peristaltic velocity or integrated relaxation pressure; this is load-bearing for the claim that GNN representations are superior to vision baselines.

    Authors: The Methods section describes the graph nodes as pressure values and edges as spatial adjacency plus impedance dynamics, with the GNN applied to learn representations. We will expand this section in revision to specify all hyperparameters (layer count, hidden size, etc.), exact edge-weight formulas, and any correlations or comparisons performed against clinical metrics such as peristaltic velocity. This will make the physiological grounding and superiority claim more transparent. revision: yes

  3. Referee: [Experiments] Experiments and ablations: Ablation studies evaluating the contribution of patient features and graph modeling are referenced but supply no numerical results, tables, or comparisons (e.g., performance change when removing the GNN component), preventing verification of the complementary modality contributions.

    Authors: The Experiments section contains ablation results with numerical tables comparing full multimodal model against HRIM-only and vision baselines, including performance changes when patient embeddings or the GNN component are removed. To improve clarity we will add explicit cross-references in the main text to the relevant table rows and deltas, and if needed include a concise summary table of ablation contributions. revision: partial

Circularity Check

0 steps flagged

No circularity: standard supervised GNN pipeline with empirical ablations

full rationale

The paper presents a conventional supervised learning workflow: HRIM recordings from 104 patients are converted to spatio-temporal graphs (nodes as pressure values, edges as adjacency plus impedance), a GNN learns representations, these are fused with patient embeddings derived from demographics/symptoms, and the combined features are used for multi-class swallow classification. Ablation studies and comparisons to HRIM-only and vision baselines are reported as empirical results. No equations, parameters, or claims reduce by construction to their own inputs; no self-citation load-bearing uniqueness theorems, fitted inputs renamed as predictions, or ansatzes smuggled via prior work appear. The central claim of multimodal gains rests on held-out evaluation rather than definitional equivalence, making the derivation self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

Relies on standard assumptions from graph signal processing and multimodal ML; no free parameters explicitly fitted beyond typical ML tuning, and no new entities postulated.

free parameters (1)
  • Various GNN and embedding model hyperparameters
    Chosen during training to optimize classification performance on the dataset.
axioms (2)
  • domain assumption HRIM data can be meaningfully represented as spatio-temporal graphs with nodes as pressure values and edges encoding spatial adjacency and impedance dynamics
    Core modeling choice for applying GNNs.
  • domain assumption Patient information extracted via keyword detection and LLM processing provides complementary features for classification
    Assumed in the multimodal fusion.

pith-pipeline@v0.9.0 · 5567 in / 1394 out tokens · 109616 ms · 2026-05-14T19:30:29.684280+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    Otolaryn- gologic Clinics of North America57(4) (2024)

    Hunter, C.J., Tulunay-Ugur, O.E.: Dysphagia in the Aging Population. Otolaryn- gologic Clinics of North America57(4) (2024)

  2. [2]

    International Journal of Language & Communication Disorders58(2) (2023)

    Smith, R., Bryant, L., Hemsley, B.: The true cost of dysphagia on quality of life: The views of adults with swallowing disability. International Journal of Language & Communication Disorders58(2) (2023)

  3. [3]

    The Laryngoscope130(4) (2020)

    Allen, J., Greene, M., Sabido, I., Stretton, M., Miles, A.: Economic costs of dysphagia among hospitalized patients. The Laryngoscope130(4) (2020)

  4. [4]

    Rheumatology47(6) (2008)

    Sheehan, N.J.: Dysphagia and other manifestations of oesophageal involvement in the musculoskeletal diseases. Rheumatology47(6) (2008)

  5. [5]

    American Family Physician102(5) (2020)

    Wilkinson, J.M., Halland, M.: Esophageal Motility Disorders. American Family Physician102(5) (2020)

  6. [6]

    Digestive and Liver Disease40(3) (2008)

    Bredenoord, A.J., Smout, A.J.P.M.: High-resolution manometry. Digestive and Liver Disease40(3) (2008)

  7. [7]

    Gastroenterology & Hepatology11(6) (2015) 12

    Carlson, D.A., Pandolfino, J.E.: High-Resolution Manometry in Clinical Practice. Gastroenterology & Hepatology11(6) (2015) 12

  8. [8]

    Diseases of the Esophagus: Official Journal of the International Society for Diseases of the Esophagus28(8) (2015)

    Fox, M.R., Pandolfino, J.E., Sweis, R., et al.: Inter-observer agreement for diag- nostic classification of esophageal motility disorders defined in high-resolution manometry. Diseases of the Esophagus: Official Journal of the International Society for Diseases of the Esophagus28(8) (2015)

  9. [9]

    Journal of Neurogastroenterology and Motility24(1) (2018)

    Kim, J.H., Kim, S.E., Cho, Y.K., Lim, C.-H., Park, M.I., Hwang, J.W., Jang, J.-S., Oh, M.: Factors Determining the Inter-observer Variability and Diagnos- tic Accuracy of High-resolution Manometry for Esophageal Motility Disorders. Journal of Neurogastroenterology and Motility24(1) (2018)

  10. [10]

    Journal of Medical Internet Research27(1) (2025)

    Gong, E.J., Bang, C.S., Lee, J.J., Baik, G.H.: AI in Esophageal Motility Dis- orders: Systematic Review of High-Resolution Manometry Studies. Journal of Medical Internet Research27(1) (2025)

  11. [11]

    Physiology & Behavior165(2016)

    Jungheim, M., Busche, A., Miller, S., Schilling, N., Schmidt-Thieme, L., Ptok, M.: Calculation of upper esophageal sphincter restitution time from high resolution manometry data using machine learning. Physiology & Behavior165(2016)

  12. [12]

    The Turkish Journal of Gastroenterology: The Official Journal of Turkish Society of Gastroenterology25(5) (2014)

    Lee, T.H., Lee, J.S., Hong, S.J., Lee, J.S., Jeon, S.R., Kim, W.J., Kim, H.G., Cho, J.Y., Kim, J.O., Cho, J.H., Park, W.Y., Park, J.W., Lee, Y.G.: High- resolution manometry: Reliability of automated analysis of upper esophageal sphincter relaxation parameters. The Turkish Journal of Gastroenterology: The Official Journal of Turkish Society of Gastroenter...

  13. [13]

    Visceral Medicine36(6) (2020)

    Jell, A., Kuttler, C., Ostler, D., H¨ user, N.: How to Cope with Big Data in Functional Analysis of the Esophagus. Visceral Medicine36(6) (2020)

  14. [14]

    Communications Medicine5(1) (2025)

    Geiger, A., Wagner, L., Rueckert, D., Wilhelm, D., Jell, A.: A deep learning- based approach to enhance accuracy and feasibility of long-term high-resolution manometry examinations. Communications Medicine5(1) (2025)

  15. [15]

    International Journal of Computer Assisted Radiology and Surgery20(4) (2025)

    Geiger, A., Bernhard, L., Gassert, F., Feußner, H., Wilhelm, D., Friess, H., Jell, A.: Towards multimodal visualization of esophageal motility: Fusion of manome- try, impedance, and videofluoroscopic image sequences. International Journal of Computer Assisted Radiology and Surgery20(4) (2025)

  16. [16]

    The Laryngoscope123(3) (2013)

    Hoffman, M.R., Mielens, J.D., Omari, T.I., Rommel, N., Jiang, J.J., McCul- loch, T.M.: Artificial neural network classification of pharyngeal high-resolution manometry with impedance data. The Laryngoscope123(3) (2013)

  17. [17]

    Otolaryngology–head and neck surgery149(1) (2013)

    Hoffman, M.R., Jones, C.A., Geng, Z., Abelhalim, S.M., Walczak, C.C., Mitchell, A.R., Jiang, J.J., McCulloch, T.M.: Classification of high-resolution manome- try data according to videofluoroscopic parameters using pattern recognition. Otolaryngology–head and neck surgery149(1) (2013)

  18. [18]

    Proceedings of the Institution of Mechanical Engineers, Part H: Journal of Engineering in Medicine230(9) (2016)

    Carniel, E.L., Frigo, A., Costantini, M., Giuliani, T., Nicoletti, L., Merigliano, S., Natali, A.N.: A physiological model for the investigation of esophageal motility in 13 healthy and pathologic conditions. Proceedings of the Institution of Mechanical Engineers, Part H: Journal of Engineering in Medicine230(9) (2016)

  19. [19]

    IEEE Transactions on Biomedical Engineering65(7) (2018)

    Frigo, A., Costantini, M., Fontanella, C.G., Salvador, R., Merigliano, S., Carniel, E.L.: A Procedure for the Automatic Analysis of High-Resolution Manometry Data to Support the Clinical Diagnosis of Esophageal Motility Disorders. IEEE Transactions on Biomedical Engineering65(7) (2018)

  20. [20]

    Applied Sciences13(18) (2023)

    Zifan, A., Lin, J., Peng, Z., Bo, Y., Mittal, R.K.: Unraveling Functional Dys- phagia: A Game-Changing Automated Machine-Learning Diagnostic Approach. Applied Sciences13(18) (2023)

  21. [21]

    American Journal of Physiology-Gastrointestinal and Liver Physiology 327(3) (2024)

    Zifan, A., Lee, J.M., Mittal, R.K.: Enhancing the diagnostic yield of esophageal manometry using distension-contraction plots of peristalsis and artificial intel- ligence. American Journal of Physiology-Gastrointestinal and Liver Physiology 327(3) (2024)

  22. [22]

    Journal of Gastrointestinal and Liver Diseases31(4) (2022)

    Popa, S.L., Surdea-Blaga, T., Dumitrascu, D.L., Chiarioni, G., Savarino, E., David, L., Ismaiel, A., Leucuta, D.C., Zsigmond, I., Sebestyen, G., Hangan, A., Czako, Z.: Automatic Diagnosis of High-Resolution Esophageal Manometry Using Artificial Intelligence. Journal of Gastrointestinal and Liver Diseases31(4) (2022)

  23. [23]

    Sensors22(14) (2022)

    Surdea-Blaga, T., Sebestyen, G., Czako, Z., Hangan, A., Dumitrascu, D.L., Ismaiel, A., David, L., Zsigmond, I., Chiarioni, G., Savarino, E., Leucuta, D.C., Popa, S.L.: Automated Chicago Classification for Esophageal Motility Disorder Diagnosis Using Machine Learning. Sensors22(14) (2022)

  24. [24]

    Neurogas- troenterology and motility : the official journal of the European Gastrointestinal Motility Society34(7) (2022)

    Kou, W., Galal, G.O., Klug, M.W., Mukhin, V., Carlson, D.A., Etemadi, M., Kahrilas, P.J., Pandolfino, J.E.: Deep learning based artificial intelligence model for identifying swallow types in esophageal high-resolution manometry. Neurogas- troenterology and motility : the official journal of the European Gastrointestinal Motility Society34(7) (2022)

  25. [25]

    Artificial Intelligence in Medicine124(2022)

    Kou, W., Carlson, D.A., Baumann, A.J., Donnan, E.N., Schauer, J.M., Etemadi, M., Pandolfino, J.E.: A multi-stage machine learning model for diagnosis of esophageal manometry. Artificial Intelligence in Medicine124(2022)

  26. [26]

    Computer Methods and Programs in Biomedicine207(2021)

    Wang, Z., Hou, M., Yan, L., Dai, Y., Yin, Y., Liu, X.: Deep learning for tracing esophageal motility function over time. Computer Methods and Programs in Biomedicine207(2021)

  27. [27]

    Journal of Biomedical Informatics141(2023)

    Rafieivand, S., Moradi, M.H., Momayez Sanat, Z., Asl Soleimani, H.: A fuzzy-based framework for diagnosing esophageal mobility disorder using high- resolution manometry. Journal of Biomedical Informatics141(2023)

  28. [28]

    PLOS ONE20(2) (2025)

    Wu, X., Guo, C., Lin, J., Lin, Z., Chen, Q.: Mixed attention ensemble for 14 esophageal motility disorders classification. PLOS ONE20(2) (2025)

  29. [29]

    Journal of Neurogastroenterology and Motility23(2) (2017)

    Shim, Y.K., Kim, N., Park, Y.H., Lee, J.-C., Sung, J., Choi, Y.J., Yoon, H., Shin, C.M., Park, Y.S., Lee, D.H.: Effects of Age on Esophageal Motility: Use of High- resolution Esophageal Impedance Manometry. Journal of Neurogastroenterology and Motility23(2) (2017)

  30. [30]

    Revista de Gastroenterolog´ ıa de M´ exico85(3) (2020)

    Kunen, L.C.B., Fontes, L.H.S., Moraes-Filho, J.P., Assirati, F.S., Navarro- Rodriguez, T.: Esophageal motility patterns are altered in older adult patients. Revista de Gastroenterolog´ ıa de M´ exico85(3) (2020)

  31. [31]

    Brazilian Journal of Medical and Biological Research31(1998)

    Dantas, R.O., Ferriolli, E., Souza, M.a.N.: Gender effects on esophageal motility. Brazilian Journal of Medical and Biological Research31(1998)

  32. [32]

    Gastroenterology Report6(3) (2018)

    Kamal, A., Shakya, S., Lopez, R., Thota, P.N.: Gender, medication use and other factors associated with esophageal motility disorders in non-obstructive dysphagia. Gastroenterology Report6(3) (2018)

  33. [33]

    Journal of Neurogastroenterology and Motility27(4) (2021)

    Takahashi, S., Matsumura, T., Kaneko, T., Tokunaga, M., Oura, H., Ishikawa, T., Nagashima, A., Shiratori, W., Akizue, N., Ohta, Y., Kikuchi, A., Fujie, M., Saito, K., Okimoto, K., Maruoka, D., Nakagawa, T., Arai, M., Kato, J., Kato, N.: Clinical Characteristics of Esophageal Motility Disorders in Patients With Heartburn. Journal of Neurogastroenterology a...

  34. [34]

    Neurogastroenterology & Motility36(11) (2024)

    Le, K.H.N., Low, E.E., Sharma, P., Greytak, M., Yadlapati, R.: Normative high resolution esophageal manometry values in asymptomatic patients with obesity. Neurogastroenterology & Motility36(11) (2024)

  35. [35]

    Dysphagia38(4) (2023)

    Cohen, D.L., Hijazi, B., Omari, A., Bermont, A., Shirin, H., Said Ahmad, H., Azzam, N., Shibli, F., Dickman, R., Mari, A.: Ethnic Differences in Clinical Pre- sentations and Esophageal High-Resolution Manometry Findings in Patients with Achalasia. Dysphagia38(4) (2023)

  36. [36]

    Neurogastroenterology & Motility27(2) (2015)

    Kahrilas, P.J., Bredenoord, A.J., Fox, M., Gyawali, C.P., Roman, S., Smout, A.J.P.M., Pandolfino, J.E., Group, I.H.R.M.W.: The Chicago Classification of esophageal motility disorders, v3.0. Neurogastroenterology & Motility27(2) (2015)

  37. [37]

    Qwen3 Technical Report

    Qwen Team: Qwen3 Technical Report (2025). https://arxiv.org/abs/2505.09388

  38. [38]

    Brody, S., Alon, U., Yahav, E.: How Attentive are Graph Attention Networks? In: International Conference on Learning Representations (2022)

  39. [39]

    arXiv (2020)

    Li, G., Xiong, C., Thabet, A., Ghanem, B.: DeeperGCN: All You Need to Train Deeper GCNs. arXiv (2020)

  40. [40]

    In: Advances in Neural Information Processing Systems, vol

    Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., 15 Liu, C., Krishnan, D.: Supervised Contrastive Learning. In: Advances in Neural Information Processing Systems, vol. 33 (2020)

  41. [41]

    In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recog- nition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, NV, USA (2016)

  42. [42]

    In: International Conference on Learning Representations (2021)

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In: International Conference on Learning Representations (2021)

  43. [43]

    In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022) 16

    Liu, Z., Hu, H., Lin, Y., Yao, Z., Xie, Z., Wei, Y., Ning, J., Cao, Y., Zhang, Z., Dong, L., Wei, F., Guo, B.: Swin Transformer V2: Scaling Up Capacity and Resolution. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022) 16