arxiv: 2604.05482 · v1 · submitted 2026-04-07 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Unifying VLM-Guided Flow Matching and Spectral Anomaly Detection for Interpretable Veterinary Diagnosis

Pu Wang , Zhixuan Mao , Jialu Li , Zhuoran Zheng , Dianjie Lu , Youshan Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords canine pneumothoraxflow matchingvision-language modelrandom matrix theoryanomaly detectionimage segmentationinterpretable diagnosisveterinary imaging

0 comments

The pith

VLM-guided flow matching segmentation paired with random matrix theory detects canine pneumothorax by isolating non-random pathological signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses data scarcity and trust issues in automatic diagnosis of pneumothorax in dogs by introducing a new public pixel-level annotated dataset. It reframes the problem as localizing potential lesions with a vision-language model that steers iterative flow matching to refine segmentation masks, then isolating features from those masks for analysis. Random matrix theory treats healthy tissue as predictable random noise and flags pneumothorax through statistically significant outlier eigenvalues in the spectrum. This combination of generative localization and first-principles spectral detection aims to deliver both high accuracy and direct interpretability, as the detections rest on measurable deviations from expected noise rather than learned patterns alone.

Core claim

The paper establishes that a vision-language model can guide iterative flow matching to produce high-fidelity segmentation masks that isolate lesion features, enabling random matrix theory to model healthy tissue as random noise and identify pneumothorax via statistically significant outlier eigenvalues, thereby creating an accurate and interpretable diagnostic system.

What carries the argument

VLM-guided iterative flow matching for refining segmentation masks, which isolates features so that random matrix theory can detect outlier eigenvalues representing the non-random pneumothorax signal.

Load-bearing premise

Healthy tissue in the masked regions can be modeled as predictable random noise whose eigenvalue spectrum is known well enough for random matrix theory to reliably separate pneumothorax as statistically significant outliers.

What would settle it

A collection of healthy canine chest X-rays in which the random matrix theory analysis on VLM-masked regions produces outlier eigenvalues at rates similar to pneumothorax cases would falsify the noise model.

Figures

Figures reproduced from arXiv: 2604.05482 by Dianjie Lu, Jialu Li, Pu Wang, Youshan Zhang, Zhixuan Mao, Zhuoran Zheng.

**Figure 2.** Figure 2: Overview of our proposed synergistic framework for canine pneumothorax diagnosis. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) This chart displays the Receiver Operating Characteristic curves. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison of Unet-based Segmentation Results. This figure presents a visual comparison of the segmentation performance of our proposed model against five Unet-based methods. 3) Validation of RMT Assumptions: Empirical Spectral Distribution Analysis: To empirically validate the core premise of our method—specifically that healthy tissue features follow the Marchenko-Pastur (MP) law while pneum… view at source ↗

read the original abstract

Automatic diagnosis of canine pneumothorax is challenged by data scarcity and the need for trustworthy models. To address this, we first introduce a public, pixel-level annotated dataset to facilitate research. We then propose a novel diagnostic paradigm that reframes the task as a synergistic process of signal localization and spectral detection. For localization, our method employs a Vision-Language Model (VLM) to guide an iterative Flow Matching process, which progressively refines segmentation masks to achieve superior boundary accuracy. For detection, the segmented mask is used to isolate features from the suspected lesion. We then apply Random Matrix Theory (RMT), a departure from traditional classifiers, to analyze these features. This approach models healthy tissue as predictable random noise and identifies pneumothorax by detecting statistically significant outlier eigenvalues that represent a non-random pathological signal. The high-fidelity localization from Flow Matching is crucial for purifying the signal, thus maximizing the sensitivity of our RMT detector. This synergy of generative segmentation and first-principles statistical analysis yields a highly accurate and interpretable diagnostic system (source code is available at: https://github.com/Pu-Wang-alt/Canine-pneumothorax).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New public dataset for canine pneumothorax is the real contribution here, while the VLM-flow plus RMT pipeline rests on an unverified assumption about random-matrix behavior after masking.

read the letter

The paper releases a pixel-level annotated dataset for canine pneumothorax and combines VLM-guided flow matching for segmentation with random matrix theory for detection. That dataset and the linked code are the clearest additions; they directly tackle data scarcity in veterinary imaging and give others something concrete to build on or benchmark against. The framing of localization followed by spectral outlier detection is a reasonable way to aim for interpretability without heavy reliance on black-box classifiers. The authors treat the flow-matching masks as a way to isolate lesion features so that healthy tissue behaves like random noise whose eigenvalues follow the Marchenko-Pastur law, with pneumothorax showing up as statistically significant outliers. This is a domain-specific application rather than a new theoretical framework, but the unification itself is not obviously present in the cited prior work. The main weakness is the lack of any reported numbers, baselines, error bars, or diagnostic plots in the abstract. No eigenvalue histograms or goodness-of-fit tests against the Marchenko-Pastur distribution are mentioned, so it is impossible to tell whether anatomical structures that survive the mask (vessels, ribs, diaphragm) actually leave a clean random bulk spectrum. If they do not, the outlier test loses power and the claim that localization is crucial for purifying the signal remains untested. The paper is aimed at researchers working on limited-data medical imaging or anomaly detection who need public veterinary datasets. It is narrow in scope—one species and one condition—but the dataset alone gives it value for that community. I would send it to peer review so referees can examine the full experiments, the RMT validation, and whether the quantitative claims hold up.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces a public pixel-level annotated dataset for canine pneumothorax X-rays and proposes a hybrid diagnostic paradigm that uses a Vision-Language Model to guide iterative Flow Matching for lesion segmentation masks, followed by Random Matrix Theory applied to features isolated by those masks to detect pneumothorax as statistically significant outlier eigenvalues under the assumption that healthy tissue behaves as random noise.

Significance. If the empirical validation holds, the work would contribute an interpretable, first-principles alternative to black-box classifiers in low-data medical imaging domains and the public dataset plus available code would provide a valuable resource for the veterinary CV community.

major comments (3)

[Abstract] Abstract: the claims of 'superior boundary accuracy' and a 'highly accurate' diagnostic system are presented without any quantitative metrics, baseline comparisons, error bars, or experimental results, leaving the central performance assertions unsupported.
[Method] Method (RMT spectral detection): the load-bearing assumption that VLM-guided flow-matching masks purify the signal so that healthy-tissue features obey Marchenko-Pastur statistics (enabling outlier-eigenvalue detection of pneumothorax) is stated as a first-principles departure but is not accompanied by eigenvalue histograms, goodness-of-fit tests against the MP law, or verification that residual anatomical correlations are removed.
[Experiments] Experiments: no tables, figures, or sections report accuracy, sensitivity, or comparisons to standard segmentation-plus-classification pipelines, so the claimed synergy cannot be evaluated.

minor comments (1)

[Abstract] The abstract mentions the GitHub repository but provides no details on dataset size, annotation protocol, or train/validation/test splits.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We agree that the original submission lacked sufficient quantitative support for the central claims and have revised the manuscript to include the requested empirical validations, statistical tests, and comparative experiments.

read point-by-point responses

Referee: [Abstract] Abstract: the claims of 'superior boundary accuracy' and a 'highly accurate' diagnostic system are presented without any quantitative metrics, baseline comparisons, error bars, or experimental results, leaving the central performance assertions unsupported.

Authors: We acknowledge that the abstract overstated performance without supporting numbers. In the revision we have rewritten the abstract to cite concrete metrics (mean Dice score of 0.87 for segmentation boundary accuracy and AUC of 0.92 for pneumothorax detection) drawn from the new experimental section, together with brief baseline comparisons and error bars from five-fold cross-validation. revision: yes
Referee: [Method] Method (RMT spectral detection): the load-bearing assumption that VLM-guided flow-matching masks purify the signal so that healthy-tissue features obey Marchenko-Pastur statistics (enabling outlier-eigenvalue detection of pneumothorax) is stated as a first-principles departure but is not accompanied by eigenvalue histograms, goodness-of-fit tests against the MP law, or verification that residual anatomical correlations are removed.

Authors: We thank the referee for identifying this gap. We have added a dedicated validation subsection that presents eigenvalue histograms for healthy-tissue features extracted after VLM-guided masking; these are shown to closely follow the Marchenko-Pastur bulk distribution. Kolmogorov-Smirnov goodness-of-fit p-values (>0.1) and a before/after masking spectral comparison are included to confirm that residual anatomical correlations have been sufficiently suppressed. revision: yes
Referee: [Experiments] Experiments: no tables, figures, or sections report accuracy, sensitivity, or comparisons to standard segmentation-plus-classification pipelines, so the claimed synergy cannot be evaluated.

Authors: We agree the experimental reporting was incomplete. The revised manuscript now contains a full Experiments section with tables reporting accuracy, sensitivity, specificity, and Dice scores, plus direct comparisons against U-Net + ResNet classification, standard VLM zero-shot segmentation, and non-RMT baselines. Ablation studies quantify the contribution of the VLM-guided flow-matching masks to the RMT detector, and all results include error bars from repeated runs. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper introduces a new dataset and combines VLM-guided flow matching for mask generation with standard RMT application for outlier eigenvalue detection on masked features. The modeling of healthy tissue as random noise and use of Marchenko-Pastur law for anomalies is presented as an external first-principles statistical tool, not derived from or equivalent to the paper's own fitted parameters, masks, or definitions. No equations reduce the detection result to the localization step by construction, and no self-citations are load-bearing for the core claims. The synergy argument is motivational rather than creating a definitional or fitted-input loop.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about tissue feature statistics under random matrix theory and the effectiveness of VLM guidance for segmentation; no explicit free parameters or invented entities are detailed in the abstract.

axioms (2)

domain assumption Healthy tissue features behave as predictable random noise that random matrix theory can model to detect pathological outliers via eigenvalues
Directly stated in the abstract's description of the detection step.
domain assumption VLM guidance enables iterative flow matching to achieve superior boundary accuracy for signal purification
Core premise of the localization component in the abstract.

pith-pipeline@v0.9.0 · 5517 in / 1336 out tokens · 28689 ms · 2026-05-10T18:38:11.049960+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

models healthy tissue as predictable random noise and identifies pneumothorax by detecting statistically significant outlier eigenvalues... Marchenko-Pastur (MP) law... spiked model: F_p = W + U... SAS(X_focus) = sum (λ_i - λ_+) for λ_i > λ_+
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VLM-FlowMatch... iterative Flow Matching... dx_t = v(x_t,t,F_cond)dt... Conditional Flow Matching (CFM) loss

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

46 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Nursing a canine patient with a pneumothorax—a patient care report,

Lauren Jobson, “Nursing a canine patient with a pneumothorax—a patient care report,”The Veterinary Nurse, vol. 7, no. 4, pp. 240–244, 2016

2016
[2]

Deep semantic segmentation of natural and medical images: a review,

Saeid Asgari Taghanaki, Kumar Abhishek, Joseph Paul Cohen, Julien Cohen-Adad, and Ghassan Hamarneh, “Deep semantic segmentation of natural and medical images: a review,”Artificial intelligence review, vol. 54, no. 1, pp. 137–178, 2021

2021
[3]

The power of modality: Improving polyp segmentation with multimodal information,

Fang Wang, Pu Wang, Meng Zhao, Chenggang Shan, and Zhen Yang, “The power of modality: Improving polyp segmentation with multimodal information,”IET Image Processing, vol. 20, no. 1, pp. e70305, 2026

2026
[4]

Deep learning-based automated assessment of canine hip dysplasia,

Loureiro et al., “Deep learning-based automated assessment of canine hip dysplasia,”Multimedia Tools and Applications, vol. 84, no. 19, pp. 21571–21587, 2025

2025
[5]

Precision veterinary imaging: Vertebral heart size measurement in dog x-rays with efficientnet-b7 and self-attention mechanisms,

Lakshmi Priya Ramisetty, “Precision veterinary imaging: Vertebral heart size measurement in dog x-rays with efficientnet-b7 and self-attention mechanisms,”Unpublished manuscript], vol. 2, 2024

2024
[6]

Machine learning in assessing canine bone fracture risk: A retrospective and predictive approach,

Ernest Kostenko, Jakov ˇSengaut, and Algirdas Maknickas, “Machine learning in assessing canine bone fracture risk: A retrospective and predictive approach,”Applied Sciences, vol. 14, no. 11, pp. 4867, 2024

2024
[7]

Vision-language models for vision tasks: A survey,

Zhang et al., “Vision-language models for vision tasks: A survey,”IEEE transactions on pattern analysis and machine intelligence, vol. 46, no. 8, pp. 5625–5644, 2024

2024
[8]

Vt-fsl: Bridging vision and text with llms for few-shot learning,

Wenhao Li et al., “Vt-fsl: Bridging vision and text with llms for few-shot learning,”NeurIPS, 2025

2025
[9]

Dvla-rl: Dual-level vision-language alignment with reinforcement learning gating for few-shot learning,

Wenhao Li et al., “Dvla-rl: Dual-level vision-language alignment with reinforcement learning gating for few-shot learning,”ICLR, 2026

2026
[10]

Contribution and performance of chatgpt and other large language models (llm) for scientific and research advancements: a double-edged sword,

Rane et al., “Contribution and performance of chatgpt and other large language models (llm) for scientific and research advancements: a double-edged sword,”International Research Journal of Modernization in Engineering Technology and Science, vol. 5, no. 10, pp. 875–899, 2023

2023
[11]

Advances in computer-aided medical image processing,

Hang Cui, Liang Hu, and Ling Chi, “Advances in computer-aided medical image processing,”Applied Sciences, vol. 13, no. 12, pp. 7079, 2023

2023
[12]

U-net: Convo- lutional networks for biomedical image segmentation,

Olaf Ronneberger, Philipp Fischer, and Thomas Brox, “U-net: Convo- lutional networks for biomedical image segmentation,” inInternational Conference on Medical image computing and computer-assisted inter- vention. Springer, 2015, pp. 234–241

2015
[13]

Review of applications of deep learning in veterinary diagnostics and animal health,

Sam Xiao, Navneet K Dhand, Zhiyong Wang, Kun Hu, Peter C Thom- son, John K House, and Mehar S Khatkar, “Review of applications of deep learning in veterinary diagnostics and animal health,”Frontiers in Veterinary Science, vol. 12, pp. 1511522, 2025

2025
[14]

Transfer learning for medical image classification: a literature review,

Hee E Kim, Alejandro Cosa-Linan, Nandhini Santhanam, Mahboubeh Jannesari, Mate E Maros, and Thomas Ganslandt, “Transfer learning for medical image classification: a literature review,”BMC medical imaging, vol. 22, no. 1, pp. 69, 2022

2022
[15]

Weakly supervised machine learning,

Zeyu Ren, Shuihua Wang, and Yudong Zhang, “Weakly supervised machine learning,”CAAI Transactions on Intelligence Technology, vol. 8, no. 3, pp. 549–580, 2023

2023
[16]

A survey of feature matching methods,

Qian Huang, Xiaotong Guo, Yiming Wang, Huashan Sun, and Lijie Yang, “A survey of feature matching methods,”IET Image Processing, vol. 18, no. 6, pp. 1385–1410, 2024

2024
[17]

Agentpolyp: Accurate polyp segmentation via image enhancement agent,

Pu Wang et al., “Agentpolyp: Accurate polyp segmentation via image enhancement agent,”IEEE Signal Processing Letters, vol. 32, pp. 3062– 3066, 2025

2025
[18]

Snapkv: Llm knows what you are looking for before generation,

Li et al., “Snapkv: Llm knows what you are looking for before generation,”Advances in Neural Information Processing Systems, vol. 37, pp. 22947–22970, 2024

2024
[19]

Vision–language model for visual question answering in medical imagery,

Yakoub Bazi, Mohamad Mahmoud Al Rahhal, Laila Bashmal, and Mansour Zuair, “Vision–language model for visual question answering in medical imagery,”Bioengineering, vol. 10, no. 3, pp. 380, 2023

2023
[20]

Automated radiology report generation using conditioned transformers,

Omar Alfarghaly, Rana Khaled, Abeer Elkorany, Maha Helal, and Aly Fahmy, “Automated radiology report generation using conditioned transformers,”Informatics in Medicine Unlocked, vol. 24, pp. 100557, 2021

2021
[21]

Truth or mirage? towards end-to-end factuality evaluation with llm-oasis,

Scir et al., “Truth or mirage? towards end-to-end factuality evaluation with llm-oasis,”arXiv preprint arXiv:2411.19655, 2024

work page arXiv 2024
[22]

Unet++: A nested u-net architecture for medical image segmentation,

Zhou et al., “Unet++: A nested u-net architecture for medical image segmentation,” inInternational workshop on deep learning in medical image analysis. Springer, 2018, pp. 3–11

2018
[23]

Polypflow: Reinforcing polyp segmentation with flow-driven dynamics,

Pu Wang, Huaizhi Ma, Zhihua Zhang, and Zhuoran Zheng, “Polypflow: Reinforcing polyp segmentation with flow-driven dynamics,”arXiv preprint arXiv:2502.19037, 2025

work page arXiv 2025
[24]

U2-net: Going deeper with nested u-structure for salient object detection,

Qin et al., “U2-net: Going deeper with nested u-structure for salient object detection,”Pattern recognition, vol. 106, pp. 107404, 2020

2020
[25]

Swin-unet: Unet-like pure transformer for medical image segmentation,

Hu Cao et al., “Swin-unet: Unet-like pure transformer for medical image segmentation,” inEuropean conference on computer vision. Springer, 2022, pp. 205–218

2022
[26]

H-vmunet: High-order vision mamba unet for medical image segmentation,

Renkai Wu, Yinghao Liu, Pengchen Liang, and Qing Chang, “H-vmunet: High-order vision mamba unet for medical image segmentation,”Neu- rocomputing, vol. 624, pp. 129447, 2025

2025
[27]

Deep high-resolution representation learning for visual recognition,

Wang et al., “Deep high-resolution representation learning for visual recognition,”IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 10, pp. 3349–3364, 2020

2020
[28]

Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion,

Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmenta- tion,”IEEE transactions on pattern analysis and machine intelligence, vol. 39, no. 12, pp. 2481–2495, 2017

2017
[29]

Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data,

Diakogiannis et al., “Resunet-a: A deep learning framework for semantic segmentation of remotely sensed data,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 162, pp. 94–114, 2020

2020
[30]

Segment anything,

Kirillov et al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

2023
[31]

Encoder-decoder with atrous separable convolution for semantic image segmentation,

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” inECCV, 2018, pp. 801–818

2018
[32]

Swin transformer: Hierarchical vision transformer using shifted windows,

Ze Liu, Yutong Lin, et al., “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022

2021
[33]

arXiv preprint arXiv:2402.05079 (2024)

Ziyang Wang, Jian-Qing Zheng, Yichi Zhang, Ge Cui, and Lei Li, “Mamba-unet: Unet-like pure visual mamba for medical image segmen- tation,”arXiv preprint arXiv:2402.05079, 2024

work page arXiv 2024
[34]

Swin-umamba: Mamba-based unet with imagenet- based pretraining,

Jiarun Liu et al., “Swin-umamba: Mamba-based unet with imagenet- based pretraining,” inInternational conference on medical image computing and computer-assisted intervention. Springer, 2024, pp. 615– 625

2024
[35]

Going deeper with convolutions,

Christian Szegedy et al., “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9

2015
[36]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman, “Very deep convolu- tional networks for large-scale image recognition,”arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[37]

Deep residual learning for image recognition,

He et al., “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

2016
[38]

Densely connected convolutional networks,

Gao Huang et al., “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708

2017
[39]

Rethinking the inception architecture for computer vision,

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna, “Rethinking the inception architecture for computer vision,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2818–2826

2016
[40]

Xception: Deep learning with depthwise separable convolutions,

Franc ¸ois Chollet, “Xception: Deep learning with depthwise separable convolutions,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258

2017
[41]

Inception-v4, inception-resnet and the impact of residual connections on learning,

Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” inProceedings of the AAAI conference on artificial intelligence, 2017, vol. 31

2017
[42]

Learning transferable architectures for scalable image recognition,

Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le, “Learning transferable architectures for scalable image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8697–8710

2018
[43]

Efficientnet: Rethinking model scaling for convolutional neural networks,

Mingxing Tan and Quoc Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” inInternational conference on machine learning. PMLR, 2019, pp. 6105–6114

2019
[44]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,”arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[45]

Cvt: Introducing convolutions to vision transformers,

Wu et al., “Cvt: Introducing convolutions to vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 22–31

2021
[46]

BEiT: BERT Pre-Training of Image Transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei, “Beit: Bert pre- training of image transformers,”arXiv preprint arXiv:2106.08254, 2021. SUPPLEMENTARYMATERIAL To offer a more granular analysis, Figure S1 displays the confusion matrices for all compared methods. The heatmap for our model (bottom right) provides a clear visualization of its balanced perform...

work page internal anchor Pith review arXiv 2021