VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

Bo-Cheng Qiu; Chia-Ming Lee; Chih-Chung Hsu; Fang-Ying Lin; Ming-Han Sun; Yu-Fan Lin

arxiv: 2605.22096 · v1 · pith:MVBYXGXCnew · submitted 2026-05-21 · 💻 cs.CV

VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

Bo-Cheng Qiu , Fang-Ying Lin , Ming-Han Sun , Yu-Fan Lin , Chia-Ming Lee , Chih-Chung Hsu This is my paper

Pith reviewed 2026-05-22 07:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords capsule endoscopyevent detectionrare pathologyfoundation modelsvideo analysistemporal decodingmedical imaginganatomical awareness

0 comments

The pith

VISTA integrates temporal and spatial foundation models with anatomical decoding to detect rare events in capsule endoscopy videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Capsule endoscopy videos contain long sequences where important medical findings are both rare and visually varied, so frame-level accuracy falls short of clinical needs. The paper develops VISTA to fuse outputs from a temporal foundation model and a spatial one, using validation to guide the weighting and adding anatomy-aware decoding to produce event-level predictions. This setup is shown to raise performance on a hidden test set for the RAREVISION task. A sympathetic reader would care because better detection could help doctors find subtle issues in these examinations without sifting through thousands of frames manually.

Core claim

The VISTA framework achieves hidden-test temporal mAP@0.5 of 0.3726 and mAP@0.95 of 0.3431 by combining EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Weighted Fusion, and Anatomy-Aware Temporal Event Decoding, with post-competition global coarse search improving on the original scores of 0.3530 and 0.3235.

What carries the argument

Validation-Guided Weighted Fusion (VGWF) that weights multi-backbone predictions based on validation performance, paired with Anatomy-Aware Temporal Event Decoding (ATED) to align outputs with event-level metrics.

Load-bearing premise

That the validation performance used to guide fusion and threshold choices will translate to performance on truly unseen clinical distributions rather than being tuned to the competition sets.

What would settle it

Evaluating the VISTA model on a new collection of capsule endoscopy videos collected from different hospitals or patient groups, checking if the mAP remains close to 0.37 or falls significantly.

Figures

Figures reproduced from arXiv: 2605.22096 by Bo-Cheng Qiu, Chia-Ming Lee, Chih-Chung Hsu, Fang-Ying Lin, Ming-Han Sun, Yu-Fan Lin.

**Figure 1.** Figure 1: Overview of the developed pipeline. We report validation-set ablations using the official event-level metrics temporal mAP@0.5 and temporal mAP@0.95 [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

read the original abstract

Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95, ranking Team ACVLab second in the post-competition evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VISTA wires EndoFM-LV and DINOv3 into a working pipeline for rare VCE events and posts a modest hidden-test mAP lift, but the post-competition threshold search undercuts how much we can trust the final numbers.

read the letter

The main point is that this paper shows a concrete way to combine a temporal foundation model with a visual one for detecting sparse pathology events in capsule endoscopy video. It reaches second place in the RAREVISION post-competition ranking with temporal mAP@0.5 of 0.3726 and mAP@0.95 of 0.3431 on the hidden test set. The pipeline uses EndoFM-LV for sequence context, DINOv3 for frame features, a diverse head ensemble, validation-guided fusion, and anatomy-aware decoding. That integration is the practical contribution, and the hidden-test numbers give it some external grounding that pure validation scores lack. The original submission already sat at 0.3530/0.3235, so the method itself is doing real work before any extra tuning. The post-competition global coarse search over thresholds is the clearest soft spot. Because it was run after results were known, the chosen values can encode test-set statistics rather than just anatomical or temporal priors. No ablation isolates that search from the hidden labels, and the abstract gives no error bars, dataset size details, or full hyperparameter ranges. This makes the final 0.02-point gain harder to read as generalizable rather than distribution-specific fitting. The rest of the approach looks like standard ensemble and fusion steps applied to existing models, which is fine for an engineering paper but does not claim new primitives. Readers who build diagnostic tools for gastroenterology or who run similar medical-video competitions will find the recipe useful. It is worth sending to peer review so referees can check the implementation details and ask for the missing ablations. The empirical ranking on a hidden set is enough to justify the time even if the tuning step needs tighter controls.

Referee Report

1 major / 2 minor

Summary. The paper presents VISTA, a multi-backbone framework for rare-pathology video capsule endoscopy (VCE) event detection that fuses EndoFM-LV temporal context with DINOv3 ViTL/16 frame-level semantics via a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). It reports an original hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235, improved to 0.3726 and 0.3431 after post-competition extension of local threshold refinement with a global coarse search, achieving second place in post-competition ranking.

Significance. If the performance gains are shown to arise from the core VISTA pipeline rather than post-competition threshold fitting, the work would demonstrate a practical route for combining spatial-temporal foundation models with anatomical priors to address sparse, event-level detection in medical video, with direct relevance to improving diagnostic sensitivity for rare findings where conventional frame-wise metrics fall short.

major comments (1)

[Abstract] Abstract: the headline claim of improved hidden-test mAP (0.3726/0.3431) rests on a global coarse search over detection thresholds performed after competition results were known. No ablation isolating this search from hidden-test labels is referenced, and the description does not establish that the procedure was fixed and validation-only; this directly weakens the generalizability of the reported ranking and the assertion that the gains reflect the DHE+VGWF+ATED pipeline.

minor comments (2)

The manuscript lacks dataset statistics (e.g., number of positive events, class imbalance ratios), error bars or variance estimates on the mAP scores, and component ablations that would quantify the individual contributions of DHE, VGWF, and ATED.
Notation for the threshold search (local refinement vs. global coarse) should be formalized, ideally with pseudocode or explicit parameter ranges, to allow readers to reproduce the exact post-competition procedure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the major comment regarding the abstract and post-competition threshold refinement below, and we commit to revisions that improve clarity without altering the reported contributions of the VISTA pipeline.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of improved hidden-test mAP (0.3726/0.3431) rests on a global coarse search over detection thresholds performed after competition results were known. No ablation isolating this search from hidden-test labels is referenced, and the description does not establish that the procedure was fixed and validation-only; this directly weakens the generalizability of the reported ranking and the assertion that the gains reflect the DHE+VGWF+ATED pipeline.

Authors: We agree that the current abstract wording emphasizes the post-competition metrics (0.3726 mAP@0.5 and 0.3431 mAP@0.95), which were obtained by extending local threshold refinement with a global coarse search after competition results were known. The official competition submission, using thresholds fixed from validation data only, achieved 0.3530 mAP@0.5 and 0.3235 mAP@0.95; these are the results that directly reflect the DHE+VGWF+ATED pipeline. We will revise the abstract to lead with the official competition metrics as the primary reported outcomes and move the post-competition refinement to a brief note on potential further gains. This revision will also explicitly state that all pre-submission threshold decisions were validation-only. An ablation that isolates the global search while using hidden-test labels is not feasible, as test annotations remain unavailable to participants; the search relied on validation-based heuristics extended after the fact. We maintain that the competitive ranking and core performance gains originate from the proposed multi-backbone fusion and decoding components rather than threshold tuning alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with hidden-test grounding

full rationale

The paper presents an applied ML framework (DHE + VGWF + ATED) for VCE event detection and reports measured mAP on a hidden test set. No derivation chain, first-principles result, or closed-form prediction is claimed that reduces by construction to its own inputs, fitted parameters, or self-citations. The post-competition threshold search is described as an empirical extension rather than a mathematical step whose output is definitionally identical to its input. External hidden-test evaluation supplies independent grounding, making the central ranking claim self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that pre-trained EndoFM-LV and DINOv3 features transfer usefully to VCE, plus the modeling choice that validation-guided fusion plus threshold search will improve event-level mAP without overfitting.

free parameters (1)

detection thresholds
Local and global thresholds refined post-competition via coarse search to raise hidden-test mAP.

axioms (1)

domain assumption Foundation models trained on general or endoscopic data supply transferable features for rare VCE pathology detection.
Invoked by the choice to use EndoFM-LV and DINOv3 as backbones without domain-specific retraining from scratch.

pith-pipeline@v0.9.0 · 5729 in / 1475 out tokens · 57786 ms · 2026-05-22T07:27:19.681985+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VISTA combines EndoFM-LV for temporal context and DINOv3 ViT-L/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED).
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Maxime Le Floch, Fabian Wolf, Lucian McIntyre, Paul Herzog, Christoph Wein- ert, Albrecht Palm, Konrad Volk, Sophie Helene Kirk, Jonas L. Steinh¨ auser, Catrein Stopp, Mark Enrik Geissler, Moritz Herzog, Stefan Sulk, Jakob Niko- las Kather, Alexander Meining, Alexander Hann, Jochen Hampe, Nora Her- zog, and Franz Brinkmann. Galar - a large multi-label vid...

work page doi:10.25452/figshare.plus.25304616 2025
[2]

ICPR 2026 RARE-VISION Competition Document and Flyer

Anni Lawniczak, Manas Dhir, Maxime Le Floch, Palak Handa, and Anas- tasios Koulaouzidis. ICPR 2026 RARE-VISION Competition Document and Flyer. 12 2025. doi: 10.6084/m9.figshare.30884858.v3. URLhttps: //figshare.com/articles/preprint/ICPR_2026_RARE-VISION_Competition_ Document_and_Flyer/30884858

work page doi:10.6084/m9.figshare.30884858.v3 2026
[3]

Divide and con- quer: Grounding a bleeding areas in gastrointestinal image with two-stage model

Yu-Fan Lin, Bo-Cheng Qiu, Chia-Ming Lee, and Chih-Chung Hsu. Divide and con- quer: Grounding a bleeding areas in gastrointestinal image with two-stage model. arXiv preprint arXiv:2412.16723, 2024

work page arXiv 2024
[4]

Improving foundation model for endoscopy video analysis via representation learning on long sequences.IEEE Journal of Biomedical and Health Informatics, 2025

Zhao Wang, Chang Liu, Lingting Zhu, Tongtong Wang, Shaoting Zhang, and Qi Dou. Improving foundation model for endoscopy video analysis via representation learning on long sequences.IEEE Journal of Biomedical and Health Informatics, 2025

work page 2025
[5]

Taming domain shift in multi-source ct-scan classification via input-space standardization

Chia-Ming Lee, Bo-Cheng Qiu, Ting-Yao Chen, Ming-Han Sun, Yu-Fan Lin, Fang- Ying Lin, Jung-Tse Tsai, I-An Tsai, and Chih-Chung Hsu. Taming domain shift in multi-source ct-scan classification via input-space standardization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7331–7339, 2025

work page 2025
[6]

DINOv3

Oriane Sim´ eoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨ el Rama- monjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

Robust asymmetric loss for multi-label long-tailed learning

Wongi Park, Inhyuk Park, Sungeun Kim, and Jongbin Ryu. Robust asymmetric loss for multi-label long-tailed learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 2711–2720, 2023

work page 2023
[8]

Weighted stratification in multi-label con- trastive learning for long-tailed medical image classification

Ying-Chih Lin and Yong-Sheng Chen. Weighted stratification in multi-label con- trastive learning for long-tailed medical image classification. InInternational Confer- ence on Medical Image Computing and Computer-Assisted Intervention, pages 677–

work page
[9]

Deep learning-based real-time organ localization and transit time estimation in wireless capsule endoscopy.Biomedicines, 12(8):1704, 2024

Seung-Joo Nam, Gwiseong Moon, Jung-Hwan Park, Yoon Kim, Yun Jeong Lim, and Hyun-Soo Choi. Deep learning-based real-time organ localization and transit time estimation in wireless capsule endoscopy.Biomedicines, 12(8):1704, 2024

work page 2024
[10]

Rare-vision-2026-competition website.https://github.com/ RAREChallenge2026/RARE-VISION-2026-Challenge, 2026

Anni Lawniczak, Manas Dhir, Maxime Le Floch, Palak Handa, and Anasta- sios Koulaouzidis. Rare-vision-2026-competition website.https://github.com/ RAREChallenge2026/RARE-VISION-2026-Challenge, 2026. Website and GitHub repository for the ICPR 2026 RARE-VISION Competition; accessed 2026-03-27

work page 2026
[11]

Steinhaeuser-Meerz, Jochen Hampe, and Franz Brinkmann

Maxime Le Floch, Anni Lawniczak, Catrein Stopp, Alexander Zech, Alexandra Kolbig, Hannah Tolle, Jonas L. Steinhaeuser-Meerz, Jochen Hampe, and Franz Brinkmann. Test data for icpr 2026 - rare-vision competition, 2026. URLhttps: //doi.org/10.25532/OPARA-1119

work page doi:10.25532/opara-1119 2026
[12]

Rareeval socring app.https://scoringrarevision.streamlit.app/, 2026

Manas Dhir, Palak Handa, Anni Lawniczak, and Maxime Le Floch. Rareeval socring app.https://scoringrarevision.streamlit.app/, 2026. Streamlit application, accessed 2026-03-27

work page 2026

[1] [1]

Maxime Le Floch, Fabian Wolf, Lucian McIntyre, Paul Herzog, Christoph Wein- ert, Albrecht Palm, Konrad Volk, Sophie Helene Kirk, Jonas L. Steinh¨ auser, Catrein Stopp, Mark Enrik Geissler, Moritz Herzog, Stefan Sulk, Jakob Niko- las Kather, Alexander Meining, Alexander Hann, Jochen Hampe, Nora Her- zog, and Franz Brinkmann. Galar - a large multi-label vid...

work page doi:10.25452/figshare.plus.25304616 2025

[2] [2]

ICPR 2026 RARE-VISION Competition Document and Flyer

Anni Lawniczak, Manas Dhir, Maxime Le Floch, Palak Handa, and Anas- tasios Koulaouzidis. ICPR 2026 RARE-VISION Competition Document and Flyer. 12 2025. doi: 10.6084/m9.figshare.30884858.v3. URLhttps: //figshare.com/articles/preprint/ICPR_2026_RARE-VISION_Competition_ Document_and_Flyer/30884858

work page doi:10.6084/m9.figshare.30884858.v3 2026

[3] [3]

Divide and con- quer: Grounding a bleeding areas in gastrointestinal image with two-stage model

Yu-Fan Lin, Bo-Cheng Qiu, Chia-Ming Lee, and Chih-Chung Hsu. Divide and con- quer: Grounding a bleeding areas in gastrointestinal image with two-stage model. arXiv preprint arXiv:2412.16723, 2024

work page arXiv 2024

[4] [4]

Improving foundation model for endoscopy video analysis via representation learning on long sequences.IEEE Journal of Biomedical and Health Informatics, 2025

Zhao Wang, Chang Liu, Lingting Zhu, Tongtong Wang, Shaoting Zhang, and Qi Dou. Improving foundation model for endoscopy video analysis via representation learning on long sequences.IEEE Journal of Biomedical and Health Informatics, 2025

work page 2025

[5] [5]

Taming domain shift in multi-source ct-scan classification via input-space standardization

Chia-Ming Lee, Bo-Cheng Qiu, Ting-Yao Chen, Ming-Han Sun, Yu-Fan Lin, Fang- Ying Lin, Jung-Tse Tsai, I-An Tsai, and Chih-Chung Hsu. Taming domain shift in multi-source ct-scan classification via input-space standardization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7331–7339, 2025

work page 2025

[6] [6]

DINOv3

Oriane Sim´ eoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨ el Rama- monjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

Robust asymmetric loss for multi-label long-tailed learning

Wongi Park, Inhyuk Park, Sungeun Kim, and Jongbin Ryu. Robust asymmetric loss for multi-label long-tailed learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 2711–2720, 2023

work page 2023

[8] [8]

Weighted stratification in multi-label con- trastive learning for long-tailed medical image classification

Ying-Chih Lin and Yong-Sheng Chen. Weighted stratification in multi-label con- trastive learning for long-tailed medical image classification. InInternational Confer- ence on Medical Image Computing and Computer-Assisted Intervention, pages 677–

work page

[9] [9]

Deep learning-based real-time organ localization and transit time estimation in wireless capsule endoscopy.Biomedicines, 12(8):1704, 2024

Seung-Joo Nam, Gwiseong Moon, Jung-Hwan Park, Yoon Kim, Yun Jeong Lim, and Hyun-Soo Choi. Deep learning-based real-time organ localization and transit time estimation in wireless capsule endoscopy.Biomedicines, 12(8):1704, 2024

work page 2024

[10] [10]

Rare-vision-2026-competition website.https://github.com/ RAREChallenge2026/RARE-VISION-2026-Challenge, 2026

Anni Lawniczak, Manas Dhir, Maxime Le Floch, Palak Handa, and Anasta- sios Koulaouzidis. Rare-vision-2026-competition website.https://github.com/ RAREChallenge2026/RARE-VISION-2026-Challenge, 2026. Website and GitHub repository for the ICPR 2026 RARE-VISION Competition; accessed 2026-03-27

work page 2026

[11] [11]

Steinhaeuser-Meerz, Jochen Hampe, and Franz Brinkmann

Maxime Le Floch, Anni Lawniczak, Catrein Stopp, Alexander Zech, Alexandra Kolbig, Hannah Tolle, Jonas L. Steinhaeuser-Meerz, Jochen Hampe, and Franz Brinkmann. Test data for icpr 2026 - rare-vision competition, 2026. URLhttps: //doi.org/10.25532/OPARA-1119

work page doi:10.25532/opara-1119 2026

[12] [12]

Rareeval socring app.https://scoringrarevision.streamlit.app/, 2026

Manas Dhir, Palak Handa, Anni Lawniczak, and Maxime Le Floch. Rareeval socring app.https://scoringrarevision.streamlit.app/, 2026. Streamlit application, accessed 2026-03-27

work page 2026