VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results
Pith reviewed 2026-05-22 07:27 UTC · model grok-4.3
The pith
VISTA integrates temporal and spatial foundation models with anatomical decoding to detect rare events in capsule endoscopy videos.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The VISTA framework achieves hidden-test temporal mAP@0.5 of 0.3726 and mAP@0.95 of 0.3431 by combining EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Weighted Fusion, and Anatomy-Aware Temporal Event Decoding, with post-competition global coarse search improving on the original scores of 0.3530 and 0.3235.
What carries the argument
Validation-Guided Weighted Fusion (VGWF) that weights multi-backbone predictions based on validation performance, paired with Anatomy-Aware Temporal Event Decoding (ATED) to align outputs with event-level metrics.
Load-bearing premise
That the validation performance used to guide fusion and threshold choices will translate to performance on truly unseen clinical distributions rather than being tuned to the competition sets.
What would settle it
Evaluating the VISTA model on a new collection of capsule endoscopy videos collected from different hospitals or patient groups, checking if the mAP remains close to 0.37 or falls significantly.
Figures
read the original abstract
Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95, ranking Team ACVLab second in the post-competition evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents VISTA, a multi-backbone framework for rare-pathology video capsule endoscopy (VCE) event detection that fuses EndoFM-LV temporal context with DINOv3 ViTL/16 frame-level semantics via a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). It reports an original hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235, improved to 0.3726 and 0.3431 after post-competition extension of local threshold refinement with a global coarse search, achieving second place in post-competition ranking.
Significance. If the performance gains are shown to arise from the core VISTA pipeline rather than post-competition threshold fitting, the work would demonstrate a practical route for combining spatial-temporal foundation models with anatomical priors to address sparse, event-level detection in medical video, with direct relevance to improving diagnostic sensitivity for rare findings where conventional frame-wise metrics fall short.
major comments (1)
- [Abstract] Abstract: the headline claim of improved hidden-test mAP (0.3726/0.3431) rests on a global coarse search over detection thresholds performed after competition results were known. No ablation isolating this search from hidden-test labels is referenced, and the description does not establish that the procedure was fixed and validation-only; this directly weakens the generalizability of the reported ranking and the assertion that the gains reflect the DHE+VGWF+ATED pipeline.
minor comments (2)
- The manuscript lacks dataset statistics (e.g., number of positive events, class imbalance ratios), error bars or variance estimates on the mAP scores, and component ablations that would quantify the individual contributions of DHE, VGWF, and ATED.
- Notation for the threshold search (local refinement vs. global coarse) should be formalized, ideally with pseudocode or explicit parameter ranges, to allow readers to reproduce the exact post-competition procedure.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address the major comment regarding the abstract and post-competition threshold refinement below, and we commit to revisions that improve clarity without altering the reported contributions of the VISTA pipeline.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of improved hidden-test mAP (0.3726/0.3431) rests on a global coarse search over detection thresholds performed after competition results were known. No ablation isolating this search from hidden-test labels is referenced, and the description does not establish that the procedure was fixed and validation-only; this directly weakens the generalizability of the reported ranking and the assertion that the gains reflect the DHE+VGWF+ATED pipeline.
Authors: We agree that the current abstract wording emphasizes the post-competition metrics (0.3726 mAP@0.5 and 0.3431 mAP@0.95), which were obtained by extending local threshold refinement with a global coarse search after competition results were known. The official competition submission, using thresholds fixed from validation data only, achieved 0.3530 mAP@0.5 and 0.3235 mAP@0.95; these are the results that directly reflect the DHE+VGWF+ATED pipeline. We will revise the abstract to lead with the official competition metrics as the primary reported outcomes and move the post-competition refinement to a brief note on potential further gains. This revision will also explicitly state that all pre-submission threshold decisions were validation-only. An ablation that isolates the global search while using hidden-test labels is not feasible, as test annotations remain unavailable to participants; the search relied on validation-based heuristics extended after the fact. We maintain that the competitive ranking and core performance gains originate from the proposed multi-backbone fusion and decoding components rather than threshold tuning alone. revision: yes
Circularity Check
No circularity: empirical framework with hidden-test grounding
full rationale
The paper presents an applied ML framework (DHE + VGWF + ATED) for VCE event detection and reports measured mAP on a hidden test set. No derivation chain, first-principles result, or closed-form prediction is claimed that reduces by construction to its own inputs, fitted parameters, or self-citations. The post-competition threshold search is described as an empirical extension rather than a mathematical step whose output is definitionally identical to its input. External hidden-test evaluation supplies independent grounding, making the central ranking claim self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- detection thresholds
axioms (1)
- domain assumption Foundation models trained on general or endoscopic data supply transferable features for rare VCE pathology detection.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
VISTA combines EndoFM-LV for temporal context and DINOv3 ViT-L/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED).
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Maxime Le Floch, Fabian Wolf, Lucian McIntyre, Paul Herzog, Christoph Wein- ert, Albrecht Palm, Konrad Volk, Sophie Helene Kirk, Jonas L. Steinh¨ auser, Catrein Stopp, Mark Enrik Geissler, Moritz Herzog, Stefan Sulk, Jakob Niko- las Kather, Alexander Meining, Alexander Hann, Jochen Hampe, Nora Her- zog, and Franz Brinkmann. Galar - a large multi-label vid...
-
[2]
ICPR 2026 RARE-VISION Competition Document and Flyer
Anni Lawniczak, Manas Dhir, Maxime Le Floch, Palak Handa, and Anas- tasios Koulaouzidis. ICPR 2026 RARE-VISION Competition Document and Flyer. 12 2025. doi: 10.6084/m9.figshare.30884858.v3. URLhttps: //figshare.com/articles/preprint/ICPR_2026_RARE-VISION_Competition_ Document_and_Flyer/30884858
-
[3]
Divide and con- quer: Grounding a bleeding areas in gastrointestinal image with two-stage model
Yu-Fan Lin, Bo-Cheng Qiu, Chia-Ming Lee, and Chih-Chung Hsu. Divide and con- quer: Grounding a bleeding areas in gastrointestinal image with two-stage model. arXiv preprint arXiv:2412.16723, 2024
-
[4]
Zhao Wang, Chang Liu, Lingting Zhu, Tongtong Wang, Shaoting Zhang, and Qi Dou. Improving foundation model for endoscopy video analysis via representation learning on long sequences.IEEE Journal of Biomedical and Health Informatics, 2025
work page 2025
-
[5]
Taming domain shift in multi-source ct-scan classification via input-space standardization
Chia-Ming Lee, Bo-Cheng Qiu, Ting-Yao Chen, Ming-Han Sun, Yu-Fan Lin, Fang- Ying Lin, Jung-Tse Tsai, I-An Tsai, and Chih-Chung Hsu. Taming domain shift in multi-source ct-scan classification via input-space standardization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7331–7339, 2025
work page 2025
-
[6]
Oriane Sim´ eoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨ el Rama- monjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Robust asymmetric loss for multi-label long-tailed learning
Wongi Park, Inhyuk Park, Sungeun Kim, and Jongbin Ryu. Robust asymmetric loss for multi-label long-tailed learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 2711–2720, 2023
work page 2023
-
[8]
Ying-Chih Lin and Yong-Sheng Chen. Weighted stratification in multi-label con- trastive learning for long-tailed medical image classification. InInternational Confer- ence on Medical Image Computing and Computer-Assisted Intervention, pages 677–
-
[9]
Seung-Joo Nam, Gwiseong Moon, Jung-Hwan Park, Yoon Kim, Yun Jeong Lim, and Hyun-Soo Choi. Deep learning-based real-time organ localization and transit time estimation in wireless capsule endoscopy.Biomedicines, 12(8):1704, 2024
work page 2024
-
[10]
Anni Lawniczak, Manas Dhir, Maxime Le Floch, Palak Handa, and Anasta- sios Koulaouzidis. Rare-vision-2026-competition website.https://github.com/ RAREChallenge2026/RARE-VISION-2026-Challenge, 2026. Website and GitHub repository for the ICPR 2026 RARE-VISION Competition; accessed 2026-03-27
work page 2026
-
[11]
Steinhaeuser-Meerz, Jochen Hampe, and Franz Brinkmann
Maxime Le Floch, Anni Lawniczak, Catrein Stopp, Alexander Zech, Alexandra Kolbig, Hannah Tolle, Jonas L. Steinhaeuser-Meerz, Jochen Hampe, and Franz Brinkmann. Test data for icpr 2026 - rare-vision competition, 2026. URLhttps: //doi.org/10.25532/OPARA-1119
-
[12]
Rareeval socring app.https://scoringrarevision.streamlit.app/, 2026
Manas Dhir, Palak Handa, Anni Lawniczak, and Maxime Le Floch. Rareeval socring app.https://scoringrarevision.streamlit.app/, 2026. Streamlit application, accessed 2026-03-27
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.