pith. sign in

arxiv: 2605.22096 · v1 · pith:MVBYXGXCnew · submitted 2026-05-21 · 💻 cs.CV

VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

Pith reviewed 2026-05-22 07:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords capsule endoscopyevent detectionrare pathologyfoundation modelsvideo analysistemporal decodingmedical imaginganatomical awareness
0
0 comments X

The pith

VISTA integrates temporal and spatial foundation models with anatomical decoding to detect rare events in capsule endoscopy videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Capsule endoscopy videos contain long sequences where important medical findings are both rare and visually varied, so frame-level accuracy falls short of clinical needs. The paper develops VISTA to fuse outputs from a temporal foundation model and a spatial one, using validation to guide the weighting and adding anatomy-aware decoding to produce event-level predictions. This setup is shown to raise performance on a hidden test set for the RAREVISION task. A sympathetic reader would care because better detection could help doctors find subtle issues in these examinations without sifting through thousands of frames manually.

Core claim

The VISTA framework achieves hidden-test temporal mAP@0.5 of 0.3726 and mAP@0.95 of 0.3431 by combining EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble, Validation-Guided Weighted Fusion, and Anatomy-Aware Temporal Event Decoding, with post-competition global coarse search improving on the original scores of 0.3530 and 0.3235.

What carries the argument

Validation-Guided Weighted Fusion (VGWF) that weights multi-backbone predictions based on validation performance, paired with Anatomy-Aware Temporal Event Decoding (ATED) to align outputs with event-level metrics.

Load-bearing premise

That the validation performance used to guide fusion and threshold choices will translate to performance on truly unseen clinical distributions rather than being tuned to the competition sets.

What would settle it

Evaluating the VISTA model on a new collection of capsule endoscopy videos collected from different hospitals or patient groups, checking if the mAP remains close to 0.37 or falls significantly.

Figures

Figures reproduced from arXiv: 2605.22096 by Bo-Cheng Qiu, Chia-Ming Lee, Chih-Chung Hsu, Fang-Ying Lin, Ming-Han Sun, Yu-Fan Lin.

Figure 1
Figure 1. Figure 1: Overview of the developed pipeline. We report validation-set ablations using the official event-level metrics temporal mAP@0.5 and temporal mAP@0.95 [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 mAP@0.5 and 0.3431 mAP@0.95, ranking Team ACVLab second in the post-competition evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents VISTA, a multi-backbone framework for rare-pathology video capsule endoscopy (VCE) event detection that fuses EndoFM-LV temporal context with DINOv3 ViTL/16 frame-level semantics via a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). It reports an original hidden-test temporal mAP@0.5 of 0.3530 and mAP@0.95 of 0.3235, improved to 0.3726 and 0.3431 after post-competition extension of local threshold refinement with a global coarse search, achieving second place in post-competition ranking.

Significance. If the performance gains are shown to arise from the core VISTA pipeline rather than post-competition threshold fitting, the work would demonstrate a practical route for combining spatial-temporal foundation models with anatomical priors to address sparse, event-level detection in medical video, with direct relevance to improving diagnostic sensitivity for rare findings where conventional frame-wise metrics fall short.

major comments (1)
  1. [Abstract] Abstract: the headline claim of improved hidden-test mAP (0.3726/0.3431) rests on a global coarse search over detection thresholds performed after competition results were known. No ablation isolating this search from hidden-test labels is referenced, and the description does not establish that the procedure was fixed and validation-only; this directly weakens the generalizability of the reported ranking and the assertion that the gains reflect the DHE+VGWF+ATED pipeline.
minor comments (2)
  1. The manuscript lacks dataset statistics (e.g., number of positive events, class imbalance ratios), error bars or variance estimates on the mAP scores, and component ablations that would quantify the individual contributions of DHE, VGWF, and ATED.
  2. Notation for the threshold search (local refinement vs. global coarse) should be formalized, ideally with pseudocode or explicit parameter ranges, to allow readers to reproduce the exact post-competition procedure.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address the major comment regarding the abstract and post-competition threshold refinement below, and we commit to revisions that improve clarity without altering the reported contributions of the VISTA pipeline.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of improved hidden-test mAP (0.3726/0.3431) rests on a global coarse search over detection thresholds performed after competition results were known. No ablation isolating this search from hidden-test labels is referenced, and the description does not establish that the procedure was fixed and validation-only; this directly weakens the generalizability of the reported ranking and the assertion that the gains reflect the DHE+VGWF+ATED pipeline.

    Authors: We agree that the current abstract wording emphasizes the post-competition metrics (0.3726 mAP@0.5 and 0.3431 mAP@0.95), which were obtained by extending local threshold refinement with a global coarse search after competition results were known. The official competition submission, using thresholds fixed from validation data only, achieved 0.3530 mAP@0.5 and 0.3235 mAP@0.95; these are the results that directly reflect the DHE+VGWF+ATED pipeline. We will revise the abstract to lead with the official competition metrics as the primary reported outcomes and move the post-competition refinement to a brief note on potential further gains. This revision will also explicitly state that all pre-submission threshold decisions were validation-only. An ablation that isolates the global search while using hidden-test labels is not feasible, as test annotations remain unavailable to participants; the search relied on validation-based heuristics extended after the fact. We maintain that the competitive ranking and core performance gains originate from the proposed multi-backbone fusion and decoding components rather than threshold tuning alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with hidden-test grounding

full rationale

The paper presents an applied ML framework (DHE + VGWF + ATED) for VCE event detection and reports measured mAP on a hidden test set. No derivation chain, first-principles result, or closed-form prediction is claimed that reduces by construction to its own inputs, fitted parameters, or self-citations. The post-competition threshold search is described as an empirical extension rather than a mathematical step whose output is definitionally identical to its input. External hidden-test evaluation supplies independent grounding, making the central ranking claim self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that pre-trained EndoFM-LV and DINOv3 features transfer usefully to VCE, plus the modeling choice that validation-guided fusion plus threshold search will improve event-level mAP without overfitting.

free parameters (1)
  • detection thresholds
    Local and global thresholds refined post-competition via coarse search to raise hidden-test mAP.
axioms (1)
  • domain assumption Foundation models trained on general or endoscopic data supply transferable features for rare VCE pathology detection.
    Invoked by the choice to use EndoFM-LV and DINOv3 as backbones without domain-specific retraining from scratch.

pith-pipeline@v0.9.0 · 5729 in / 1475 out tokens · 57786 ms · 2026-05-22T07:27:19.681985+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

  1. [1]

    Maxime Le Floch, Fabian Wolf, Lucian McIntyre, Paul Herzog, Christoph Wein- ert, Albrecht Palm, Konrad Volk, Sophie Helene Kirk, Jonas L. Steinh¨ auser, Catrein Stopp, Mark Enrik Geissler, Moritz Herzog, Stefan Sulk, Jakob Niko- las Kather, Alexander Meining, Alexander Hann, Jochen Hampe, Nora Her- zog, and Franz Brinkmann. Galar - a large multi-label vid...

  2. [2]

    ICPR 2026 RARE-VISION Competition Document and Flyer

    Anni Lawniczak, Manas Dhir, Maxime Le Floch, Palak Handa, and Anas- tasios Koulaouzidis. ICPR 2026 RARE-VISION Competition Document and Flyer. 12 2025. doi: 10.6084/m9.figshare.30884858.v3. URLhttps: //figshare.com/articles/preprint/ICPR_2026_RARE-VISION_Competition_ Document_and_Flyer/30884858

  3. [3]

    Divide and con- quer: Grounding a bleeding areas in gastrointestinal image with two-stage model

    Yu-Fan Lin, Bo-Cheng Qiu, Chia-Ming Lee, and Chih-Chung Hsu. Divide and con- quer: Grounding a bleeding areas in gastrointestinal image with two-stage model. arXiv preprint arXiv:2412.16723, 2024

  4. [4]

    Improving foundation model for endoscopy video analysis via representation learning on long sequences.IEEE Journal of Biomedical and Health Informatics, 2025

    Zhao Wang, Chang Liu, Lingting Zhu, Tongtong Wang, Shaoting Zhang, and Qi Dou. Improving foundation model for endoscopy video analysis via representation learning on long sequences.IEEE Journal of Biomedical and Health Informatics, 2025

  5. [5]

    Taming domain shift in multi-source ct-scan classification via input-space standardization

    Chia-Ming Lee, Bo-Cheng Qiu, Ting-Yao Chen, Ming-Han Sun, Yu-Fan Lin, Fang- Ying Lin, Jung-Tse Tsai, I-An Tsai, and Chih-Chung Hsu. Taming domain shift in multi-source ct-scan classification via input-space standardization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7331–7339, 2025

  6. [6]

    DINOv3

    Oriane Sim´ eoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨ el Rama- monjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

  7. [7]

    Robust asymmetric loss for multi-label long-tailed learning

    Wongi Park, Inhyuk Park, Sungeun Kim, and Jongbin Ryu. Robust asymmetric loss for multi-label long-tailed learning. InProceedings of the IEEE/CVF international conference on computer vision, pages 2711–2720, 2023

  8. [8]

    Weighted stratification in multi-label con- trastive learning for long-tailed medical image classification

    Ying-Chih Lin and Yong-Sheng Chen. Weighted stratification in multi-label con- trastive learning for long-tailed medical image classification. InInternational Confer- ence on Medical Image Computing and Computer-Assisted Intervention, pages 677–

  9. [9]

    Deep learning-based real-time organ localization and transit time estimation in wireless capsule endoscopy.Biomedicines, 12(8):1704, 2024

    Seung-Joo Nam, Gwiseong Moon, Jung-Hwan Park, Yoon Kim, Yun Jeong Lim, and Hyun-Soo Choi. Deep learning-based real-time organ localization and transit time estimation in wireless capsule endoscopy.Biomedicines, 12(8):1704, 2024

  10. [10]

    Rare-vision-2026-competition website.https://github.com/ RAREChallenge2026/RARE-VISION-2026-Challenge, 2026

    Anni Lawniczak, Manas Dhir, Maxime Le Floch, Palak Handa, and Anasta- sios Koulaouzidis. Rare-vision-2026-competition website.https://github.com/ RAREChallenge2026/RARE-VISION-2026-Challenge, 2026. Website and GitHub repository for the ICPR 2026 RARE-VISION Competition; accessed 2026-03-27

  11. [11]

    Steinhaeuser-Meerz, Jochen Hampe, and Franz Brinkmann

    Maxime Le Floch, Anni Lawniczak, Catrein Stopp, Alexander Zech, Alexandra Kolbig, Hannah Tolle, Jonas L. Steinhaeuser-Meerz, Jochen Hampe, and Franz Brinkmann. Test data for icpr 2026 - rare-vision competition, 2026. URLhttps: //doi.org/10.25532/OPARA-1119

  12. [12]

    Rareeval socring app.https://scoringrarevision.streamlit.app/, 2026

    Manas Dhir, Palak Handa, Anni Lawniczak, and Maxime Le Floch. Rareeval socring app.https://scoringrarevision.streamlit.app/, 2026. Streamlit application, accessed 2026-03-27