pith. sign in

arxiv: 2606.23531 · v2 · pith:PNIOGZJ3new · submitted 2026-06-22 · 💻 cs.RO

BiliVLA: Scene-Aware Vision-Language-Action Model with Reinforcement Learning for Autonomous Biliary Endoscopic Navigation

Pith reviewed 2026-07-02 21:33 UTC · model grok-4.3

classification 💻 cs.RO
keywords ERCPendoscopic navigationvision-language-actionreinforcement learningautonomous roboticsscene-aware supervisionbiliary cannulation
0
0 comments X

The pith

BiliVLA combines scene-aware vision-language-action modeling with reinforcement learning to navigate endoscopes autonomously during ERCP procedures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BiliVLA to address the challenges of precise navigation in ERCP, where narrow views, reflections, and tissue contact often hinder manual or robotic control. It frames the task as predicting actions from endoscopic images paired with language instructions for specific stages. Scene-aware supervision improves consistency in identifying targets while safety supervision encourages retreat when contact occurs. Training proceeds in two stages, starting with supervised fine-tuning that includes grounding and then applying Group Relative Policy Optimization for better decisions. Results from phantom tests across three subtasks show strong action precision and success rates, suggesting the approach can support more reliable closed-loop autonomy where variability and artifacts are present.

Core claim

BiliVLA formulates biliary endoscopic navigation as an instruction-conditioned visuomotor learning problem. Given an endoscopic observation and a stage-specific language instruction, BiliVLA jointly predicts the target category, a grounded bounding box, and a discrete three degrees of freedom (DoF) motor command for a continuum endoscope. The proposed framework incorporates scene-aware supervision to enhance semantic target consistency and safety-aware recovery supervision to induce conservative retreat behaviors under luminal wall contact. A key component of BiliVLA is a two-stage training paradigm that combines grounding-enhanced supervised fine-tuning (SFT) with Group Relative Policy Opti

What carries the argument

The scene-aware Vision-Language-Action (VLA) framework that jointly predicts target category, grounded bounding box, and 3DoF motor commands, trained via grounding-enhanced SFT followed by GRPO.

If this is right

  • Scene-aware supervision produces more consistent semantic targets across varying views.
  • Safety-aware recovery supervision leads to conservative retreat actions on wall contact.
  • The two-stage SFT then GRPO training increases action reliability and decision consistency in closed-loop use.
  • The combined elements improve perception-action alignment under the visual conditions tested.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same instruction-conditioned VLA structure might apply to navigation tasks in other narrow-channel endoscopic procedures.
  • Extending the safety supervision to continuous rather than discrete commands could further reduce contact risks.

Load-bearing premise

Phantom experiments with controlled conditions sufficiently capture the pronounced anatomical variability, specular reflections, partial occlusions, and tissue contact present in live human ERCP procedures.

What would settle it

A substantial drop in success rate when the same model is evaluated in live human ERCP cases with natural anatomical variations and artifacts would indicate the phantom results do not generalize.

Figures

Figures reproduced from arXiv: 2606.23531 by Chi Kit Ng, Eddie Cheung, Hongliang Ren, Huxin Gao, Jiewen Lai, Jinsong Lin, Tabassum Tamima, Yihan Hu, Zhiyong Xiong, Zikang Pan, Ziyi Hao.

Figure 1
Figure 1. Figure 1: Instruction-guided robotic endoscopic navigation for ERCP workflows. Left: stage-specific language instructions with endoscopic observations. Middle: robotic endoscope backend with bending (x/y) and axial (z) DoF. Right: clinical ERCP navigation context with a zoomed papilla view. Bottom￾right: comparison of EndoVLA [14] and BiliVLA in DoF, action space, dataset size, and scene/safety awareness. hood of un… view at source ↗
Figure 2
Figure 2. Figure 2: shows the hardware setup of the robotic endoscopic navigation system for ERCP procedures [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Generation of the BiliVLA-Motion dataset. Each frame is annotated with the navigation stage, target bounding box, and a discrete motion command, forming 10K image–motion pairs across three ERCP tasks. B. BiliVLA-Motion Dataset To address the lack of multimodal data for continuum robots, we construct the BiliVLA-Motion, a vision-language￾kinematic dataset designed to train the BiliVLA model. The dataset con… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of BiliVLA. The framework takes a monocular endoscopic observation and a stage-specific language instruction as input, and outputs [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of BiliVLA in ERCP navigation. (a) Representative predictions across three navigation stages: entry navigation, lumen traversal, and calculus localization. Green boxes denote grounded targets, yellow arrows indicate the predicted bending direction, and the circle marker represents axial motion (insertion/stop/retraction). (b) Safety-aware recovery under luminal wall contact, where the p… view at source ↗
read the original abstract

Endoscopic retrograde cholangiopancreatography (ERCP) demands precise endoscopic navigation and stable biliary cannulation within a narrow monocular field characterized by specular reflections, partial occlusions, and frequent tissue contact. Although recent robotic systems and vision-based assistance techniques improve operator ergonomics and provide perceptual cues, their performance degrades under pronounced anatomical variability and safety-critical visual artifacts, which hinders reliable autonomy in cannulation-grade procedures. Here, we present BiliVLA, a scene-aware Vision-Language-Action (VLA) framework that formulates biliary endoscopic navigation as an instruction-conditioned visuomotor learning problem. Given an endoscopic observation and a stage-specific language instruction, BiliVLA jointly predicts the target category, a grounded bounding box, and a discrete three degrees of freedom (DoF) motor command for a continuum endoscope. The proposed framework incorporates scene-aware supervision to enhance semantic target consistency and safety-aware recovery supervision to induce conservative retreat behaviors under luminal wall contact. A key component of BiliVLA is a two-stage training paradigm that combines grounding-enhanced supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO), which significantly improves action reliability and decision consistency during closed-loop navigation. Across three ERCP subtasks, BiliVLA achieves an average action precision of 91.96\% and an overall success rate (SR) of 84.85\% in real-world phantom experiments. These results indicate that integrating semantic grounding, scene-aware learning, and reward-guided optimization improves perception-action alignment and enables robust autonomous endoscopic navigation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces BiliVLA, a scene-aware Vision-Language-Action (VLA) framework for autonomous biliary endoscopic navigation during ERCP. It casts navigation as an instruction-conditioned visuomotor problem that jointly predicts target category, grounded bounding box, and discrete 3-DoF motor commands from monocular endoscopic images. The method adds scene-aware supervision for semantic consistency and safety-aware recovery supervision for conservative retreat under wall contact, trained via a two-stage pipeline of grounding-enhanced supervised fine-tuning followed by Group Relative Policy Optimization (GRPO). In real-world phantom experiments across three ERCP subtasks, the system reports 91.96% average action precision and 84.85% overall success rate.

Significance. If the reported metrics are shown to be statistically reliable and the phantom conditions are demonstrated to capture the cited visual and anatomical challenges, the work would constitute a concrete advance in applying VLA models with RL to safety-critical continuum-robot navigation. The explicit design of safety-aware recovery supervision and the use of GRPO to improve decision consistency are strengths that directly address closed-loop reliability in visually degraded environments.

major comments (2)
  1. [Abstract] Abstract: The headline metrics (91.96% action precision, 84.85% SR) are presented as aggregate values with no trial counts, baseline comparisons, statistical tests, error bars, or exclusion criteria supplied. Without these details it is impossible to judge whether the numbers support the claims of improved action reliability and decision consistency.
  2. [Abstract] Abstract: All quantitative results derive exclusively from real-world phantom experiments. The text itself identifies pronounced anatomical variability, specular reflections, partial occlusions, and tissue contact as the core challenges that degrade prior methods in live human ERCP, yet no evidence is given that the phantom reproduces patient-to-patient duct geometry, dynamic deformation, or lighting variation. This gap is load-bearing for the central claim that BiliVLA enables robust autonomous endoscopic navigation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive assessment of the work's significance. We address each major comment below with clarifications and proposed revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline metrics (91.96% action precision, 84.85% SR) are presented as aggregate values with no trial counts, baseline comparisons, statistical tests, error bars, or exclusion criteria supplied. Without these details it is impossible to judge whether the numbers support the claims of improved action reliability and decision consistency.

    Authors: We agree that the abstract would be strengthened by including supporting experimental details. The full manuscript reports the underlying trial counts (multiple runs across the three subtasks), baseline comparisons, and evaluation protocol in the Experiments section. We will revise the abstract to incorporate the total number of trials, a reference to the statistical reliability of the results, and a brief note on the evaluation criteria. revision: yes

  2. Referee: [Abstract] Abstract: All quantitative results derive exclusively from real-world phantom experiments. The text itself identifies pronounced anatomical variability, specular reflections, partial occlusions, and tissue contact as the core challenges that degrade prior methods in live human ERCP, yet no evidence is given that the phantom reproduces patient-to-patient duct geometry, dynamic deformation, or lighting variation. This gap is load-bearing for the central claim that BiliVLA enables robust autonomous endoscopic navigation.

    Authors: The reported results are from phantom experiments constructed to replicate the visual artifacts and contact conditions highlighted in the introduction. We acknowledge that the current phantom does not fully capture patient-to-patient duct geometry or dynamic deformation. We will revise the manuscript to add an explicit limitations paragraph clarifying the scope of the phantom validation, the specific challenges it does reproduce, and the need for future clinical studies. revision: partial

Circularity Check

0 steps flagged

No significant circularity; performance metrics are direct experimental outcomes.

full rationale

The paper presents BiliVLA as a VLA framework trained with grounding-enhanced SFT followed by GRPO, then reports action precision and success rate as measured results from phantom experiments. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described pipeline. The central claims reduce to empirical measurements rather than any self-referential construction, satisfying the default expectation of no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; evaluation is limited to surface claims.

pith-pipeline@v0.9.1-grok · 5858 in / 1138 out tokens · 27749 ms · 2026-07-02T21:33:40.551061+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 11 canonical work pages · 9 internal anchors

  1. [1]

    Prophylaxis of post-ERCP pancreatitis: European society of gastrointestinal endoscopy (ESGE) guideline–updated june 2014,

    J.-M. Dumonceau, A. Andriulli, B. J. Elmunzer, A. Mariani, T. Meis- ter, J. Deviere, T. Marek, T. H. Baron, C. Hassan, P. A. Testoni et al., “Prophylaxis of post-ERCP pancreatitis: European society of gastrointestinal endoscopy (ESGE) guideline–updated june 2014,” Endoscopy, vol. 46, no. 09, pp. 799–815, 2014

  2. [2]

    Papillary cannulation and sphincterotomy techniques at ERCP: European society of gastrointestinal endoscopy (ESGE) clinical guideline,

    P. A. Testoni, A. Mariani, L. Aabakken, M. Arvanitakis, E. Bories, G. Costamagna, J. Devi `ere, M. Dinis-Ribeiro, J.-M. Dumonceau, M. Giovanniniet al., “Papillary cannulation and sphincterotomy techniques at ERCP: European society of gastrointestinal endoscopy (ESGE) clinical guideline,”Endoscopy, vol. 48, no. 07, pp. 657–683, 2016

  3. [3]

    Post-ERCP pan- creatitis: prevention, diagnosis and management,

    O. Cahyadi, N. Tehami, E. de Madaria, and K. Siau, “Post-ERCP pan- creatitis: prevention, diagnosis and management,”Medicina, vol. 58, no. 9, p. 1261, 2022

  4. [4]

    Robot-assisted endoscopic retrograde cholangiopan- creatography: A pilot study,

    X. Chen, Z. Wang, H. Wu, Q. Lou, W. Gu, W. Ma, J.-F. Yang, H.-b. Jin, and X. Zhang, “Robot-assisted endoscopic retrograde cholangiopan- creatography: A pilot study,”Endoscopy, 2026

  5. [5]

    Artificial intelligence- assisted analysis of endoscopic retrograde cholangiopancreatography image for identifying ampulla and difficulty of selective cannulation,

    T. Kim, J. Kim, H. S. Choi, E. S. Kim, B. Keum, Y . T. Jeen, H. S. Lee, H. J. Chun, S. Y . Han, D. U. Kimet al., “Artificial intelligence- assisted analysis of endoscopic retrograde cholangiopancreatography image for identifying ampulla and difficulty of selective cannulation,” Scientific Reports, vol. 11, no. 1, p. 8381, 2021

  6. [6]

    DD-VNB: A depth-based dual-loop framework for real- time visually navigated bronchoscopy,

    Q. Tian, H. Liao, X. Huang, J. Chen, Z. Zhang, B. Yang, S. Ourselin, and H. Liu, “DD-VNB: A depth-based dual-loop framework for real- time visually navigated bronchoscopy,” in2024 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS). IEEE, 2024, pp. 12 979–12 986

  7. [7]

    Robotic endoscope control via autonomous instrument tracking,

    C. Gruijthuijsen, L. C. Garcia-Peraza-Herrera, G. Borghesan, D. Rey- naerts, J. Deprest, S. Ourselin, T. Vercauteren, and E. Vander Poorten, “Robotic endoscope control via autonomous instrument tracking,” Frontiers in Robotics and AI, vol. 9, p. 832208, 2022

  8. [8]

    Enabling autonomous colonoscopy intervention using a robotic endoscope platform,

    Q. Zhang, J. M. Prendergast, G. A. Formosa, M. J. Fulton, and M. E. Rentschler, “Enabling autonomous colonoscopy intervention using a robotic endoscope platform,”IEEE Transactions on Biomedical Engineering, vol. 68, no. 6, pp. 1957–1968, 2020

  9. [9]

    Navigation of tendon- driven flexible robotic endoscope through deep reinforcement learn- ing,

    C. Ng, H. Gao, T.-A. Ren, J. Lai, and H. Ren, “Navigation of tendon- driven flexible robotic endoscope through deep reinforcement learn- ing,” in2024 IEEE International Conference on Advanced Robotics and Its Social Impacts (ARSO). IEEE, 2024, pp. 134–139

  10. [10]

    Contact-aided navigation of flexible robotic endoscope using deep reinforcement learning in dynamic stomach,

    C. K. Ng, H. Gao, T.-A. Ren, J. Lai, and H. Ren, “Contact-aided navigation of flexible robotic endoscope using deep reinforcement learning in dynamic stomach,”arXiv preprint arXiv:2509.00319, 2025

  11. [11]

    Jacobian exploratory dual-phase rein- forcement learning for dynamic endoluminal navigation of deformable continuum robots,

    Y . Tian, C. K. Ng, and H. Ren, “Jacobian exploratory dual-phase rein- forcement learning for dynamic endoluminal navigation of deformable continuum robots,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 21 559– 21 565

  12. [12]

    A Survey on Vision-Language-Action Models: An Action Tokenization Perspective

    Y . Zhong, F. Bai, S. Cai, X. Huang, Z. Chen, X. Zhang, Y . Wang, S. Guo, T. Guan, K. N. Luiet al., “A survey on Vision-Language- Action models: An action tokenization perspective,”arXiv preprint arXiv:2507.01925, 2025

  13. [13]

    Sapkota, Y

    R. Sapkota, Y . Cao, K. I. Roumeliotis, and M. Karkee, “Vision- Language-Action (VLA) models: Concepts, progress, applications and challenges,”arXiv preprint arXiv:2505.04769, 2025

  14. [14]

    EndoVLA: Dual-phase Vision-Language-Action for precise autonomous tracking in endoscopy,

    C. K. Ng, L. Bai, G. Wang, Y . Wang, H. Gao, C. Jin, T. Zeng, H. Renet al., “EndoVLA: Dual-phase Vision-Language-Action for precise autonomous tracking in endoscopy,” in9th Annual Conference on Robot Learning, 2025

  15. [15]

    SRT-H: A hierarchical framework for autonomous surgery via language-conditioned imitation learning,

    J. W. Kim, J.-T. Chen, P. Hansen, L. X. Shi, A. Goldenberg, S. Schmidgall, P. M. Scheikl, A. Deguet, B. M. White, D. R. Tsai et al., “SRT-H: A hierarchical framework for autonomous surgery via language-conditioned imitation learning,”Science Robotics, vol. 10, no. 104, p. eadt5254, 2025

  16. [16]

    The gastrointestinal tract,

    C. Van Ginneken, “The gastrointestinal tract,”The Minipig in Biomed- ical Research, pp. 218–237, 2025

  17. [17]

    Specular reflections removal of gastrointestinal polyps based on endoscopic image,

    K. Qian, C. Xu, B. Feng, and Z. An, “Specular reflections removal of gastrointestinal polyps based on endoscopic image,” in2022 7th International Conference on Signal and Image Processing (ICSIP). IEEE, 2022, pp. 627–631

  18. [18]

    American society for gastrointestinal endoscopy guideline on post-ERCP pancreatitis prevention strategies: summary and recommendations,

    J. L. Buxbaum, M. Freeman, S. K. Amateau, J. M. Chalhoub, N. Coelho-Prabhu, M. Desai, S. E. Elhanafi, N. Forbes, L. L. Fujii-Lau, D. R. Kohliet al., “American society for gastrointestinal endoscopy guideline on post-ERCP pancreatitis prevention strategies: summary and recommendations,”Gastrointestinal Endoscopy, vol. 97, no. 2, pp. 153–162, 2023

  19. [19]

    Difficult biliary cannulation during ERCP: how to facilitate biliary access and minimize the risk of post-ERCP pancreatitis,

    P. A. Testoni, S. Testoni, and A. Giussani, “Difficult biliary cannulation during ERCP: how to facilitate biliary access and minimize the risk of post-ERCP pancreatitis,”Digestive and Liver Disease, vol. 43, no. 8, pp. 596–603, 2011

  20. [20]

    RT-1: Robotics Transformer for Real-World Control at Scale

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “RT-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

  21. [21]

    RT-2: Vision-Language-Action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahidet al., “RT-2: Vision-Language-Action models transfer web knowledge to robotic control,” inConference on Robot Learning. PMLR, 2023, pp. 2165–2183

  22. [22]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichteret al., “π 0: A Vision- Language-Action flow model for general robot control,”arXiv preprint arXiv:2410.24164, 2024

  23. [23]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    P. Intelligence, K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusaiet al., “π 0.5: a Vision-Language-Action model with open-world generalization,” arXiv preprint arXiv:2504.16054, 2025

  24. [24]

    RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

    S. Liu, L. Wu, B. Li, H. Tan, H. Chen, Z. Wang, K. Xu, H. Su, and J. Zhu, “RDT-1B: a diffusion foundation model for bimanual manipulation,”arXiv preprint arXiv:2410.07864, 2024

  25. [25]

    Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

    N. Nelson, J.-T. Chen, J. Haworth, X. Chen, L. Zbinden, D. Huang, A. E. Abdelaal, A. Arezzo, A. Acar, F. Alambeigiet al., “Open-h- embodiment: A large-scale dataset for enabling foundation models in medical robotics,”arXiv preprint arXiv:2604.21017, 2026

  26. [26]

    YOLOv11: An Overview of the Key Architectural Enhancements

    R. Khanam and M. Hussain, “YOLOv11: An overview of the key architectural enhancements,”arXiv preprint arXiv:2410.17725, 2024

  27. [27]

    LoRA: Low-rank adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “LoRA: Low-rank adaptation of large language models,”ICLR, vol. 1, no. 2, p. 3, 2022

  28. [28]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Biet al., “DeepSeek-R1: Incentivizing reason- ing capability in LLMs via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  29. [29]

    Unsloth: Fast language model fine-tuning,

    U. Team, “Unsloth: Fast language model fine-tuning,” GitHub repos- itory, 2024, https://github.com/unslothai/unsloth

  30. [30]

    Qwen3-VL Technical Report

    S. Bai, Y . Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Geet al., “Qwen3-VL technical report,”arXiv preprint arXiv:2511.21631, 2025

  31. [31]

    TRL: Transformer reinforcement learning,

    L. von Werra, J. Beeching, Y . Belkada, N. Carper, M. Cointepas, L. Gatti, C. Harris, Q. Lhoest, T. Thrush, L. Tunstallet al., “TRL: Transformer reinforcement learning,” GitHub repository, 2024, https: //github.com/huggingface/trl