Anatomical Landmark-Guided Deep Reinforcement Learning for Autonomous Gastric Navigation

Haitao Gao; Haoxuan Wu; Hongliang Ren; Sishen Yuan; Xiuli Zuo; Zhen Li

arxiv: 2605.08269 · v1 · submitted 2026-05-08 · 💻 cs.RO · cs.SY· eess.SY

Anatomical Landmark-Guided Deep Reinforcement Learning for Autonomous Gastric Navigation

Haoxuan Wu , Sishen Yuan , Haitao Gao , Zhen Li , Xiuli Zuo , Hongliang Ren This is my paper

Pith reviewed 2026-05-12 00:45 UTC · model grok-4.3

classification 💻 cs.RO cs.SYeess.SY

keywords deep reinforcement learningwireless capsule endoscopyautonomous gastric navigationsim-to-real transferanatomical landmarksedge-contour-depth fusionmedical robotics

0 comments

The pith

Anatomical landmark-guided reinforcement learning lets wireless capsules navigate the stomach autonomously by operating on low-dimensional coordinates instead of raw video.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an AL-DRL framework that fuses edge, contour, and depth data into stable landmark coordinates to train a reinforcement learning policy for gastric navigation. This replaces direct processing of high-dimensional video streams, which the authors show improves transfer from simulation to physical settings across eight patient-derived stomach models. In simulation the policy reaches over 97 percent coverage in under 50 seconds and beats standard PPO, SAC, and DQN agents. A two-stage sim-to-real pipeline adds an adaptive dynamic programming controller to handle disturbances, and ex-vivo tests yield 87 percent mean coverage with a 53 percent shorter procedure time than expert manual control. The central goal is to make wireless capsule endoscopy more reliable by ensuring thorough mucosal visualization without depending on patient-specific video features.

Core claim

The AL-DRL method uses a lightweight edge-contour-depth fusion module to supply the reinforcement learning policy with low-dimensional anatomical landmark coordinates rather than full video streams, enabling over 97 percent coverage in simulations on eight patient-derived models within 50 seconds, plus an 87 percent mean coverage and 53 percent time reduction versus manual control in ex-vivo experiments while actively mitigating physical disturbances through a two-stage sim-to-real pipeline with adaptive dynamic programming.

What carries the argument

The edge-contour-depth fusion module that extracts stable, low-dimensional anatomical landmark coordinates to drive the reinforcement learning policy.

If this is right

The policy achieves higher coverage and speed than vanilla PPO, SAC, and DQN agents in simulation.
The two-stage sim-to-real pipeline with an adaptive dynamic programming controller reduces the effect of physical disturbances.
Ex-vivo results show both higher mucosal coverage and substantially shorter navigation time than expert manual operation.
The approach directly targets incomplete visualization, one of the main limits on diagnostic use of wireless capsule endoscopy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same landmark-based reduction in input dimension could be tested for navigation through the small intestine or colon where video features vary even more.
Running the controller on compact hardware becomes feasible because the policy no longer needs to process full video frames in real time.
Combining the navigation policy with separate diagnostic models could allow the capsule to pause or adjust path when suspicious tissue is detected.
Repeating the ex-vivo protocol with deliberate changes in capsule speed or added fluid would test how robust the adaptive controller remains under broader disturbance ranges.

Load-bearing premise

The fusion module can extract landmark coordinates that stay stable and transferable across varied patient stomach shapes and physical disturbances during real procedures.

What would settle it

A trial on new stomach models or during ex-vivo runs with added tissue deformation and lighting changes that drops coverage below 70 percent or prevents sim-to-real transfer would show the landmark coordinates are not sufficiently stable.

Figures

Figures reproduced from arXiv: 2605.08269 by Haitao Gao, Haoxuan Wu, Hongliang Ren, Sishen Yuan, Xiuli Zuo, Zhen Li.

**Figure 2.** Figure 2: Quantitative results and ex-vivo validation. (a) Coverage ratio and elapsed time across eight stomach models for vanilla PPO vs. AL-DRL. (b) Mean [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

Wireless capsule endoscopy (WCE) enables painless visualization of the gastrointestinal tract, but its diagnostic potential is limited by incomplete mucosal coverage and poor transferability of existing navigation methods across patient anatomies. We propose a transferable, anatomical landmarkguided deep reinforcement learning (AL-DRL) framework for autonomous gastric navigation. Leveraging a lightweight edgecontour-depth fusion module, our policy operates on stable, lowdimensional landmark coordinates rather than high-dimensional video streams, effectively bridging the sim-to-real gap. In simulations across eight patient-derived models, the method achieves over 97% coverage within 50 seconds, significantly outperforming vanilla PPO, SAC, and DQN agents. A two-stage sim-to-real pipeline with an adaptive dynamic programming controller actively mitigates physical disturbances. Ex-vivo experiments demonstrate a mean coverage of 87% and a 53% reduction in procedure time compared with expert manual control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper delivers usable RL navigation for gastric capsules with strong sim and ex-vivo coverage numbers, but the landmark fusion module's claimed stability lacks the supporting checks needed to fully back the sim-to-real advantage.

read the letter

The key takeaway is that the AL-DRL approach gets impressive coverage in simulation and decent results ex-vivo for capsule navigation in the stomach, yet the reliance on the edge-contour-depth fusion for stable landmarks is not backed by enough checks. They introduce a framework that guides deep RL with anatomical landmarks extracted via a fusion module, letting the policy work on low-dimensional coordinates. This is trained across eight patient-derived models and transferred using a two-stage pipeline with an adaptive dynamic programming controller to handle disturbances. The results show over 97% coverage in under 50 seconds in sim, beating the vanilla agents, and 87% mean coverage ex-vivo with half the time of manual control. The work is new in its specific combination for this clinical task. It does well by moving beyond pure simulation and providing ex-vivo validation, which is not common enough in this area. The outperformance over standard RL methods is clear from the numbers given. The main weakness is the untested assumption about the fusion module. There are no reported accuracy rates for landmark detection, no ablation studies comparing with and without the fusion, and no analysis of how stable the landmarks are when anatomy varies or when physical disturbances occur. Without those, it's hard to know if the low-dimensional input is truly the reason for better transfer or if other factors are at play. The claims about bridging the sim-to-real gap would be more convincing with that data. This paper is aimed at researchers in medical robotics and gastrointestinal diagnostics who want to see RL applied to real navigation problems. Readers looking for practical implementations with patient models and physical tests will get value from the experimental design. It is coherent enough and grounded in experiments to merit a serious referee, though revisions would likely focus on bolstering the method validation. I recommend putting it through peer review rather than desk rejecting it.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes an Anatomical Landmark-Guided Deep Reinforcement Learning (AL-DRL) framework for autonomous gastric navigation in wireless capsule endoscopy. It introduces a lightweight edge-contour-depth fusion module that reduces high-dimensional video input to stable low-dimensional landmark coordinates for the RL policy. Simulations across eight patient-derived models report over 97% mucosal coverage within 50 seconds, outperforming vanilla PPO, SAC, and DQN agents. A two-stage sim-to-real pipeline incorporates an adaptive dynamic programming controller to handle physical disturbances, with ex-vivo experiments showing 87% mean coverage and a 53% reduction in procedure time relative to expert manual control.

Significance. If the core claims are substantiated with detailed validation, the work could advance autonomous navigation in medical robotics by demonstrating a practical route to sim-to-real transfer via anatomical landmark reduction rather than raw video. The multi-model simulation and ex-vivo testing provide a reasonable foundation for assessing real-world applicability in improving diagnostic coverage and efficiency. The emphasis on low-dimensional stable inputs addresses a known challenge in endoscopic RL, though the assessed significance remains provisional pending stronger evidence on the fusion module.

major comments (3)

[Methods (edge-contour-depth fusion module)] The description of the edge-contour-depth fusion module (Methods section) provides no quantitative metrics on landmark detection accuracy, precision, recall, or stability across the eight patient-derived models and under disturbances. This is load-bearing for the central sim-to-real claim, as the abstract attributes performance gains to operating on 'stable, lowdimensional landmark coordinates' rather than video; without these numbers or an ablation removing the fusion step, it is impossible to rule out that gains arise merely from dimensionality reduction.
[Results (simulation experiments)] The simulation results (Results section, likely including the 97% coverage claim) report outperformance over PPO, SAC, and DQN but omit variance, number of trials, statistical significance tests, or per-model breakdowns. Without these, the assertion of 'significantly outperforming' cannot be evaluated as robust, particularly given the reliance on eight specific patient-derived models whose selection and variability are not detailed.
[Sim-to-Real Pipeline and Ex-vivo Experiments] The two-stage sim-to-real pipeline and adaptive dynamic programming controller (Sim-to-Real Transfer section) are presented without quantitative disturbance rejection metrics, success rates under physical perturbations, or direct comparisons of landmark coordinate stability between simulation and ex-vivo conditions. This directly impacts the transferability claim, as the ex-vivo 87% coverage and time reduction rest on the unverified premise that the fusion module remains reliable outside simulation.

minor comments (2)

[Abstract] The abstract contains inconsistent hyphenation ('edgecontour-depth' vs. 'edge-contour-depth'); standardize this throughout the manuscript for clarity.
[Methods] Notation for landmark coordinates and policy input dimensions could be made more explicit (e.g., defining the exact dimensionality and coordinate representation) to support reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We provide detailed responses to each major comment and commit to revisions that address the concerns raised to better substantiate our findings.

read point-by-point responses

Referee: The description of the edge-contour-depth fusion module (Methods section) provides no quantitative metrics on landmark detection accuracy, precision, recall, or stability across the eight patient-derived models and under disturbances. This is load-bearing for the central sim-to-real claim, as the abstract attributes performance gains to operating on 'stable, lowdimensional landmark coordinates' rather than video; without these numbers or an ablation removing the fusion step, it is impossible to rule out that gains arise merely from dimensionality reduction.

Authors: We agree that quantitative metrics and an ablation are needed to isolate the contribution of the fusion module. The manuscript describes the module but does not report accuracy, precision, recall, or stability numbers. We will add these metrics (computed on the eight models and under disturbances) plus an ablation study comparing the full pipeline against a version using raw video input or non-landmark dimensionality reduction. This will be included in the revised Methods and Results sections. revision: yes
Referee: The simulation results (Results section, likely including the 97% coverage claim) report outperformance over PPO, SAC, and DQN but omit variance, number of trials, statistical significance tests, or per-model breakdowns. Without these, the assertion of 'significantly outperforming' cannot be evaluated as robust, particularly given the reliance on eight specific patient-derived models whose selection and variability are not detailed.

Authors: We accept that the reported averages lack supporting statistics. We will revise the Results section to state that each result is the mean over 10 independent trials per agent per model, include standard deviations, report p-values from appropriate significance tests, and add a per-model breakdown table. Model selection criteria and anatomical variability details will also be expanded. revision: yes
Referee: The two-stage sim-to-real pipeline and adaptive dynamic programming controller (Sim-to-Real Transfer section) are presented without quantitative disturbance rejection metrics, success rates under physical perturbations, or direct comparisons of landmark coordinate stability between simulation and ex-vivo conditions. This directly impacts the transferability claim, as the ex-vivo 87% coverage and time reduction rest on the unverified premise that the fusion module remains reliable outside simulation.

Authors: We agree that quantitative disturbance-rejection and stability data are missing. We will expand the Sim-to-Real Transfer section with success rates and coverage under simulated physical perturbations, plus direct comparisons of landmark coordinate variance between simulation and ex-vivo recordings. These additions will support the reliability of the fusion module in the reported ex-vivo results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on independent sim/ex-vivo benchmarks rather than self-referential definitions or fitted predictions.

full rationale

The paper presents an AL-DRL framework whose core claims (97% sim coverage on 8 patient-derived models, 87% ex-vivo coverage, 53% time reduction vs. manual control) are supported by direct performance comparisons against vanilla PPO/SAC/DQN and expert operators. No equations or sections reduce a claimed prediction to a fitted parameter by construction, nor does any load-bearing premise collapse to a self-citation whose validity is presupposed. The edge-contour-depth fusion module is introduced as an engineering choice whose reliability is tested empirically rather than defined into the result. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, the paper relies on standard deep RL algorithms (PPO, SAC, DQN) and introduces a custom fusion module without detailing any fitted parameters, unproven axioms, or new physical entities. No free parameters or invented entities are explicitly described.

pith-pipeline@v0.9.0 · 5466 in / 1261 out tokens · 41191 ms · 2026-05-12T00:45:49.437372+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages

[1]

Robotic wireless capsule endoscopy: recent advances and upcoming technologies,

Q. Caoet al., “Robotic wireless capsule endoscopy: recent advances and upcoming technologies,”Nature Commun., vol. 15, p. 4597, 2024

work page 2024
[2]

Capsule endoscopy: progress update and challenges ahead,

A. Mogliaet al., “Capsule endoscopy: progress update and challenges ahead,”Nature Rev. Gastroenterol. Hepatol., vol. 6, pp. 353–361, 2009

work page 2009
[3]

DRL-based WCE navigation,

Y . Zhang, L. Bai, L. Liu, H. Ren, and M. Q. H. Meng, “DRL-based WCE navigation,” inProc. IEEE ROBIO, 2022, pp. 01–06

work page 2022
[4]

Functional and anatomical gastric regions and their relations to motility control,

M. R. Di Nataleet al., “Functional and anatomical gastric regions and their relations to motility control,”Neurogastroenterol. Motil., vol. 35, p. e14560, 2023

work page 2023
[5]

EndoSLAM dataset and an unsupervised monoc- ular visual odometry and depth estimation approach,

K. B. Ozyoruket al., “EndoSLAM dataset and an unsupervised monoc- ular visual odometry and depth estimation approach,”Med. Image Anal., vol. 71, p. 102058, 2021

work page 2021
[6]

AMagPoseNet: Real-time six-DoF magnet pose estima- tion,

S. Suet al., “AMagPoseNet: Real-time six-DoF magnet pose estima- tion,”IEEE Trans. Ind. Inform., vol. 19, pp. 9722–9732, 2023

work page 2023

[1] [1]

Robotic wireless capsule endoscopy: recent advances and upcoming technologies,

Q. Caoet al., “Robotic wireless capsule endoscopy: recent advances and upcoming technologies,”Nature Commun., vol. 15, p. 4597, 2024

work page 2024

[2] [2]

Capsule endoscopy: progress update and challenges ahead,

A. Mogliaet al., “Capsule endoscopy: progress update and challenges ahead,”Nature Rev. Gastroenterol. Hepatol., vol. 6, pp. 353–361, 2009

work page 2009

[3] [3]

DRL-based WCE navigation,

Y . Zhang, L. Bai, L. Liu, H. Ren, and M. Q. H. Meng, “DRL-based WCE navigation,” inProc. IEEE ROBIO, 2022, pp. 01–06

work page 2022

[4] [4]

Functional and anatomical gastric regions and their relations to motility control,

M. R. Di Nataleet al., “Functional and anatomical gastric regions and their relations to motility control,”Neurogastroenterol. Motil., vol. 35, p. e14560, 2023

work page 2023

[5] [5]

EndoSLAM dataset and an unsupervised monoc- ular visual odometry and depth estimation approach,

K. B. Ozyoruket al., “EndoSLAM dataset and an unsupervised monoc- ular visual odometry and depth estimation approach,”Med. Image Anal., vol. 71, p. 102058, 2021

work page 2021

[6] [6]

AMagPoseNet: Real-time six-DoF magnet pose estima- tion,

S. Suet al., “AMagPoseNet: Real-time six-DoF magnet pose estima- tion,”IEEE Trans. Ind. Inform., vol. 19, pp. 9722–9732, 2023

work page 2023