pith. sign in

arxiv: 2512.14157 · v2 · pith:4DLLRZEVnew · submitted 2025-12-16 · 💻 cs.AI · cs.CV

Ophiuchus: Incentivizing Tool-augmented "Think with Images" for Joint Medical Segmentation, Understanding and Reasoning

classification 💻 cs.AI cs.CV
keywords ophiuchusmedicalreasoningsegmentationfine-grainedmllmtooltool-augmented
0
0 comments X
read the original abstract

Recent medical MLLMs have made significant progress in generating step-by-step textual reasoning chains. However, they still struggle with complex clinical tasks that necessitate dynamic and iterative focusing on fine-grained visual regions. To close this gap, we introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when fine-grained visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought for precise segmentation and diagnosis. Ophiuchus moves beyond mere tool-calling by tightly fusing the MLLM's inherent grounding and reasoning capabilities with external tools, enabling more accurate and trustworthy decisions. The core of our method is a three-stage training strategy: cold-start SFT for basic tool selection; self-reflection fine-tuning to strengthen decision revision; and agentic tool reinforcement learning to elicit sophisticated, expert-like diagnostic behaviors. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our project code is available at https://github.com/SII-zyj/Ophiuchus.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

    cs.CV 2026-05 accept novelty 8.0

    DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.

  2. MedScribe: Clinically Grounded CT Reporting through Agentic Workflows

    cs.CV 2026-05 unverdicted novelty 6.0

    MedScribe reformulates CT radiology reporting as an agentic evidence-acquisition workflow using LLM-invoked diagnostic tools and pathology-aligned retrieval, yielding higher clinical accuracy and consistency than stan...

  3. UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA

    cs.CV 2026-06 unverdicted novelty 4.0

    UniReason-Med introduces a unified framework for 2D and 3D medical VQA with shared grounded reasoning, trained on a 220K dataset, claiming that joint 2D+3D supervision improves 3D performance over 3D-only training.