Ophiuchus: Incentivizing Tool-augmented "Think with Images" for Joint Medical Segmentation, Understanding and Reasoning

Jintai Chen; Peng Zhang; Shihui Zhen; Wenjie Li; Xiaoming Shi; Yankai Jiang; Yichen Li; Yujie Zhang

arxiv: 2512.14157 · v2 · pith:4DLLRZEVnew · submitted 2025-12-16 · 💻 cs.AI · cs.CV

Ophiuchus: Incentivizing Tool-augmented "Think with Images" for Joint Medical Segmentation, Understanding and Reasoning

Yankai Jiang , Yujie Zhang , Peng Zhang , Wenjie Li , Yichen Li , Jintai Chen , Xiaoming Shi , Shihui Zhen This is my paper

classification 💻 cs.AI cs.CV

keywords ophiuchusmedicalreasoningsegmentationfine-grainedmllmtooltool-augmented

0 comments

read the original abstract

Recent medical MLLMs have made significant progress in generating step-by-step textual reasoning chains. However, they still struggle with complex clinical tasks that necessitate dynamic and iterative focusing on fine-grained visual regions. To close this gap, we introduce Ophiuchus, a versatile, tool-augmented framework that equips an MLLM to (i) decide when fine-grained visual evidence is needed, (ii) determine where to probe and ground within the medical image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved, multimodal chain of thought for precise segmentation and diagnosis. Ophiuchus moves beyond mere tool-calling by tightly fusing the MLLM's inherent grounding and reasoning capabilities with external tools, enabling more accurate and trustworthy decisions. The core of our method is a three-stage training strategy: cold-start SFT for basic tool selection; self-reflection fine-tuning to strengthen decision revision; and agentic tool reinforcement learning to elicit sophisticated, expert-like diagnostic behaviors. Extensive experiments show that Ophiuchus consistently outperforms both closed-source and open-source SOTA methods across diverse medical benchmarks, including VQA, detection, and reasoning-based segmentation. Our project code is available at https://github.com/SII-zyj/Ophiuchus.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents
cs.CV 2026-05 accept novelty 8.0

DeepTumorVQA is a new stage-wise 3D CT VQA benchmark showing that quantitative measurement is the main failure point for current medical VLMs and that tool augmentation substantially improves later reasoning stages.
MedScribe: Clinically Grounded CT Reporting through Agentic Workflows
cs.CV 2026-05 unverdicted novelty 6.0

MedScribe reformulates CT radiology reporting as an agentic evidence-acquisition workflow using LLM-invoked diagnostic tools and pathology-aligned retrieval, yielding higher clinical accuracy and consistency than stan...
UniReason-Med: A Shared Grounded Reasoning Interface for 2D-to-3D Transfer in Medical VQA
cs.CV 2026-06 unverdicted novelty 4.0

UniReason-Med introduces a unified framework for 2D and 3D medical VQA with shared grounded reasoning, trained on a 220K dataset, claiming that joint 2D+3D supervision improves 3D performance over 3D-only training.