Recognition: unknown
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
Pith reviewed 2026-05-16 17:26 UTC · model grok-4.3
The pith
IBISAgent lets MLLMs perform medical segmentation through iterative reasoning and text-based tool calls without model changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IBISAgent reformulates segmentation as a vision-centric, multi-step decision-making process. The model generates interleaved reasoning and text-based click actions, invokes segmentation tools, and produces high-quality masks without architectural modifications. Iterative visual reasoning on masked image features supports mask refinement and builds pixel-level capabilities. A two-stage training framework of cold-start supervised fine-tuning followed by agentic reinforcement learning with tailored fine-grained rewards strengthens robustness on complex medical referring and segmentation tasks, yielding consistent gains over both closed-source and open-source baselines.
What carries the argument
The agentic loop of interleaved reasoning and text-based click actions that invoke external segmentation tools, enabling iterative refinement on masked features.
If this is right
- Segmentation becomes possible without joint fine-tuning of the MLLM and any pixel decoder, lowering the chance of catastrophic forgetting.
- The model can iteratively refine masks by reasoning over previously masked features in successive steps.
- Performance on out-of-domain medical images improves relative to single-pass methods that lack refinement.
- Pixel-level visual reasoning emerges directly inside the MLLM rather than being off-loaded to a separate decoder.
- The two-stage training with fine-grained rewards produces stronger results on referring and reasoning segmentation benchmarks.
Where Pith is reading between the lines
- The same action-plus-reasoning pattern could be tested on non-medical fine-grained vision tasks such as instance segmentation in natural scenes.
- If external tools prove stable, future MLLMs might avoid dedicated pixel decoders altogether.
- Clinical users could inspect and correct individual reasoning steps rather than receiving a final mask only.
- Extension to 3D volumes would require checking whether the click-action interface scales without additional architectural changes.
Load-bearing premise
That text-based click actions can reliably drive external segmentation tools to accurate results across varied medical images without new failure modes.
What would settle it
Run IBISAgent on a held-out set of medical images containing unseen pathologies or imaging modalities and check whether its mask accuracy falls below that of jointly fine-tuned baselines.
Figures
read the original abstract
Recent research on medical MLLMs has gradually shifted its focus from image-level understanding to fine-grained, pixel-level comprehension. Although segmentation serves as the foundation for pixel-level understanding, existing approaches face two major challenges. First, they introduce implicit segmentation tokens and require simultaneous fine-tuning of both the MLLM and external pixel decoders, which increases the risk of catastrophic forgetting and limits generalization to out-of-domain scenarios. Second, most methods rely on single-pass reasoning and lack the capability to iteratively refine segmentation results, leading to suboptimal performance. To overcome these limitations, we propose a novel agentic MLLM, named IBISAgent, that reformulates segmentation as a vision-centric, multi-step decision-making process. IBISAgent enables MLLMs to generate interleaved reasoning and text-based click actions, invoke segmentation tools, and produce high-quality masks without architectural modifications. By iteratively performing multi-step visual reasoning on masked image features, IBISAgent naturally supports mask refinement and promotes the development of pixel-level visual reasoning capabilities. We further design a two-stage training framework consisting of cold-start supervised fine-tuning and agentic reinforcement learning with tailored, fine-grained rewards, enhancing the model's robustness in complex medical referring and reasoning segmentation tasks. Extensive experiments demonstrate that IBISAgent consistently outperforms both closed-source and open-source SOTA methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents IBISAgent, a novel agentic multimodal large language model for universal biomedical object referring and segmentation. It reformulates segmentation as a vision-centric multi-step decision-making process in which the MLLM generates interleaved reasoning and text-based click actions to invoke external segmentation tools, enabling iterative mask refinement without architectural modifications to the MLLM or pixel decoders. A two-stage training framework is proposed consisting of cold-start supervised fine-tuning followed by agentic reinforcement learning with tailored fine-grained rewards. The paper claims that extensive experiments demonstrate consistent outperformance over closed-source and open-source SOTA methods on complex medical referring and reasoning segmentation tasks.
Significance. If the empirical results hold, this work would be significant for advancing pixel-level visual reasoning in medical MLLMs. By avoiding simultaneous fine-tuning of the MLLM and external decoders, the approach reduces the risk of catastrophic forgetting and improves generalization to out-of-domain scenarios. The agentic formulation with text-based tool invocation and RL-based iterative refinement offers a new paradigm for segmentation that could be adapted to other vision-language tasks requiring precise localization.
major comments (2)
- §4 Experiments: the central claim of consistent outperformance over SOTA methods is presented without any specific metrics, baselines, ablation studies, error bars, or dataset splits, making it impossible to verify the empirical support for the agentic framework.
- §3.2 Agentic Reinforcement Learning: the assumption that text-based click actions reliably control external segmentation tools lacks quantitative validation such as action validity rates, refinement success rates, or performance drops on out-of-domain scans, which is load-bearing for attributing gains to the proposed reasoning process rather than the base segmenter.
minor comments (1)
- The abstract would be strengthened by including at least one key quantitative result (e.g., Dice score improvement) to support the outperformance claim.
Simulated Author's Rebuttal
We are grateful to the referee for their thorough review and constructive suggestions. These comments have prompted us to enhance the clarity and completeness of our experimental reporting and methodological validations. We provide point-by-point responses below and have updated the manuscript accordingly.
read point-by-point responses
-
Referee: §4 Experiments: the central claim of consistent outperformance over SOTA methods is presented without any specific metrics, baselines, ablation studies, error bars, or dataset splits, making it impossible to verify the empirical support for the agentic framework.
Authors: We agree that the experimental section would benefit from more explicit and upfront presentation of these elements to facilitate verification. In the revised manuscript, we have restructured §4 to begin with a summary of key metrics, included comprehensive tables listing all baselines and results with error bars, detailed the dataset splits and ablation studies, and added cross-references to ensure the support for the agentic framework is clearly verifiable. revision: yes
-
Referee: §3.2 Agentic Reinforcement Learning: the assumption that text-based click actions reliably control external segmentation tools lacks quantitative validation such as action validity rates, refinement success rates, or performance drops on out-of-domain scans, which is load-bearing for attributing gains to the proposed reasoning process rather than the base segmenter.
Authors: This observation is well-taken, as explicit validation of the tool invocation reliability is crucial for the claims. We have augmented the manuscript with quantitative results in the revised §3.2 and §4, including action validity rates, refinement success rates across iterations, and comparative performance on out-of-domain data. These additions confirm that the iterative reasoning contributes meaningfully beyond the base segmentation tool. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper presents IBISAgent as a new agentic MLLM framework that reformulates segmentation as multi-step visual reasoning with interleaved text-based click actions and external tool invocation, trained via a two-stage process of cold-start SFT followed by RL with custom fine-grained rewards. These components are introduced as architectural and training innovations without any equations, parameters, or results that reduce by construction to fitted inputs from the same data or to self-citations. The central performance claims are framed as empirical outcomes from experiments on medical referring and segmentation tasks rather than tautological restatements. No load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work appear in the derivation; the method is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- fine-grained RL rewards
axioms (1)
- domain assumption External segmentation tools can be invoked reliably through text-based click actions without any architectural modification to the underlying MLLM.
Reference graph
Works this paper leans on
-
[1]
Fan Bai, Yuxin Du, Tiejun Huang, Max Q-H Meng, and Bo Zhao. M3d: Advancing 3d medical image analysis with multi-modal large language models.arXiv preprint arXiv:2404.00578, 2024. 2
-
[2]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning seg- mentation in videos.Advances in Neural Information Pro- cessing Systems, 37:6833–6859, 2024. 2
work page 2024
-
[4]
Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shu- nian Chen, Guiming Hardy Chen, Xidong Wang, Ruifei Zhang, Zhenyang Cai, Ke Ji, et al. Huatuogpt-vision, to- wards injecting medical visual knowledge into multimodal llms at scale.arXiv preprint arXiv:2406.19280, 2024. 1, 7, 8
-
[5]
Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation
Yi-Chia Chen, Wei-Hua Li, Cheng Sun, Yu-Chiang Frank Wang, and Chu-Song Chen. Sam4mllm: Enhance multi- modal large language model for referring expression seg- mentation. InEuropean Conference on Computer Vision, pages 323–340. Springer, 2024. 6
work page 2024
-
[6]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
PathVQA: 30000+ Questions for Medical Visual Question Answering
Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020. 1, 6
work page internal anchor Pith review Pith/arXiv arXiv 2003
-
[9]
Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model
Jingcheng Hu, Yinmin Zhang, Qi Han, Daxin Jiang, Xiangyu Zhang, and Heung-Yeung Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.arXiv preprint arXiv:2503.24290, 2025. 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm
Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 22170–22183, 2024. 6
work page 2024
-
[11]
Jiaqi Huang, Zunnan Xu, Jun Zhou, Ting Liu, Yicheng Xiao, Mingwen Ou, Bowen Ji, Xiu Li, and Kehong Yuan. Sam-r1: Leveraging sam for reward feedback in multimodal segmentation via reinforcement learning.arXiv preprint arXiv:2505.22596, 2025. 3, 5
-
[12]
Towards a multimodal large language model with pixel-level insight for biomedicine, 2025
Xiaoshuang Huang, Lingdong Shen, Jia Liu, Fangxin Shang, Hongxiang Li, Haifeng Huang, and Yehui Yang. Towards a multimodal large language model with pixel-level insight for biomedicine, 2025. 2, 6, 7
work page 2025
-
[13]
Towards a multimodal large language model with pixel-level insight for biomedicine
Xiaoshuang Huang, Lingdong Shen, Jia Liu, Fangxin Shang, Hongxiang Li, Haifeng Huang, and Yehui Yang. Towards a multimodal large language model with pixel-level insight for biomedicine. InProceedings of the AAAI Conference on Artificial Intelligence, pages 3779–3787, 2025. 2, 6, 7
work page 2025
-
[14]
Medseg-r: Reasoning segmentation in medical images with multimodal large language models,
Yu Huang, Zelin Peng, Yichen Zhao, Piao Yang, Xiaokang Yang, and Wei Shen. Medseg-r: Reasoning segmentation in medical images with multimodal large language models,
-
[15]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146, 2019
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146, 2019. 1
-
[17]
Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C. Berg, Wan-Yen Lo, Piotr Doll ´ar, and Ross Girshick. Segment anything, 2023. 2, 8
work page 2023
-
[18]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9579–9589, 2024. 2, 6
work page 2024
-
[19]
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):1–10, 2018. 1, 6
work page 2018
-
[20]
Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, et al. Mmedagent: Learning to use medical tools with multi- modal agent.arXiv preprint arXiv:2407.02483, 2024. 3, 6, 7
-
[21]
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language- and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564,
-
[22]
Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering
Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge- enhanced dataset for medical visual question answering. In 2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021. 6
work page 2021
-
[23]
Unipixel: Unified object referring and segmentation for pixel-level visual reasoning
Ye Liu, Zongyang Ma, Junfu Pu, Zhongang Qi, Yang Wu, Ying Shan, and Chang Wen Chen. Unipixel: Unified object referring and segmentation for pixel-level visual reasoning. arXiv preprint arXiv:2509.18094, 2025. 2 9
-
[24]
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025. 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
VisionReasoner: Unified Vi- sual Perception and Reasoning via Reinforcement Learning
Yuqi Liu, Tianyuan Qu, Zhisheng Zhong, Bohao Peng, Shu Liu, Bei Yu, and Jiaya Jia. Visionreasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025. 3, 5, 6
-
[26]
Segment anything in medical images.Nature Communications, 15(1), 2024
Jun Ma, Yuting He, Feifei Li, Lin Han, Chenyu You, and Bo Wang. Segment anything in medical images.Nature Communications, 15(1), 2024. 2, 6, 7, 8, 9
work page 2024
-
[27]
Medsam2: Segment any- thing in 3d medical images and videos.arXiv preprint arXiv:2504.03600, 2025
Jun Ma, Zongxin Yang, Sumin Kim, Bihui Chen, Mo- hammed Baharoon, Adibvafa Fallahpour, Reza Asakereh, Hongwei Lyu, and Bo Wang. Medsam2: Segment any- thing in 3d medical images and videos.arXiv preprint arXiv:2504.03600, 2025. 3, 6, 7, 8, 9
- [28]
-
[29]
Accessed: 2025-09-17. 4
work page 2025
-
[30]
Medmcqa: A large-scale multi-subject multi- choice dataset for medical domain question answering
Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. Medmcqa: A large-scale multi-subject multi- choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248–
-
[31]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Sam 2: Segment anything in images and videos,
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos,
-
[33]
Pixellm: Pixel reasoning with large multimodal model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. Pixellm: Pixel reasoning with large multimodal model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26374–26383, 2024. 2
work page 2024
-
[34]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroen- sri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, C ´ıan Hughes, Charles Lau, et al. Medgemma technical report.arXiv preprint arXiv:2507.05201, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Hybridflow: A flexible and efficient rlhf frame- work
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf frame- work. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025. 6
work page 2025
-
[36]
Mennatullah Siam. Pixfoundation: Are we heading in the right direction with pixel-level vision foundation models? arXiv preprint arXiv:2502.04192, 2025. 2
-
[37]
Amber L. Simpson, Michela Antonelli, Spyridon Bakas, Michel Bilello, Keyvan Farahani, Bram van Ginneken, An- nette Kopp-Schneider, Bennett A. Landman, Geert Lit- jens, Bjoern Menze, Olaf Ronneberger, Ronald M. Sum- mers, Patrick Bilic, Patrick F. Christ, Richard K. G. Do, Marc Gollub, Jennifer Golia-Pernicka, Stephan H. Heckers, William R. Jarnagin, Maure...
work page 2019
-
[38]
Haoran Sun, Yankai Jiang, Wenjie Lou, Yujie Zhang, Wenjie Li, Lilong Wang, Mianxin Liu, Lei Liu, and Xiaosong Wang. Enhancing step-by-step and verifiable medical reasoning in mllms.arXiv preprint arXiv:2506.16962, 2025. 7, 8
-
[39]
Medisee: Reasoning-based pixel-level percep- tion in medical images.arXiv preprint arXiv:2504.11008,
Qinyue Tong, Ziqian Lu, Jun Liu, Yangming Zheng, and Zheming Lu. Medisee: Reasoning-based pixel-level percep- tion in medical images.arXiv preprint arXiv:2504.11008,
-
[40]
Prs-med: Position reasoning segmentation with vision-language model in medical imaging, 2025
Quoc-Huy Trinh, Minh-Van Nguyen, Jung Zeng, Ulas Bagci, and Debesh Jha. Prs-med: Position reasoning segmentation with vision-language model in medical imaging, 2025. 2
work page 2025
-
[41]
Guoxin Wang, Jun Zhao, Xinyi Liu, Yanbo Liu, Xuyang Cao, Chao Li, Zhuoyun Liu, Qintian Sun, Fangru Zhou, Hao- qiang Xing, and Zhenhong Yang. Citrus-v: Advancing medi- cal foundation models with unified medical image grounding for clinical reasoning, 2025. 2, 6, 7
work page 2025
-
[42]
Linshan Wu, Yuxiang Nie, Sunan He, Jiaxin Zhuang, Luyang Luo, Neeraj Mahboobani, Varut Vardhanabhuti, Ronald Cheong Kin Chan, Yifan Peng, Pranav Rajpurkar, et al. Unibiomed: A universal foundation model for grounded biomedical image interpretation.arXiv preprint arXiv:2504.21336, 2025. 2, 6, 7
-
[43]
Deep interactive object selection
Ning Xu, Brian Price, Scott Cohen, Jimei Yang, and Thomas S Huang. Deep interactive object selection. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 373–381, 2016. 4, 3
work page 2016
-
[44]
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning
Weiwen Xu, Hou Pong Chan, Long Li, Mahani Aljunied, Ruifeng Yuan, Jianyu Wang, Chenghao Xiao, Guizhen Chen, Chaoqun Liu, Zhaodonghui Li, et al. Lingshu: A general- ist foundation model for unified multimodal medical under- standing and reasoning.arXiv preprint arXiv:2506.07044,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
Zhonghao Yan, Muxi Diao, Yuxuan Yang, Jiayuan Xu, Kaizhou Zhang, Ruoyan Jing, Lele Yang, Yanxi Liu, Kong- ming Liang, and Zhanyu Ma. Medreasoner: Reinforcement learning drives reasoning grounding from clinical thought to pixel-level precision, 2025. 5
work page 2025
-
[46]
Senqiao Yang, Tianyuan Qu, Xin Lai, Zhuotao Tian, Bohao Peng, Shu Liu, and Jiaya Jia. Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023. 2, 6
-
[47]
Theodore Zhao, Yu Gu, Jianwei Yang, Naoto Usuyama, Ho Hin Lee, Tristan Naumann, Jianfeng Gao, Angela Crab- tree, Jacob Abel, Christine Moung-Wen, et al. Biomed- parse: a biomedical foundation model for image pars- ing of everything everywhere all at once.arXiv preprint arXiv:2405.12971, 2024. 1, 4, 6, 9 10 IBISAgent: Reinforcing Pixel-Level Visual Reasoni...
-
[48]
• Outline the {obj}in this image
Direct Commands Direct • Segment the {obj}. • Outline the {obj}in this image. • Show me the {obj}. • ···
-
[49]
• I need to create a segmentation
Goal-Oriented Goal • My goal is to segment the {obj}. • I need to create a segmentation. • The task is to segment the {obj}. • ···
-
[50]
• Place the first point for {obj}
Initialization Intent Init Intent • Initialize the segmentation. • Place the first point for {obj}. • Start the {obj}segmentation. • ···
-
[51]
Mixed & Conversational Mixed • Show me where the {obj}is. • Time to segment the {obj}. • Just the {obj}, please. • ···
-
[52]
• Help me begin the segmentation
Polite Requests Polite • Could you please segment {obj}? • Please start segmenting {obj}. • Help me begin the segmentation. • ···
-
[53]
Question-Based Query • Can you segment the {obj}? • Where is the {obj}located? • Is there a {obj}? Segment it. • ···
-
[54]
Concise & Technical Concise • Structure: {obj}. • {obj}segmentation. • Contour the {obj}. • ··· Group A. Initialization Prompts (7 Categories)
-
[55]
Direct “What’s Next?” Next Step • What is the next step for {obj}? • Continue segmenting the {obj}. • Next action for {obj}, please. • ···
-
[56]
Hinting at Errors Error Hint • The mask missed a section. • Mask extends beyond {obj}. • Error in current segmentation. • ···
-
[57]
Concise Commands Concise • Adjust {obj}. • Correction for {obj}. • {obj}boundary. • ···
-
[58]
• Improve accuracy of the mask
Improve & Correct Improve • Refine the {obj}segmentation. • Improve accuracy of the mask. • Correct the current mask. • ···
-
[59]
Check & Verify Verify • Is segmentation complete? • Are we finished with {obj}? • Check the boundaries. • ···
-
[60]
• Can you help fix this? • Thanks
Polite / Conversational Polite • Please adjust the mask. • Can you help fix this? • Thanks. What’s next? • ··· Group B. Refinement Prompts (6 Categories) Instantiated (Init Samples):
- [61]
- [62]
- [63]
- [64]
- [65]
- [66]
- [67]
-
[68]
The object of interest is the Liver. Please start
“The object of interest is the Liver. Please start.”
- [69]
-
[70]
Is it possible to get a segmentation of the Liver?
“Is it possible to get a segmentation of the Liver?”
- [71]
- [72]
- [73]
-
[74]
Okay , now let’s find the Liver
“Okay , now let’s find the Liver.” · · · Instantiated (Refine Samples):
- [75]
- [76]
- [77]
- [78]
-
[79]
Let’s make the outline for the Liver more precise
“Let’s make the outline for the Liver more precise.”
-
[80]
This segmentation of the Liver is not quite right
“This segmentation of the Liver is not quite right.”
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.