Learning Where to Look While Tracking Instruments in Robot-assisted Surgery
Pith reviewed 2026-05-25 12:57 UTC · model grok-4.3
The pith
An end-to-end multitask model with shared encoder, separate decoders, batch-Wasserstein loss, and soft attention module reports better performance than prior segmentation and saliency methods on the MICCAI robotic instrument dataset.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Compared to the state of the art segmentation and saliency models, our model outperforms most of the evaluation metrics on the MICCAI robotic instrument segmentation dataset.
Load-bearing premise
The two-phase training schedule with poly loss weighting is sufficient to achieve stable convergence for both the segmentation and attention-prediction tasks at the same time, as stated when the abstract notes the general difficulty of multitask optimization.
Figures
read the original abstract
Directing of the task-specific attention while tracking instrument in surgery holds great potential in robot-assisted intervention. For this purpose, we propose an end-to-end trainable multitask learning (MTL) model for real-time surgical instrument segmentation and attention prediction. Our model is designed with a weight-shared encoder and two task-oriented decoders and optimized for the joint tasks. We introduce batch-Wasserstein (bW) loss and construct a soft attention module to refine the distinctive visual region for efficient saliency learning. For multitask optimization, it is always challenging to obtain convergence of both tasks in the same epoch. We deal with this problem by adopting `poly' loss weight and two phases of training. We further propose a novel way to generate task-aware saliency map and scanpath of the instruments on MICCAI robotic instrument segmentation dataset. Compared to the state of the art segmentation and saliency models, our model outperforms most of the evaluation metrics.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
free parameters (1)
- poly loss weight schedule
axioms (1)
- domain assumption A weight-shared encoder can extract features that are simultaneously useful for both instrument segmentation and attention prediction.
Reference graph
Works this paper leans on
-
[1]
2017 Robotic Instrument Segmentation Challenge
Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., et al.: 2017 robotic instrument segmen- tation challenge. arXiv preprint arXiv:1902.06426 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[2]
In: 2017 IEEE Visual Communications and Image Processing (VCIP)
Chaurasia, A., Culurciello, E.: Linknet: Exploiting encoder representations for ef- ficient semantic segmentation. In: 2017 IEEE Visual Communications and Image Processing (VCIP). pp. 1–4. IEEE (2017)
work page 2017
-
[3]
In: Chinese Automation Congress (CAC), 2017
Chen, Z., Zhao, Z., Cheng, X.: Surgical instruments tracking based on deep learning with lines detection and spatio-temporal context. In: Chinese Automation Congress (CAC), 2017. pp. 2711–2714. IEEE (2017)
work page 2017
-
[4]
In: Proceedings of the IEEE international conference on computer vision
Dvornik, N., Shmelkov, K., Mairal, J., Schmid, C.: Blitznet: A real-time deep net- work for scene understanding. In: Proceedings of the IEEE international conference on computer vision. pp. 4154–4162 (2017)
work page 2017
-
[5]
In: Advances in Neural Information Processing Systems
Frogner, C., Zhang, C., Mobahi, H., Araya, M., Poggio, T.A.: Learning with a wasserstein loss. In: Advances in Neural Information Processing Systems. pp. 2053– 2061 (2015)
work page 2053
-
[6]
In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Garc´ ıa-Peraza-Herrera, L.C., Li, W., Fidon, L., Gruijthuijsen, C., Devreker, A., Attilakos, G., Deprest, J., Vander Poorten, E., Stoyanov, D., Vercauteren, T., et al.: Toolnet: Holistically-nested real-time segmentation of robotic surgical tools. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 5717–5722. IEEE (2017)
work page 2017
-
[7]
IEEE Robotics and Automation Letters (2019)
Islam, M., Atputharuban, D.A., Ramesh, R., Ren, H.: Real-time instrument seg- mentation in robotic surgery using auxiliary supervised deep adversarial learning. IEEE Robotics and Automation Letters (2019)
work page 2019
-
[8]
In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition
Jiang, M., Huang, S., Duan, J., Zhao, Q.: Salicon: Saliency in context. In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition. pp. 1072–1080 (2015)
work page 2015
-
[9]
In: 2009 IEEE 12th international conference on computer vision
Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: 2009 IEEE 12th international conference on computer vision. pp. 2106–
work page 2009
-
[10]
IEEE Transactions on Image Processing 27(7), 3264– 3274 (2018)
Liu, N., Han, J.: A deep spatial contextual long-term recurrent convolutional net- work for saliency detection. IEEE Transactions on Image Processing 27(7), 3264– 3274 (2018)
work page 2018
-
[11]
ParseNet: Looking Wider to See Better
Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations
Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Shen, C., Reid, I.: Real- time joint semantic segmentation and depth estimation using asymmetric annota- tions. arXiv preprint arXiv:1809.04766 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Robotic Surgery: Research and Reviews 4, 77–85 (2017)
Ngu, J.C.Y., Tsang, C.B.S., Koh, D.C.S.: The da vinci xi: a review of its capabil- ities, versatility, and potential role in robotic colorectal surgery. Robotic Surgery: Research and Reviews 4, 77–85 (2017)
work page 2017
-
[14]
IEEE transactions on pattern analysis and machine intelligence (2018)
Palazzi, A., Abati, D., Calderara, S., Solera, F., Cucchiara, R.: Predicting the driver’s focus of attention: the dr (eye) ve project. IEEE transactions on pattern analysis and machine intelligence (2018)
work page 2018
-
[15]
SalGAN: Visual Saliency Prediction with Generative Adversarial Networks
Pan, J., Ferrer, C.C., McGuinness, K., O’Connor, N.E., Torres, J., Sayrol, E., Giro-i Nieto, X.: Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[16]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve se- mantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4353–4361 (2017)
work page 2017
-
[17]
In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention
Roy, A.G., Navab, N., Wachinger, C.: Concurrent spatial and channel ‘squeeze & excitation’in fully convolutional networks. In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention. pp. 421–429. Springer (2018)
work page 2018
-
[18]
In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)
Shvets, A.A., Rakhlin, A., Kalinin, A.A., Iglovikov, V.I.: Automatic instrument segmentation in robot-assisted surgery using deep learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). pp. 624–628. IEEE (2018)
work page 2018
-
[19]
Tian, Z., Shen, C., He, T., Yan, Y.: Decoders matter for semantic segmenta- tion: Data-dependent decoding enables flexible feature aggregation. arXiv preprint arXiv:1903.02120 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[20]
Computer Assisted Surgery 22(sup1), 26–35 (2017)
Zhao, Z., Voros, S., Weng, Y., Chang, F., Li, R.: Tracking-by-detection of surgical instruments in minimally invasive surgery via the convolutional neural network deep learning-based method. Computer Assisted Surgery 22(sup1), 26–35 (2017)
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.