Learning Where to Look While Tracking Instruments in Robot-assisted Surgery

Hongliang Ren; Mobarakol Islam; Yueyuan Li

arxiv: 1907.00214 · v1 · pith:Q4CVINNBnew · submitted 2019-06-29 · 💻 cs.CV

Learning Where to Look While Tracking Instruments in Robot-assisted Surgery

Mobarakol Islam , Yueyuan Li , Hongliang Ren This is my paper

Pith reviewed 2026-05-25 12:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords attentioninstrumentlearningmodelsaliencysegmentationinstrumentsloss

0 comments

The pith

An end-to-end multitask model with shared encoder, separate decoders, batch-Wasserstein loss, and soft attention module reports better performance than prior segmentation and saliency methods on the MICCAI robotic instrument dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work builds a single neural network that handles two related jobs during robot-assisted surgery videos. One part of the network finds the exact shape of the instruments. Another part predicts where the visual focus should be directed for the task. Both parts share the same early layers that pull out visual features from the images. To improve the attention prediction, the authors add a loss based on batch-Wasserstein distance and a soft attention block that highlights useful image regions. Training is done in two phases with a changing weight schedule called poly loss weighting to keep both tasks improving together. The model is tested on a public dataset of robotic surgery scenes and is reported to beat earlier single-task models on most standard measures for segmentation accuracy and saliency prediction. The approach aims to make the system focus only on the parts of the scene that matter for tracking instruments rather than processing the entire frame equally.

Core claim

Compared to the state of the art segmentation and saliency models, our model outperforms most of the evaluation metrics on the MICCAI robotic instrument segmentation dataset.

Load-bearing premise

The two-phase training schedule with poly loss weighting is sufficient to achieve stable convergence for both the segmentation and attention-prediction tasks at the same time, as stated when the abstract notes the general difficulty of multitask optimization.

Figures

Figures reproduced from arXiv: 1907.00214 by Hongliang Ren, Mobarakol Islam, Yueyuan Li.

**Figure 1.** Figure 1: Our proposed MTL model. It has shared encoder and task-oriented decoders for the segmentation and saliency prediction. 2.1 Attention Module (AM) We design a light attention module to suppress irrelevant regions and emphasize salient features. It contains global pooling layer followed by convolution block and sigmoid layer to extract the global features and multiplies with original input to refine the fea… view at source ↗

**Figure 2.** Figure 2: Proposed Modules such as (a)scSE-Decoder,(b) Boundary Refinement (BR), (c)Attention Module, (d) AM-Decoder 2.4 Network Architecture Our multitask model forms of the shared encoder and task-oriented decoders as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Saliency map and scanpath generation using instruments movement and size 2.6 Generation of saliency map The context of our saliency map is top-bottom attention. Saliency map is usually generated from fixation map which is manually annotated by eye tracker [9] or mouse click [8]. We simulate the clicking process by locating fixation points only in the wrist and clasper parts of instruments. A temporal weig… view at source ↗

**Figure 4.** Figure 4: Input, ground truth annotations, saliency map and type segmentation generated by our model and other models of the same image are shown. saliency and segmentation as in equation 1. In phase II, we fine-tune the earliest converged model for a task by emphasizing the loss of the remaining task. To do this, we reduce the loss factor (λseg or λsal ) of the converged task by using ‘poly’ learning rate policy [1… view at source ↗

**Figure 5.** Figure 5: (a)Visualization of type and binary segmentation. (b)Diagram of the accuracy of the top-one scanpath and whole scanpath prediction. 4 Discussion and Conclusion In this work, we present a real-time multitask learning model to predict segmentation and scanpath of the surgical instruments during surgery. We introduce and generate task-oriented attention guidance to train the system where to look during robot… view at source ↗

read the original abstract

Directing of the task-specific attention while tracking instrument in surgery holds great potential in robot-assisted intervention. For this purpose, we propose an end-to-end trainable multitask learning (MTL) model for real-time surgical instrument segmentation and attention prediction. Our model is designed with a weight-shared encoder and two task-oriented decoders and optimized for the joint tasks. We introduce batch-Wasserstein (bW) loss and construct a soft attention module to refine the distinctive visual region for efficient saliency learning. For multitask optimization, it is always challenging to obtain convergence of both tasks in the same epoch. We deal with this problem by adopting `poly' loss weight and two phases of training. We further propose a novel way to generate task-aware saliency map and scanpath of the instruments on MICCAI robotic instrument segmentation dataset. Compared to the state of the art segmentation and saliency models, our model outperforms most of the evaluation metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies a shared-encoder multitask setup with batch-Wasserstein loss and two-phase poly training to surgical instrument segmentation plus attention, but supplies no numbers or convergence checks so the outperformance claim stays untestable from the abstract.

read the letter

The main thing to know is that this work combines a weight-shared encoder with separate decoders for instrument segmentation and attention prediction, adds a batch-Wasserstein loss and soft attention module, and uses a two-phase poly-weighted schedule to handle the usual multitask convergence problem. It then reports better numbers than prior segmentation and saliency models on the MICCAI robotic instrument dataset and shows how to produce task-aware saliency maps and scanpaths from the outputs. That combination is presented as new for this domain. The architecture choices are sensible for a real-time surgical setting and the use of a public dataset is the right move. The authors are straightforward about the difficulty of joint optimization and describe a concrete schedule to address it. The soft spots are exactly where the stress-test note flags them. The abstract gives no numerical scores, no baseline names, no split details, and no error bars, so there is no way to judge whether the claimed gains are real or meaningful. The two-phase schedule is offered as the solution to convergence, yet the text supplies no loss curves, per-task metrics at the phase switch, or ablation that removes the schedule. Without those, it is impossible to know whether both tasks actually reach usable performance together or whether one simply dominates. That is a load-bearing gap for the central claim. The paper is aimed at researchers who build vision systems for robot-assisted surgery or who adapt saliency and segmentation models to constrained medical domains. A reader already working on the MICCAI benchmark might extract the architecture and training recipe even if the results need re-checking. It is worth sending to peer review because the problem is well-motivated, the components are standard but sensibly assembled, and the dataset is public; a referee can ask for the missing numbers and ablations without starting from zero.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the proposed training schedule and loss terms for balancing two tasks; these are introduced without independent verification in the abstract and rely on standard assumptions about neural network feature sharing.

free parameters (1)

poly loss weight schedule
Used to balance the two tasks during the two-phase training; its specific form is chosen to address convergence issues.

axioms (1)

domain assumption A weight-shared encoder can extract features that are simultaneously useful for both instrument segmentation and attention prediction.
Invoked by the design choice of a single encoder feeding two task-oriented decoders.

pith-pipeline@v0.9.0 · 5691 in / 1340 out tokens · 45236 ms · 2026-05-25T12:57:46.609479+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 5 internal anchors

[1]

2017 Robotic Instrument Segmentation Challenge

Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., et al.: 2017 robotic instrument segmen- tation challenge. arXiv preprint arXiv:1902.06426 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

In: 2017 IEEE Visual Communications and Image Processing (VCIP)

Chaurasia, A., Culurciello, E.: Linknet: Exploiting encoder representations for ef- ﬁcient semantic segmentation. In: 2017 IEEE Visual Communications and Image Processing (VCIP). pp. 1–4. IEEE (2017)

work page 2017
[3]

In: Chinese Automation Congress (CAC), 2017

Chen, Z., Zhao, Z., Cheng, X.: Surgical instruments tracking based on deep learning with lines detection and spatio-temporal context. In: Chinese Automation Congress (CAC), 2017. pp. 2711–2714. IEEE (2017)

work page 2017
[4]

In: Proceedings of the IEEE international conference on computer vision

Dvornik, N., Shmelkov, K., Mairal, J., Schmid, C.: Blitznet: A real-time deep net- work for scene understanding. In: Proceedings of the IEEE international conference on computer vision. pp. 4154–4162 (2017)

work page 2017
[5]

In: Advances in Neural Information Processing Systems

Frogner, C., Zhang, C., Mobahi, H., Araya, M., Poggio, T.A.: Learning with a wasserstein loss. In: Advances in Neural Information Processing Systems. pp. 2053– 2061 (2015)

work page 2053
[6]

In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Garc´ ıa-Peraza-Herrera, L.C., Li, W., Fidon, L., Gruijthuijsen, C., Devreker, A., Attilakos, G., Deprest, J., Vander Poorten, E., Stoyanov, D., Vercauteren, T., et al.: Toolnet: Holistically-nested real-time segmentation of robotic surgical tools. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 5717–5722. IEEE (2017)

work page 2017
[7]

IEEE Robotics and Automation Letters (2019)

Islam, M., Atputharuban, D.A., Ramesh, R., Ren, H.: Real-time instrument seg- mentation in robotic surgery using auxiliary supervised deep adversarial learning. IEEE Robotics and Automation Letters (2019)

work page 2019
[8]

In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition

Jiang, M., Huang, S., Duan, J., Zhao, Q.: Salicon: Saliency in context. In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition. pp. 1072–1080 (2015)

work page 2015
[9]

In: 2009 IEEE 12th international conference on computer vision

Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: 2009 IEEE 12th international conference on computer vision. pp. 2106–

work page 2009
[10]

IEEE Transactions on Image Processing 27(7), 3264– 3274 (2018)

Liu, N., Han, J.: A deep spatial contextual long-term recurrent convolutional net- work for saliency detection. IEEE Transactions on Image Processing 27(7), 3264– 3274 (2018)

work page 2018
[11]

ParseNet: Looking Wider to See Better

Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations

Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Shen, C., Reid, I.: Real- time joint semantic segmentation and depth estimation using asymmetric annota- tions. arXiv preprint arXiv:1809.04766 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Robotic Surgery: Research and Reviews 4, 77–85 (2017)

Ngu, J.C.Y., Tsang, C.B.S., Koh, D.C.S.: The da vinci xi: a review of its capabil- ities, versatility, and potential role in robotic colorectal surgery. Robotic Surgery: Research and Reviews 4, 77–85 (2017)

work page 2017
[14]

IEEE transactions on pattern analysis and machine intelligence (2018)

Palazzi, A., Abati, D., Calderara, S., Solera, F., Cucchiara, R.: Predicting the driver’s focus of attention: the dr (eye) ve project. IEEE transactions on pattern analysis and machine intelligence (2018)

work page 2018
[15]

SalGAN: Visual Saliency Prediction with Generative Adversarial Networks

Pan, J., Ferrer, C.C., McGuinness, K., O’Connor, N.E., Torres, J., Sayrol, E., Giro-i Nieto, X.: Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve se- mantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4353–4361 (2017)

work page 2017
[17]

In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention

Roy, A.G., Navab, N., Wachinger, C.: Concurrent spatial and channel ‘squeeze & excitation’in fully convolutional networks. In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention. pp. 421–429. Springer (2018)

work page 2018
[18]

In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)

Shvets, A.A., Rakhlin, A., Kalinin, A.A., Iglovikov, V.I.: Automatic instrument segmentation in robot-assisted surgery using deep learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). pp. 624–628. IEEE (2018)

work page 2018
[19]

Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation

Tian, Z., Shen, C., He, T., Yan, Y.: Decoders matter for semantic segmenta- tion: Data-dependent decoding enables ﬂexible feature aggregation. arXiv preprint arXiv:1903.02120 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1903
[20]

Computer Assisted Surgery 22(sup1), 26–35 (2017)

Zhao, Z., Voros, S., Weng, Y., Chang, F., Li, R.: Tracking-by-detection of surgical instruments in minimally invasive surgery via the convolutional neural network deep learning-based method. Computer Assisted Surgery 22(sup1), 26–35 (2017)

work page 2017

[1] [1]

2017 Robotic Instrument Segmentation Challenge

Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., et al.: 2017 robotic instrument segmen- tation challenge. arXiv preprint arXiv:1902.06426 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

In: 2017 IEEE Visual Communications and Image Processing (VCIP)

Chaurasia, A., Culurciello, E.: Linknet: Exploiting encoder representations for ef- ﬁcient semantic segmentation. In: 2017 IEEE Visual Communications and Image Processing (VCIP). pp. 1–4. IEEE (2017)

work page 2017

[3] [3]

In: Chinese Automation Congress (CAC), 2017

Chen, Z., Zhao, Z., Cheng, X.: Surgical instruments tracking based on deep learning with lines detection and spatio-temporal context. In: Chinese Automation Congress (CAC), 2017. pp. 2711–2714. IEEE (2017)

work page 2017

[4] [4]

In: Proceedings of the IEEE international conference on computer vision

Dvornik, N., Shmelkov, K., Mairal, J., Schmid, C.: Blitznet: A real-time deep net- work for scene understanding. In: Proceedings of the IEEE international conference on computer vision. pp. 4154–4162 (2017)

work page 2017

[5] [5]

In: Advances in Neural Information Processing Systems

Frogner, C., Zhang, C., Mobahi, H., Araya, M., Poggio, T.A.: Learning with a wasserstein loss. In: Advances in Neural Information Processing Systems. pp. 2053– 2061 (2015)

work page 2053

[6] [6]

In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

Garc´ ıa-Peraza-Herrera, L.C., Li, W., Fidon, L., Gruijthuijsen, C., Devreker, A., Attilakos, G., Deprest, J., Vander Poorten, E., Stoyanov, D., Vercauteren, T., et al.: Toolnet: Holistically-nested real-time segmentation of robotic surgical tools. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 5717–5722. IEEE (2017)

work page 2017

[7] [7]

IEEE Robotics and Automation Letters (2019)

Islam, M., Atputharuban, D.A., Ramesh, R., Ren, H.: Real-time instrument seg- mentation in robotic surgery using auxiliary supervised deep adversarial learning. IEEE Robotics and Automation Letters (2019)

work page 2019

[8] [8]

In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition

Jiang, M., Huang, S., Duan, J., Zhao, Q.: Salicon: Saliency in context. In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition. pp. 1072–1080 (2015)

work page 2015

[9] [9]

In: 2009 IEEE 12th international conference on computer vision

Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: 2009 IEEE 12th international conference on computer vision. pp. 2106–

work page 2009

[10] [10]

IEEE Transactions on Image Processing 27(7), 3264– 3274 (2018)

Liu, N., Han, J.: A deep spatial contextual long-term recurrent convolutional net- work for saliency detection. IEEE Transactions on Image Processing 27(7), 3264– 3274 (2018)

work page 2018

[11] [11]

ParseNet: Looking Wider to See Better

Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[12] [12]

Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations

Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Shen, C., Reid, I.: Real- time joint semantic segmentation and depth estimation using asymmetric annota- tions. arXiv preprint arXiv:1809.04766 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Robotic Surgery: Research and Reviews 4, 77–85 (2017)

Ngu, J.C.Y., Tsang, C.B.S., Koh, D.C.S.: The da vinci xi: a review of its capabil- ities, versatility, and potential role in robotic colorectal surgery. Robotic Surgery: Research and Reviews 4, 77–85 (2017)

work page 2017

[14] [14]

IEEE transactions on pattern analysis and machine intelligence (2018)

Palazzi, A., Abati, D., Calderara, S., Solera, F., Cucchiara, R.: Predicting the driver’s focus of attention: the dr (eye) ve project. IEEE transactions on pattern analysis and machine intelligence (2018)

work page 2018

[15] [15]

SalGAN: Visual Saliency Prediction with Generative Adversarial Networks

Pan, J., Ferrer, C.C., McGuinness, K., O’Connor, N.E., Torres, J., Sayrol, E., Giro-i Nieto, X.: Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [16]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve se- mantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4353–4361 (2017)

work page 2017

[17] [17]

In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention

Roy, A.G., Navab, N., Wachinger, C.: Concurrent spatial and channel ‘squeeze & excitation’in fully convolutional networks. In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention. pp. 421–429. Springer (2018)

work page 2018

[18] [18]

In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)

Shvets, A.A., Rakhlin, A., Kalinin, A.A., Iglovikov, V.I.: Automatic instrument segmentation in robot-assisted surgery using deep learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). pp. 624–628. IEEE (2018)

work page 2018

[19] [19]

Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation

Tian, Z., Shen, C., He, T., Yan, Y.: Decoders matter for semantic segmenta- tion: Data-dependent decoding enables ﬂexible feature aggregation. arXiv preprint arXiv:1903.02120 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1903

[20] [20]

Computer Assisted Surgery 22(sup1), 26–35 (2017)

Zhao, Z., Voros, S., Weng, Y., Chang, F., Li, R.: Tracking-by-detection of surgical instruments in minimally invasive surgery via the convolutional neural network deep learning-based method. Computer Assisted Surgery 22(sup1), 26–35 (2017)

work page 2017