pith. sign in

arxiv: 1907.00214 · v1 · pith:Q4CVINNBnew · submitted 2019-06-29 · 💻 cs.CV

Learning Where to Look While Tracking Instruments in Robot-assisted Surgery

Pith reviewed 2026-05-25 12:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords attentioninstrumentlearningmodelsaliencysegmentationinstrumentsloss
0
0 comments X

The pith

An end-to-end multitask model with shared encoder, separate decoders, batch-Wasserstein loss, and soft attention module reports better performance than prior segmentation and saliency methods on the MICCAI robotic instrument dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The work builds a single neural network that handles two related jobs during robot-assisted surgery videos. One part of the network finds the exact shape of the instruments. Another part predicts where the visual focus should be directed for the task. Both parts share the same early layers that pull out visual features from the images. To improve the attention prediction, the authors add a loss based on batch-Wasserstein distance and a soft attention block that highlights useful image regions. Training is done in two phases with a changing weight schedule called poly loss weighting to keep both tasks improving together. The model is tested on a public dataset of robotic surgery scenes and is reported to beat earlier single-task models on most standard measures for segmentation accuracy and saliency prediction. The approach aims to make the system focus only on the parts of the scene that matter for tracking instruments rather than processing the entire frame equally.

Core claim

Compared to the state of the art segmentation and saliency models, our model outperforms most of the evaluation metrics on the MICCAI robotic instrument segmentation dataset.

Load-bearing premise

The two-phase training schedule with poly loss weighting is sufficient to achieve stable convergence for both the segmentation and attention-prediction tasks at the same time, as stated when the abstract notes the general difficulty of multitask optimization.

Figures

Figures reproduced from arXiv: 1907.00214 by Hongliang Ren, Mobarakol Islam, Yueyuan Li.

Figure 1
Figure 1. Figure 1: Our proposed MTL model. It has shared encoder and task-oriented decoders for the segmentation and saliency prediction. 2.1 Attention Module (AM) We design a light attention module to suppress irrelevant regions and em￾phasize salient features. It contains global pooling layer followed by convolution block and sigmoid layer to extract the global features and multiplies with orig￾inal input to refine the fea… view at source ↗
Figure 2
Figure 2. Figure 2: Proposed Modules such as (a)scSE-Decoder,(b) Boundary Refinement (BR), (c)Attention Module, (d) AM-Decoder 2.4 Network Architecture Our multitask model forms of the shared encoder and task-oriented decoders as illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Saliency map and scanpath generation using instruments movement and size 2.6 Generation of saliency map The context of our saliency map is top-bottom attention. Saliency map is usu￾ally generated from fixation map which is manually annotated by eye tracker [9] or mouse click [8]. We simulate the clicking process by locating fixation points only in the wrist and clasper parts of instruments. A temporal weig… view at source ↗
Figure 4
Figure 4. Figure 4: Input, ground truth annotations, saliency map and type segmentation generated by our model and other models of the same image are shown. saliency and segmentation as in equation 1. In phase II, we fine-tune the earliest converged model for a task by emphasizing the loss of the remaining task. To do this, we reduce the loss factor (λseg or λsal ) of the converged task by using ‘poly’ learning rate policy [1… view at source ↗
Figure 5
Figure 5. Figure 5: (a)Visualization of type and binary segmentation. (b)Diagram of the accuracy of the top-one scanpath and whole scanpath prediction. 4 Discussion and Conclusion In this work, we present a real-time multitask learning model to predict seg￾mentation and scanpath of the surgical instruments during surgery. We introduce and generate task-oriented attention guidance to train the system where to look during robot… view at source ↗
read the original abstract

Directing of the task-specific attention while tracking instrument in surgery holds great potential in robot-assisted intervention. For this purpose, we propose an end-to-end trainable multitask learning (MTL) model for real-time surgical instrument segmentation and attention prediction. Our model is designed with a weight-shared encoder and two task-oriented decoders and optimized for the joint tasks. We introduce batch-Wasserstein (bW) loss and construct a soft attention module to refine the distinctive visual region for efficient saliency learning. For multitask optimization, it is always challenging to obtain convergence of both tasks in the same epoch. We deal with this problem by adopting `poly' loss weight and two phases of training. We further propose a novel way to generate task-aware saliency map and scanpath of the instruments on MICCAI robotic instrument segmentation dataset. Compared to the state of the art segmentation and saliency models, our model outperforms most of the evaluation metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of the proposed training schedule and loss terms for balancing two tasks; these are introduced without independent verification in the abstract and rely on standard assumptions about neural network feature sharing.

free parameters (1)
  • poly loss weight schedule
    Used to balance the two tasks during the two-phase training; its specific form is chosen to address convergence issues.
axioms (1)
  • domain assumption A weight-shared encoder can extract features that are simultaneously useful for both instrument segmentation and attention prediction.
    Invoked by the design choice of a single encoder feeding two task-oriented decoders.

pith-pipeline@v0.9.0 · 5691 in / 1340 out tokens · 45236 ms · 2026-05-25T12:57:46.609479+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 5 internal anchors

  1. [1]

    2017 Robotic Instrument Segmentation Challenge

    Allan, M., Shvets, A., Kurmann, T., Zhang, Z., Duggal, R., Su, Y.H., Rieke, N., Laina, I., Kalavakonda, N., Bodenstedt, S., et al.: 2017 robotic instrument segmen- tation challenge. arXiv preprint arXiv:1902.06426 (2019)

  2. [2]

    In: 2017 IEEE Visual Communications and Image Processing (VCIP)

    Chaurasia, A., Culurciello, E.: Linknet: Exploiting encoder representations for ef- ficient semantic segmentation. In: 2017 IEEE Visual Communications and Image Processing (VCIP). pp. 1–4. IEEE (2017)

  3. [3]

    In: Chinese Automation Congress (CAC), 2017

    Chen, Z., Zhao, Z., Cheng, X.: Surgical instruments tracking based on deep learning with lines detection and spatio-temporal context. In: Chinese Automation Congress (CAC), 2017. pp. 2711–2714. IEEE (2017)

  4. [4]

    In: Proceedings of the IEEE international conference on computer vision

    Dvornik, N., Shmelkov, K., Mairal, J., Schmid, C.: Blitznet: A real-time deep net- work for scene understanding. In: Proceedings of the IEEE international conference on computer vision. pp. 4154–4162 (2017)

  5. [5]

    In: Advances in Neural Information Processing Systems

    Frogner, C., Zhang, C., Mobahi, H., Araya, M., Poggio, T.A.: Learning with a wasserstein loss. In: Advances in Neural Information Processing Systems. pp. 2053– 2061 (2015)

  6. [6]

    In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Garc´ ıa-Peraza-Herrera, L.C., Li, W., Fidon, L., Gruijthuijsen, C., Devreker, A., Attilakos, G., Deprest, J., Vander Poorten, E., Stoyanov, D., Vercauteren, T., et al.: Toolnet: Holistically-nested real-time segmentation of robotic surgical tools. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 5717–5722. IEEE (2017)

  7. [7]

    IEEE Robotics and Automation Letters (2019)

    Islam, M., Atputharuban, D.A., Ramesh, R., Ren, H.: Real-time instrument seg- mentation in robotic surgery using auxiliary supervised deep adversarial learning. IEEE Robotics and Automation Letters (2019)

  8. [8]

    In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition

    Jiang, M., Huang, S., Duan, J., Zhao, Q.: Salicon: Saliency in context. In: Pro- ceedings of the IEEE conference on computer vision and pattern recognition. pp. 1072–1080 (2015)

  9. [9]

    In: 2009 IEEE 12th international conference on computer vision

    Judd, T., Ehinger, K., Durand, F., Torralba, A.: Learning to predict where humans look. In: 2009 IEEE 12th international conference on computer vision. pp. 2106–

  10. [10]

    IEEE Transactions on Image Processing 27(7), 3264– 3274 (2018)

    Liu, N., Han, J.: A deep spatial contextual long-term recurrent convolutional net- work for saliency detection. IEEE Transactions on Image Processing 27(7), 3264– 3274 (2018)

  11. [11]

    ParseNet: Looking Wider to See Better

    Liu, W., Rabinovich, A., Berg, A.C.: Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579 (2015)

  12. [12]

    Real-Time Joint Semantic Segmentation and Depth Estimation Using Asymmetric Annotations

    Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Shen, C., Reid, I.: Real- time joint semantic segmentation and depth estimation using asymmetric annota- tions. arXiv preprint arXiv:1809.04766 (2018)

  13. [13]

    Robotic Surgery: Research and Reviews 4, 77–85 (2017)

    Ngu, J.C.Y., Tsang, C.B.S., Koh, D.C.S.: The da vinci xi: a review of its capabil- ities, versatility, and potential role in robotic colorectal surgery. Robotic Surgery: Research and Reviews 4, 77–85 (2017)

  14. [14]

    IEEE transactions on pattern analysis and machine intelligence (2018)

    Palazzi, A., Abati, D., Calderara, S., Solera, F., Cucchiara, R.: Predicting the driver’s focus of attention: the dr (eye) ve project. IEEE transactions on pattern analysis and machine intelligence (2018)

  15. [15]

    SalGAN: Visual Saliency Prediction with Generative Adversarial Networks

    Pan, J., Ferrer, C.C., McGuinness, K., O’Connor, N.E., Torres, J., Sayrol, E., Giro-i Nieto, X.: Salgan: Visual saliency prediction with generative adversarial networks. arXiv preprint arXiv:1701.01081 (2017)

  16. [16]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Peng, C., Zhang, X., Yu, G., Luo, G., Sun, J.: Large kernel matters–improve se- mantic segmentation by global convolutional network. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4353–4361 (2017)

  17. [17]

    In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention

    Roy, A.G., Navab, N., Wachinger, C.: Concurrent spatial and channel ‘squeeze & excitation’in fully convolutional networks. In: International Conference on Medi- cal Image Computing and Computer-Assisted Intervention. pp. 421–429. Springer (2018)

  18. [18]

    In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)

    Shvets, A.A., Rakhlin, A., Kalinin, A.A., Iglovikov, V.I.: Automatic instrument segmentation in robot-assisted surgery using deep learning. In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA). pp. 624–628. IEEE (2018)

  19. [19]

    Decoders Matter for Semantic Segmentation: Data-Dependent Decoding Enables Flexible Feature Aggregation

    Tian, Z., Shen, C., He, T., Yan, Y.: Decoders matter for semantic segmenta- tion: Data-dependent decoding enables flexible feature aggregation. arXiv preprint arXiv:1903.02120 (2019)

  20. [20]

    Computer Assisted Surgery 22(sup1), 26–35 (2017)

    Zhao, Z., Voros, S., Weng, Y., Chang, F., Li, R.: Tracking-by-detection of surgical instruments in minimally invasive surgery via the convolutional neural network deep learning-based method. Computer Assisted Surgery 22(sup1), 26–35 (2017)