Multi-Task Recurrent Convolutional Network with Correlation Loss for Surgical Video Analysis

Chi-Wing Fu; Hao Chen; Huaxia Li; Jing Qin; Pheng-Ann Heng; Qi Dou; Yueming Jin

arxiv: 1907.06099 · v1 · pith:GWSKEUWMnew · submitted 2019-07-13 · 💻 cs.CV · cs.LG· eess.IV

Multi-Task Recurrent Convolutional Network with Correlation Loss for Surgical Video Analysis

Yueming Jin , Huaxia Li , Qi Dou , Hao Chen , Jing Qin , Chi-Wing Fu , Pheng-Ann Heng This is my paper

Pith reviewed 2026-05-24 21:47 UTC · model grok-4.3

classification 💻 cs.CV cs.LGeess.IV

keywords multi-task learningsurgical video analysistool presence detectionphase recognitioncorrelation lossrecurrent convolutional networkCholec80 dataset

0 comments

The pith

A multi-task recurrent network with correlation loss improves both surgical tool detection and phase recognition by exploiting their relatedness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MTRCNet-CL, an end-to-end model with two branches that share early feature encoders to extract general visual features while using task-specific higher layers. The phase recognition branch incorporates LSTM to capture temporal dependencies, and a novel correlation loss minimizes the divergence between the predictions of the two branches to model task relatedness. By combining low-level feature sharing with high-level prediction correlation, the method encourages interactions that benefit both tool presence detection and phase recognition. Experiments on the Cholec80 dataset show the approach exceeds prior state-of-the-art results, reaching 89.1% mAP for tool detection and 87.4% F1 for phase recognition.

Core claim

The MTRCNet-CL model jointly solves tool presence detection and phase recognition through shared earlier feature encoders, task-specific higher layers with LSTM in the phase branch, and a correlation loss that minimizes prediction divergence to leverage the well-defined relatedness between the two tasks in surgical videos.

What carries the argument

The multi-task recurrent convolutional network with correlation loss (MTRCNet-CL), which shares low-level visual feature encoders across branches and applies a correlation loss to align high-level task predictions.

If this is right

Tool presence detection reaches 89.1% mAP, exceeding the prior 81.0% on Cholec80.
Phase recognition reaches 87.4% F1 score, exceeding the prior 84.5% on Cholec80.
Low-level feature sharing and high-level prediction correlation together encourage beneficial interactions between the tasks.
The same end-to-end architecture consistently exceeds multiple state-of-the-art single-task methods on a large surgical video dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The correlation loss could be applied to other pairs of related tasks in medical video analysis where one task provides temporal structure for another.
Performance gains may shrink on datasets where surgical procedures lack the strong sequential definition assumed here.
Replacing the divergence term with a learned alignment module might further reduce interference if the tasks become less correlated.

Load-bearing premise

The two tasks are highly correlated in clinical practice because the surgical process is well-defined, so minimizing prediction divergence between branches improves both without introducing harmful interference.

What would settle it

Train the same architecture with the correlation loss removed and measure whether mAP for tool presence and F1 for phase recognition both fall below the reported 89.1% and 87.4% on the Cholec80 test set.

read the original abstract

Surgical tool presence detection and surgical phase recognition are two fundamental yet challenging tasks in surgical video analysis and also very essential components in various applications in modern operating rooms. While these two analysis tasks are highly correlated in clinical practice as the surgical process is well-defined, most previous methods tackled them separately, without making full use of their relatedness. In this paper, we present a novel method by developing a multi-task recurrent convolutional network with correlation loss (MTRCNet-CL) to exploit their relatedness to simultaneously boost the performance of both tasks. Specifically, our proposed MTRCNet-CL model has an end-to-end architecture with two branches, which share earlier feature encoders to extract general visual features while holding respective higher layers targeting for specific tasks. Given that temporal information is crucial for phase recognition, long-short term memory (LSTM) is explored to model the sequential dependencies in the phase recognition branch. More importantly, a novel and effective correlation loss is designed to model the relatedness between tool presence and phase identification of each video frame, by minimizing the divergence of predictions from the two branches. Mutually leveraging both low-level feature sharing and high-level prediction correlating, our MTRCNet-CL method can encourage the interactions between the two tasks to a large extent, and hence can bring about benefits to each other. Extensive experiments on a large surgical video dataset (Cholec80) demonstrate outstanding performance of our proposed method, consistently exceeding the state-of-the-art methods by a large margin (e.g., 89.1% v.s. 81.0% for the mAP in tool presence detection and 87.4% v.s. 84.5% for F1 score in phase recognition). The code can be found on our project website.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes MTRCNet-CL, a multi-task recurrent CNN with shared early encoders, task-specific higher layers, LSTM in the phase branch, and a novel correlation loss that minimizes prediction divergence between tool-presence and phase branches. It reports state-of-the-art results on Cholec80 (89.1% mAP tool detection, 87.4% F1 phase recognition) and attributes gains to joint low-level feature sharing plus high-level prediction correlation.

Significance. If the correlation loss is shown to contribute beyond shared features, the work would provide a concrete mechanism for exploiting clinical task relatedness in surgical video analysis and would be strengthened by the stated code release.

major comments (1)

[Experiments] Experiments section: no ablation retains the dual-branch LSTM architecture and shared encoder while setting the correlation-loss weight to zero. Without this control, the reported margins (89.1% vs 81.0% mAP; 87.4% vs 84.5% F1) cannot be attributed specifically to the high-level correlation term rather than multi-task feature sharing or temporal modeling alone.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment regarding the need for a targeted ablation study. We agree that isolating the contribution of the correlation loss requires the specific control experiment described and will add it to the revised manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: no ablation retains the dual-branch LSTM architecture and shared encoder while setting the correlation-loss weight to zero. Without this control, the reported margins (89.1% vs 81.0% mAP; 87.4% vs 84.5% F1) cannot be attributed specifically to the high-level correlation term rather than multi-task feature sharing or temporal modeling alone.

Authors: We agree that the current set of experiments does not include an ablation that keeps the dual-branch architecture (shared early encoders plus LSTM in the phase branch) while setting the correlation-loss weight exactly to zero. In the revised manuscript we will add this control experiment on Cholec80, reporting the resulting mAP and F1 scores. This will allow direct quantification of the incremental benefit attributable to the high-level correlation term beyond multi-task feature sharing and temporal modeling. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical multi-task architecture evaluated on public dataset

full rationale

The paper defines an architecture (shared encoders + task-specific branches + LSTM + correlation loss) and reports empirical results on Cholec80. No equations, fitted parameters, or self-citations reduce any claimed result to its inputs by construction. The correlation loss is an explicit design choice whose contribution is tested via overall performance comparisons rather than being presupposed. The derivation chain is therefore self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim depends on task correlation and effectiveness of the new loss term; abstract-only view limits enumeration of all hyperparameters.

free parameters (1)

correlation loss weight
Balancing hyperparameter for the divergence term between branches; value and tuning not described in abstract.

axioms (1)

domain assumption Surgical tool presence detection and phase recognition tasks are highly correlated because the surgical process is well-defined.
Invoked explicitly in abstract to motivate joint modeling and correlation loss.

pith-pipeline@v0.9.0 · 5872 in / 1303 out tokens · 26722 ms · 2026-05-24T21:47:14.789162+00:00 · methodology

Multi-Task Recurrent Convolutional Network with Correlation Loss for Surgical Video Analysis

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)