SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

Sukju Oh; Sukkyu Sun

arxiv: 2605.14889 · v2 · pith:OSXGKRYSnew · submitted 2026-05-14 · 💻 cs.CV · cs.AI

SurgicalMamba: Dual-Path SSD with State Regramming for Online Surgical Phase Recognition

Sukju Oh , Sukkyu Sun This is my paper

Pith reviewed 2026-05-20 20:56 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords surgical phase recognitiononline video recognitionstate space modelsdual-path architecturereal-time predictionmedical video analysiscausal sequence modelingphase segmentation

0 comments

The pith

SurgicalMamba extends structured state-space models with dual paths, intensity warping, and channel rotations to recognize surgical phases online at constant per-frame cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a causal model for online surgical phase recognition that commits to a phase label at every frame using only past video context. Standard video models either let cost grow with sequence length or advance state uniformly without adapting to irregular phase transitions or correlated visual channels in narrow surgical scenes. SurgicalMamba keeps cost at O(d) per frame by adding three compatible modifications to the underlying state-space duality: a dual-path block that splits long- and short-term recurrence, intensity-modulated stepping that warps the slow path's rate according to information content, and state regramming that introduces cross-channel mixing through learned Cayley rotations. These changes produce higher accuracy and phase Jaccard scores on multiple public benchmarks while running at hundreds of frames per second. A reader would care because reliable real-time phase tracking supports context-aware tools in operating rooms without requiring future frames or unbounded memory.

Core claim

SurgicalMamba is a causal SPR model built on Mamba2's structured state-space duality that holds per-frame cost at O(d). It introduces a dual-path SSD block separating long- and short-term regimes at the recurrent state level, intensity-modulated stepping as a continuous-time warp that adapts the slow path to phase-relevant information, and state regramming as a per-chunk Cayley rotation that enables cross-channel mixing. The learned rotation planes exhibit phase-aligned structure without direct supervision. The model reaches new state-of-the-art accuracy and Jaccard under strict online evaluation across seven benchmarks while sustaining high frame rates on a single GPU.

What carries the argument

Dual-path SSD block with intensity-modulated stepping and state regramming, which separates temporal regimes at the state level, warps dynamics to information intensity, and adds cross-channel mixing via Cayley rotations in the recurrence.

If this is right

Per-frame computation stays bounded at O(d) no matter how many frames have elapsed.
The slow path's effective update rate self-adjusts to moments carrying phase-defining visual information.
Cross-channel interactions arise automatically through rotation planes that align with surgical workflow phases.
Accuracy and phase-level Jaccard improve over prior online methods on standard benchmarks without sacrificing speed.
Internal state representations become interpretable as phase signatures without any extra supervision signal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same three modifications could transfer to other long-sequence medical video tasks such as tool tracking or gesture detection.
Real-time phase outputs could feed directly into decision-support systems that anticipate upcoming steps in a procedure.
The unsupervised phase alignment in the rotation planes suggests a route to automatic discovery of workflow patterns across different surgical specialties.
Combining the constant-cost recurrence with multi-task heads might allow joint prediction of phases, tools, and anatomy from the same backbone.

Load-bearing premise

The three added components stay fully compatible with the base state-space duality and introduce neither instability nor hidden costs when operating causally on long surgical videos with non-uniform timing.

What would settle it

A test on a new surgical video dataset containing sequences longer than 50,000 frames with more extreme timing irregularity or stronger inter-channel correlations than the current benchmarks, checking whether accuracy or Jaccard gains disappear while per-frame latency remains constant.

Figures

Figures reproduced from arXiv: 2605.14889 by Sukju Oh, Sukkyu Sun.

**Figure 1.** Figure 1: Two of SurgicalMamba’s core mechanisms. (A) Intensity-modulated temporal stepping (λ). A learned per-frame scalar λ (green) modulates the temporal dynamics of the SSM, enabling video-specific adaptation. Near a phase transition (vertical dashed line), λ rises sharply, which in turn drives the effective SSM decay dA (blue) down. The reduced dA shrinks the contribution of accumulated past state to the curren… view at source ↗

**Figure 2.** Figure 2: Overview of SurgicalMamba. Top: The model takes a stream of surgical frames {xt−3, xt−2, xt−1, xt}, extracts perframe visual features through a partially frozen ConvNeXt backbone, projects them to the model dimension via a visual projection, processes them through K stacked Dual-Path Surgical Mamba blocks, and produces phase predictions through an Out Head. Each Dual-Path Surgical Mamba block contains a S… view at source ↗

**Figure 3.** Figure 3: Hyperparameter sensitivity on Cholec80. We sweep the rotation rank r (top), the chunk size Lc (middle), and the state dimension N (bottom) while keeping the remaining two hyperparameters fixed at the default (r=16, Lc=64, N=64), marked by vertical dotted lines. Solid blue and dashed red curves denote relaxed and strict evaluation protocols, respectively; shaded regions show one standard deviation across th… view at source ↗

**Figure 4.** Figure 4: Phase prediction on Cholec80 video 41. From top: ground truth, SurgicalMamba (Ours), MTTR-Net, and DACAT. SurgicalMamba recovers all phases with stable predictions inside each phase and tight transition boundaries, while MTTR-Net misses the CleanCoag phase entirely [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Chunk-to-chunk cosine similarity of state-regramming rotation planes on a Cholec80 test video (1 = same plane, 0 = orthogonal). Side bars mark ground-truth phase membership. Bright block-diagonal structure aligned with phase boundaries shows that each phase receives its own rotation basis, with sharp re-orientation at transitions. This is the empirical counterpart to the conceptual mechanism illustrated in… view at source ↗

**Figure 6.** Figure 6: Per-chunk rotation angles on a Cholec80 test video. Maximum (dashed), mean (solid), and minimum (dotted) angle over the r = 16 planes per head, with SSM heads partitioned into three groups by trajectory similarity. The spread within each head (max ∼ 105–125◦ , min near 0 ◦ ) reflects a division of labor between transformative and near-identity planes; trajectories are nearly flat across the procedure, so p… view at source ↗

**Figure 7.** Figure 7: Per-frame intensity λ(t) (top) and effective decay dA = exp(A · ∆t · (1 + λ)) (bottom) on a Cholec80 test video. Shaded colors denote ground-truth phases; blue band is the 10–90 percentile across SSM heads, dark line the mean. λ stays near zero during sustained phases and spikes at phase boundaries; dA dips correspondingly at transitions and holds a plateau near 0.92 within phases. Forgetting is engaged on… view at source ↗

read the original abstract

Online surgical phase recognition (SPR) underpins context-aware operating-room systems and requires committing to a prediction at every frame from past context alone. Surgical video poses three demands that natural-video recognizers do not jointly address: procedures span tens of thousands of frames, time flows non-uniformly as long routine stretches are punctuated by brief phase-defining transitions, and the visual domain is narrow so backbone features are strongly correlated across channels. Existing recognizers either let per-frame cost grow with elapsed length, or hold cost bounded but advance state at a uniform rate with channel-independent dynamics, leaving the latter two demands unaddressed. We present SurgicalMamba, a causal SPR model built on Mamba2's structured state-space duality (SSD) that holds per-frame cost at O(d). It introduces three SSD-compatible components, each targeting one demand: a dual-path SSD block that separates long- and short-term regimes at the level of recurrent state; intensity-modulated stepping, a continuous-time time-warp that adapts the slow path's effective rate to phase-relevant information; and state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence. The learned rotation planes inherit a phase-aligned structure without any direct supervision, offering an interpretable internal signature of surgical workflow. Across seven public SPR benchmarks, SurgicalMamba reaches state-of-the-art accuracy and phase-level Jaccard under strict online evaluation: 94.6%/82.7% on Cholec80 (+0.7 pp/+2.2 pp over the strongest prior) and 89.5%/68.9% on AutoLaparo (+1.7 pp/+2.0 pp), at 238.74 fps on a single GPU. Ablations isolate the contribution of each component. The code is publicly available at https://github.com/sukjuoh/Surgical-Mamba.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SurgicalMamba layers dual-path SSD, intensity-modulated stepping, and per-chunk Cayley rotations onto Mamba2 for online surgical phase recognition and reports modest SOTA gains with public code, though the rotation's impact on long-sequence stability is the part that still needs checking.

read the letter

Colleague, the main thing to know is that this paper takes Mamba2's structured state-space duality and adds three pieces aimed at surgical video: a dual-path block that splits long-term and short-term state handling, intensity-modulated stepping that warps the time scale based on content, and state regramming that uses per-chunk Cayley rotations to mix channels inside the otherwise axis-aligned recurrence. They show accuracy and Jaccard improvements on seven benchmarks under strict online rules, including 94.6%/82.7% on Cholec80 and 89.5%/68.9% on AutoLaparo, while running at 238 fps on one GPU, with ablations and released code.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces SurgicalMamba, a causal online surgical phase recognition model built on Mamba2's structured state-space duality (SSD). It proposes three SSD-compatible extensions: a dual-path SSD block separating long- and short-term recurrent regimes, intensity-modulated stepping as a continuous-time warp adapting the slow path to phase-relevant information, and state regramming via per-chunk Cayley rotation to introduce cross-channel mixing in the otherwise axis-aligned recurrence. The model reports state-of-the-art accuracy and phase-level Jaccard index on seven public benchmarks under strict online causal evaluation (e.g., 94.6%/82.7% on Cholec80 and 89.5%/68.9% on AutoLaparo), with 238.74 fps inference on a single GPU. Ablations isolate each component, the learned rotations exhibit emergent phase-aligned structure, and code is released publicly.

Significance. If the central claims hold, the work advances efficient online SPR by jointly addressing long surgical sequences, non-uniform phase timing, and channel correlations while preserving Mamba2's linear per-frame cost. The public code, component ablations, and interpretable internal signatures strengthen reproducibility and utility for context-aware OR systems.

major comments (1)

[State regramming description and any associated recurrence equations] The state regramming component (per-chunk Cayley rotation for cross-channel mixing) is presented as fully compatible with Mamba2 SSD duality and O(d) recurrence. No derivation or analysis is supplied showing that the modified transition operator still admits the original closed-form solution, preserves linear complexity, or keeps hidden-state norms bounded over 10k+ frame sequences in the causal online regime. This is load-bearing for the claim that the three extensions solve the stated video demands without hidden costs or instability, as the reported FPS and accuracy gains rest on this assumption.

minor comments (1)

[Abstract] The abstract states results across seven benchmarks but provides detailed numbers only for Cholec80 and AutoLaparo; a compact summary table of all seven would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their insightful comments on our manuscript. We address the major comment regarding the state regramming component in detail below, and we will incorporate the necessary clarifications and derivations in the revised version.

read point-by-point responses

Referee: [State regramming description and any associated recurrence equations] The state regramming component (per-chunk Cayley rotation for cross-channel mixing) is presented as fully compatible with Mamba2 SSD duality and O(d) recurrence. No derivation or analysis is supplied showing that the modified transition operator still admits the original closed-form solution, preserves linear complexity, or keeps hidden-state norms bounded over 10k+ frame sequences in the causal online regime. This is load-bearing for the claim that the three extensions solve the stated video demands without hidden costs or instability, as the reported FPS and accuracy gains rest on this assumption.

Authors: We agree that a formal derivation would strengthen the presentation. The per-chunk Cayley rotation is applied as a post-recurrence mixing step on the hidden state after each chunk of frames is processed. Because the Cayley transform produces an orthogonal matrix, it preserves the Euclidean norm of the state vector, ensuring bounded norms over arbitrarily long sequences. Furthermore, since the rotation is constant within each chunk and the underlying SSD recurrence is linear, the overall transition can still be expressed in closed form by composing the rotation with the original SSD solution at chunk boundaries. This composition does not increase the per-frame complexity beyond O(d), as the rotation is a fixed d x d matrix multiplication performed only once per chunk (with chunk size typically 16 or 32 frames). In the revision, we will add this analysis to Section 3, including the updated recurrence equations, and report norm stability plots in the appendix to confirm no instability over 10k+ frames. revision: yes

Circularity Check

0 steps flagged

No circularity: architecture components defined independently of reported metrics

full rationale

The paper defines SurgicalMamba via three explicit SSD-compatible extensions (dual-path block, intensity-modulated stepping, state regramming via per-chunk Cayley rotation) whose functional forms are stated directly in the abstract and do not reference the final accuracy numbers or Jaccard scores. No equation reduces a claimed prediction to a fitted parameter, and the central performance claims are presented as empirical outcomes on public benchmarks rather than derived quantities. The Mamba2 SSD base is cited as external prior work with no overlapping authorship indicated. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work inherits the O(d) per-frame cost and structured duality properties of Mamba2 as background; the three new components are algorithmic additions rather than new physical entities or fitted constants beyond standard neural-network training.

axioms (1)

standard math Mamba2 structured state-space duality maintains O(d) per-frame cost under causal recurrence
Invoked when the abstract states that the model holds per-frame cost at O(d) while adding the three components.

pith-pipeline@v0.9.0 · 5883 in / 1400 out tokens · 56013 ms · 2026-05-20T20:56:14.732852+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

state regramming, a per-chunk Cayley rotation that opens cross-channel mixing in the otherwise axis-aligned SSM recurrence
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

intensity-modulated stepping (λ): a continuous-time time-warp that adapts the slow path’s effective rate
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mamba2’s structured state-space duality (SSD) that holds per-frame cost at O(d)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 4 internal anchors

[1]

Surgical workflow analysis in the SensorOR (EndoVis 2017 sub-challenge participant).https:// endovissub2017-workflow.grand-challenge.org/,

Andrei team. Surgical workflow analysis in the SensorOR (EndoVis 2017 sub-challenge participant).https:// endovissub2017-workflow.grand-challenge.org/,

work page 2017
[2]

arXiv preprint arXiv:2401.11174. A. Banino, J. Balaguer, and C. Blundell. PonderNet: Learning to ponder.arXiv preprint arXiv:2107.05407,

work page arXiv
[3]

Unsupervised temporal context learning using convolutional neural networks for laparoscopic workflow analysis

S. Bodenstedt, M. Wagner, D. Katic, P. Mietkowski, B. Mayer, H. Kenngott, B. Müller-Stich, R. Dillmann, and S. Speidel. Unsupervised temporal context learning using convolutional neural networks for laparoscopic workflow analysis.arXiv preprint arXiv:1702.03684,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

R. Cao, J. Wang, and Y .-H. Liu. SR-Mamba: Effective surgical phase recognition with state space model.arXiv preprint arXiv:2407.08333,

work page arXiv
[5]

Y . Chen, X. Zhang, S. Hu, X. Han, Z. Liu, and M. Sun. Stuffed mamba: Oversized states lead to the inability to forget. arXiv preprint arXiv:2410.07145,

work page arXiv
[6]

X. Ding, X. Yan, Z. Wang, W. Zhao, J. Zhuang, X. Xu, and X. Li. Less is more: Surgical phase recognition from timestamp supervision.IEEE Transactions on Medical Imaging, 42(6):1897–1910,

work page 1910
[7]

Surgical workflow analysis in the SensorOR (EndoVis 2017 sub-challenge participant).https:// endovissub2017-workflow.grand-challenge.org/,

Dylan team. Surgical workflow analysis in the SensorOR (EndoVis 2017 sub-challenge participant).https:// endovissub2017-workflow.grand-challenge.org/,

work page 2017
[8]

Funke, D

I. Funke, D. Rivoir, and S. Speidel. Metrics matter in surgical phase recognition.arXiv preprint arXiv:2305.13961,

work page arXiv
[9]

A. Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Y . Jin, Y . Long, C. Chen, Z. Zhao, Q. Dou, and P.-A. Heng. Temporal memory relation network for workflow recognition from surgical video.IEEE Transactions on Medical Imaging, 40(7):1911–1923,

work page 1911
[11]

Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, and Y . Liu. VMamba: Visual state space model.arXiv preprint arXiv:2401.10166,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

J. Ma, F. Li, and B. Wang. U-Mamba: Enhancing long-range dependency for biomedical image segmentation.arXiv preprint arXiv:2401.04722,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Surgical workflow analysis in the SensorOR (EndoVis 2017 sub-challenge participant).https:// endovissub2017-workflow.grand-challenge.org/,

Robin team. Surgical workflow analysis in the SensorOR (EndoVis 2017 sub-challenge participant).https:// endovissub2017-workflow.grand-challenge.org/,

work page 2017
[14]

doi: 10.1145/3204949.3208137. A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. De Mathelin, and N. Padoy. Endonet: a deep architecture for recognition tasks on laparoscopic videos.IEEE Transactions on Medical Imaging, 36(1):86–97, 2016a. A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. De Mathelin, and N. Padoy. Workshop and challenges on mo...

work page doi:10.1145/3204949.3208137
[15]

Y . Wang, Y . Chen, J. Yan, J. Lu, and X. Sun. MemMamba: Rethinking memory patterns in state space model.arXiv preprint arXiv:2510.03279,

work page arXiv
[16]

Mamba-unet: Unet- like pure visual mamba for medical image segmentation,

Z. Wang, J.-Q. Zheng, Y . Zhang, G. Cui, and L. Li. Mamba-UNet: UNet-like pure visual mamba for medical image segmentation.arXiv preprint arXiv:2402.05079,

work page arXiv

[1] [1]

Surgical workflow analysis in the SensorOR (EndoVis 2017 sub-challenge participant).https:// endovissub2017-workflow.grand-challenge.org/,

Andrei team. Surgical workflow analysis in the SensorOR (EndoVis 2017 sub-challenge participant).https:// endovissub2017-workflow.grand-challenge.org/,

work page 2017

[2] [2]

arXiv preprint arXiv:2401.11174. A. Banino, J. Balaguer, and C. Blundell. PonderNet: Learning to ponder.arXiv preprint arXiv:2107.05407,

work page arXiv

[3] [3]

Unsupervised temporal context learning using convolutional neural networks for laparoscopic workflow analysis

S. Bodenstedt, M. Wagner, D. Katic, P. Mietkowski, B. Mayer, H. Kenngott, B. Müller-Stich, R. Dillmann, and S. Speidel. Unsupervised temporal context learning using convolutional neural networks for laparoscopic workflow analysis.arXiv preprint arXiv:1702.03684,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

R. Cao, J. Wang, and Y .-H. Liu. SR-Mamba: Effective surgical phase recognition with state space model.arXiv preprint arXiv:2407.08333,

work page arXiv

[5] [5]

Y . Chen, X. Zhang, S. Hu, X. Han, Z. Liu, and M. Sun. Stuffed mamba: Oversized states lead to the inability to forget. arXiv preprint arXiv:2410.07145,

work page arXiv

[6] [6]

X. Ding, X. Yan, Z. Wang, W. Zhao, J. Zhuang, X. Xu, and X. Li. Less is more: Surgical phase recognition from timestamp supervision.IEEE Transactions on Medical Imaging, 42(6):1897–1910,

work page 1910

[7] [7]

Surgical workflow analysis in the SensorOR (EndoVis 2017 sub-challenge participant).https:// endovissub2017-workflow.grand-challenge.org/,

Dylan team. Surgical workflow analysis in the SensorOR (EndoVis 2017 sub-challenge participant).https:// endovissub2017-workflow.grand-challenge.org/,

work page 2017

[8] [8]

Funke, D

I. Funke, D. Rivoir, and S. Speidel. Metrics matter in surgical phase recognition.arXiv preprint arXiv:2305.13961,

work page arXiv

[9] [9]

A. Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Y . Jin, Y . Long, C. Chen, Z. Zhao, Q. Dou, and P.-A. Heng. Temporal memory relation network for workflow recognition from surgical video.IEEE Transactions on Medical Imaging, 40(7):1911–1923,

work page 1911

[11] [11]

Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, and Y . Liu. VMamba: Visual state space model.arXiv preprint arXiv:2401.10166,

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

J. Ma, F. Li, and B. Wang. U-Mamba: Enhancing long-range dependency for biomedical image segmentation.arXiv preprint arXiv:2401.04722,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Surgical workflow analysis in the SensorOR (EndoVis 2017 sub-challenge participant).https:// endovissub2017-workflow.grand-challenge.org/,

Robin team. Surgical workflow analysis in the SensorOR (EndoVis 2017 sub-challenge participant).https:// endovissub2017-workflow.grand-challenge.org/,

work page 2017

[14] [14]

doi: 10.1145/3204949.3208137. A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. De Mathelin, and N. Padoy. Endonet: a deep architecture for recognition tasks on laparoscopic videos.IEEE Transactions on Medical Imaging, 36(1):86–97, 2016a. A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. De Mathelin, and N. Padoy. Workshop and challenges on mo...

work page doi:10.1145/3204949.3208137

[15] [15]

Y . Wang, Y . Chen, J. Yan, J. Lu, and X. Sun. MemMamba: Rethinking memory patterns in state space model.arXiv preprint arXiv:2510.03279,

work page arXiv

[16] [16]

Mamba-unet: Unet- like pure visual mamba for medical image segmentation,

Z. Wang, J.-Q. Zheng, Y . Zhang, G. Cui, and L. Li. Mamba-UNet: UNet-like pure visual mamba for medical image segmentation.arXiv preprint arXiv:2402.05079,

work page arXiv