HyKey: Hyperspectral Keypoint Detection and Matching in Minimally Invasive Surgery

Alexander Saikia; Chiara Di Vece; Chloe He; Danail Stoyanov; Joao Ramalhinho; Sierra Bonilla; Sophia Bano; Tobias Czempiel; Zhehua Mao

arxiv: 2604.17446 · v1 · submitted 2026-04-19 · 💻 cs.CV

HyKey: Hyperspectral Keypoint Detection and Matching in Minimally Invasive Surgery

Alexander Saikia , Chiara Di Vece , Zhehua Mao , Sierra Bonilla , Chloe He , Joao Ramalhinho , Tobias Czempiel , Sophia Bano

show 1 more author

Danail Stoyanov

This is my paper

Pith reviewed 2026-05-10 05:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords hyperspectral imagingkeypoint detectionminimally invasive surgery3D reconstructionfeature matchingpose estimationconvolutional neural network

0 comments

The pith

A hyperspectral imaging model for keypoint detection in surgery achieves higher matching accuracy than standard RGB methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops HyKey, a neural network that processes hyperspectral image cubes to find and match keypoints in surgical scenes. It shows that incorporating spectral data helps overcome the poor texture and lighting issues common in minimally invasive surgery. This leads to better performance than traditional RGB methods on tasks like matching accuracy and camera pose estimation. A reader would care because accurate 3D reconstruction can improve surgical guidance and augmented reality tools.

Core claim

HyKey is a hybrid 3D-2D convolutional neural network that jointly extracts spatial-spectral features from HSI cubes. Trained on a robotically-acquired dual-camera RGB-HSI laparoscopic dataset of ex-vivo organs using synthetic homographic augmentation and epipolar geometry constraints, the model outperforms RGB baselines such as SuperPoint and ALIKE, reaching 96.62% mean matching accuracy and 67.18% mean average accuracy at 10 degrees on pose estimation.

What carries the argument

HyKey, a hybrid 3D-2D convolutional neural network that jointly extracts spatial-spectral features from hyperspectral imaging cubes, trained with synthetic homographic augmentation and epipolar geometry constraints.

If this is right

Spectral-spatial feature discrimination improves robustness in texture-poor surgical environments.
Higher matching accuracy supports more reliable pose estimation for 3D reconstruction.
The approach enables enhanced monocular 3D reconstruction without relying solely on RGB texture.
Consistent metric gains across evaluation settings indicate broader utility for surgical visualization tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Single-camera monocular deployment could become viable if spectral cues replace stereo geometry.
Real-time in-vivo testing on human tissue would reveal whether perfusion changes degrade the observed gains.
The model might combine with other spectral bands or modalities to further enrich feature sets.

Load-bearing premise

The assumption that performance gains observed on ex-vivo organs using synthetic homographic augmentations and dual-camera epipolar constraints will generalize to in-vivo human surgery with real-time constraints, variable tissue perfusion, and single-camera monocular use.

What would settle it

A direct comparison of HyKey against RGB baselines on in-vivo human surgical footage that shows no improvement or lower matching accuracy under monocular conditions.

read the original abstract

Purpose: 3D reconstruction in minimally invasive surgery (MIS) enables enhanced surgical guidance through improved visualisation, tool tracking, and augmented reality. However, traditional RGB-based keypoint detection and matching pipelines struggle with surgical challenges, such as poor texture and complex illumination. We investigate whether using snapshot hyperspectral imaging (HSI) can provide improved results on keypoint detection and matching surgical scenes. Methods: We developed HyKey, a HYperspectral KEYpoint detection and description model made up of a hybrid 3D-2D convolutional neural network that jointly extracts spatial-spectral features from HSI. The model was trained using synthetic homographic augmentation and epipolar geometry constraints on a robotically-acquired dual-camera RGB-HSI laparoscopic dataset of ex-vivo organs with calibrated camera poses. We benchmarked performance against established RGB-based methods, including SuperPoint and ALIKE. Results: Our HSI-based model outperformed RGB baselines on registered RGB frames, achieving 96.62% mean matching accuracy and 67.18% mean average accuracy at 10 degree on pose estimation, demonstrating consistent improvements across multiple evaluation metrics. Conclusion: Integrating spectral information from an HSI cube offers a promising approach for robust monocular 3D reconstruction in MIS, addressing limitations of texture-poor surgical environments through enhanced spectral-spatial feature discrimination. Our model and dataset are available at https://github.com/alexsaikia/HyKey-Hyperspectral-Keypoint-Detection

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HyKey gets solid gains on ex-vivo HSI keypoint matching over RGB baselines but the jump to monocular in-vivo use remains untested.

read the letter

The main thing to know is that this paper builds a hybrid 3D-2D CNN called HyKey to detect and match keypoints from hyperspectral image cubes in minimally invasive surgery, and it reports clear improvements over SuperPoint and ALIKE on their dataset. The architecture is new in this domain because it jointly processes the spectral and spatial dimensions rather than treating HSI as just extra channels or running separate 2D networks. They trained on a robotically captured dual-camera RGB-HSI set of ex-vivo organs with homographic augmentations and epipolar constraints, then released both the model and data. That release and the direct benchmark numbers (96.62% mean matching accuracy, 67.18% mAA@10° on pose) are the practical contributions worth noting. The work is aimed squarely at the surgical vision community and gives a concrete starting point for anyone trying to leverage spectral information where texture is poor. It deserves peer review because the empirical comparison is there and the clinical motivation is real, even if the experiments stay narrow. The soft spots are straightforward. Everything is ex-vivo with a calibrated dual-camera rig and synthetic warps; the paper does not show results on live tissue deformation, perfusion shifts, specular highlights, or true single-camera monocular capture. The abstract mentions evaluation on registered RGB frames, which is fine for a controlled test but leaves the monocular real-time claim without direct support. There are also no reported ablations isolating the spectral contribution, no training-split details, and no statistical significance on the gains. Those gaps are fixable but they limit how far the current numbers can be taken. Overall this is a useful incremental piece for the subfield rather than a broad advance, and a referee would likely ask for the missing generalization checks and component tests.

Referee Report

4 major / 2 minor

Summary. The manuscript presents HyKey, a hyperspectral keypoint detection and matching model using a hybrid 3D-2D CNN for minimally invasive surgery applications. It is trained on a custom robotically-acquired ex-vivo dual-camera RGB-HSI dataset of organs using synthetic homographic augmentations and epipolar constraints from calibrated poses. The model is benchmarked against RGB-based methods such as SuperPoint and ALIKE, reporting a mean matching accuracy of 96.62% and a mean average accuracy of 67.18% at 10 degrees for pose estimation on registered RGB frames. The authors conclude that incorporating spectral information improves robustness in texture-poor surgical scenes and release the model and dataset publicly.

Significance. Should the empirical results prove robust, this contribution highlights the utility of hyperspectral imaging for enhancing keypoint matching and 3D reconstruction in challenging surgical environments. By providing an open dataset and implementation, the work facilitates reproducibility and further exploration of spectral-spatial features in medical computer vision. It offers a concrete step toward addressing limitations of standard RGB pipelines in MIS.

major comments (4)

Results section: The performance metrics (96.62% mean matching accuracy and 67.18% mAA@10°) lack accompanying information on the test set size, variance across trials, or statistical significance tests against the RGB baselines, which is necessary to substantiate the central claim of consistent improvements.
Methods section: Training relies on a dual-camera setup with synthetic homographies and epipolar geometry; however, the paper does not include experiments or discussion on adapting the approach to monocular single-camera scenarios or handling real in-vivo deformations and perfusion variations, which are critical for the claimed applicability to MIS.
Experiments section: No ablation experiments are reported that isolate the effect of the spectral dimension in the hybrid 3D-2D CNN (e.g., comparing to a 2D-only variant), making it unclear whether the performance gains stem specifically from hyperspectral data rather than other architectural or training choices.
Evaluation protocol: The HSI model is evaluated on 'registered RGB frames'; the methods section should explicitly describe the input processing pipeline for this comparison to ensure the benchmark is fair and the model is not inadvertently using HSI-specific information during testing.

minor comments (2)

Abstract: The term 'mean average accuracy at 10 degree' should be expanded as mean average accuracy (mAA) at a 10° threshold for pose estimation to improve clarity.
Conclusion: The phrasing on applicability to 'robust monocular 3D reconstruction in MIS' should be qualified to align with the ex-vivo dual-camera experimental scope.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating the changes we will make to strengthen the manuscript.

read point-by-point responses

Referee: Results section: The performance metrics (96.62% mean matching accuracy and 67.18% mAA@10°) lack accompanying information on the test set size, variance across trials, or statistical significance tests against the RGB baselines, which is necessary to substantiate the central claim of consistent improvements.

Authors: We agree that these details are necessary to rigorously support our claims. In the revised manuscript, we will report the exact size of the test set (number of images and image pairs), include standard deviations or variance measures across multiple evaluation runs or data splits, and add statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing HyKey against the RGB baselines. These updates will be incorporated into the Results section. revision: yes
Referee: Methods section: Training relies on a dual-camera setup with synthetic homographies and epipolar geometry; however, the paper does not include experiments or discussion on adapting the approach to monocular single-camera scenarios or handling real in-vivo deformations and perfusion variations, which are critical for the claimed applicability to MIS.

Authors: Our current work uses an ex-vivo dual-camera dataset, and the model processes individual HSI cubes, making it inherently suitable for monocular deployment at inference. However, we lack in-vivo data and thus cannot conduct new experiments on real deformations or perfusion. In the revision, we will expand the Discussion to describe how the trained model can be applied in monocular single-camera pipelines (using the learned features without stereo input) and to explicitly discuss the limitations of ex-vivo data regarding tissue deformation and perfusion, along with future work directions. revision: partial
Referee: Experiments section: No ablation experiments are reported that isolate the effect of the spectral dimension in the hybrid 3D-2D CNN (e.g., comparing to a 2D-only variant), making it unclear whether the performance gains stem specifically from hyperspectral data rather than other architectural or training choices.

Authors: We will add a new ablation study to the Experiments section. This will include training and evaluating a 2D-only variant of the network (by converting 3D convolutions to 2D and handling spectral bands separately) under the same training protocol and dataset splits. Direct comparison of this variant against the full hybrid model will isolate the contribution of the spectral dimension. revision: yes
Referee: Evaluation protocol: The HSI model is evaluated on 'registered RGB frames'; the methods section should explicitly describe the input processing pipeline for this comparison to ensure the benchmark is fair and the model is not inadvertently using HSI-specific information during testing.

Authors: We will revise the Methods and Experiments sections to provide a clear description of the evaluation pipeline. The registered RGB frames refer to the RGB images aligned to the HSI cubes using the calibrated dual-camera poses; this registration is used only to establish ground-truth correspondences and poses for metric computation. The HyKey model receives the full HSI cube as input during testing, while RGB baselines receive only the corresponding RGB channels from the same frames. No additional HSI information is provided to the baselines, and the HSI model does not access RGB-only data in its forward pass. A flowchart illustrating the distinct input paths will be added for clarity. revision: yes

Circularity Check

0 steps flagged

Empirical ML evaluation with independent training constraints and external baselines

full rationale

The paper describes training a hybrid 3D-2D CNN on ex-vivo HSI data using standard synthetic homographic augmentations and epipolar geometry from a calibrated dual-camera rig, then reports matching accuracy and pose-estimation mAA against independent RGB baselines (SuperPoint, ALIKE). No derivation, equation, or 'prediction' reduces to its own fitted inputs by construction; the reported metrics are measured on held-out registered frames and are not statistically forced by the training losses. Self-citations, if present, are not load-bearing for the central empirical claim. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of supervised CNN training and the representativeness of the ex-vivo dataset; no new physical entities or ad-hoc constants are introduced beyond typical deep-learning hyperparameters.

axioms (1)

domain assumption Standard supervised learning assumptions hold for the hybrid CNN on the provided dataset.
Invoked implicitly in the training description using synthetic augmentations and epipolar constraints.

pith-pipeline@v0.9.0 · 5592 in / 1345 out tokens · 40137 ms · 2026-05-10T05:38:27.522024+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages

[1]

Journal of biomedical optics19(1) (2014)

Lu, G., Fei, B.: Medical hyperspectral imaging: a review. Journal of biomedical optics19(1) (2014)

work page 2014
[2]

Medical image analysis63, 101699 (2020) 13

Clancy, N.T., Jones, G., Maier-Hein, L., Elson, D.S., Stoyanov, D.: Surgical spectral imaging. Medical image analysis63, 101699 (2020) 13

work page 2020
[3]

Computer Assisted Surgery30(1), 2546819 (2025)

Ali, H.M., Xiao, Y., Kersten-Oertel, M.: Surgical hyperspectral imaging: a systematic review. Computer Assisted Surgery30(1), 2546819 (2025)

work page 2025
[4]

Proceedings of the Royal Society of London

Ullman, S.: The interpretation of structure from motion. Proceedings of the Royal Society of London. Series B. Biological Sciences203(1153), 405–426 (1979)

work page 1979
[5]

International journal of computer vision9(2), 137–154 (1992)

Tomasi, C., Kanade, T.: Shape and motion from image streams under orthog- raphy: a factorization method. International journal of computer vision9(2), 137–154 (1992)

work page 1992
[6]

IEEE robotics & automation magazine13(2), 99–110 (2006)

Durrant-Whyte, H., Bailey, T.: Simultaneous localization and mapping: part i. IEEE robotics & automation magazine13(2), 99–110 (2006)

work page 2006
[7]

Interna- tional journal of computer vision60(2), 91–110 (2004)

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision60(2), 91–110 (2004)

work page 2004
[8]

Computer vision and image understanding110(3), 346–359 (2008)

Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Computer vision and image understanding110(3), 346–359 (2008)

work page 2008
[9]

In: 2011 International Conference on Computer Vision, pp

Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: 2011 International Conference on Computer Vision, pp. 2564–2571 (2011). Ieee

work page 2011
[10]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp

DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised inter- est point detection and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236 (2018)

work page 2018
[11]

IEEE Transactions on Multimedia25, 3101–3112 (2022)

Zhao, X., Wu, X., Miao, J., Chen, W., Chen, P.C., Li, Z.: Alike: Accurate and lightweight keypoint detection and descriptor extraction. IEEE Transactions on Multimedia25, 3101–3112 (2022)

work page 2022
[12]

IEEE Transactions on Instrumentation and Measurement72, 1–16 (2023)

Zhao, X., Wu, X., Chen, W., Chen, P.C., Xu, Q., Li, Z.: Aliked: A lighter key- point and descriptor extraction network via deformable transformation. IEEE Transactions on Instrumentation and Measurement72, 1–16 (2023)

work page 2023
[13]

IEEE Access10, 85266–85277 (2022)

Ma, T., Xing, Y., Gong, D., Lin, Z., Li, Y., Jiang, J., He, S.: A deep learning- based hyperspectral keypoint representation method and its application for 3d reconstruction. IEEE Access10, 85266–85277 (2022)

work page 2022
[14]

IEEE geoscience and remote sensing letters17(2), 277–281 (2019)

Roy, S.K., Krishna, G., Dubey, S.R., Chaudhuri, B.B.: Hybridsn: Exploring 3-d– 2-d cnn feature hierarchy for hyperspectral image classification. IEEE geoscience and remote sensing letters17(2), 277–281 (2019)

work page 2019
[15]

Smart Agricultural Technology5, 100316 (2023) 14

Noshiri, N., Beck, M.A., Bidinosti, C.P., Henry, C.J.: A comprehensive review of 3d convolutional neural network-based classification techniques of diseased and defective crops using non-uav-based hyperspectral images. Smart Agricultural Technology5, 100316 (2023) 14

work page 2023
[16]

IEEE Robotics and Automation Letters (2025)

Saikia, A., Di Vece, C., Bonilla, S., He, C., Magbagbeola, M., Mennillo, L., Czem- piel, T., Bano, S., Stoyanov, D.: Robotic arm platform for multi-view image acquisition and 3d reconstruction in minimally invasive surgery. IEEE Robotics and Automation Letters (2025)

work page 2025
[17]

Computer Science Review53, 100658 (2024)

Kumar, V., Singh, R.S., Rambabu, M., Dua, Y.: Deep learning for hyperspectral image classification: A survey. Computer Science Review53, 100658 (2024)

work page 2024
[18]

Multimedia Tools and Applications83(34), 80941–81038 (2024)

Tejasree, G., Agilandeeswari, L.: An extensive review of hyperspectral image classification and prediction: techniques and challenges. Multimedia Tools and Applications83(34), 80941–81038 (2024)

work page 2024
[19]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5173–5182 (2017)

work page 2017
[20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Barath, D., Noskova, J., Ivashechkin, M., Matas, J.: Magsac++, a fast, reliable and accurate robust estimator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1304–1312 (2020) 15

work page 2020

[1] [1]

Journal of biomedical optics19(1) (2014)

Lu, G., Fei, B.: Medical hyperspectral imaging: a review. Journal of biomedical optics19(1) (2014)

work page 2014

[2] [2]

Medical image analysis63, 101699 (2020) 13

Clancy, N.T., Jones, G., Maier-Hein, L., Elson, D.S., Stoyanov, D.: Surgical spectral imaging. Medical image analysis63, 101699 (2020) 13

work page 2020

[3] [3]

Computer Assisted Surgery30(1), 2546819 (2025)

Ali, H.M., Xiao, Y., Kersten-Oertel, M.: Surgical hyperspectral imaging: a systematic review. Computer Assisted Surgery30(1), 2546819 (2025)

work page 2025

[4] [4]

Proceedings of the Royal Society of London

Ullman, S.: The interpretation of structure from motion. Proceedings of the Royal Society of London. Series B. Biological Sciences203(1153), 405–426 (1979)

work page 1979

[5] [5]

International journal of computer vision9(2), 137–154 (1992)

Tomasi, C., Kanade, T.: Shape and motion from image streams under orthog- raphy: a factorization method. International journal of computer vision9(2), 137–154 (1992)

work page 1992

[6] [6]

IEEE robotics & automation magazine13(2), 99–110 (2006)

Durrant-Whyte, H., Bailey, T.: Simultaneous localization and mapping: part i. IEEE robotics & automation magazine13(2), 99–110 (2006)

work page 2006

[7] [7]

Interna- tional journal of computer vision60(2), 91–110 (2004)

Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision60(2), 91–110 (2004)

work page 2004

[8] [8]

Computer vision and image understanding110(3), 346–359 (2008)

Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Computer vision and image understanding110(3), 346–359 (2008)

work page 2008

[9] [9]

In: 2011 International Conference on Computer Vision, pp

Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: 2011 International Conference on Computer Vision, pp. 2564–2571 (2011). Ieee

work page 2011

[10] [10]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp

DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised inter- est point detection and description. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 224–236 (2018)

work page 2018

[11] [11]

IEEE Transactions on Multimedia25, 3101–3112 (2022)

Zhao, X., Wu, X., Miao, J., Chen, W., Chen, P.C., Li, Z.: Alike: Accurate and lightweight keypoint detection and descriptor extraction. IEEE Transactions on Multimedia25, 3101–3112 (2022)

work page 2022

[12] [12]

IEEE Transactions on Instrumentation and Measurement72, 1–16 (2023)

Zhao, X., Wu, X., Chen, W., Chen, P.C., Xu, Q., Li, Z.: Aliked: A lighter key- point and descriptor extraction network via deformable transformation. IEEE Transactions on Instrumentation and Measurement72, 1–16 (2023)

work page 2023

[13] [13]

IEEE Access10, 85266–85277 (2022)

Ma, T., Xing, Y., Gong, D., Lin, Z., Li, Y., Jiang, J., He, S.: A deep learning- based hyperspectral keypoint representation method and its application for 3d reconstruction. IEEE Access10, 85266–85277 (2022)

work page 2022

[14] [14]

IEEE geoscience and remote sensing letters17(2), 277–281 (2019)

Roy, S.K., Krishna, G., Dubey, S.R., Chaudhuri, B.B.: Hybridsn: Exploring 3-d– 2-d cnn feature hierarchy for hyperspectral image classification. IEEE geoscience and remote sensing letters17(2), 277–281 (2019)

work page 2019

[15] [15]

Smart Agricultural Technology5, 100316 (2023) 14

Noshiri, N., Beck, M.A., Bidinosti, C.P., Henry, C.J.: A comprehensive review of 3d convolutional neural network-based classification techniques of diseased and defective crops using non-uav-based hyperspectral images. Smart Agricultural Technology5, 100316 (2023) 14

work page 2023

[16] [16]

IEEE Robotics and Automation Letters (2025)

Saikia, A., Di Vece, C., Bonilla, S., He, C., Magbagbeola, M., Mennillo, L., Czem- piel, T., Bano, S., Stoyanov, D.: Robotic arm platform for multi-view image acquisition and 3d reconstruction in minimally invasive surgery. IEEE Robotics and Automation Letters (2025)

work page 2025

[17] [17]

Computer Science Review53, 100658 (2024)

Kumar, V., Singh, R.S., Rambabu, M., Dua, Y.: Deep learning for hyperspectral image classification: A survey. Computer Science Review53, 100658 (2024)

work page 2024

[18] [18]

Multimedia Tools and Applications83(34), 80941–81038 (2024)

Tejasree, G., Agilandeeswari, L.: An extensive review of hyperspectral image classification and prediction: techniques and challenges. Multimedia Tools and Applications83(34), 80941–81038 (2024)

work page 2024

[19] [19]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

Balntas, V., Lenc, K., Vedaldi, A., Mikolajczyk, K.: Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5173–5182 (2017)

work page 2017

[20] [20]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Barath, D., Noskova, J., Ivashechkin, M., Matas, J.: Magsac++, a fast, reliable and accurate robust estimator. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1304–1312 (2020) 15

work page 2020