Multi-scale Cell Instance Segmentation with Keypoint Graph based Bounding Boxes

Bo Liu; Daniel J. Hoeppner; Dimitris N. Metaxas; Hui Qu; Jingru Yi; Pengxiang Wu; Qiaoying Huang

arxiv: 1907.09140 · v2 · pith:Y3FPT5V5new · submitted 2019-07-22 · 💻 cs.CV

Multi-scale Cell Instance Segmentation with Keypoint Graph based Bounding Boxes

Jingru Yi , Pengxiang Wu , Qiaoying Huang , Hui Qu , Bo Liu , Daniel J. Hoeppner , Dimitris N. Metaxas This is my paper

Pith reviewed 2026-05-24 18:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords cell instance segmentationkeypoint detectionbounding boxgraph groupingtouching cellsinstance segmentation

0 comments

The pith

A keypoint graph groups five detected points per cell into bounding boxes that then guide instance segmentation inside those boxes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that cell instance segmentation improves when bounding boxes are derived from grouped keypoints rather than from anchor-based detectors or direct pixel labeling. Most prior methods either segment without boxes and therefore merge touching cells or rely on anchors that suffer class imbalance. The new pipeline first locates five pre-defined cell points, connects them through a graph to assign points to distinct instances, extracts the resulting boxes, and runs segmentation only on the cropped feature maps. This yields higher accuracy on two cell datasets that differ in shape and crowding. A sympathetic reader would care because the approach directly targets the failure mode of merged instances that limits many biomedical imaging pipelines.

Core claim

We first detect the five pre-defined points of a cell via keypoints detection. Then we group these points according to a keypoint graph and subsequently extract the bounding box for each cell. Finally, cell segmentation is performed on feature maps within the bounding boxes, producing superior results compared with other instance segmentation techniques on two cell datasets.

What carries the argument

The keypoint graph that groups the five detected points into per-cell bounding boxes before segmentation occurs inside those boxes.

If this is right

Touching cells are separated more reliably than by methods that segment without boxes.
Class imbalance problems of anchor-based detectors are avoided.
The same pipeline works across cell datasets that have visibly different shapes.
Segmentation computation is restricted to the interior of the extracted boxes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same five-point plus graph construction could be tested on non-cell objects that have consistent landmark positions.
If the graph grouping step is made differentiable, end-to-end training might further reduce errors on crowded scenes.
The bounding-box restriction might also cut memory use in high-resolution whole-slide images.

Load-bearing premise

Five pre-defined keypoints can be detected reliably on every cell and the graph will group them correctly even when cells touch or overlap.

What would settle it

On a held-out set of images containing many overlapping cells, the method produces more merged instances than a strong anchor-based box segmentation baseline.

Figures

Figures reproduced from arXiv: 1907.09140 by Bo Liu, Daniel J. Hoeppner, Dimitris N. Metaxas, Hui Qu, Jingru Yi, Pengxiang Wu, Qiaoying Huang.

**Figure 1.** Figure 1: Multi-scale cell instance segmentation framework. We use a ResNet-50 Conv1- 4 [4] as the backbone network. The framework contains two branches: (a) keypoints detection branch and (b) individual cell segmentation branch. The keypoint module outputs the heatmap h(x), single offset map s(x), and group offset map g(x) that will be used for bounding box generation. x represents a 2-D position in the map, y is a… view at source ↗

**Figure 2.** Figure 2: Qualitative cell instance segmentation results on neural cells (top two rows) and cell nuclei (bottom two rows). We compare our instance segmentation method with DCAN [1], CosineEmbedding [9] and Mask R-CNN [3]. The white dotted circle shows an example where our method separates the touching cells. Input Ground Truth Seg s1 Input Ground Truth Seg s Seg Branch 1 Seg Branch [PITH_FULL_IMAGE:figures/full_fig… view at source ↗

**Figure 3.** Figure 3: Comparison between individual cell segmentation from feature map s1 (seg s1) and from individual cell segmentation branch (seg branch). The left four columns are neural cells. The right four columns are cell nuclei. The yellow arrows point to the over-segmentions of method seg s1. 4 Results and Discussion We compare our instance segmentation method with DCAN [1], CosineEmbedding [9] and Mask R-CNN [3]. Th… view at source ↗

**Figure 4.** Figure 4: Visualization of heatmap predictions and keypoint groups overlaid on the input images. We show the heatmaps at four scales si, i = 1, 2, 3, 4. The circles illustrate an example that a large cell is unrecognized at scale s1 but is captured at scale s4. ported in [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Most existing methods handle cell instance segmentation problems directly without relying on additional detection boxes. These methods generally fails to separate touching cells due to the lack of global understanding of the objects. In contrast, box-based instance segmentation solves this problem by combining object detection with segmentation. However, existing methods typically utilize anchor box-based detectors, which would lead to inferior instance segmentation performance due to the class imbalance issue. In this paper, we propose a new box-based cell instance segmentation method. In particular, we first detect the five pre-defined points of a cell via keypoints detection. Then we group these points according to a keypoint graph and subsequently extract the bounding box for each cell. Finally, cell segmentation is performed on feature maps within the bounding boxes. We validate our method on two cell datasets with distinct object shapes, and empirically demonstrate the superiority of our method compared to other instance segmentation techniques. Code is available at: https://github.com/yijingru/KG_Instance_Segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's keypoint-graph pipeline for generating boxes before cell segmentation is a concrete alternative to anchors or direct masks, but the abstract supplies no metrics or ablations to support the superiority claim.

read the letter

The main thing to know is that this paper describes a three-stage box-based pipeline for cell instance segmentation: detect five pre-defined keypoints per cell, group them via a keypoint graph to form bounding boxes, then run segmentation inside those boxes. The intent is to handle touching cells better than direct segmentation methods and avoid class imbalance from anchor-box detectors. It releases code, which is straightforward to check.

Referee Report

2 major / 2 minor

Summary. The paper proposes a box-based cell instance segmentation pipeline: detect five pre-defined keypoints per cell, group them with a keypoint graph to extract per-cell bounding boxes, then run segmentation inside those boxes. It argues this avoids the touching-cell failures of direct instance segmentation and the class-imbalance problems of anchor-based detectors, and reports empirical superiority on two cell datasets with different shapes.

Significance. If the quantitative claims hold, the keypoint-graph approach could provide a practical way to obtain instance boxes without anchors, potentially improving separation of touching cells in dense biomedical images. Code release is a positive factor for reproducibility.

major comments (2)

[Abstract] Abstract: the central claim of superiority over other instance segmentation techniques is stated without any quantitative metrics (AP, Dice, IoU, etc.), error bars, ablation results, or breakdown by touching-cell subsets, making it impossible to evaluate the asserted performance gains.
[Method] Pipeline description (method section): the load-bearing assumptions that five keypoints are reliably detected on every cell and that the keypoint graph correctly clusters points even when cells touch or overlap are presented without per-stage recall/precision numbers or failure analysis on dense/overlapping regions; any systematic error here would directly produce incorrect or merged boxes and negate the claimed advantage.

minor comments (2)

[Abstract] Abstract: grammatical error ('These methods generally fails' should be 'fail').
[Title] Title refers to 'Multi-scale' but the abstract and pipeline description do not clarify where or how multi-scale processing is applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to strengthen the presentation of our results and intermediate validation. We address each major comment below and will incorporate revisions accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of superiority over other instance segmentation techniques is stated without any quantitative metrics (AP, Dice, IoU, etc.), error bars, ablation results, or breakdown by touching-cell subsets, making it impossible to evaluate the asserted performance gains.

Authors: We agree that the abstract would be strengthened by including quantitative support for the superiority claims. The current abstract emphasizes the methodological novelty and high-level empirical demonstration but omits specific numbers. In the revised version we will add key metrics (AP, Dice, IoU) with comparisons to baselines, note the presence of error bars or standard deviations from our experiments, and briefly reference the touching-cell performance gains shown in the main text and supplementary material. revision: yes
Referee: [Method] Pipeline description (method section): the load-bearing assumptions that five keypoints are reliably detected on every cell and that the keypoint graph correctly clusters points even when cells touch or overlap are presented without per-stage recall/precision numbers or failure analysis on dense/overlapping regions; any systematic error here would directly produce incorrect or merged boxes and negate the claimed advantage.

Authors: The referee correctly identifies that the reliability of the two core stages is central to the approach. While the manuscript reports strong end-to-end instance segmentation results on two datasets, it does not isolate recall/precision for keypoint detection or graph clustering, nor does it provide a dedicated failure analysis on dense/overlapping regions. We will add these per-stage metrics and a short failure-case discussion in the revised method section to directly address potential systematic errors. revision: yes

Circularity Check

0 steps flagged

Empirical pipeline with no self-referential derivations

full rationale

The paper proposes a sequential computer-vision pipeline (keypoint detection of five fixed points, graph-based grouping, box extraction, then mask prediction inside boxes) and validates it empirically on two external cell datasets. No equations, fitted parameters, or self-citations are described that would reduce any reported performance metric to a definition or input by construction. The central claims rest on standard detection and segmentation stages whose correctness is assessed by external benchmarks rather than internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The pipeline implicitly assumes reliable keypoint detection and graph connectivity rules, but these are not quantified or justified in the provided text.

pith-pipeline@v0.9.0 · 5718 in / 1088 out tokens · 17111 ms · 2026-05-24T18:29:12.967611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

[1]

In: CVPR

Chen, H., Qi, X., Yu, L., Heng, P.A.: Dcan: deep contour-aware networks for accurate gland segmentation. In: CVPR. pp. 2487–2496 (2016)

work page 2016
[2]

IJCV 88(2), 303–338 (2010)

Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. IJCV 88(2), 303–338 (2010)

work page 2010
[3]

In: ICCV

He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask r-cnn. In: ICCV. pp. 2961–2969 (2017)

work page 2017
[4]

In: CVPR

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)

work page 2016
[5]

In: ECCV

Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: ECCV. pp. 734–750 (2018)

work page 2018
[6]

In: CVPR

Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: CVPR. pp. 2359–2367 (2017)

work page 2017
[7]

In: ECCV

Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estima- tion. In: ECCV. pp. 483–499. Springer (2016)

work page 2016
[8]

In: ECCV

Papandreou, G., Zhu, T., Chen, L.C., Gidaris, S., Tompson, J., Murphy, K.: Per- sonlab: Person pose estimation and instance segmentation with a bottom-up, part- based, geometric embedding model. In: ECCV. pp. 269–286 (2018)

work page 2018
[9]

In: MICCAI

Payer, C., ˇStern, D., Neﬀ, T., Bischof, H., Urschler, M.: Instance segmentation and tracking with cosine embeddings and recurrent hourglass networks. In: MICCAI. pp. 3–11. Springer (2018)

work page 2018
[10]

In: MICCAI

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: MICCAI. pp. 234–241. Springer (2015) Title Suppressed Due to Excessive Length 9

work page 2015
[11]

In: MICCAI

Schmidt, U., Weigert, M., Broaddus, C., Myers, G.: Cell detection with star-convex polygons. In: MICCAI. pp. 265–273. Springer (2018)

work page 2018
[12]

Medical image analysis (2019)

Yi, J., Wu, P., Jiang, M., Huang, Q., Hoeppner, D.J., Metaxas, D.N.: Attentive neural cell instance segmentation. Medical image analysis (2019)

work page 2019

[1] [1]

In: CVPR

Chen, H., Qi, X., Yu, L., Heng, P.A.: Dcan: deep contour-aware networks for accurate gland segmentation. In: CVPR. pp. 2487–2496 (2016)

work page 2016

[2] [2]

IJCV 88(2), 303–338 (2010)

Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. IJCV 88(2), 303–338 (2010)

work page 2010

[3] [3]

In: ICCV

He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask r-cnn. In: ICCV. pp. 2961–2969 (2017)

work page 2017

[4] [4]

In: CVPR

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)

work page 2016

[5] [5]

In: ECCV

Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: ECCV. pp. 734–750 (2018)

work page 2018

[6] [6]

In: CVPR

Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. In: CVPR. pp. 2359–2367 (2017)

work page 2017

[7] [7]

In: ECCV

Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estima- tion. In: ECCV. pp. 483–499. Springer (2016)

work page 2016

[8] [8]

In: ECCV

Papandreou, G., Zhu, T., Chen, L.C., Gidaris, S., Tompson, J., Murphy, K.: Per- sonlab: Person pose estimation and instance segmentation with a bottom-up, part- based, geometric embedding model. In: ECCV. pp. 269–286 (2018)

work page 2018

[9] [9]

In: MICCAI

Payer, C., ˇStern, D., Neﬀ, T., Bischof, H., Urschler, M.: Instance segmentation and tracking with cosine embeddings and recurrent hourglass networks. In: MICCAI. pp. 3–11. Springer (2018)

work page 2018

[10] [10]

In: MICCAI

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: MICCAI. pp. 234–241. Springer (2015) Title Suppressed Due to Excessive Length 9

work page 2015

[11] [11]

In: MICCAI

Schmidt, U., Weigert, M., Broaddus, C., Myers, G.: Cell detection with star-convex polygons. In: MICCAI. pp. 265–273. Springer (2018)

work page 2018

[12] [12]

Medical image analysis (2019)

Yi, J., Wu, P., Jiang, M., Huang, Q., Hoeppner, D.J., Metaxas, D.N.: Attentive neural cell instance segmentation. Medical image analysis (2019)

work page 2019