SkeletonNet: Shape Pixel to Skeleton Pixel
Pith reviewed 2026-05-25 10:35 UTC · model grok-4.3
The pith
A U-Net with an HED-style decoder extracts skeleton pixels from shape pixels of 89 objects by fusing four side layers through a dilation convolution and reaches 0.77 F1 on test data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We use a U-net model with an encoder-decoder structure. Unlike the plain decoder in the traditional Unet, we have designed the decoder in the format of HED architecture, wherein we have introduced 4 side layers and fused them to one dilation convolutional layer to connect the broken links of the skeleton. Our proposed architecture achieved the F1 score of 0.77 on test data.
What carries the argument
The HED-style decoder that introduces four side layers fused through a single dilation convolutional layer to reconnect broken skeleton segments.
If this is right
- The architecture converts shape pixels into skeleton pixels for the 89 objects in the first track of the challenge.
- The model reaches an F1 score of 0.77 when tested on the competition test data.
- The side-layer fusion step is intended to restore connectivity in the extracted skeletons.
- The same encoder-decoder design can be trained end-to-end on the supplied dataset for this pixel-to-pixel task.
Where Pith is reading between the lines
- The same decoder modification might be tried on other dense-prediction problems such as edge detection or thin-structure segmentation.
- Running the network on object classes outside the original 89 would test whether the connectivity benefit transfers.
- An ablation that removes the dilation fusion layer and measures the change in F1 would quantify how much the side-layer step contributes.
Load-bearing premise
Adding four side layers fused via a single dilation convolutional layer will reliably connect broken skeleton links on the provided competition dataset.
What would settle it
Evaluating the trained model on a new collection of shape images that contain known gaps in the ground-truth skeletons and checking whether those gaps remain unconnected in the output.
Figures
read the original abstract
Deep Learning for Geometric Shape Understating has organized a challenge for extracting different kinds of skeletons from the images of different objects. This competition is organized in association with CVPR 2019. There are three different tracks of this competition. The present manuscript describes the method used to train the model for the dataset provided in the first track. The first track aims to extract skeleton pixels from the shape pixels of 89 different objects. For the purpose of extracting the skeleton, a U-net model which is comprised of an encoder-decoder structure has been used. In our proposed architecture, unlike the plain decoder in the traditional Unet, we have designed the decoder in the format of HED architecture, wherein we have introduced 4 side layers and fused them to one dilation convolutional layer to connect the broken links of the skeleton. Our proposed architecture achieved the F1 score of 0.77 on test data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes a U-Net variant for the first track of the CVPR 2019 Deep Learning for Geometric Shape Understanding challenge. The architecture modifies the decoder to an HED-style structure with four side layers fused through a single dilation convolutional layer, with the goal of connecting broken skeleton links; the reported result is an F1 score of 0.77 on the held-out test set for skeleton pixel extraction from shape images of 89 object classes.
Significance. A validated improvement in skeleton connectivity via side-layer fusion could be useful for shape analysis tasks, but the single scalar F1 score on one competition dataset provides limited insight into broader applicability or the contribution of the claimed architectural element.
major comments (2)
- [Abstract] Abstract: the assertion that the four side layers fused via one dilation convolutional layer connect broken skeleton links is presented without any ablation study, baseline comparison to a plain U-Net decoder, or quantitative justification. Because the central claim attributes the F1=0.77 result to this design choice, the absence of controls means the performance cannot be tied to the proposed modification.
- [Abstract] Abstract: no error bars, cross-validation details, training/validation split information, or comparison against other skeletonization methods are supplied, so the reported F1 score stands as an isolated number whose reliability and improvement over standard approaches cannot be assessed.
minor comments (2)
- [Abstract] Typo: 'Understating' should read 'Understanding'.
- The manuscript provides no information on loss function, optimizer, data augmentation, or implementation framework, all of which are required for reproducibility of the reported F1 score.
Simulated Author's Rebuttal
We appreciate the referee's comments highlighting the need for stronger justification of the architectural choices and additional experimental details. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the assertion that the four side layers fused via one dilation convolutional layer connect broken skeleton links is presented without any ablation study, baseline comparison to a plain U-Net decoder, or quantitative justification. Because the central claim attributes the F1=0.77 result to this design choice, the absence of controls means the performance cannot be tied to the proposed modification.
Authors: We agree that the manuscript does not contain an ablation study or direct baseline comparison to a standard U-Net decoder, which limits the ability to quantitatively attribute the result to the side-layer fusion and dilation convolution. The design draws from the HED architecture's demonstrated utility in multi-scale feature extraction for edge-like tasks, with the dilation layer added specifically to address skeleton connectivity. To strengthen the claim, we will incorporate a baseline U-Net comparison in the revised manuscript. revision: yes
-
Referee: [Abstract] Abstract: no error bars, cross-validation details, training/validation split information, or comparison against other skeletonization methods are supplied, so the reported F1 score stands as an isolated number whose reliability and improvement over standard approaches cannot be assessed.
Authors: The F1 score of 0.77 is the official result on the competition's fixed held-out test set for the 89-class shape skeletonization track. Internal training/validation splits were performed but omitted due to the concise competition-report format; no cross-validation or error bars were computed because evaluation is determined by the organizers. We will expand the methods section in revision to detail the training procedure and any internal splits. Comparisons to non-deep-learning skeletonization methods (e.g., morphological thinning) can be added for context, though the challenge emphasizes learned approaches. revision: yes
Circularity Check
No circularity; empirical F1 on held-out test data
full rationale
The manuscript describes a U-Net variant with HED-style side layers for skeleton extraction and reports an F1 score of 0.77 on competition test data. No derivation chain, equations, fitted-parameter predictions, or self-citation load-bearing steps exist. The performance number is an external held-out metric, not a quantity defined by construction from the model's own inputs or prior self-citations. The paper is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Computing and simplifying 2D and 3D continuous skeletons
Attali D, Montanvert A. Computing and simplifying 2D and 3D continuous skeletons. Computer vision and image understanding, 67(3):261-73, 1997
work page 1997
-
[2]
Skeleton pruning by contour partitioning with discrete curve evolution
Bai X, Latecki LJ, Liu WY. Skeleton pruning by contour partitioning with discrete curve evolution. IEEE transactions on pattern analysis and machine intelligence, 29(3):449-62, 1997
work page 1997
-
[3]
Deepedge: A multi- scale bifurcated deep network for top-down contour detection
Bertasius G, Shi J, Torresani L. Deepedge: A multi- scale bifurcated deep network for top-down contour detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4380- 4389, 2015
work page 2015
-
[4]
A practical introduction to skeletons for the plant sciences
Bucksch A. A practical introduction to skeletons for the plant sciences. Applications in plant sciences, 2(8):1400005, 2014
work page 2014
-
[5]
Chazal F, Lieutier A. The “ λ-medial axis”. Graphical Models, 67(4):304-31, 2005
work page 2005
-
[6]
SkelNetOn 2019: Dataset and Challenge on Deep Learning for Geometric Shape Understanding
Demir, I., Hahn, C., Leonard, K., Morin, G., Rahbani, D., Panotopoulou, A., . . . Kortylewski, A. (2019, March 21). SkelNetOn 2019 Dataset and Challenge on Deep Learning for Geometric Shape Understanding. Retrieved from https://arxiv.org/abs/1903.09233
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[7]
The Propagated Skeleton: A Robust Detail-Preserving Approach
Durix B, Chambon S, Leonard K, Mari JL, Morin G. The Propagated Skeleton: A Robust Detail-Preserving Approach. In International Conference on Discrete Geometry for Computer Imagery, pp. 343-354. Springer, Cham, 2005
work page 2005
-
[8]
Fast edge detection using structured forests
Dollár P, Zitnick CL. Fast edge detection using structured forests. IEEE transactions on pattern analysis and machine intelligence, 37(8):1558-70, 2015
work page 2015
-
[9]
Giesen J, Miklos B, Pauly M, Wormser C. The scale axis transform. In Proceedings of the twenty-fifth annual symposium on Computational geometry, pp. 106-115, ACM, 2009
work page 2009
-
[10]
Hypercolumns for object segmentation and fine-grained localization
Hariharan B, Arbeláez P, Girshick R, Malik J. Hypercolumns for object segmentation and fine-grained localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 447-456, 2015
work page 2015
-
[11]
Hou Q, Liu J, Cheng MM, Borji A, Torr PH. Three birds one stone: a unified framework for salient object segmentation, edge detection and skeleton extraction. arXiv preprint arXiv:1803.09860. 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
Squeeze-and-excitation networks
Hu J, Shen L, Sun G. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132-7141, 2018
work page 2018
-
[13]
Jalal A, Kamal S, Kim D. Depth map-based human activity tracking and recognition using body joints features and self-organized map. In Fifth International Conference on Computing, Communications and Networking Technologies (ICCCNT), pp. 1-6, IEEE, 2014
work page 2014
-
[14]
Tracking Direction of Human Movement - An Efficient Implementation using Skeleton
Kundu M, Sengupta D, Dastidar JG. Tracking Direction of Human Movement-An Efficient Implementation using Skeleton. arXiv preprint arXiv:1506.08815. 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[15]
Skeletonization using SSM of the distance transform
Latecki LJ, Li QN, Bai X, Liu WY. Skeletonization using SSM of the distance transform. In2007 IEEE International Conference on Image Processing, 5(V):349, IEEE, 2007
work page 2007
-
[16]
A 2D shape structure for decomposition and part similarity
Leonard K, Morin G, Hahmann S, Carlier A. A 2D shape structure for decomposition and part similarity. In2016 23rd International Conference on Pattern Recognition (ICPR), pp. 3216-3221, IEEE, 2016
work page 2016
-
[17]
An implementation of ocr system based on skeleton matching, 1993
Li N. An implementation of ocr system based on skeleton matching, 1993
work page 1993
-
[18]
An intriguing failing of convolutional neural networks and the coordconv solution
Liu R, Lehman J, Molino P, Such FP, Frank E, Sergeev A, Yosinski J. An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems, pp. 9605-9616, 2018
work page 2018
-
[19]
Analysis of two-dimensional non-rigid shapes
Bronstein AM, Bronstein MM, Bruckstein AM, Kimmel R. Analysis of two-dimensional non-rigid shapes. International Journal of Computer Vision, 78(1):67-88, 2008
work page 2008
-
[20]
U-net: Convolutional networks for biomedical image segmentation
Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234-241, Springer, Cham, 2015
work page 2015
-
[21]
Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks
Roy AG, Navab N, Wachinger C. Concurrent spatial and channel ‘squeeze & excitation’ in fully convolutional networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 421-429, Springer, Cham, 2018
work page 2018
-
[22]
Object skeleton extraction in natural images by fusing scale-associated deep side outputs
Shen W, Zhao K, Jiang Y, Wang Y, Zhang Z, Bai X. Object skeleton extraction in natural images by fusing scale-associated deep side outputs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 222-230, 2016
work page 2016
-
[23]
Deepcontour: A deep convolutional feature learned by positive-sharing loss for contour detection
Shen W, Wang X, Wang Y, Bai X, Zhang Z. Deepcontour: A deep convolutional feature learned by positive-sharing loss for contour detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3982-3991, 2015
work page 2015
-
[24]
Recognition of shapes by editing their shock graphs
Sebastian TB, Klein PN, Kimia BB. Recognition of shapes by editing their shock graphs. IEEE Transactions on Pattern Analysis & Machine Intelligence, 1(5):550- 71, 2004
work page 2004
-
[25]
SkelNetOn @ CVPR19. (2019, February). Retrieved from http://ubee.enseeiht.fr/skelneton/challenge.html
work page 2019
-
[26]
Holistically-nested edge detection
Xie S, Tu Z. Holistically-nested edge detection. In Proceedings of the IEEE international conference on computer vision, pp. 1395-1403, 2015
work page 2015
-
[27]
Object contour detection with a fully convolutional encoder- decoder network
Yang J, Price B, Cohen S, Lee H, Yang MH. Object contour detection with a fully convolutional encoder- decoder network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 193-202, 2016
work page 2016
-
[28]
Preprocessing and postprocessing for skeleton-based fingerprint minutiae extraction
Zhao F, Tang X. Preprocessing and postprocessing for skeleton-based fingerprint minutiae extraction. Pattern Recognition, 40(4):1270-81, 2007
work page 2007
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.