pith. sign in

arxiv: 1907.08895 · v1 · pith:7DTM7NPSnew · submitted 2019-07-21 · 💻 cs.CV · eess.IV

An Efficient 3D CNN for Action/Object Segmentation in Video

Pith reviewed 2026-05-24 18:57 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords video object segmentation3D CNNaction segmentationseparable convolutionencoder-decoderpyramid poolingtemporal aggregation
0
0 comments X

The pith

An end-to-end 3D CNN aggregates spatial and temporal information for efficient video object and action segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that a 3D convolutional network can jointly handle spatial appearance and temporal motion in videos for segmentation tasks using a single encoder-decoder structure. A sympathetic reader would care because most prior work splits these into two separate streams, which increases complexity for video processing. The key innovation is applying 3D separable convolutions in the pyramid pooling and decoder stages to cut down on the heavy computation typical of 3D CNNs. The model is also extended with a classifier for action labels. If the approach works, it suggests a simpler path to accurate video understanding with lower resource use.

Core claim

The central claim is that an end-to-end encoder-decoder style 3D CNN can aggregate spatial and temporal information simultaneously for video object segmentation, and that 3D separable convolution in the pyramid pooling module and decoder dramatically reduces operations while maintaining performance. The framework extends to video action segmentation by adding an extra classifier, with experiments showing superior performance compared to state-of-the-art on several video datasets.

What carries the argument

3D separable convolution in the pyramid pooling module and decoder of the encoder-decoder 3D CNN, which factors standard 3D operations to lower computational cost while processing video volumes.

If this is right

  • Video object segmentation becomes feasible with a unified 3D model instead of separate spatial and motion streams.
  • The number of operations drops significantly in the pyramid pooling and decoder parts.
  • The same model architecture supports both object segmentation and action classification in videos.
  • Performance exceeds previous methods on standard video segmentation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If separable convolutions preserve accuracy, similar efficiency gains could appear in other video analysis tasks like detection or tracking.
  • Real-time applications on edge devices might become practical due to the reduced operations.
  • Further testing on longer video sequences could reveal if the temporal aggregation scales effectively.

Load-bearing premise

Replacing full 3D convolutions with separable versions maintains equivalent segmentation accuracy.

What would settle it

Measure the difference in segmentation accuracy and runtime between versions using full 3D convolutions and separable 3D convolutions on a held-out video dataset.

Figures

Figures reproduced from arXiv: 1907.08895 by Chen Chen, Mubarak Shah, Rahul Sukthankar, Rui Hou.

Figure 1
Figure 1. Figure 1: Comparison between standard 3D convolution, R2plus1D and 3D separable con [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The network architecture of our method for video object segmentation. It has three [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results of the proposed (Ours) approach (red), ARP (yellow), LVO [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Action segmentation and detection results obtained by our method on the J-HMDB [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Convolutional Neural Network (CNN) based image segmentation has made great progress in recent years. However, video object segmentation remains a challenging task due to its high computational complexity. Most of the previous methods employ a two-stream CNN framework to handle spatial and motion features separately. In this paper, we propose an end-to-end encoder-decoder style 3D CNN to aggregate spatial and temporal information simultaneously for video object segmentation. To efficiently process video, we propose 3D separable convolution for the pyramid pooling module and decoder, which dramatically reduces the number of operations while maintaining the performance. Moreover, we also extend our framework to video action segmentation by adding an extra classifier to predict the action label for actors in videos. Extensive experiments on several video datasets demonstrate the superior performance of the proposed approach for action and object segmentation compared to the state-of-the-art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes an end-to-end encoder-decoder 3D CNN architecture for video object segmentation that aggregates spatial and temporal information simultaneously, in contrast to prior two-stream approaches. It introduces 3D separable convolutions within the pyramid pooling module and decoder to reduce the number of operations while asserting that performance is maintained. The framework is extended to video action segmentation via an added classifier, with claims of superior performance over state-of-the-art methods demonstrated through extensive experiments on several video datasets.

Significance. An efficient single-stream 3D CNN that simultaneously processes spatial-temporal features could be impactful for video segmentation if the efficiency gains are shown to be performance-neutral. The architectural choice of separable 3D convolutions for the pyramid pooling and decoder stages addresses a practical computational bottleneck. However, without quantitative validation of the performance-maintenance claim, the significance remains provisional.

major comments (1)
  1. [Abstract] Abstract: The central assertion that 3D separable convolutions 'dramatically reduces the number of operations while maintaining the performance' is load-bearing for the efficiency contribution, yet the provided text supplies no ablation studies, operation counts, accuracy metrics, or direct comparisons against a full-3D-convolution baseline to substantiate the equivalence.
minor comments (1)
  1. [Abstract] Abstract: Dataset names, evaluation metrics, and any reported quantitative improvements (e.g., mIoU deltas or operation reductions) are omitted, which would be needed to evaluate the 'superior performance' claim even at a high level.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central assertion that 3D separable convolutions 'dramatically reduces the number of operations while maintaining the performance' is load-bearing for the efficiency contribution, yet the provided text supplies no ablation studies, operation counts, accuracy metrics, or direct comparisons against a full-3D-convolution baseline to substantiate the equivalence.

    Authors: We agree that the abstract's efficiency claim requires direct substantiation. The full manuscript reports extensive experiments across multiple datasets showing competitive performance, but does not include a dedicated ablation isolating 3D separable convolutions against an otherwise identical full-3D-convolution baseline in the pyramid pooling module and decoder, nor the associated operation counts. We will add this ablation study, including FLOPs comparisons and accuracy metrics, to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal without derivation chain

full rationale

The paper proposes an end-to-end 3D CNN encoder-decoder architecture for video object and action segmentation, introducing 3D separable convolutions in the pyramid pooling module and decoder to reduce operations. No equations, fitted parameters, predictions, or first-principles derivations are described that could reduce to inputs by construction. The work frames its contributions as empirical architectural choices validated on datasets, with no self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work. The central efficiency claim is an engineering assertion requiring external ablation evidence, but this is a correctness issue rather than circularity. The derivation chain is self-contained as a methods contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes standard CNN training and that separable 3D convolutions preserve representational power, which are domain assumptions in computer vision.

pith-pipeline@v0.9.0 · 5674 in / 964 out tokens · 22856 ms · 2026-05-24T18:57:13.512681+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 3 internal anchors

  1. [1]

    Segnet: A deep convo- lutional encoder-decoder architecture for image segmentation

    Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convo- lutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence , 39(12):2481–2495, 2017. HOU ET AL.: AN EFFICIENT 3D CNN FOR ACTION/OBJECT SEGMENTA TION IN VIDEO 11

  2. [2]

    One-shot video object segmentation

    Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cre- mers, and Luc Van Gool. One-shot video object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017

  3. [3]

    Quo vadis, action recognition? a new model and the Kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the Kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

  4. [4]

    Fisher III

    Jason Chang, Donglai Wei, and John W. Fisher III. A video representation using tem- poral superpixels. In IEEE Conference on Computer Vision Pattern Recognition, 2013

  5. [5]

    Rethinking Atrous Convolution for Semantic Image Segmentation

    Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Re- thinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017

  6. [6]

    DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs

    Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018

  7. [7]

    Videocapsulenet: A simplified net- work for action detection

    Kevin Duarte, Yogesh Rawat, and Mubarak Shah. Videocapsulenet: A simplified net- work for action detection. In Advances in Neural Information Processing Systems , pages 7610–7619, 2018

  8. [8]

    FusionSeg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos

    Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. FusionSeg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3664–3673, 2017

  9. [9]

    Video segmentation by non-local consensus voting

    Alon Faktor and Michal Irani. Video segmentation by non-local consensus voting. In British Machine and Vision Conference, volume 2, 2014

  10. [10]

    Video segmentation by tracing discontinuities in a trajectory embedding

    Katerina Fragkiadaki, Geng Zhang, and Jianbo Shi. Video segmentation by tracing discontinuities in a trajectory embedding. In IEEE Conference on Computer Vision and Pattern Recognition, 2012

  11. [11]

    Actor and action video segmentation from a sentence

    Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GM Snoek. Actor and action video segmentation from a sentence. In IEEE Conference on Computer Vision and Pattern Recognition, 2018

  12. [12]

    Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and ImageNet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

    Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and ImageNet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

  13. [13]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  14. [14]

    Fast connected-component labeling

    Lifeng He, Yuyan Chao, Kenji Suzuki, and Kesheng Wu. Fast connected-component labeling. Pattern Recognition, 42(9):1977–1987, 2009. 12 HOU ET AL.: AN EFFICIENT 3D CNN FOR ACTION/OBJECT SEGMENTA TION IN VIDEO

  15. [15]

    Jhuang, J

    H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision, 2013

  16. [16]

    Action tubelet detector for spatio-temporal action localization

    Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, 2017

  17. [17]

    Large-scale video classification with convolutional neural networks

    Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition , 2014

  18. [18]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vi- jayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

  19. [19]

    Event detection in crowded videos

    Yan Ke, Rahul Sukthankar, and Martial Hebert. Event detection in crowded videos. In Proceedings of the IEEE International Conference on Computer Vision , 2007

  20. [20]

    Motion trajectory segmentation via minimum cost multicuts

    Margret Keuper, Bjoern Andres, and Thomas Brox. Motion trajectory segmentation via minimum cost multicuts. In Proceedings of the IEEE International Conference on Computer Vision, 2015

  21. [21]

    Primary object segmentation in videos based on region augmentation and reduction

    Yeong Jun Koh and Chang-Su Kim. Primary object segmentation in videos based on region augmentation and reduction. In IEEE Conference on Computer Vision and Pattern Recognition, 2017

  22. [22]

    ImageNet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012

  23. [23]

    Deep learning

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553), 2015

  24. [24]

    Key-segments for video object seg- mentation

    Yong Jae Lee, Jaechul Kim, and Kristen Grauman. Key-segments for video object seg- mentation. In Proceedings of the IEEE International Conference on Computer Vision , 2011

  25. [25]

    Video segmentation by tracking many figure-ground segments

    Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M Rehg. Video segmentation by tracking many figure-ground segments. In Proceedings of the IEEE International Conference on Computer Vision, pages 2192–2199, 2013

  26. [26]

    Fully convolutional networks for semantic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015

  27. [27]

    Human action segmentation with hierarchical super- voxel consistency

    Jiasen Lu, Jason J Corso, et al. Human action segmentation with hierarchical super- voxel consistency. In IEEE Conference on Computer Vision and Pattern Recognition , 2015

  28. [28]

    Video object segmentation without temporal information

    Kevis Kokitsi Maninis, Sergi Caelles, Yuhua Chen, Jordi Pont-Tuset, and Luc Van Gool. Video object segmentation without temporal information. IEEE transactions on pattern analysis and machine intelligence , PP(99):1–1, 2018. HOU ET AL.: AN EFFICIENT 3D CNN FOR ACTION/OBJECT SEGMENTA TION IN VIDEO 13

  29. [29]

    Learning deconvolution net- work for semantic segmentation

    Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution net- work for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015

  30. [30]

    Fast object segmentation in unconstrained video

    Anestis Papazoglou and Vittorio Ferrari. Fast object segmentation in unconstrained video. In Proceedings of the IEEE International Conference on Computer Vision, 2013

  31. [31]

    Perazzi, J

    F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmen- tation. In IEEE Conference on Computer Vision and Pattern Recognition , 2016

  32. [32]

    Learning video object segmentation from static images

    Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. Learning video object segmentation from static images. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2663–2672, 2017

  33. [33]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention , pages 234–241. Springer, 2015

  34. [34]

    Tube convolutional neural network (T-CNN) for action detection in videos

    Hou Rui, Chen Chen, and Mubarak Shah. Tube convolutional neural network (T-CNN) for action detection in videos. In Proceedings of the IEEE International Conference on Computer Vision, 2017

  35. [35]

    Video Object Segmentation using Teacher-Student Adaptation in a Human Robot Interaction (HRI) Setting

    Mennatullah Siam, Chen Jiang, Steven Lu, Laura Petrich, Mahmoud Gamal, Mohamed Elhoseiny, and Martin Jagersand. Video segmentation using teacher-student adaptation in a human robot interaction (HRI) setting. arXiv preprint arXiv:1810.07733, 2018

  36. [36]

    Two-stream convolutional networks for ac- tion recognition in videos

    Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for ac- tion recognition in videos. In NIPS, 2014

  37. [37]

    Pyramid dilated deeper convlstm for video salient object detection

    Hongmei Song, Wenguan Wang, Sanyuan Zhao, Jianbing Shen, and Kin-Man Lam. Pyramid dilated deeper convlstm for video salient object detection. In The European Conference on Computer Vision (ECCV), September 2018

  38. [38]

    Learning Video Object Segmentation with Vi- sual Memory

    Pavel Tokmakov and Karteek Alahari. Learning Video Object Segmentation with Vi- sual Memory. In Proceedings of the IEEE International Conference on Computer Vi- sion, 2017

  39. [39]

    Learning motion patterns in videos

    Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning motion patterns in videos. IEEE Conference on Computer Vision and Pattern Recognition , 2017

  40. [40]

    Learning to segment moving objects

    Pavel Tokmakov, Cordelia Schmid, and Karteek Alahari. Learning to segment moving objects. International Journal of Computer Vision, 127(3):282–301, 2019

  41. [41]

    Learning spatiotemporal features with 3d convolutional networks

    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision , 2015

  42. [42]

    A closer look at spatiotemporal convolutions for action recognition

    Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 14 HOU ET AL.: AN EFFICIENT 3D CNN FOR ACTION/OBJECT SEGMENTA TION IN VIDEO

  43. [43]

    Online adaptation of convolutional neural net- works for the 2017 DA VIS challenge on video object segmentation

    Paul V oigtlaender and Bastian Leibe. Online adaptation of convolutional neural net- works for the 2017 DA VIS challenge on video object segmentation. InThe 2017 DA VIS Challenge on Video Object Segmentation-CVPR Workshops, 2017

  44. [44]

    Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification

    Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 305–321, 2018

  45. [45]

    Learning 4d action feature models for arbitrary view action recognition

    Pingkun Yan, Saad M Khan, and Mubarak Shah. Learning 4d action feature models for arbitrary view action recognition. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7. IEEE, 2008

  46. [46]

    Multi-scale context aggregation by dilated convolutions

    Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016

  47. [47]

    Beyond short snippets: Deep networks for video classification

    Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classification. InIEEE Conference on Computer Vision and Pattern Recognition, 2015

  48. [48]

    Pyramid scene parsing network

    Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017