An Efficient 3D CNN for Action/Object Segmentation in Video

Chen Chen; Mubarak Shah; Rahul Sukthankar; Rui Hou

arxiv: 1907.08895 · v1 · pith:7DTM7NPSnew · submitted 2019-07-21 · 💻 cs.CV · eess.IV

An Efficient 3D CNN for Action/Object Segmentation in Video

Rui Hou , Chen Chen , Rahul Sukthankar , Mubarak Shah This is my paper

Pith reviewed 2026-05-24 18:57 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords video object segmentation3D CNNaction segmentationseparable convolutionencoder-decoderpyramid poolingtemporal aggregation

0 comments

The pith

An end-to-end 3D CNN aggregates spatial and temporal information for efficient video object and action segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that a 3D convolutional network can jointly handle spatial appearance and temporal motion in videos for segmentation tasks using a single encoder-decoder structure. A sympathetic reader would care because most prior work splits these into two separate streams, which increases complexity for video processing. The key innovation is applying 3D separable convolutions in the pyramid pooling and decoder stages to cut down on the heavy computation typical of 3D CNNs. The model is also extended with a classifier for action labels. If the approach works, it suggests a simpler path to accurate video understanding with lower resource use.

Core claim

The central claim is that an end-to-end encoder-decoder style 3D CNN can aggregate spatial and temporal information simultaneously for video object segmentation, and that 3D separable convolution in the pyramid pooling module and decoder dramatically reduces operations while maintaining performance. The framework extends to video action segmentation by adding an extra classifier, with experiments showing superior performance compared to state-of-the-art on several video datasets.

What carries the argument

3D separable convolution in the pyramid pooling module and decoder of the encoder-decoder 3D CNN, which factors standard 3D operations to lower computational cost while processing video volumes.

If this is right

Video object segmentation becomes feasible with a unified 3D model instead of separate spatial and motion streams.
The number of operations drops significantly in the pyramid pooling and decoder parts.
The same model architecture supports both object segmentation and action classification in videos.
Performance exceeds previous methods on standard video segmentation benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If separable convolutions preserve accuracy, similar efficiency gains could appear in other video analysis tasks like detection or tracking.
Real-time applications on edge devices might become practical due to the reduced operations.
Further testing on longer video sequences could reveal if the temporal aggregation scales effectively.

Load-bearing premise

Replacing full 3D convolutions with separable versions maintains equivalent segmentation accuracy.

What would settle it

Measure the difference in segmentation accuracy and runtime between versions using full 3D convolutions and separable 3D convolutions on a held-out video dataset.

Figures

Figures reproduced from arXiv: 1907.08895 by Chen Chen, Mubarak Shah, Rahul Sukthankar, Rui Hou.

**Figure 2.** Figure 2: The network architecture of our method for video object segmentation. It has three [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative results of the proposed (Ours) approach (red), ARP (yellow), LVO [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Action segmentation and detection results obtained by our method on the J-HMDB [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Convolutional Neural Network (CNN) based image segmentation has made great progress in recent years. However, video object segmentation remains a challenging task due to its high computational complexity. Most of the previous methods employ a two-stream CNN framework to handle spatial and motion features separately. In this paper, we propose an end-to-end encoder-decoder style 3D CNN to aggregate spatial and temporal information simultaneously for video object segmentation. To efficiently process video, we propose 3D separable convolution for the pyramid pooling module and decoder, which dramatically reduces the number of operations while maintaining the performance. Moreover, we also extend our framework to video action segmentation by adding an extra classifier to predict the action label for actors in videos. Extensive experiments on several video datasets demonstrate the superior performance of the proposed approach for action and object segmentation compared to the state-of-the-art.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes a 3D encoder-decoder CNN with separable convolutions for efficient video object and action segmentation, but the abstract gives no numbers or ablations to back the performance claim.

read the letter

The main takeaway is a single-stream 3D CNN that aggregates space and time for video segmentation, using separable 3D convolutions in the pyramid pooling module and decoder to cut operations. They also tack on a classifier for action labels. This is a direct response to the cost of two-stream methods and the complexity of video data. The design choice to go fully 3D and end-to-end is clear enough, and applying separability to those specific blocks is a practical efficiency step that builds on existing separable conv ideas. The abstract states that this keeps performance while slashing operations, and it claims superior results on multiple datasets. That is the extent of what is shown. The soft spot is exactly the one flagged in the stress test: no quantitative comparison to a full 3D baseline, no ablation on the separable layers, no metrics, no error bars, and no dataset details. The equivalence between separable and full 3D is asserted but not demonstrated in the text available. Without those results the efficiency benefit cannot be taken as given. This work would interest people already building efficient video models who want to see the architecture details. A reader hunting for new ideas might skim the design, but the missing evidence makes it hard to judge impact. I would not send it to peer review until the experiments are supplied and checked.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes an end-to-end encoder-decoder 3D CNN architecture for video object segmentation that aggregates spatial and temporal information simultaneously, in contrast to prior two-stream approaches. It introduces 3D separable convolutions within the pyramid pooling module and decoder to reduce the number of operations while asserting that performance is maintained. The framework is extended to video action segmentation via an added classifier, with claims of superior performance over state-of-the-art methods demonstrated through extensive experiments on several video datasets.

Significance. An efficient single-stream 3D CNN that simultaneously processes spatial-temporal features could be impactful for video segmentation if the efficiency gains are shown to be performance-neutral. The architectural choice of separable 3D convolutions for the pyramid pooling and decoder stages addresses a practical computational bottleneck. However, without quantitative validation of the performance-maintenance claim, the significance remains provisional.

major comments (1)

[Abstract] Abstract: The central assertion that 3D separable convolutions 'dramatically reduces the number of operations while maintaining the performance' is load-bearing for the efficiency contribution, yet the provided text supplies no ablation studies, operation counts, accuracy metrics, or direct comparisons against a full-3D-convolution baseline to substantiate the equivalence.

minor comments (1)

[Abstract] Abstract: Dataset names, evaluation metrics, and any reported quantitative improvements (e.g., mIoU deltas or operation reductions) are omitted, which would be needed to evaluate the 'superior performance' claim even at a high level.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The central assertion that 3D separable convolutions 'dramatically reduces the number of operations while maintaining the performance' is load-bearing for the efficiency contribution, yet the provided text supplies no ablation studies, operation counts, accuracy metrics, or direct comparisons against a full-3D-convolution baseline to substantiate the equivalence.

Authors: We agree that the abstract's efficiency claim requires direct substantiation. The full manuscript reports extensive experiments across multiple datasets showing competitive performance, but does not include a dedicated ablation isolating 3D separable convolutions against an otherwise identical full-3D-convolution baseline in the pyramid pooling module and decoder, nor the associated operation counts. We will add this ablation study, including FLOPs comparisons and accuracy metrics, to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: architectural proposal without derivation chain

full rationale

The paper proposes an end-to-end 3D CNN encoder-decoder architecture for video object and action segmentation, introducing 3D separable convolutions in the pyramid pooling module and decoder to reduce operations. No equations, fitted parameters, predictions, or first-principles derivations are described that could reduce to inputs by construction. The work frames its contributions as empirical architectural choices validated on datasets, with no self-citation load-bearing steps, uniqueness theorems, or ansatzes smuggled via prior work. The central efficiency claim is an engineering assertion requiring external ablation evidence, but this is a correctness issue rather than circularity. The derivation chain is self-contained as a methods contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes standard CNN training and that separable 3D convolutions preserve representational power, which are domain assumptions in computer vision.

pith-pipeline@v0.9.0 · 5674 in / 964 out tokens · 22856 ms · 2026-05-24T18:57:13.512681+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 3 internal anchors

[1]

Segnet: A deep convo- lutional encoder-decoder architecture for image segmentation

Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convo- lutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence , 39(12):2481–2495, 2017. HOU ET AL.: AN EFFICIENT 3D CNN FOR ACTION/OBJECT SEGMENTA TION IN VIDEO 11

work page 2017
[2]

One-shot video object segmentation

Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cre- mers, and Luc Van Gool. One-shot video object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017

work page 2017
[3]

Quo vadis, action recognition? a new model and the Kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the Kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

work page 2017
[4]

Fisher III

Jason Chang, Donglai Wei, and John W. Fisher III. A video representation using tem- poral superpixels. In IEEE Conference on Computer Vision Pattern Recognition, 2013

work page 2013
[5]

Rethinking Atrous Convolution for Semantic Image Segmentation

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Re- thinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018

work page 2018
[7]

Videocapsulenet: A simpliﬁed net- work for action detection

Kevin Duarte, Yogesh Rawat, and Mubarak Shah. Videocapsulenet: A simpliﬁed net- work for action detection. In Advances in Neural Information Processing Systems , pages 7610–7619, 2018

work page 2018
[8]

FusionSeg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos

Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. FusionSeg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3664–3673, 2017

work page 2017
[9]

Video segmentation by non-local consensus voting

Alon Faktor and Michal Irani. Video segmentation by non-local consensus voting. In British Machine and Vision Conference, volume 2, 2014

work page 2014
[10]

Video segmentation by tracing discontinuities in a trajectory embedding

Katerina Fragkiadaki, Geng Zhang, and Jianbo Shi. Video segmentation by tracing discontinuities in a trajectory embedding. In IEEE Conference on Computer Vision and Pattern Recognition, 2012

work page 2012
[11]

Actor and action video segmentation from a sentence

Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GM Snoek. Actor and action video segmentation from a sentence. In IEEE Conference on Computer Vision and Pattern Recognition, 2018

work page 2018
[12]

Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and ImageNet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and ImageNet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

work page 2018
[13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016
[14]

Fast connected-component labeling

Lifeng He, Yuyan Chao, Kenji Suzuki, and Kesheng Wu. Fast connected-component labeling. Pattern Recognition, 42(9):1977–1987, 2009. 12 HOU ET AL.: AN EFFICIENT 3D CNN FOR ACTION/OBJECT SEGMENTA TION IN VIDEO

work page 1977
[15]

Jhuang, J

H. Jhuang, J. Gall, S. Zufﬁ, C. Schmid, and M. J. Black. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision, 2013

work page 2013
[16]

Action tubelet detector for spatio-temporal action localization

Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, 2017

work page 2017
[17]

Large-scale video classiﬁcation with convolutional neural networks

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classiﬁcation with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition , 2014

work page 2014
[18]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vi- jayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Event detection in crowded videos

Yan Ke, Rahul Sukthankar, and Martial Hebert. Event detection in crowded videos. In Proceedings of the IEEE International Conference on Computer Vision , 2007

work page 2007
[20]

Motion trajectory segmentation via minimum cost multicuts

Margret Keuper, Bjoern Andres, and Thomas Brox. Motion trajectory segmentation via minimum cost multicuts. In Proceedings of the IEEE International Conference on Computer Vision, 2015

work page 2015
[21]

Primary object segmentation in videos based on region augmentation and reduction

Yeong Jun Koh and Chang-Su Kim. Primary object segmentation in videos based on region augmentation and reduction. In IEEE Conference on Computer Vision and Pattern Recognition, 2017

work page 2017
[22]

ImageNet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012

work page 2012
[23]

Deep learning

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553), 2015

work page 2015
[24]

Key-segments for video object seg- mentation

Yong Jae Lee, Jaechul Kim, and Kristen Grauman. Key-segments for video object seg- mentation. In Proceedings of the IEEE International Conference on Computer Vision , 2011

work page 2011
[25]

Video segmentation by tracking many ﬁgure-ground segments

Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M Rehg. Video segmentation by tracking many ﬁgure-ground segments. In Proceedings of the IEEE International Conference on Computer Vision, pages 2192–2199, 2013

work page 2013
[26]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015

work page 2015
[27]

Human action segmentation with hierarchical super- voxel consistency

Jiasen Lu, Jason J Corso, et al. Human action segmentation with hierarchical super- voxel consistency. In IEEE Conference on Computer Vision and Pattern Recognition , 2015

work page 2015
[28]

Video object segmentation without temporal information

Kevis Kokitsi Maninis, Sergi Caelles, Yuhua Chen, Jordi Pont-Tuset, and Luc Van Gool. Video object segmentation without temporal information. IEEE transactions on pattern analysis and machine intelligence , PP(99):1–1, 2018. HOU ET AL.: AN EFFICIENT 3D CNN FOR ACTION/OBJECT SEGMENTA TION IN VIDEO 13

work page 2018
[29]

Learning deconvolution net- work for semantic segmentation

Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution net- work for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015

work page 2015
[30]

Fast object segmentation in unconstrained video

Anestis Papazoglou and Vittorio Ferrari. Fast object segmentation in unconstrained video. In Proceedings of the IEEE International Conference on Computer Vision, 2013

work page 2013
[31]

Perazzi, J

F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmen- tation. In IEEE Conference on Computer Vision and Pattern Recognition , 2016

work page 2016
[32]

Learning video object segmentation from static images

Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. Learning video object segmentation from static images. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2663–2672, 2017

work page 2017
[33]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention , pages 234–241. Springer, 2015

work page 2015
[34]

Tube convolutional neural network (T-CNN) for action detection in videos

Hou Rui, Chen Chen, and Mubarak Shah. Tube convolutional neural network (T-CNN) for action detection in videos. In Proceedings of the IEEE International Conference on Computer Vision, 2017

work page 2017
[35]

Video Object Segmentation using Teacher-Student Adaptation in a Human Robot Interaction (HRI) Setting

Mennatullah Siam, Chen Jiang, Steven Lu, Laura Petrich, Mahmoud Gamal, Mohamed Elhoseiny, and Martin Jagersand. Video segmentation using teacher-student adaptation in a human robot interaction (HRI) setting. arXiv preprint arXiv:1810.07733, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Two-stream convolutional networks for ac- tion recognition in videos

Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for ac- tion recognition in videos. In NIPS, 2014

work page 2014
[37]

Pyramid dilated deeper convlstm for video salient object detection

Hongmei Song, Wenguan Wang, Sanyuan Zhao, Jianbing Shen, and Kin-Man Lam. Pyramid dilated deeper convlstm for video salient object detection. In The European Conference on Computer Vision (ECCV), September 2018

work page 2018
[38]

Learning Video Object Segmentation with Vi- sual Memory

Pavel Tokmakov and Karteek Alahari. Learning Video Object Segmentation with Vi- sual Memory. In Proceedings of the IEEE International Conference on Computer Vi- sion, 2017

work page 2017
[39]

Learning motion patterns in videos

Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning motion patterns in videos. IEEE Conference on Computer Vision and Pattern Recognition , 2017

work page 2017
[40]

Learning to segment moving objects

Pavel Tokmakov, Cordelia Schmid, and Karteek Alahari. Learning to segment moving objects. International Journal of Computer Vision, 127(3):282–301, 2019

work page 2019
[41]

Learning spatiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision , 2015

work page 2015
[42]

A closer look at spatiotemporal convolutions for action recognition

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 14 HOU ET AL.: AN EFFICIENT 3D CNN FOR ACTION/OBJECT SEGMENTA TION IN VIDEO

work page 2018
[43]

Online adaptation of convolutional neural net- works for the 2017 DA VIS challenge on video object segmentation

Paul V oigtlaender and Bastian Leibe. Online adaptation of convolutional neural net- works for the 2017 DA VIS challenge on video object segmentation. InThe 2017 DA VIS Challenge on Video Object Segmentation-CVPR Workshops, 2017

work page 2017
[44]

Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classiﬁcation

Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classiﬁcation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 305–321, 2018

work page 2018
[45]

Learning 4d action feature models for arbitrary view action recognition

Pingkun Yan, Saad M Khan, and Mubarak Shah. Learning 4d action feature models for arbitrary view action recognition. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7. IEEE, 2008

work page 2008
[46]

Multi-scale context aggregation by dilated convolutions

Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016

work page 2016
[47]

Beyond short snippets: Deep networks for video classiﬁcation

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classiﬁcation. InIEEE Conference on Computer Vision and Pattern Recognition, 2015

work page 2015
[48]

Pyramid scene parsing network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017

work page 2017

[1] [1]

Segnet: A deep convo- lutional encoder-decoder architecture for image segmentation

Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Segnet: A deep convo- lutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence , 39(12):2481–2495, 2017. HOU ET AL.: AN EFFICIENT 3D CNN FOR ACTION/OBJECT SEGMENTA TION IN VIDEO 11

work page 2017

[2] [2]

One-shot video object segmentation

Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cre- mers, and Luc Van Gool. One-shot video object segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2017

work page 2017

[3] [3]

Quo vadis, action recognition? a new model and the Kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the Kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

work page 2017

[4] [4]

Fisher III

Jason Chang, Donglai Wei, and John W. Fisher III. A video representation using tem- poral superpixels. In IEEE Conference on Computer Vision Pattern Recognition, 2013

work page 2013

[5] [5]

Rethinking Atrous Convolution for Semantic Image Segmentation

Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Re- thinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs

Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018

work page 2018

[7] [7]

Videocapsulenet: A simpliﬁed net- work for action detection

Kevin Duarte, Yogesh Rawat, and Mubarak Shah. Videocapsulenet: A simpliﬁed net- work for action detection. In Advances in Neural Information Processing Systems , pages 7610–7619, 2018

work page 2018

[8] [8]

FusionSeg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos

Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. FusionSeg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 3664–3673, 2017

work page 2017

[9] [9]

Video segmentation by non-local consensus voting

Alon Faktor and Michal Irani. Video segmentation by non-local consensus voting. In British Machine and Vision Conference, volume 2, 2014

work page 2014

[10] [10]

Video segmentation by tracing discontinuities in a trajectory embedding

Katerina Fragkiadaki, Geng Zhang, and Jianbo Shi. Video segmentation by tracing discontinuities in a trajectory embedding. In IEEE Conference on Computer Vision and Pattern Recognition, 2012

work page 2012

[11] [11]

Actor and action video segmentation from a sentence

Kirill Gavrilyuk, Amir Ghodrati, Zhenyang Li, and Cees GM Snoek. Actor and action video segmentation from a sentence. In IEEE Conference on Computer Vision and Pattern Recognition, 2018

work page 2018

[12] [12]

Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and ImageNet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d CNNs retrace the history of 2d CNNs and ImageNet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 6546–6555, 2018

work page 2018

[13] [13]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

work page 2016

[14] [14]

Fast connected-component labeling

Lifeng He, Yuyan Chao, Kenji Suzuki, and Kesheng Wu. Fast connected-component labeling. Pattern Recognition, 42(9):1977–1987, 2009. 12 HOU ET AL.: AN EFFICIENT 3D CNN FOR ACTION/OBJECT SEGMENTA TION IN VIDEO

work page 1977

[15] [15]

Jhuang, J

H. Jhuang, J. Gall, S. Zufﬁ, C. Schmid, and M. J. Black. Towards understanding action recognition. In Proceedings of the IEEE International Conference on Computer Vision, 2013

work page 2013

[16] [16]

Action tubelet detector for spatio-temporal action localization

Vicky Kalogeiton, Philippe Weinzaepfel, Vittorio Ferrari, and Cordelia Schmid. Action tubelet detector for spatio-temporal action localization. In Proceedings of the IEEE International Conference on Computer Vision, 2017

work page 2017

[17] [17]

Large-scale video classiﬁcation with convolutional neural networks

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classiﬁcation with convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition , 2014

work page 2014

[18] [18]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vi- jayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The Kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Event detection in crowded videos

Yan Ke, Rahul Sukthankar, and Martial Hebert. Event detection in crowded videos. In Proceedings of the IEEE International Conference on Computer Vision , 2007

work page 2007

[20] [20]

Motion trajectory segmentation via minimum cost multicuts

Margret Keuper, Bjoern Andres, and Thomas Brox. Motion trajectory segmentation via minimum cost multicuts. In Proceedings of the IEEE International Conference on Computer Vision, 2015

work page 2015

[21] [21]

Primary object segmentation in videos based on region augmentation and reduction

Yeong Jun Koh and Chang-Su Kim. Primary object segmentation in videos based on region augmentation and reduction. In IEEE Conference on Computer Vision and Pattern Recognition, 2017

work page 2017

[22] [22]

ImageNet classiﬁcation with deep convolutional neural networks

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classiﬁcation with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012

work page 2012

[23] [23]

Deep learning

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553), 2015

work page 2015

[24] [24]

Key-segments for video object seg- mentation

Yong Jae Lee, Jaechul Kim, and Kristen Grauman. Key-segments for video object seg- mentation. In Proceedings of the IEEE International Conference on Computer Vision , 2011

work page 2011

[25] [25]

Video segmentation by tracking many ﬁgure-ground segments

Fuxin Li, Taeyoung Kim, Ahmad Humayun, David Tsai, and James M Rehg. Video segmentation by tracking many ﬁgure-ground segments. In Proceedings of the IEEE International Conference on Computer Vision, pages 2192–2199, 2013

work page 2013

[26] [26]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015

work page 2015

[27] [27]

Human action segmentation with hierarchical super- voxel consistency

Jiasen Lu, Jason J Corso, et al. Human action segmentation with hierarchical super- voxel consistency. In IEEE Conference on Computer Vision and Pattern Recognition , 2015

work page 2015

[28] [28]

Video object segmentation without temporal information

Kevis Kokitsi Maninis, Sergi Caelles, Yuhua Chen, Jordi Pont-Tuset, and Luc Van Gool. Video object segmentation without temporal information. IEEE transactions on pattern analysis and machine intelligence , PP(99):1–1, 2018. HOU ET AL.: AN EFFICIENT 3D CNN FOR ACTION/OBJECT SEGMENTA TION IN VIDEO 13

work page 2018

[29] [29]

Learning deconvolution net- work for semantic segmentation

Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution net- work for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015

work page 2015

[30] [30]

Fast object segmentation in unconstrained video

Anestis Papazoglou and Vittorio Ferrari. Fast object segmentation in unconstrained video. In Proceedings of the IEEE International Conference on Computer Vision, 2013

work page 2013

[31] [31]

Perazzi, J

F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmen- tation. In IEEE Conference on Computer Vision and Pattern Recognition , 2016

work page 2016

[32] [32]

Learning video object segmentation from static images

Federico Perazzi, Anna Khoreva, Rodrigo Benenson, Bernt Schiele, and Alexander Sorkine-Hornung. Learning video object segmentation from static images. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 2663–2672, 2017

work page 2017

[33] [33]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention , pages 234–241. Springer, 2015

work page 2015

[34] [34]

Tube convolutional neural network (T-CNN) for action detection in videos

Hou Rui, Chen Chen, and Mubarak Shah. Tube convolutional neural network (T-CNN) for action detection in videos. In Proceedings of the IEEE International Conference on Computer Vision, 2017

work page 2017

[35] [35]

Video Object Segmentation using Teacher-Student Adaptation in a Human Robot Interaction (HRI) Setting

Mennatullah Siam, Chen Jiang, Steven Lu, Laura Petrich, Mahmoud Gamal, Mohamed Elhoseiny, and Martin Jagersand. Video segmentation using teacher-student adaptation in a human robot interaction (HRI) setting. arXiv preprint arXiv:1810.07733, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

Two-stream convolutional networks for ac- tion recognition in videos

Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for ac- tion recognition in videos. In NIPS, 2014

work page 2014

[37] [37]

Pyramid dilated deeper convlstm for video salient object detection

Hongmei Song, Wenguan Wang, Sanyuan Zhao, Jianbing Shen, and Kin-Man Lam. Pyramid dilated deeper convlstm for video salient object detection. In The European Conference on Computer Vision (ECCV), September 2018

work page 2018

[38] [38]

Learning Video Object Segmentation with Vi- sual Memory

Pavel Tokmakov and Karteek Alahari. Learning Video Object Segmentation with Vi- sual Memory. In Proceedings of the IEEE International Conference on Computer Vi- sion, 2017

work page 2017

[39] [39]

Learning motion patterns in videos

Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid. Learning motion patterns in videos. IEEE Conference on Computer Vision and Pattern Recognition , 2017

work page 2017

[40] [40]

Learning to segment moving objects

Pavel Tokmakov, Cordelia Schmid, and Karteek Alahari. Learning to segment moving objects. International Journal of Computer Vision, 127(3):282–301, 2019

work page 2019

[41] [41]

Learning spatiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision , 2015

work page 2015

[42] [42]

A closer look at spatiotemporal convolutions for action recognition

Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2018. 14 HOU ET AL.: AN EFFICIENT 3D CNN FOR ACTION/OBJECT SEGMENTA TION IN VIDEO

work page 2018

[43] [43]

Online adaptation of convolutional neural net- works for the 2017 DA VIS challenge on video object segmentation

Paul V oigtlaender and Bastian Leibe. Online adaptation of convolutional neural net- works for the 2017 DA VIS challenge on video object segmentation. InThe 2017 DA VIS Challenge on Video Object Segmentation-CVPR Workshops, 2017

work page 2017

[44] [44]

Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classiﬁcation

Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classiﬁcation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 305–321, 2018

work page 2018

[45] [45]

Learning 4d action feature models for arbitrary view action recognition

Pingkun Yan, Saad M Khan, and Mubarak Shah. Learning 4d action feature models for arbitrary view action recognition. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7. IEEE, 2008

work page 2008

[46] [46]

Multi-scale context aggregation by dilated convolutions

Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016

work page 2016

[47] [47]

Beyond short snippets: Deep networks for video classiﬁcation

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, and George Toderici. Beyond short snippets: Deep networks for video classiﬁcation. InIEEE Conference on Computer Vision and Pattern Recognition, 2015

work page 2015

[48] [48]

Pyramid scene parsing network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2881–2890, 2017

work page 2017