pith. sign in

arxiv: 1907.06876 · v1 · pith:JNTLONASnew · submitted 2019-07-16 · 💻 cs.CV · eess.IV

Separable Convolutional LSTMs for Faster Video Segmentation

Pith reviewed 2026-05-24 21:16 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords video segmentationconvLSTMseparable convolutionssemantic segmentationtemporal modelingcomputational efficiencyflickering metric
0
0 comments X

The pith

ConvLSTM cells modified with separable convolutions enable faster video segmentation with comparable accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video segmentation benefits from recurrent units like convLSTMs to incorporate temporal information across frames, improving performance over single-image methods. However, these units add significant computational overhead, increasing inference time by up to 66 percent. The paper generalizes spatial and depthwise separable convolution techniques to the internal operations of convLSTMs to lower parameter counts and FLOPs. Tests across datasets confirm that the resulting networks run up to 15 percent faster on GPUs with only minor or no accuracy loss. The work also introduces a metric to quantify flickering pixels in output video sequences.

Core claim

By generalizing spatial and depthwise separable convolutions to convLSTM cells, the number of parameters and required FLOPs are reduced significantly. Segmentation approaches using these modified cells achieve similar or slightly worse accuracy but are up to 15 percent faster on a GPU compared to standard convLSTM versions. A new evaluation metric measures flickering pixels in segmented video sequences.

What carries the argument

The modified convLSTM cells, where spatial and depthwise separable convolutions replace standard ones in the gates and operations.

If this is right

  • Video segmentation networks achieve similar performance with reduced computational complexity.
  • Inference time for each video frame decreases by up to 15 percent on GPU hardware.
  • The new flickering metric provides a quantitative way to evaluate temporal consistency in segmentations.
  • The approach maintains the core benefit of temporal modeling while lowering resource demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The separable modification technique might extend to other recurrent units in video processing pipelines.
  • Speed gains could support real-time operation on embedded hardware for robotics applications.
  • The flickering metric might serve as a complementary benchmark for any temporal segmentation method.

Load-bearing premise

That the separable convolution replacements in convLSTM cells do not substantially impair the recurrent temporal modeling essential for video segmentation performance.

What would settle it

A direct comparison showing that accuracy degrades beyond slight levels or that the reported speed gains disappear when implemented on different hardware would challenge the central claim.

Figures

Figures reproduced from arXiv: 1907.06876 by Andreas Pfeuffer, Klaus Dietmayer.

Figure 1
Figure 1. Figure 1: Comparison of the required model parameters, FLOPs and inference [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of mean Flickering Pixels (mFP). First row: images of a video sequence; second row: corresponding ground-truth; third row: yielded [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of mean Flickering Image Pixels (mFIP). First row: images of a video sequence; second row: corresponding segmentation map; third [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Semantic Segmentation is an important module for autonomous robots such as self-driving cars. The advantage of video segmentation approaches compared to single image segmentation is that temporal image information is considered, and their performance increases due to this. Hence, single image segmentation approaches are extended by recurrent units such as convolutional LSTM (convLSTM) cells, which are placed at suitable positions in the basic network architecture. However, a major critique of video segmentation approaches based on recurrent neural networks is their large parameter count and their computational complexity, and so, their inference time of one video frame takes up to 66 percent longer than their basic version. Inspired by the success of the spatial and depthwise separable convolutional neural networks, we generalize these techniques for convLSTMs in this work, so that the number of parameters and the required FLOPs are reduced significantly. Experiments on different datasets show that the segmentation approaches using the proposed, modified convLSTM cells achieve similar or slightly worse accuracy, but are up to 15 percent faster on a GPU than the ones using the standard convLSTM cells. Furthermore, a new evaluation metric is introduced, which measures the amount of flickering pixels in the segmented video sequence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes to replace the standard convolutions inside convLSTM gates (both input-to-state and state-to-state) with spatial and depthwise separable convolutions, thereby reducing parameter count and FLOPs while preserving the overall video-segmentation architecture. Experiments are reported to show that the resulting models achieve accuracy comparable to (or only slightly below) unmodified convLSTM baselines while delivering up to 15 % GPU speed-up; a new “flickering-pixel” metric is also introduced to quantify temporal instability.

Significance. If the empirical parity claim holds under rigorous controls, the work supplies a practical, drop-in acceleration technique for recurrent video segmentation that could be directly useful for real-time robotics and autonomous-driving pipelines. The new flickering metric is a modest but welcome addition to the evaluation toolkit.

major comments (2)
  1. [Abstract and Experiments section] The central empirical claim (comparable accuracy with speed gain) rests on experiments whose description supplies neither dataset identities, baseline architectures, number of runs, error bars, nor statistical tests. Without these controls it is impossible to determine whether the reported parity is attributable to the separable convLSTM modification or to the backbone network.
  2. [§3] §3 (proposed separable convLSTM cell): the manuscript provides no analysis or ablation demonstrating that depthwise separable factorization inside the four gates preserves the temporal state propagation that justifies the use of convLSTMs. If cross-channel mixing is materially reduced, any observed accuracy parity could be an artifact of the spatial backbone rather than evidence that the recurrent component remains functional.
minor comments (2)
  1. [Abstract] The abstract states “up to 15 percent faster” without specifying the exact hardware, batch size, or input resolution used for the timing measurements.
  2. [§3] Notation for the separable convolution operators inside the LSTM gates is introduced without an explicit equation relating the factorized kernels to the original full convolution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the manuscript to improve experimental reporting and add supporting analysis.

read point-by-point responses
  1. Referee: [Abstract and Experiments section] The central empirical claim (comparable accuracy with speed gain) rests on experiments whose description supplies neither dataset identities, baseline architectures, number of runs, error bars, nor statistical tests. Without these controls it is impossible to determine whether the reported parity is attributable to the separable convLSTM modification or to the backbone network.

    Authors: We agree that the experimental description requires greater explicitness for reproducibility. In the revised manuscript we will explicitly list the dataset identities, baseline architectures, number of runs, error bars, and any statistical tests performed. Because the backbone network is held identical between the standard convLSTM and separable-convLSTM variants, with the sole change being the factorization inside the convLSTM gates, the speed-up and accuracy results can be attributed to the proposed modification. revision: yes

  2. Referee: [§3] §3 (proposed separable convLSTM cell): the manuscript provides no analysis or ablation demonstrating that depthwise separable factorization inside the four gates preserves the temporal state propagation that justifies the use of convLSTMs. If cross-channel mixing is materially reduced, any observed accuracy parity could be an artifact of the spatial backbone rather than evidence that the recurrent component remains functional.

    Authors: We acknowledge that an explicit ablation would strengthen the claim that the recurrent dynamics are preserved. While the gate structure, recurrent connections, and overall architecture remain unchanged, we will add an ablation study in the revision that examines the effect of the factorization on temporal state propagation (for example, by comparing hidden-state evolution metrics across variants). This will help confirm that the recurrent functionality is retained rather than being an artifact of the backbone. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical modification tested on external datasets

full rationale

The paper proposes applying spatial and depthwise separable convolutions to the gates of convLSTM cells as an engineering modification, then reports GPU runtime and accuracy on standard video segmentation datasets. No equations, fitted parameters, or self-citations are used to derive the performance claims; the reported speedups and accuracy parity are direct empirical measurements against unmodified baselines. The central claim therefore does not reduce to any input quantity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that separable convolutions preserve the essential recurrent behavior of convLSTMs; no free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Convolutional LSTM cells can be modified with separable convolutions while retaining sufficient temporal modeling power for video segmentation.
    This premise is required for the generalization to deliver the claimed accuracy-speed trade-off.

pith-pipeline@v0.9.0 · 5728 in / 1142 out tokens · 25191 ms · 2026-05-24T21:16:17.915343+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 14 internal anchors

  1. [1]

    TensorFlow: Large-scale machine learning on heterogeneous systems, 2015

    Mart ´ın Abadi, Ashish Agarwal, Paul Barham, and Eugene Brevdo et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org

  2. [2]

    Encoder-decoder with atrous separable convolu- tion for semantic image segmentation

    Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolu- tion for semantic image segmentation. In ECCV, 2018

  3. [3]

    Xception: Deep Learning with Depthwise Separable Convolutions

    Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357, 2016

  4. [4]

    The Cityscapes Dataset for Semantic Urban Scene Understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. CoRR, abs/1604.01685, 2016

  5. [5]

    Virtual worlds as proxy for multi-object tracking analysis

    A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016

  6. [6]

    Geiger, Zixing Zhang, Felix Weninger, Bjrn Schuller, and Gerhard Rigoll

    Juergen T. Geiger, Zixing Zhang, Felix Weninger, Bjrn Schuller, and Gerhard Rigoll. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling

  7. [7]

    Generating Sequences With Recurrent Neural Networks

    Alex Graves. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013

  8. [8]

    Long short-term memory

    Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural Comput. , 9(9):1735–1780, November 1997

  9. [9]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017

  10. [10]

    Factorization tricks for LSTM networks

    Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. CoRR, abs/1703.10722, 2017

  11. [11]

    Fast Algorithms for Convolutional Neural Networks

    Andrew Lavin. Fast algorithms for convolutional neural networks. CoRR, abs/1509.09308, 2015

  12. [12]

    Deep convolutional and lstm neural networks for acoustic modelling in automatic speech recognition

    Xiaoyu Liu. Deep convolutional and lstm neural networks for acoustic modelling in automatic speech recognition

  13. [13]

    Robust semantic segmentation in adverse weather conditions by means of sensor data fusion

    Andreas Pfeuffer and Klaus Dietmayer. Robust semantic segmentation in adverse weather conditions by means of sensor data fusion. In 2019 22nd International Conference on Information Fusion (FUSION) (FUSION 2019) , Ottawa, Canada, July 2019

  14. [14]

    Semantic segmentation of video sequences with convolutional lstms

    Andreas Pfeuffer, Karina Schulz, and Klaus Dietmayer. Semantic segmentation of video sequences with convolutional lstms. In 2019 IEEE Intelligent V ehicles Symposium (IV) , pages 1253 – 1259, 2019

  15. [15]

    Future semantic segmentation with convolutional lstm, 07 2018

    Seyed shahabeddin Nabavi, Mrigank Rochan, Yang , and Wang . Future semantic segmentation with convolutional lstm, 07 2018

  16. [16]

    Fully Convolutional Networks for Semantic Segmentation

    Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolu- tional networks for semantic segmentation. CoRR, abs/1605.06211, 2016

  17. [17]

    Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting

    Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai- Kin Wong, and Wang-chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. CoRR, abs/1506.04214, 2015

  18. [18]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014

  19. [19]

    Ilya Sutskever, Oriol Vinyals, and Quoc V . Le. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014

  20. [20]

    Going Deeper with Convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014

  21. [21]

    Rethinking the Inception Architecture for Computer Vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015

  22. [22]

    Recurrent fully convolutional networks for video segmentation

    Sepehr Valipour, Mennatullah Siam, Martin J ¨agersand, and Nilanjan Ray. Recurrent fully convolutional networks for video segmentation. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 29–36, 2017

  23. [23]

    E. E. Yurdakul and Y . Yemez. Semantic segmentation of rgbd videos with recurrent fully convolutional neural networks. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) , pages 367–374, Oct 2017

  24. [24]

    ICNet for Real-Time Semantic Segmentation on High-Resolution Images

    Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. CoRR, abs/1704.08545, 2017

  25. [25]

    Pyramid Scene Parsing Network

    Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. CoRR, abs/1612.01105, 2016