Separable Convolutional LSTMs for Faster Video Segmentation

Andreas Pfeuffer; Klaus Dietmayer

arxiv: 1907.06876 · v1 · pith:JNTLONASnew · submitted 2019-07-16 · 💻 cs.CV · eess.IV

Separable Convolutional LSTMs for Faster Video Segmentation

Andreas Pfeuffer , Klaus Dietmayer This is my paper

Pith reviewed 2026-05-24 21:16 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords video segmentationconvLSTMseparable convolutionssemantic segmentationtemporal modelingcomputational efficiencyflickering metric

0 comments

The pith

ConvLSTM cells modified with separable convolutions enable faster video segmentation with comparable accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video segmentation benefits from recurrent units like convLSTMs to incorporate temporal information across frames, improving performance over single-image methods. However, these units add significant computational overhead, increasing inference time by up to 66 percent. The paper generalizes spatial and depthwise separable convolution techniques to the internal operations of convLSTMs to lower parameter counts and FLOPs. Tests across datasets confirm that the resulting networks run up to 15 percent faster on GPUs with only minor or no accuracy loss. The work also introduces a metric to quantify flickering pixels in output video sequences.

Core claim

By generalizing spatial and depthwise separable convolutions to convLSTM cells, the number of parameters and required FLOPs are reduced significantly. Segmentation approaches using these modified cells achieve similar or slightly worse accuracy but are up to 15 percent faster on a GPU compared to standard convLSTM versions. A new evaluation metric measures flickering pixels in segmented video sequences.

What carries the argument

The modified convLSTM cells, where spatial and depthwise separable convolutions replace standard ones in the gates and operations.

If this is right

Video segmentation networks achieve similar performance with reduced computational complexity.
Inference time for each video frame decreases by up to 15 percent on GPU hardware.
The new flickering metric provides a quantitative way to evaluate temporal consistency in segmentations.
The approach maintains the core benefit of temporal modeling while lowering resource demands.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separable modification technique might extend to other recurrent units in video processing pipelines.
Speed gains could support real-time operation on embedded hardware for robotics applications.
The flickering metric might serve as a complementary benchmark for any temporal segmentation method.

Load-bearing premise

That the separable convolution replacements in convLSTM cells do not substantially impair the recurrent temporal modeling essential for video segmentation performance.

What would settle it

A direct comparison showing that accuracy degrades beyond slight levels or that the reported speed gains disappear when implemented on different hardware would challenge the central claim.

Figures

Figures reproduced from arXiv: 1907.06876 by Andreas Pfeuffer, Klaus Dietmayer.

**Figure 2.** Figure 2: Illustration of mean Flickering Pixels (mFP). First row: images of a video sequence; second row: corresponding ground-truth; third row: yielded [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of mean Flickering Image Pixels (mFIP). First row: images of a video sequence; second row: corresponding segmentation map; third [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Semantic Segmentation is an important module for autonomous robots such as self-driving cars. The advantage of video segmentation approaches compared to single image segmentation is that temporal image information is considered, and their performance increases due to this. Hence, single image segmentation approaches are extended by recurrent units such as convolutional LSTM (convLSTM) cells, which are placed at suitable positions in the basic network architecture. However, a major critique of video segmentation approaches based on recurrent neural networks is their large parameter count and their computational complexity, and so, their inference time of one video frame takes up to 66 percent longer than their basic version. Inspired by the success of the spatial and depthwise separable convolutional neural networks, we generalize these techniques for convLSTMs in this work, so that the number of parameters and the required FLOPs are reduced significantly. Experiments on different datasets show that the segmentation approaches using the proposed, modified convLSTM cells achieve similar or slightly worse accuracy, but are up to 15 percent faster on a GPU than the ones using the standard convLSTM cells. Furthermore, a new evaluation metric is introduced, which measures the amount of flickering pixels in the segmented video sequence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies depthwise separable convolutions inside convLSTM gates to cut parameters and get modest GPU speedups in video segmentation, but the abstract leaves the accuracy and temporal modeling claims thinly supported.

read the letter

The core move is taking the spatial and depthwise separable convolution trick and dropping it into the four gates of a convLSTM. That directly reduces the parameter count and FLOPs for the recurrent part of a video segmentation network. They also introduce a flickering-pixel metric to quantify temporal consistency, which is a useful addition for this task. Those are the concrete things the work contributes on top of prior separable-convolution work and standard convLSTM video models. The reported outcome is accuracy that stays similar or drops only slightly while inference speeds up by up to 15 percent on GPU. That is a practical efficiency tweak worth noting for anyone already using convLSTMs in robotics pipelines. The main weakness is that the abstract supplies no dataset names, no baseline architectures, no error bars, and no ablation on whether the separable factorization actually preserves the cross-channel mixing needed for the recurrent state updates. Without those details the claim that temporal modeling remains intact rests on a single qualitative phrase. The stress-test point about reduced cross-channel mixing in the gates is therefore still open; if the full paper does not address it with targeted ablations, the accuracy numbers could be driven more by the backbone than by the modified cells. This is an incremental engineering paper aimed at practitioners who already run recurrent video models and want lower latency. It is worth sending to a serious referee because the construction is straightforward to reproduce and the speed claim is easy to check, even though the current write-up needs more experimental grounding before the accuracy parity can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes to replace the standard convolutions inside convLSTM gates (both input-to-state and state-to-state) with spatial and depthwise separable convolutions, thereby reducing parameter count and FLOPs while preserving the overall video-segmentation architecture. Experiments are reported to show that the resulting models achieve accuracy comparable to (or only slightly below) unmodified convLSTM baselines while delivering up to 15 % GPU speed-up; a new “flickering-pixel” metric is also introduced to quantify temporal instability.

Significance. If the empirical parity claim holds under rigorous controls, the work supplies a practical, drop-in acceleration technique for recurrent video segmentation that could be directly useful for real-time robotics and autonomous-driving pipelines. The new flickering metric is a modest but welcome addition to the evaluation toolkit.

major comments (2)

[Abstract and Experiments section] The central empirical claim (comparable accuracy with speed gain) rests on experiments whose description supplies neither dataset identities, baseline architectures, number of runs, error bars, nor statistical tests. Without these controls it is impossible to determine whether the reported parity is attributable to the separable convLSTM modification or to the backbone network.
[§3] §3 (proposed separable convLSTM cell): the manuscript provides no analysis or ablation demonstrating that depthwise separable factorization inside the four gates preserves the temporal state propagation that justifies the use of convLSTMs. If cross-channel mixing is materially reduced, any observed accuracy parity could be an artifact of the spatial backbone rather than evidence that the recurrent component remains functional.

minor comments (2)

[Abstract] The abstract states “up to 15 percent faster” without specifying the exact hardware, batch size, or input resolution used for the timing measurements.
[§3] Notation for the separable convolution operators inside the LSTM gates is introduced without an explicit equation relating the factorized kernels to the original full convolution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and will revise the manuscript to improve experimental reporting and add supporting analysis.

read point-by-point responses

Referee: [Abstract and Experiments section] The central empirical claim (comparable accuracy with speed gain) rests on experiments whose description supplies neither dataset identities, baseline architectures, number of runs, error bars, nor statistical tests. Without these controls it is impossible to determine whether the reported parity is attributable to the separable convLSTM modification or to the backbone network.

Authors: We agree that the experimental description requires greater explicitness for reproducibility. In the revised manuscript we will explicitly list the dataset identities, baseline architectures, number of runs, error bars, and any statistical tests performed. Because the backbone network is held identical between the standard convLSTM and separable-convLSTM variants, with the sole change being the factorization inside the convLSTM gates, the speed-up and accuracy results can be attributed to the proposed modification. revision: yes
Referee: [§3] §3 (proposed separable convLSTM cell): the manuscript provides no analysis or ablation demonstrating that depthwise separable factorization inside the four gates preserves the temporal state propagation that justifies the use of convLSTMs. If cross-channel mixing is materially reduced, any observed accuracy parity could be an artifact of the spatial backbone rather than evidence that the recurrent component remains functional.

Authors: We acknowledge that an explicit ablation would strengthen the claim that the recurrent dynamics are preserved. While the gate structure, recurrent connections, and overall architecture remain unchanged, we will add an ablation study in the revision that examines the effect of the factorization on temporal state propagation (for example, by comparing hidden-state evolution metrics across variants). This will help confirm that the recurrent functionality is retained rather than being an artifact of the backbone. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical modification tested on external datasets

full rationale

The paper proposes applying spatial and depthwise separable convolutions to the gates of convLSTM cells as an engineering modification, then reports GPU runtime and accuracy on standard video segmentation datasets. No equations, fitted parameters, or self-citations are used to derive the performance claims; the reported speedups and accuracy parity are direct empirical measurements against unmodified baselines. The central claim therefore does not reduce to any input quantity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that separable convolutions preserve the essential recurrent behavior of convLSTMs; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Convolutional LSTM cells can be modified with separable convolutions while retaining sufficient temporal modeling power for video segmentation.
This premise is required for the generalization to deliver the claimed accuracy-speed trade-off.

pith-pipeline@v0.9.0 · 5728 in / 1142 out tokens · 25191 ms · 2026-05-24T21:16:17.915343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 14 internal anchors

[1]

TensorFlow: Large-scale machine learning on heterogeneous systems, 2015

Mart ´ın Abadi, Ashish Agarwal, Paul Barham, and Eugene Brevdo et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorﬂow.org

work page 2015
[2]

Encoder-decoder with atrous separable convolu- tion for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolu- tion for semantic image segmentation. In ECCV, 2018

work page 2018
[3]

Xception: Deep Learning with Depthwise Separable Convolutions

Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[4]

The Cityscapes Dataset for Semantic Urban Scene Understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. CoRR, abs/1604.01685, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Virtual worlds as proxy for multi-object tracking analysis

A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016

work page 2016
[6]

Geiger, Zixing Zhang, Felix Weninger, Bjrn Schuller, and Gerhard Rigoll

Juergen T. Geiger, Zixing Zhang, Felix Weninger, Bjrn Schuller, and Gerhard Rigoll. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling

work page
[7]

Generating Sequences With Recurrent Neural Networks

Alex Graves. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[8]

Long short-term memory

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural Comput. , 9(9):1735–1780, November 1997

work page 1997
[9]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[10]

Factorization tricks for LSTM networks

Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. CoRR, abs/1703.10722, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Fast Algorithms for Convolutional Neural Networks

Andrew Lavin. Fast algorithms for convolutional neural networks. CoRR, abs/1509.09308, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

Deep convolutional and lstm neural networks for acoustic modelling in automatic speech recognition

Xiaoyu Liu. Deep convolutional and lstm neural networks for acoustic modelling in automatic speech recognition

work page
[13]

Robust semantic segmentation in adverse weather conditions by means of sensor data fusion

Andreas Pfeuffer and Klaus Dietmayer. Robust semantic segmentation in adverse weather conditions by means of sensor data fusion. In 2019 22nd International Conference on Information Fusion (FUSION) (FUSION 2019) , Ottawa, Canada, July 2019

work page 2019
[14]

Semantic segmentation of video sequences with convolutional lstms

Andreas Pfeuffer, Karina Schulz, and Klaus Dietmayer. Semantic segmentation of video sequences with convolutional lstms. In 2019 IEEE Intelligent V ehicles Symposium (IV) , pages 1253 – 1259, 2019

work page 2019
[15]

Future semantic segmentation with convolutional lstm, 07 2018

Seyed shahabeddin Nabavi, Mrigank Rochan, Yang , and Wang . Future semantic segmentation with convolutional lstm, 07 2018

work page 2018
[16]

Fully Convolutional Networks for Semantic Segmentation

Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolu- tional networks for semantic segmentation. CoRR, abs/1605.06211, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[17]

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting

Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai- Kin Wong, and Wang-chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. CoRR, abs/1506.04214, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[18]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[19]

Ilya Sutskever, Oriol Vinyals, and Quoc V . Le. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[20]

Going Deeper with Convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[21]

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[22]

Recurrent fully convolutional networks for video segmentation

Sepehr Valipour, Mennatullah Siam, Martin J ¨agersand, and Nilanjan Ray. Recurrent fully convolutional networks for video segmentation. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 29–36, 2017

work page 2017
[23]

E. E. Yurdakul and Y . Yemez. Semantic segmentation of rgbd videos with recurrent fully convolutional neural networks. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) , pages 367–374, Oct 2017

work page 2017
[24]

ICNet for Real-Time Semantic Segmentation on High-Resolution Images

Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. CoRR, abs/1704.08545, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[25]

Pyramid Scene Parsing Network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. CoRR, abs/1612.01105, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[1] [1]

TensorFlow: Large-scale machine learning on heterogeneous systems, 2015

Mart ´ın Abadi, Ashish Agarwal, Paul Barham, and Eugene Brevdo et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorﬂow.org

work page 2015

[2] [2]

Encoder-decoder with atrous separable convolu- tion for semantic image segmentation

Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolu- tion for semantic image segmentation. In ECCV, 2018

work page 2018

[3] [3]

Xception: Deep Learning with Depthwise Separable Convolutions

Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. CoRR, abs/1610.02357, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[4] [4]

The Cityscapes Dataset for Semantic Urban Scene Understanding

Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. CoRR, abs/1604.01685, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Virtual worlds as proxy for multi-object tracking analysis

A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR, 2016

work page 2016

[6] [6]

Geiger, Zixing Zhang, Felix Weninger, Bjrn Schuller, and Gerhard Rigoll

Juergen T. Geiger, Zixing Zhang, Felix Weninger, Bjrn Schuller, and Gerhard Rigoll. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling

work page

[7] [7]

Generating Sequences With Recurrent Neural Networks

Alex Graves. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[8] [8]

Long short-term memory

Sepp Hochreiter and J ¨urgen Schmidhuber. Long short-term memory. Neural Comput. , 9(9):1735–1780, November 1997

work page 1997

[9] [9]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. CoRR, abs/1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[10] [10]

Factorization tricks for LSTM networks

Oleksii Kuchaiev and Boris Ginsburg. Factorization tricks for LSTM networks. CoRR, abs/1703.10722, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Fast Algorithms for Convolutional Neural Networks

Andrew Lavin. Fast algorithms for convolutional neural networks. CoRR, abs/1509.09308, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[12] [12]

Deep convolutional and lstm neural networks for acoustic modelling in automatic speech recognition

Xiaoyu Liu. Deep convolutional and lstm neural networks for acoustic modelling in automatic speech recognition

work page

[13] [13]

Robust semantic segmentation in adverse weather conditions by means of sensor data fusion

Andreas Pfeuffer and Klaus Dietmayer. Robust semantic segmentation in adverse weather conditions by means of sensor data fusion. In 2019 22nd International Conference on Information Fusion (FUSION) (FUSION 2019) , Ottawa, Canada, July 2019

work page 2019

[14] [14]

Semantic segmentation of video sequences with convolutional lstms

Andreas Pfeuffer, Karina Schulz, and Klaus Dietmayer. Semantic segmentation of video sequences with convolutional lstms. In 2019 IEEE Intelligent V ehicles Symposium (IV) , pages 1253 – 1259, 2019

work page 2019

[15] [15]

Future semantic segmentation with convolutional lstm, 07 2018

Seyed shahabeddin Nabavi, Mrigank Rochan, Yang , and Wang . Future semantic segmentation with convolutional lstm, 07 2018

work page 2018

[16] [16]

Fully Convolutional Networks for Semantic Segmentation

Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolu- tional networks for semantic segmentation. CoRR, abs/1605.06211, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[17] [17]

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting

Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai- Kin Wong, and Wang-chun Woo. Convolutional LSTM network: A machine learning approach for precipitation nowcasting. CoRR, abs/1506.04214, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[18] [18]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[19] [19]

Ilya Sutskever, Oriol Vinyals, and Quoc V . Le. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[20] [20]

Going Deeper with Convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. CoRR, abs/1409.4842, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[21] [21]

Rethinking the Inception Architecture for Computer Vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[22] [22]

Recurrent fully convolutional networks for video segmentation

Sepehr Valipour, Mennatullah Siam, Martin J ¨agersand, and Nilanjan Ray. Recurrent fully convolutional networks for video segmentation. 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 29–36, 2017

work page 2017

[23] [23]

E. E. Yurdakul and Y . Yemez. Semantic segmentation of rgbd videos with recurrent fully convolutional neural networks. In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW) , pages 367–374, Oct 2017

work page 2017

[24] [24]

ICNet for Real-Time Semantic Segmentation on High-Resolution Images

Hengshuang Zhao, Xiaojuan Qi, Xiaoyong Shen, Jianping Shi, and Jiaya Jia. Icnet for real-time semantic segmentation on high-resolution images. CoRR, abs/1704.08545, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [25]

Pyramid Scene Parsing Network

Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. CoRR, abs/1612.01105, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016