Single Level Feature-to-Feature Forecasting with Deformable Convolutions

Josip \v{S}ari\'c; Marin Or\v{s}i\'c; Sacha Vra\v{z}i\'c; Sini\v{s}a \v{S}egvi\'c; Ton\'ci Antunovi\'c

arxiv: 1907.11475 · v1 · pith:5TDG7OK5new · submitted 2019-07-26 · 💻 cs.CV

Single Level Feature-to-Feature Forecasting with Deformable Convolutions

Josip \v{S}ari\'c , Marin Or\v{s}i\'c , Ton\'ci Antunovi\'c , Sacha Vra\v{z}i\'c , Sini\v{s}a \v{S}egvi\'c This is my paper

Pith reviewed 2026-05-24 15:52 UTC · model grok-4.3

classification 💻 cs.CV

keywords semantic segmentation forecastingdeformable convolutionsfeature-to-feature predictionautonomous drivingfuture frame anticipationCityscapesvideo prediction

0 comments

The pith

Deformable convolutions enable single-level feature forecasting of future semantic segmentations by operating only on coarse abstract features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a method for predicting semantic segmentation maps of future frames in driving videos by performing forecasting directly on high-level features rather than on pixels or low-level details. It relies on a segmentation backbone that omits lateral connections during upsampling so that all prediction effort stays at the coarsest, most abstract feature level. Deformable convolutions replace standard ones to let a single feature map capture multiple distinct motion patterns at once. Experiments demonstrate that this combination yields higher accuracy than regular or dilated convolutions while adding almost no parameters. The approach reaches state-of-the-art results on the Cityscapes validation set for forecasts nine frames ahead.

Core claim

The central claim is that feature-to-feature forecasting expressed with deformable convolutions increases modeling power for varied motion patterns within one feature map; when the forecasting is restricted to the most abstract features at very coarse resolution by removing lateral connections in the upsampling path, the resulting models outperform their regular and dilated counterparts on future semantic segmentation while keeping the parameter count nearly unchanged.

What carries the argument

Single-level feature-to-feature forecasting performed with deformable convolutions on the coarsest abstract features of a segmentation model that lacks lateral connections in its upsampling path.

If this is right

Deformable convolutions allow one feature map to represent several different motion patterns simultaneously.
Forecasting at the coarsest abstract level suffices to produce accurate future semantic segmentations.
The method achieves state-of-the-art accuracy on Cityscapes validation for nine-timestep forecasts.
The parameter overhead remains minimal compared with dilated or regular convolution baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same single-level deformable forecasting could be applied to other dense prediction tasks such as depth or optical flow anticipation.
Operating only at coarse resolution may reduce the compute needed for real-time deployment in decision-making systems.
The separation of high-level forecasting from low-level detail synthesis suggests a modular architecture for longer-horizon video prediction.

Load-bearing premise

Restricting all forecasting to the single coarsest abstract feature level is both necessary and sufficient for accurate future semantic segmentation.

What would settle it

A controlled experiment in which the identical architecture with standard convolutions or with lateral upsampling connections produces equal or higher accuracy on the Cityscapes nine-step forecasting task would falsify the central design choice.

Figures

Figures reproduced from arXiv: 1907.11475 by Josip \v{S}ari\'c, Marin Or\v{s}i\'c, Sacha Vra\v{z}i\'c, Sini\v{s}a \v{S}egvi\'c, Ton\'ci Antunovi\'c.

**Figure 1.** Figure 1: Structural diagram of the employed single-frame model (a) and the proposed compound model for forecasting semantic segmentation (b). The two models share the ResNet-18 feature extractor (yellow) and the upsampling path (green, blue). with respect to ground truth labels. We update the F2F parameters by averaging gradients from F2F L2 loss and the backpropagated cross-entropy loss. 3.2 Proposed Feature-to-Fe… view at source ↗

**Figure 2.** Figure 2: plots mIoU results from the third section of [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Short-term semantic segmentation forecasts (0.18 s into the future) for 3 sequences. The columns show i) the last observed frame, ii) the future frame, iii) the groundtruth segmentation, iv) our oracle, and v) our semantic segmentation forecast [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Mid-term semantic segmentation predictions (0.5 s into the future) for 5 sequences. The columns show i) the last observed frame, ii) the future frame, iii) the ground truth segmentation, iv) our oracle, and v) our semantic segmentation forecast. Effective receptive field. We express the effective receptive field by measuring partial derivation of log-max-softmax [19] with respect to the four input images… view at source ↗

**Figure 5.** Figure 5: Effective receptive field of mid-term forecast in 4 sequences. Columns show the four input frames, the future frame t+9 and the corresponding semantic segmentation forecast. We show pixels with the strongest gradient of log-max-softmax (red dots) in a hand-picked pixel (green dot) w.r.t. the each of the input frames. pixels from the last two frames. This is in accordance with mid-term experiments from [PI… view at source ↗

**Figure 1.** Figure 1: shows additional qualitative results obtained by our mid-term model based on ResNet-18 and DeformF2F-8. Row 1 showcases the ability of inpainting by considering a wider context. A car on the right is making the turn and unoccludes the part of the future frame which has not been visible in previous frames. The model correctly reconstructs the scene by forecasting a feasible [PITH_FULL_IMAGE:figures/full_fi… view at source ↗

**Figure 2.** Figure 2: Comparison of effective receptive fields for F2F models with dilated (top) and deformable (bottom) convolutions on two mid-term sequences. Columns show the four input frames, the future frame t+9 and the corresponding semantic segmentation forecast. We show pixels with the strongest gradient of log-max-softmax (red dots) in a hand-picked pixel (green dot) w.r.t. the each of the input frames [PITH_FULL_IMA… view at source ↗

**Figure 3.** Figure 3: Distribution of the average mean-square-error of F2F predictions on Cityscapes val. Darker colors correspond to lower values. Largest errors occur around the horizon. 2 Increasing the Capacity of the Single-frame Model We investigate the influence of the single-frame model capacity on the accuracy of the semantic segmentation forecast. We conduct short-term experiments with our standard F2F model and a str… view at source ↗

read the original abstract

Future anticipation is of vital importance in autonomous driving and other decision-making systems. We present a method to anticipate semantic segmentation of future frames in driving scenarios based on feature-to-feature forecasting. Our method is based on a semantic segmentation model without lateral connections within the upsampling path. Such design ensures that the forecasting addresses only the most abstract features on a very coarse resolution. We further propose to express feature-to-feature forecasting with deformable convolutions. This increases the modelling power due to being able to represent different motion patterns within a single feature map. Experiments show that our models with deformable convolutions outperform their regular and dilated counterparts while minimally increasing the number of parameters. Our method achieves state of the art performance on the Cityscapes validation set when forecasting nine timesteps into the future.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Deformable convolutions improve coarse single-level feature forecasting over regular and dilated versions on Cityscapes, but the SOTA claim rests on an unverified assumption about the weaker base model.

read the letter

This paper takes a single-level approach to forecasting semantic features for future frames in driving videos. It uses a segmentation model without lateral connections so that the forecasting module works only on coarse, abstract features, and replaces standard convolutions with deformable ones to better model motion. The new element is expressing the forecasting as deformable convolution, which allows different motion patterns in one feature map. The results indicate that this version beats both regular and dilated convolution versions on Cityscapes while keeping parameter count low. That is a clean, practical finding. The soft spot is the comparison for the state-of-the-art claim. The no-lateral-connections design probably reduces the base segmentation performance compared to the skip-connection models common in other work. If the numbers are measured against those stronger baselines, the headline result mixes forecasting improvement with a weaker starting accuracy. The abstract does not provide an ablation or direct control for this, so the central claim rests on an assumption that may not hold. There are also no error bars and the nine-timestep horizon could have been chosen after inspection. This is the kind of paper that would interest people working on anticipation for autonomous driving. It shows clear thinking on how to simplify the forecasting problem and tests it empirically on an external benchmark. It deserves peer review to sort out the baseline fairness and to see if the gains are robust.

Referee Report

3 major / 1 minor

Summary. The paper proposes single-level feature-to-feature forecasting for future semantic segmentation in driving scenes. It employs a segmentation backbone without lateral connections in the upsampling path so that forecasting operates exclusively on coarse, abstract features; deformable convolutions are used to model diverse motion patterns within feature maps. Experiments on Cityscapes report that deformable-convolution variants outperform regular and dilated counterparts with minimal parameter overhead and achieve state-of-the-art performance when forecasting nine timesteps ahead.

Significance. If the performance claims hold under controlled comparisons, the work would demonstrate that deformable convolutions can increase modeling capacity for motion without substantial parameter cost, while the single-level coarse-feature design simplifies the forecasting task. The approach is directly relevant to autonomous-driving anticipation and could be extended to other video-prediction settings.

major comments (3)

[Abstract] Abstract: the state-of-the-art claim for nine-timestep forecasting on the Cityscapes validation set is presented without error bars, explicit baseline model specifications, or quantitative comparison to prior forecasting methods that employ segmentation backbones with lateral/skip connections. This information is required to establish whether the reported gains arise from the deformable-convolution forecasting module or from a weaker base segmentation accuracy.
[Abstract] Abstract, paragraph 2: the design decision to remove lateral connections is justified by the claim that it 'ensures that the forecasting addresses only the most abstract features on a very coarse resolution,' yet no ablation or side-by-side evaluation against an otherwise identical model with skip connections is provided. Because the SOTA result is measured against external baselines that typically retain lateral connections, this architectural restriction is load-bearing for the central performance claim and must be controlled for.
[Abstract] The manuscript reports that deformable-convolution models 'outperform their regular and dilated counterparts' but supplies no numerical tables, per-timestep metrics, or statistical significance tests for the Cityscapes experiments. Without these data it is impossible to assess the magnitude or reliability of the reported improvement.

minor comments (1)

[Abstract] The abstract states that the method 'minimally increas[es] the number of parameters,' but no concrete parameter counts or FLOPs comparisons are supplied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on our work. We address each major comment point-by-point below. Where the manuscript requires additional data or controls, we will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the state-of-the-art claim for nine-timestep forecasting on the Cityscapes validation set is presented without error bars, explicit baseline model specifications, or quantitative comparison to prior forecasting methods that employ segmentation backbones with lateral/skip connections. This information is required to establish whether the reported gains arise from the deformable-convolution forecasting module or from a weaker base segmentation accuracy.

Authors: We agree that the abstract would benefit from additional context. In the revision we will add error bars (from repeated runs with different seeds), explicitly name the baseline models, and insert a compact quantitative comparison table against prior methods that retain skip connections. All variants in our experiments share the identical segmentation backbone, so differences are attributable to the forecasting module rather than base accuracy. revision: yes
Referee: [Abstract] Abstract, paragraph 2: the design decision to remove lateral connections is justified by the claim that it 'ensures that the forecasting addresses only the most abstract features on a very coarse resolution,' yet no ablation or side-by-side evaluation against an otherwise identical model with skip connections is provided. Because the SOTA result is measured against external baselines that typically retain lateral connections, this architectural restriction is load-bearing for the central performance claim and must be controlled for.

Authors: We acknowledge the absence of a direct ablation on skip connections. The single-level coarse-feature design is a deliberate modeling choice to simplify the forecasting task. In the revised manuscript we will add an ablation that trains otherwise identical models with and without lateral connections, reporting the resulting forecasting accuracy to isolate the contribution of the architectural restriction. revision: yes
Referee: [Abstract] The manuscript reports that deformable-convolution models 'outperform their regular and dilated counterparts' but supplies no numerical tables, per-timestep metrics, or statistical significance tests for the Cityscapes experiments. Without these data it is impossible to assess the magnitude or reliability of the reported improvement.

Authors: We will expand the experimental section to include full numerical tables showing per-timestep mIoU for all three convolution variants, together with the corresponding parameter counts. Where multiple runs are available we will also report statistical significance. These tables will be referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain; claims rest on external benchmark evaluation.

full rationale

The paper presents an architectural design (segmentation backbone without lateral connections) and a forecasting module (deformable convolutions for feature-to-feature prediction) evaluated on the external Cityscapes validation set using standard metrics. No equations, fitted parameters, or self-citations reduce the SOTA claim or method to inputs by construction. The design choice is an explicit modeling decision to restrict forecasting to coarse abstract features, not a self-referential loop. The performance claim is falsifiable against external baselines and does not rely on prior author work for its core justification.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical superiority of deformable convolutions for feature forecasting and on the assumption that coarse single-level features suffice; no new physical or mathematical axioms are introduced.

free parameters (1)

deformable convolution offset learning rate and kernel size
Standard hyperparameters of the deformable convolution layers that are tuned during training and directly affect motion modeling capacity.

axioms (1)

domain assumption Standard backpropagation and SGD training converge to a useful local minimum for the forecasting task.
Implicit in all reported neural-network experiments; invoked when claiming outperformance over baselines.

pith-pipeline@v0.9.0 · 5698 in / 1301 out tokens · 20967 ms · 2026-05-24T15:52:07.011279+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 10 internal anchors

[1]

Bayesian Prediction of Future Street Scenes using Synthetic Likelihoods

Bhattacharyya, A., Fritz, M., Schiele, B.: Bayesian prediction of future street scenes using synthetic likelihoods. arXiv preprint arXiv:1810.00746 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

https://github.com/ open-mmlab/mmdetection (2018)

Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: mmdetection. https://github.com/ open-mmlab/mmdetection (2018)

work page 2018
[3]

IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)

Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)

work page 2018
[4]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

work page 2016
[5]

In: ICCV

Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convo- lutional networks. In: ICCV. pp. 764–773 (2017)

work page 2017
[6]

In: 2009 IEEE conference on computer vision and pattern recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

work page 2009
[7]

In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016

Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fu- sion for video action recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 1933–1941 (2016)

work page 2016
[8]

In: Proceedings of the IEEE international conference on computer vision

He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)

work page 2017
[9]

In: NIPS

Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS. pp. 2017–2025 (2015)

work page 2017
[10]

In: Advances in Neural Information Processing Systems

Jin, X., Xiao, H., Shen, X., Yang, J., Lin, Z., Chen, Y., Jie, Z., Feng, J., Yan, S.: Predicting scene parsing and motion dynamics in the future. In: Advances in Neural Information Processing Systems. pp. 6915–6924 (2017)

work page 2017
[11]

In: Proceedings of the 34th International Conference on Machine Learning-Volume 70

Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., Kavukcuoglu, K.: Video pixel networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 1771–1779. JMLR. org (2017)

work page 2017
[12]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[13]

Panoptic Segmentation

Kirillov, A., He, K., Girshick, R., Rother, C., Doll´ ar, P.: Panoptic segmentation. arXiv preprint arXiv:1801.00868 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Efficient Ladder-style DenseNets for Semantic Segmentation of Large Images

Kreˇ so, I., Krapac, J.,ˇSegvi´ c, S.: Eﬃcient ladder-style densenets for semantic seg- mentation of large images. arXiv preprint arXiv:1905.05661 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1905
[15]

The handbook of brain theory and neural networks 3361(10), 1995 (1995)

LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10), 1995 (1995)

work page 1995
[16]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Lin, T.Y., Doll´ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2117–2125 (2017)

work page 2017
[17]

In: Proceedings of the European Con- ference on Computer Vision (ECCV)

Luc, P., Couprie, C., Lecun, Y., Verbeek, J.: Predicting future instance segmenta- tion by forecasting convolutional features. In: Proceedings of the European Con- ference on Computer Vision (ECCV). pp. 584–599 (2018) 14 Josip ˇSari´ c, Marin Orˇ si´ c, Ton´ ci Antunovi´ c, Sacha Vraˇ zi´ c, and Siniˇ saˇSegvi´ c

work page 2018
[18]

In: Proceedings of the IEEE International Conference on Computer Vision

Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 648–657 (2017)

work page 2017
[19]

In: Advances in neural information processing systems

Luo, W., Li, Y., Urtasun, R., Zemel, R.: Understanding the eﬀective receptive ﬁeld in deep convolutional neural networks. In: Advances in neural information processing systems. pp. 4898–4906 (2016)

work page 2016
[20]

Deep multi-scale video prediction beyond mean square error

Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[21]

BMVC (2018)

Nabavi, S.S., Rochan, M., Wang, Y.: Future semantic segmentation with convolu- tional lstm. BMVC (2018)

work page 2018
[22]

In Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving Images

Orˇ si´ c, M., Kreˇ so, I., Bevandi´ c, P.,ˇSegvi´ c, S.: In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. arXiv preprint arXiv:1903.08469 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1903
[23]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical ﬂow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8934–8943 (2018)

work page 2018
[24]

Recurrent Flow-Guided Semantic Forecasting

Terwilliger, A.M., Brazil, G., Liu, X.: Recurrent ﬂow-guided semantic forecasting. arXiv preprint arXiv:1809.08318 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

Anticipating Visual Representations from Unlabeled Video

Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023 2 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[26]

In: International Conference on Image Analysis and Processing

Vukoti´ c, V., Pintea, S.L., Raymond, C., Gravier, G., van Gemert, J.C.: One-step time-dependent future video frame prediction with a convolutional encoder-decoder neural network. In: International Conference on Image Analysis and Processing. pp. 140–151. Springer (2017)

work page 2017
[27]

In: Advances in neural information processing systems

Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convo- lutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems. pp. 802–810 (2015)

work page 2015
[28]

In: CVPR

Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: Denseaspp for semantic segmentation in street scenes. In: CVPR. pp. 3684–3692 (2018)

work page 2018
[29]

Multi-Scale Context Aggregation by Dilated Convolutions

Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015
[30]

In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

work page 2017
[31]

Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: More deformable, better results. arXiv preprint arXiv:1811.11168 (2018) Single Level Feature-to-Feature Forecasting with Deformable Convolutions - Supplementary Material Josip ˇSari´ c1, Marin Orˇ si´ c1, Ton´ ci Antunovi´ c2, Sacha Vraˇ zi´ c2, and Siniˇ sa ˇSegvi´ c1 1 University of Zagreb, Facu...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

Bayesian Prediction of Future Street Scenes using Synthetic Likelihoods

Bhattacharyya, A., Fritz, M., Schiele, B.: Bayesian prediction of future street scenes using synthetic likelihoods. arXiv preprint arXiv:1810.00746 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

https://github.com/ open-mmlab/mmdetection (2018)

Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: mmdetection. https://github.com/ open-mmlab/mmdetection (2018)

work page 2018

[3] [3]

IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)

Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)

work page 2018

[4] [4]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

work page 2016

[5] [5]

In: ICCV

Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convo- lutional networks. In: ICCV. pp. 764–773 (2017)

work page 2017

[6] [6]

In: 2009 IEEE conference on computer vision and pattern recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

work page 2009

[7] [7]

In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016

Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fu- sion for video action recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 1933–1941 (2016)

work page 2016

[8] [8]

In: Proceedings of the IEEE international conference on computer vision

He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)

work page 2017

[9] [9]

In: NIPS

Jaderberg, M., Simonyan, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS. pp. 2017–2025 (2015)

work page 2017

[10] [10]

In: Advances in Neural Information Processing Systems

Jin, X., Xiao, H., Shen, X., Yang, J., Lin, Z., Chen, Y., Jie, Z., Feng, J., Yan, S.: Predicting scene parsing and motion dynamics in the future. In: Advances in Neural Information Processing Systems. pp. 6915–6924 (2017)

work page 2017

[11] [11]

In: Proceedings of the 34th International Conference on Machine Learning-Volume 70

Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., Kavukcuoglu, K.: Video pixel networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 1771–1779. JMLR. org (2017)

work page 2017

[12] [12]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[13] [13]

Panoptic Segmentation

Kirillov, A., He, K., Girshick, R., Rother, C., Doll´ ar, P.: Panoptic segmentation. arXiv preprint arXiv:1801.00868 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Efficient Ladder-style DenseNets for Semantic Segmentation of Large Images

Kreˇ so, I., Krapac, J.,ˇSegvi´ c, S.: Eﬃcient ladder-style densenets for semantic seg- mentation of large images. arXiv preprint arXiv:1905.05661 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1905

[15] [15]

The handbook of brain theory and neural networks 3361(10), 1995 (1995)

LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10), 1995 (1995)

work page 1995

[16] [16]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Lin, T.Y., Doll´ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2117–2125 (2017)

work page 2017

[17] [17]

In: Proceedings of the European Con- ference on Computer Vision (ECCV)

Luc, P., Couprie, C., Lecun, Y., Verbeek, J.: Predicting future instance segmenta- tion by forecasting convolutional features. In: Proceedings of the European Con- ference on Computer Vision (ECCV). pp. 584–599 (2018) 14 Josip ˇSari´ c, Marin Orˇ si´ c, Ton´ ci Antunovi´ c, Sacha Vraˇ zi´ c, and Siniˇ saˇSegvi´ c

work page 2018

[18] [18]

In: Proceedings of the IEEE International Conference on Computer Vision

Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 648–657 (2017)

work page 2017

[19] [19]

In: Advances in neural information processing systems

Luo, W., Li, Y., Urtasun, R., Zemel, R.: Understanding the eﬀective receptive ﬁeld in deep convolutional neural networks. In: Advances in neural information processing systems. pp. 4898–4906 (2016)

work page 2016

[20] [20]

Deep multi-scale video prediction beyond mean square error

Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[21] [21]

BMVC (2018)

Nabavi, S.S., Rochan, M., Wang, Y.: Future semantic segmentation with convolu- tional lstm. BMVC (2018)

work page 2018

[22] [22]

In Defense of Pre-trained ImageNet Architectures for Real-time Semantic Segmentation of Road-driving Images

Orˇ si´ c, M., Kreˇ so, I., Bevandi´ c, P.,ˇSegvi´ c, S.: In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. arXiv preprint arXiv:1903.08469 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1903

[23] [23]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical ﬂow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8934–8943 (2018)

work page 2018

[24] [24]

Recurrent Flow-Guided Semantic Forecasting

Terwilliger, A.M., Brazil, G., Liu, X.: Recurrent ﬂow-guided semantic forecasting. arXiv preprint arXiv:1809.08318 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

Anticipating Visual Representations from Unlabeled Video

Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023 2 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[26] [26]

In: International Conference on Image Analysis and Processing

Vukoti´ c, V., Pintea, S.L., Raymond, C., Gravier, G., van Gemert, J.C.: One-step time-dependent future video frame prediction with a convolutional encoder-decoder neural network. In: International Conference on Image Analysis and Processing. pp. 140–151. Springer (2017)

work page 2017

[27] [27]

In: Advances in neural information processing systems

Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convo- lutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems. pp. 802–810 (2015)

work page 2015

[28] [28]

In: CVPR

Yang, M., Yu, K., Zhang, C., Li, Z., Yang, K.: Denseaspp for semantic segmentation in street scenes. In: CVPR. pp. 3684–3692 (2018)

work page 2018

[29] [29]

Multi-Scale Context Aggregation by Dilated Convolutions

Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

work page internal anchor Pith review Pith/arXiv arXiv 2015

[30] [30]

In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)

work page 2017

[31] [31]

Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: More deformable, better results. arXiv preprint arXiv:1811.11168 (2018) Single Level Feature-to-Feature Forecasting with Deformable Convolutions - Supplementary Material Josip ˇSari´ c1, Marin Orˇ si´ c1, Ton´ ci Antunovi´ c2, Sacha Vraˇ zi´ c2, and Siniˇ sa ˇSegvi´ c1 1 University of Zagreb, Facu...

work page internal anchor Pith review Pith/arXiv arXiv 2018