Single Level Feature-to-Feature Forecasting with Deformable Convolutions
Pith reviewed 2026-05-24 15:52 UTC · model grok-4.3
The pith
Deformable convolutions enable single-level feature forecasting of future semantic segmentations by operating only on coarse abstract features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that feature-to-feature forecasting expressed with deformable convolutions increases modeling power for varied motion patterns within one feature map; when the forecasting is restricted to the most abstract features at very coarse resolution by removing lateral connections in the upsampling path, the resulting models outperform their regular and dilated counterparts on future semantic segmentation while keeping the parameter count nearly unchanged.
What carries the argument
Single-level feature-to-feature forecasting performed with deformable convolutions on the coarsest abstract features of a segmentation model that lacks lateral connections in its upsampling path.
If this is right
- Deformable convolutions allow one feature map to represent several different motion patterns simultaneously.
- Forecasting at the coarsest abstract level suffices to produce accurate future semantic segmentations.
- The method achieves state-of-the-art accuracy on Cityscapes validation for nine-timestep forecasts.
- The parameter overhead remains minimal compared with dilated or regular convolution baselines.
Where Pith is reading between the lines
- The same single-level deformable forecasting could be applied to other dense prediction tasks such as depth or optical flow anticipation.
- Operating only at coarse resolution may reduce the compute needed for real-time deployment in decision-making systems.
- The separation of high-level forecasting from low-level detail synthesis suggests a modular architecture for longer-horizon video prediction.
Load-bearing premise
Restricting all forecasting to the single coarsest abstract feature level is both necessary and sufficient for accurate future semantic segmentation.
What would settle it
A controlled experiment in which the identical architecture with standard convolutions or with lateral upsampling connections produces equal or higher accuracy on the Cityscapes nine-step forecasting task would falsify the central design choice.
Figures
read the original abstract
Future anticipation is of vital importance in autonomous driving and other decision-making systems. We present a method to anticipate semantic segmentation of future frames in driving scenarios based on feature-to-feature forecasting. Our method is based on a semantic segmentation model without lateral connections within the upsampling path. Such design ensures that the forecasting addresses only the most abstract features on a very coarse resolution. We further propose to express feature-to-feature forecasting with deformable convolutions. This increases the modelling power due to being able to represent different motion patterns within a single feature map. Experiments show that our models with deformable convolutions outperform their regular and dilated counterparts while minimally increasing the number of parameters. Our method achieves state of the art performance on the Cityscapes validation set when forecasting nine timesteps into the future.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes single-level feature-to-feature forecasting for future semantic segmentation in driving scenes. It employs a segmentation backbone without lateral connections in the upsampling path so that forecasting operates exclusively on coarse, abstract features; deformable convolutions are used to model diverse motion patterns within feature maps. Experiments on Cityscapes report that deformable-convolution variants outperform regular and dilated counterparts with minimal parameter overhead and achieve state-of-the-art performance when forecasting nine timesteps ahead.
Significance. If the performance claims hold under controlled comparisons, the work would demonstrate that deformable convolutions can increase modeling capacity for motion without substantial parameter cost, while the single-level coarse-feature design simplifies the forecasting task. The approach is directly relevant to autonomous-driving anticipation and could be extended to other video-prediction settings.
major comments (3)
- [Abstract] Abstract: the state-of-the-art claim for nine-timestep forecasting on the Cityscapes validation set is presented without error bars, explicit baseline model specifications, or quantitative comparison to prior forecasting methods that employ segmentation backbones with lateral/skip connections. This information is required to establish whether the reported gains arise from the deformable-convolution forecasting module or from a weaker base segmentation accuracy.
- [Abstract] Abstract, paragraph 2: the design decision to remove lateral connections is justified by the claim that it 'ensures that the forecasting addresses only the most abstract features on a very coarse resolution,' yet no ablation or side-by-side evaluation against an otherwise identical model with skip connections is provided. Because the SOTA result is measured against external baselines that typically retain lateral connections, this architectural restriction is load-bearing for the central performance claim and must be controlled for.
- [Abstract] The manuscript reports that deformable-convolution models 'outperform their regular and dilated counterparts' but supplies no numerical tables, per-timestep metrics, or statistical significance tests for the Cityscapes experiments. Without these data it is impossible to assess the magnitude or reliability of the reported improvement.
minor comments (1)
- [Abstract] The abstract states that the method 'minimally increas[es] the number of parameters,' but no concrete parameter counts or FLOPs comparisons are supplied.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our work. We address each major comment point-by-point below. Where the manuscript requires additional data or controls, we will revise accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the state-of-the-art claim for nine-timestep forecasting on the Cityscapes validation set is presented without error bars, explicit baseline model specifications, or quantitative comparison to prior forecasting methods that employ segmentation backbones with lateral/skip connections. This information is required to establish whether the reported gains arise from the deformable-convolution forecasting module or from a weaker base segmentation accuracy.
Authors: We agree that the abstract would benefit from additional context. In the revision we will add error bars (from repeated runs with different seeds), explicitly name the baseline models, and insert a compact quantitative comparison table against prior methods that retain skip connections. All variants in our experiments share the identical segmentation backbone, so differences are attributable to the forecasting module rather than base accuracy. revision: yes
-
Referee: [Abstract] Abstract, paragraph 2: the design decision to remove lateral connections is justified by the claim that it 'ensures that the forecasting addresses only the most abstract features on a very coarse resolution,' yet no ablation or side-by-side evaluation against an otherwise identical model with skip connections is provided. Because the SOTA result is measured against external baselines that typically retain lateral connections, this architectural restriction is load-bearing for the central performance claim and must be controlled for.
Authors: We acknowledge the absence of a direct ablation on skip connections. The single-level coarse-feature design is a deliberate modeling choice to simplify the forecasting task. In the revised manuscript we will add an ablation that trains otherwise identical models with and without lateral connections, reporting the resulting forecasting accuracy to isolate the contribution of the architectural restriction. revision: yes
-
Referee: [Abstract] The manuscript reports that deformable-convolution models 'outperform their regular and dilated counterparts' but supplies no numerical tables, per-timestep metrics, or statistical significance tests for the Cityscapes experiments. Without these data it is impossible to assess the magnitude or reliability of the reported improvement.
Authors: We will expand the experimental section to include full numerical tables showing per-timestep mIoU for all three convolution variants, together with the corresponding parameter counts. Where multiple runs are available we will also report statistical significance. These tables will be referenced from the abstract. revision: yes
Circularity Check
No circularity in derivation chain; claims rest on external benchmark evaluation.
full rationale
The paper presents an architectural design (segmentation backbone without lateral connections) and a forecasting module (deformable convolutions for feature-to-feature prediction) evaluated on the external Cityscapes validation set using standard metrics. No equations, fitted parameters, or self-citations reduce the SOTA claim or method to inputs by construction. The design choice is an explicit modeling decision to restrict forecasting to coarse abstract features, not a self-referential loop. The performance claim is falsifiable against external baselines and does not rely on prior author work for its core justification.
Axiom & Free-Parameter Ledger
free parameters (1)
- deformable convolution offset learning rate and kernel size
axioms (1)
- domain assumption Standard backpropagation and SGD training converge to a useful local minimum for the forecasting task.
Reference graph
Works this paper leans on
-
[1]
Bayesian Prediction of Future Street Scenes using Synthetic Likelihoods
Bhattacharyya, A., Fritz, M., Schiele, B.: Bayesian prediction of future street scenes using synthetic likelihoods. arXiv preprint arXiv:1810.00746 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
https://github.com/ open-mmlab/mmdetection (2018)
Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., Loy, C.C., Lin, D.: mmdetection. https://github.com/ open-mmlab/mmdetection (2018)
work page 2018
-
[3]
IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se- mantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelli- gence 40(4), 834–848 (2018)
work page 2018
-
[4]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)
work page 2016
- [5]
-
[6]
In: 2009 IEEE conference on computer vision and pattern recognition
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)
work page 2009
-
[7]
Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fu- sion for video action recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 1933–1941 (2016)
work page 2016
-
[8]
In: Proceedings of the IEEE international conference on computer vision
He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 2961–2969 (2017)
work page 2017
- [9]
-
[10]
In: Advances in Neural Information Processing Systems
Jin, X., Xiao, H., Shen, X., Yang, J., Lin, Z., Chen, Y., Jie, Z., Feng, J., Yan, S.: Predicting scene parsing and motion dynamics in the future. In: Advances in Neural Information Processing Systems. pp. 6915–6924 (2017)
work page 2017
-
[11]
In: Proceedings of the 34th International Conference on Machine Learning-Volume 70
Kalchbrenner, N., van den Oord, A., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., Kavukcuoglu, K.: Video pixel networks. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70. pp. 1771–1779. JMLR. org (2017)
work page 2017
-
[12]
Adam: A Method for Stochastic Optimization
Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[13]
Kirillov, A., He, K., Girshick, R., Rother, C., Doll´ ar, P.: Panoptic segmentation. arXiv preprint arXiv:1801.00868 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Efficient Ladder-style DenseNets for Semantic Segmentation of Large Images
Kreˇ so, I., Krapac, J.,ˇSegvi´ c, S.: Efficient ladder-style densenets for semantic seg- mentation of large images. arXiv preprint arXiv:1905.05661 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[15]
The handbook of brain theory and neural networks 3361(10), 1995 (1995)
LeCun, Y., Bengio, Y., et al.: Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361(10), 1995 (1995)
work page 1995
-
[16]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Lin, T.Y., Doll´ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2117–2125 (2017)
work page 2017
-
[17]
In: Proceedings of the European Con- ference on Computer Vision (ECCV)
Luc, P., Couprie, C., Lecun, Y., Verbeek, J.: Predicting future instance segmenta- tion by forecasting convolutional features. In: Proceedings of the European Con- ference on Computer Vision (ECCV). pp. 584–599 (2018) 14 Josip ˇSari´ c, Marin Orˇ si´ c, Ton´ ci Antunovi´ c, Sacha Vraˇ zi´ c, and Siniˇ saˇSegvi´ c
work page 2018
-
[18]
In: Proceedings of the IEEE International Conference on Computer Vision
Luc, P., Neverova, N., Couprie, C., Verbeek, J., LeCun, Y.: Predicting deeper into the future of semantic segmentation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 648–657 (2017)
work page 2017
-
[19]
In: Advances in neural information processing systems
Luo, W., Li, Y., Urtasun, R., Zemel, R.: Understanding the effective receptive field in deep convolutional neural networks. In: Advances in neural information processing systems. pp. 4898–4906 (2016)
work page 2016
-
[20]
Deep multi-scale video prediction beyond mean square error
Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[21]
Nabavi, S.S., Rochan, M., Wang, Y.: Future semantic segmentation with convolu- tional lstm. BMVC (2018)
work page 2018
-
[22]
Orˇ si´ c, M., Kreˇ so, I., Bevandi´ c, P.,ˇSegvi´ c, S.: In defense of pre-trained imagenet architectures for real-time semantic segmentation of road-driving images. arXiv preprint arXiv:1903.08469 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[23]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Sun, D., Yang, X., Liu, M.Y., Kautz, J.: Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8934–8943 (2018)
work page 2018
-
[24]
Recurrent Flow-Guided Semantic Forecasting
Terwilliger, A.M., Brazil, G., Liu, X.: Recurrent flow-guided semantic forecasting. arXiv preprint arXiv:1809.08318 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[25]
Anticipating Visual Representations from Unlabeled Video
Vondrick, C., Pirsiavash, H., Torralba, A.: Anticipating the future by watching unlabeled video. arXiv preprint arXiv:1504.08023 2 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[26]
In: International Conference on Image Analysis and Processing
Vukoti´ c, V., Pintea, S.L., Raymond, C., Gravier, G., van Gemert, J.C.: One-step time-dependent future video frame prediction with a convolutional encoder-decoder neural network. In: International Conference on Image Analysis and Processing. pp. 140–151. Springer (2017)
work page 2017
-
[27]
In: Advances in neural information processing systems
Xingjian, S., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., Woo, W.c.: Convo- lutional lstm network: A machine learning approach for precipitation nowcasting. In: Advances in neural information processing systems. pp. 802–810 (2015)
work page 2015
- [28]
-
[29]
Multi-Scale Context Aggregation by Dilated Convolutions
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[30]
In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (July 2017)
work page 2017
-
[31]
Zhu, X., Hu, H., Lin, S., Dai, J.: Deformable convnets v2: More deformable, better results. arXiv preprint arXiv:1811.11168 (2018) Single Level Feature-to-Feature Forecasting with Deformable Convolutions - Supplementary Material Josip ˇSari´ c1, Marin Orˇ si´ c1, Ton´ ci Antunovi´ c2, Sacha Vraˇ zi´ c2, and Siniˇ sa ˇSegvi´ c1 1 University of Zagreb, Facu...
work page internal anchor Pith review Pith/arXiv arXiv 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.