Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving

Heechul Yun; Qitao Weng

arxiv: 2605.29138 · v1 · pith:XHT2RQZFnew · submitted 2026-05-27 · 💻 cs.RO · cs.AI· cs.LG· cs.SY· eess.SY

Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving

Qitao Weng , Heechul Yun This is my paper

Pith reviewed 2026-06-29 11:20 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.LGcs.SYeess.SY

keywords multi-resolution CNNautonomous drivinglatency-accuracy tradeoffCARLA simulatorend-to-end learningbatch normalizationruntime adaptation

0 comments

The pith

A multi-resolution CNN selects input scale at runtime to improve safety metrics under a latency budget in autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the latency-optimal input resolution for a driving network changes with scene context and available compute, making any single fixed-resolution model suboptimal. It introduces an end-to-end CNN for the CARLA simulator that supports multiple resolutions through per-resolution batch normalization, permitting runtime selection of the best scale within a given latency limit. Resolution retargeting allows the network to be trained across resolutions even without access to the original training dataset. When evaluated on CARLA routes, the approach produces consistent reductions in lane invasions, red-light infractions, and collisions relative to fixed-resolution baselines.

Core claim

A convolutional neural network equipped with per-resolution batch normalization and resolution retargeting can choose its input resolution dynamically at inference time to respect a latency budget, resulting in lower rates of lane invasions, red-light violations, and collisions across CARLA driving routes than any single fixed-resolution counterpart.

What carries the argument

per-resolution batch normalization, which normalizes activations separately for each supported input resolution to support stable training and runtime scale selection

Load-bearing premise

Per-resolution batch normalization combined with resolution retargeting enables effective multi-resolution training and runtime selection without introducing significant accuracy loss or training instability compared to standard single-resolution training.

What would settle it

Evaluating the multi-resolution model on identical CARLA routes against the strongest fixed-resolution baseline and observing no reduction in lane invasions, red-light infractions, or collisions.

Figures

Figures reproduced from arXiv: 2605.29138 by Heechul Yun, Qitao Weng.

**Figure 2.** Figure 2: Baseline network architecture Our only modification to the policy network is to the convolutional backbone (ResNet-34), which we extend to support multiple input scales via per-resolution batch normalization, as described below. Note that the segmentation head is also extended to support multiple input scales, but it is only used during training for auxiliary learning as in the original setup. A. Per-Reso… view at source ↗

**Figure 3.** Figure 3: Traffic-light infractions vs. input scale at a 50 ms control period, [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Lane invasions per route (mean±sd) at a 50 ms control period, with/without 50 ms injected delay. E. Runtime Latency-Accuracy Tradeoffs In this experiment, we investigate the potential benefits of our approach by dynamically switching resolutions depending on the environment compared to the fixed resolution baselines. Here, we assume an “oracle”, which decides which resolution is ideal for a given environm… view at source ↗

**Figure 5.** Figure 5: Success and collision rates vs. injected delay ( [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Latency-accuracy tradeoffs are fundamental in real-time applications of deep neural networks (DNNs) for cyber-physical systems. In autonomous driving, in particular, safety depends on both prediction quality and the end-to-end delay from sensing to actuation. We observe that (1) when latency is accounted for, the latency-optimal network configuration varies with scene context and compute availability; and (2) a single fixed-resolution model becomes suboptimal as conditions change. We present a multi-resolution, end-to-end deep neural network for the CARLA urban driving challenge using monocular camera input. Our approach employs a convolutional neural network (CNN) that supports multiple input resolutions through per-resolution batch normalization, enabling runtime selection of an ideal input scale under a latency budget, as well as resolution retargeting, which allows multi-resolution training without access to the original training dataset. We implement and evaluate our multi-resolution end-to-end CNN in CARLA to explore the latency-safety frontier. Results show consistent improvements in per-route safety metrics - lane invasions, red-light infractions, and collisions - relative to fixed-resolution baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable multi-resolution CNN for CARLA driving that switches input scales at runtime via per-resolution batch norm and retargeting, but lacks the per-scale accuracy checks needed to show the gains aren't just from uneven training.

read the letter

The main thing here is a single CNN trained to handle several input resolutions for CARLA end-to-end driving. Per-resolution batch normalization keeps separate stats for each scale, and resolution retargeting lets them train across scales without the original dataset. That pair of tricks is the concrete engineering step they contribute.

The approach directly tackles the fact that the best resolution changes with scene and available compute. They run the model in the simulator and report fewer lane invasions, red-light runs, and collisions than fixed-resolution baselines. That matches the practical need in autonomous driving where both prediction quality and end-to-end delay count.

The soft spot is the missing comparison to single-resolution models trained from scratch at each fixed scale. The abstract claims no big accuracy drop or training trouble, yet supplies no ablation tables or per-resolution metrics. Without those numbers it is hard to tell whether the safety improvements come from the multi-resolution design or simply from the way the shared backbone was trained. The stress-test note flags exactly this gap, and the abstract does not close it.

This is for people who build real-time perception stacks for vehicles or robots and need to adapt compute on the fly. A reader who wants to try the retargeting trick or the per-resolution norm layers could get something usable out of it.

It is worth sending for peer review. The problem is real, the methods are specific, and the simulator results are a reasonable starting point. The authors will need to add the per-scale comparisons and training details before the claims can be taken as settled.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes a multi-resolution end-to-end CNN for the CARLA urban driving task that uses per-resolution batch normalization and resolution retargeting to support runtime input-scale selection under latency constraints. It claims that this architecture yields consistent improvements in per-route safety metrics (lane invasions, red-light infractions, collisions) relative to fixed-resolution baselines.

Significance. If the multi-resolution model truly matches or exceeds the per-scale accuracy of dedicated single-resolution models, the approach would provide a practical mechanism for context-aware latency-accuracy adaptation in real-time cyber-physical systems. The resolution-retargeting technique for training without the original dataset is a pragmatic engineering contribution.

major comments (1)

[Abstract] The headline safety-metric gains rest on the assumption that the shared multi-resolution model, when evaluated at any chosen scale, performs at least as well as a model trained from scratch at that exact scale. No ablation, per-resolution accuracy table, or direct comparison of multi-resolution weights against single-resolution counterparts at identical input resolutions is supplied; without this evidence the observed gains cannot be attributed to the multi-resolution capability rather than unequal training effort or baseline construction.

minor comments (1)

The abstract states that results were obtained in CARLA but provides no information on the number of routes, evaluation protocol, statistical significance, or variance across runs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The major comment correctly identifies a missing comparison that is needed to strengthen the attribution of results to the multi-resolution design. We will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] The headline safety-metric gains rest on the assumption that the shared multi-resolution model, when evaluated at any chosen scale, performs at least as well as a model trained from scratch at that exact scale. No ablation, per-resolution accuracy table, or direct comparison of multi-resolution weights against single-resolution counterparts at identical input resolutions is supplied; without this evidence the observed gains cannot be attributed to the multi-resolution capability rather than unequal training effort or baseline construction.

Authors: We agree that a direct per-resolution comparison is required to substantiate the claims. The current manuscript does not include an ablation table contrasting the multi-resolution model (evaluated at each scale) against dedicated single-resolution models trained from scratch at the same resolutions. In the revised manuscript we will add this ablation, reporting lane invasions, red-light infractions, and collisions for both the shared model and the scale-specific baselines at each input resolution. This will allow readers to verify that performance is preserved or improved by the multi-resolution architecture (including per-resolution batch normalization) rather than differences in training effort. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation of multi-resolution architecture

full rationale

The paper proposes a multi-resolution CNN using per-resolution batch normalization and resolution retargeting, then evaluates it empirically in the CARLA simulator against fixed-resolution baselines. All load-bearing claims (safety metric improvements) rest on reported simulator runs rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are invoked that could reduce to the inputs by construction. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based solely on abstract; no free parameters, axioms, or invented entities are specified in the provided text.

pith-pipeline@v0.9.1-grok · 5740 in / 1008 out tokens · 33038 ms · 2026-06-29T11:20:19.795365+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 3 canonical work pages · 1 internal anchor

[1]

End to End Learning for Self-Driving Cars

M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Mulleret al., “End to end learning for self-driving cars,”arXiv:1604.07316, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[2]

End- to-end driving via conditional imitation learning,

F. Codevilla, M. M ¨uller, A. L´opez, V . Koltun, and A. Dosovitskiy, “End- to-end driving via conditional imitation learning,” inICRA. IEEE, 2018

2018
[3]

End-to-end autonomous driving: Challenges and frontiers,

L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024
[4]

Alvinn: An autonomous land vehicle in a neural network,

D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,”Advances in neural information processing systems, vol. 1, 1988

1988
[5]

Plm-net: Perception latency mitigation network for vision-based lateral control of autonomous vehicles,

A. Khalil and J. Kwon, “Plm-net: Perception latency mitigation network for vision-based lateral control of autonomous vehicles,” arXiv:2407.16740, 2024

work page arXiv 2024
[6]

Deeppicarmicro: Applying tinyml to autonomous cyber physical systems,

M. Bechtel, Q. Weng, and H. Yun, “Deeppicarmicro: Applying tinyml to autonomous cyber physical systems,” inRTCSA. IEEE, 2022

2022
[7]

Speed/accuracy trade-offs for modern convolutional object detectors,

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarramaet al., “Speed/accuracy trade-offs for modern convolutional object detectors,” inCVPR. IEEE, 2017

2017
[8]

D3: a dynamic deadline-driven approach for building autonomous vehicles,

I. Gog, S. Kalra, P. Schafhalter, J. E. Gonzalez, and I. Stoica, “D3: a dynamic deadline-driven approach for building autonomous vehicles,” inEuroSys. ACM, 2022

2022
[9]

Resolution switchable networks for runtime efficient image recognition,

Y . Wang, F. Sun, D. Li, and A. Yao, “Resolution switchable networks for runtime efficient image recognition,” inECCV. Springer, 2020

2020
[10]

Dy- namic resolution network,

M. Zhu, K. Han, E. Wu, Q. Zhang, Y . Nie, Z. Lan, and Y . Wang, “Dy- namic resolution network,”Advances in Neural Information Processing Systems, vol. 34, 2021

2021
[11]

Learning to drive from a world on rails,

D. Chen, V . Koltun, and P. Kr¨ahenb¨uhl, “Learning to drive from a world on rails,” inICCV. IEEE, 2021

2021
[12]

End-to-end training of deep visuomotor policies,

S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,”Journal of Machine Learning Research, vol. 17, no. 39, 2016

2016
[13]

Deeppicar: A low- cost deep neural network-based autonomous car,

M. G. Bechtel, E. McEllhiney, M. Kim, and H. Yun, “Deeppicar: A low- cost deep neural network-based autonomous car,” inRTCSA. IEEE, 2018

2018
[14]

How fast is too fast? the role of perception latency in high-speed sense and avoid,

D. Falanga, S. Kim, and D. Scaramuzza, “How fast is too fast? the role of perception latency in high-speed sense and avoid,”IEEE Robotics and Automation Letters, vol. 4, no. 2, 2019

2019
[15]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016

2016
[16]

Pcla: A framework for testing autonomous agents in the carla simulator,

M. J. Tehrani, J. Kim, and P. Tonella, “Pcla: A framework for testing autonomous agents in the carla simulator,” inFSE, 2025

2025
[17]

Pylot: A modular platform for exploring latency-accuracy tradeoffs in autonomous vehicles,

I. Gog, S. Kalra, P. Schafhalter, M. A. Wright, J. E. Gonzalez, and I. Stoica, “Pylot: A modular platform for exploring latency-accuracy tradeoffs in autonomous vehicles,” inICRA. IEEE, 2021

2021
[18]

Anytime stereo image depth estimation on mobile devices,

Y . Wang, Z. Lai, G. Huang, B. H. Wang, L. Van Der Maaten, M. Camp- bell, and K. Q. Weinberger, “Anytime stereo image depth estimation on mobile devices,” inICRA. IEEE, 2019

2019
[19]

Adadet: An adaptive object detection system based on early-exit neural networks,

L. Yang, Z. Zheng, J. Wang, S. Song, G. Huang, and F. Li, “Adadet: An adaptive object detection system based on early-exit neural networks,” IEEE Transactions on Cognitive and Developmental Systems, vol. 16, no. 1, 2023

2023
[20]

You only look once at anytime (anytimeyolo): Analysis and optimization of early-exits for object-detection,

D. Kuhse, H. Teper, S. Buschj ¨ager, C.-Y . Wang, and J.-J. Chen, “You only look once at anytime (anytimeyolo): Analysis and optimization of early-exits for object-detection,”arXiv:2503.17497, 2025

work page arXiv 2025
[21]

Mural: A multi-resolution anytime framework for lidar object detection deep neural networks,

A. Soyyigit, S. Yao, and H. Yun, “Mural: A multi-resolution anytime framework for lidar object detection deep neural networks,” inRTCSA. IEEE, 2025

2025
[22]

Anytime-lidar: Deadline-aware 3d object detection,

——, “Anytime-lidar: Deadline-aware 3d object detection,” inRTCSA. IEEE, 2022

2022
[23]

Valo: a versatile anytime framework for lidar-based object detection deep neural networks,

——, “Valo: a versatile anytime framework for lidar-based object detection deep neural networks,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 43, no. 11, 2024

2024

[1] [1]

End to End Learning for Self-Driving Cars

M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Mulleret al., “End to end learning for self-driving cars,”arXiv:1604.07316, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[2] [2]

End- to-end driving via conditional imitation learning,

F. Codevilla, M. M ¨uller, A. L´opez, V . Koltun, and A. Dosovitskiy, “End- to-end driving via conditional imitation learning,” inICRA. IEEE, 2018

2018

[3] [3]

End-to-end autonomous driving: Challenges and frontiers,

L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

2024

[4] [4]

Alvinn: An autonomous land vehicle in a neural network,

D. A. Pomerleau, “Alvinn: An autonomous land vehicle in a neural network,”Advances in neural information processing systems, vol. 1, 1988

1988

[5] [5]

Plm-net: Perception latency mitigation network for vision-based lateral control of autonomous vehicles,

A. Khalil and J. Kwon, “Plm-net: Perception latency mitigation network for vision-based lateral control of autonomous vehicles,” arXiv:2407.16740, 2024

work page arXiv 2024

[6] [6]

Deeppicarmicro: Applying tinyml to autonomous cyber physical systems,

M. Bechtel, Q. Weng, and H. Yun, “Deeppicarmicro: Applying tinyml to autonomous cyber physical systems,” inRTCSA. IEEE, 2022

2022

[7] [7]

Speed/accuracy trade-offs for modern convolutional object detectors,

J. Huang, V . Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y . Song, S. Guadarramaet al., “Speed/accuracy trade-offs for modern convolutional object detectors,” inCVPR. IEEE, 2017

2017

[8] [8]

D3: a dynamic deadline-driven approach for building autonomous vehicles,

I. Gog, S. Kalra, P. Schafhalter, J. E. Gonzalez, and I. Stoica, “D3: a dynamic deadline-driven approach for building autonomous vehicles,” inEuroSys. ACM, 2022

2022

[9] [9]

Resolution switchable networks for runtime efficient image recognition,

Y . Wang, F. Sun, D. Li, and A. Yao, “Resolution switchable networks for runtime efficient image recognition,” inECCV. Springer, 2020

2020

[10] [10]

Dy- namic resolution network,

M. Zhu, K. Han, E. Wu, Q. Zhang, Y . Nie, Z. Lan, and Y . Wang, “Dy- namic resolution network,”Advances in Neural Information Processing Systems, vol. 34, 2021

2021

[11] [11]

Learning to drive from a world on rails,

D. Chen, V . Koltun, and P. Kr¨ahenb¨uhl, “Learning to drive from a world on rails,” inICCV. IEEE, 2021

2021

[12] [12]

End-to-end training of deep visuomotor policies,

S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,”Journal of Machine Learning Research, vol. 17, no. 39, 2016

2016

[13] [13]

Deeppicar: A low- cost deep neural network-based autonomous car,

M. G. Bechtel, E. McEllhiney, M. Kim, and H. Yun, “Deeppicar: A low- cost deep neural network-based autonomous car,” inRTCSA. IEEE, 2018

2018

[14] [14]

How fast is too fast? the role of perception latency in high-speed sense and avoid,

D. Falanga, S. Kim, and D. Scaramuzza, “How fast is too fast? the role of perception latency in high-speed sense and avoid,”IEEE Robotics and Automation Letters, vol. 4, no. 2, 2019

2019

[15] [15]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016

2016

[16] [16]

Pcla: A framework for testing autonomous agents in the carla simulator,

M. J. Tehrani, J. Kim, and P. Tonella, “Pcla: A framework for testing autonomous agents in the carla simulator,” inFSE, 2025

2025

[17] [17]

Pylot: A modular platform for exploring latency-accuracy tradeoffs in autonomous vehicles,

I. Gog, S. Kalra, P. Schafhalter, M. A. Wright, J. E. Gonzalez, and I. Stoica, “Pylot: A modular platform for exploring latency-accuracy tradeoffs in autonomous vehicles,” inICRA. IEEE, 2021

2021

[18] [18]

Anytime stereo image depth estimation on mobile devices,

Y . Wang, Z. Lai, G. Huang, B. H. Wang, L. Van Der Maaten, M. Camp- bell, and K. Q. Weinberger, “Anytime stereo image depth estimation on mobile devices,” inICRA. IEEE, 2019

2019

[19] [19]

Adadet: An adaptive object detection system based on early-exit neural networks,

L. Yang, Z. Zheng, J. Wang, S. Song, G. Huang, and F. Li, “Adadet: An adaptive object detection system based on early-exit neural networks,” IEEE Transactions on Cognitive and Developmental Systems, vol. 16, no. 1, 2023

2023

[20] [20]

You only look once at anytime (anytimeyolo): Analysis and optimization of early-exits for object-detection,

D. Kuhse, H. Teper, S. Buschj ¨ager, C.-Y . Wang, and J.-J. Chen, “You only look once at anytime (anytimeyolo): Analysis and optimization of early-exits for object-detection,”arXiv:2503.17497, 2025

work page arXiv 2025

[21] [21]

Mural: A multi-resolution anytime framework for lidar object detection deep neural networks,

A. Soyyigit, S. Yao, and H. Yun, “Mural: A multi-resolution anytime framework for lidar object detection deep neural networks,” inRTCSA. IEEE, 2025

2025

[22] [22]

Anytime-lidar: Deadline-aware 3d object detection,

——, “Anytime-lidar: Deadline-aware 3d object detection,” inRTCSA. IEEE, 2022

2022

[23] [23]

Valo: a versatile anytime framework for lidar-based object detection deep neural networks,

——, “Valo: a versatile anytime framework for lidar-based object detection deep neural networks,”IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 43, no. 11, 2024

2024