arxiv: 2605.06264 · v1 · submitted 2026-05-07 · 💻 cs.LG

Recognition: unknown

Can Attribution Predict Risk? From Multi-View Attribution to Planning Risk Signals in End-to-End Autonomous Driving

Le Yang , Ruoyu Chen , Haijun Liu , Jiawei Liang , Shangquan Sun , Xiaochun Cao

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:15 UTC · model grok-4.3

classification 💻 cs.LG

keywords attributionrisk predictionend-to-end autonomous drivingmulti-view camerastrajectory planningcollision detectioninterpretability

0 comments

The pith

Statistics from hierarchical attribution of six-view images predict planning risks in end-to-end autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether attribution methods can directly flag risks inside the trajectory planning step of end-to-end driving models. These models map six camera views to future paths yet remain opaque, so the authors build a coarse-to-fine attribution procedure that searches the full input space and refines important regions while enforcing consistency with the model's original output trajectory. Three summary statistics are then computed from the resulting maps: how concentrated the attribution is overall, how spread it is inside each camera, and how unevenly it is shared across the six cameras. On three different planners the statistics track trajectory error with Spearman correlation around 0.30 and separate collision-prone plans with AUROC near 0.77, and the same pattern holds on unseen scenes and with a second attribution technique.

Core claim

A hierarchical attribution procedure optimized for L2 consistency with the original planned trajectory produces three statistics—attribution entropy, within-camera spatial variance, and cross-camera Gini coefficient—that serve as predictive signals for planning risk; these signals achieve Spearman correlation 0.30 plus or minus 0.07 with trajectory error and AUROC 0.77 plus or minus 0.04 for collision detection across BridgeAD, UniAD, and GenAD, generalize to held-out scenes, and remain stable when an alternative attribution method is substituted.

What carries the argument

The hierarchical coarse-to-fine attribution strategy that searches candidate regions across all six camera views and refines them under an L2 consistency objective with the planned trajectory.

If this is right

The three attribution statistics can act as risk monitors that operate without any auxiliary safety model.
The same signals apply across multiple distinct end-to-end planners and survive changes in scene distribution.
Attribution can localize which camera views and image regions most influence a given trajectory prediction.
The approach stays effective when the underlying attribution algorithm is replaced by another baseline method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These risk signals could be fed back into the planner at inference time to trigger conservative trajectory adjustments when attribution entropy or cross-camera imbalance is high.
The framework offers a route to post-hoc auditing of deployed driving systems by logging which visual regions were decisive in each planned maneuver.
Similar attribution-derived statistics might transfer to other continuous-output perception-planning pipelines such as robotic manipulation or drone navigation.

Load-bearing premise

The optimized attribution maps actually capture the visual evidence the planner uses rather than merely satisfying the consistency objective by coincidence.

What would settle it

A drop of the reported correlations to near zero, or a failure of the statistics to flag collisions, when the same procedure is run on a fresh driving dataset collected with different cameras or model training protocols.

Figures

Figures reproduced from arXiv: 2605.06264 by Haijun Liu, Jiawei Liang, Le Yang, Ruoyu Chen, Shangquan Sun, Xiaochun Cao.

**Figure 1.** Figure 1: Hierarchical attribution pipeline. Multi-camera inputs are partitioned into SLICO superpixels and grouped into coarse regions (left). Candidate subsets are evaluated by an objective function defined on the planner’s predicted trajectory (middle). A coarse-to-fine greedy search first selects coarse regions and then refines subregions, producing the per-pixel saliency tensor (right). interpretability through… view at source ↗

**Figure 2.** Figure 2: Three attribution statistics from a distributional view of multi-camera saliency. Attribution entropy for global concentration, within-camera spatial variance for spatial dispersion within each view, and cross-camera Gini coefficient for imbalance across views. The algorithm then runs greedy ranking over the subregions in G(j) , at each step selecting arg max v∈G(j)\Slocal ∆(v | Cj ∪ Slocal), where Slocal … view at source ↗

**Figure 3.** Figure 3: Pairwise Spearman correlation among the three attribution statistics, computed on all view at source ↗

read the original abstract

End-to-end autonomous driving models generate future trajectories from multi-view inputs, improving system integration but introducing opaque decisions and hard-to-localize risks. Existing methods either rely on auxiliary monitoring models or generate textual explanations, but are decoupled from the planning process and fail to reveal the visual evidence underlying trajectory generation. While attribution offers a direct alternative, planning differs from image classification by taking six-view camera images as input and predicting continuous multi-step trajectories, requiring attribution to capture both critical views and regions and their influence on outputs. Moreover, whether attribution maps can support risk identification remains underexplored. To address this, we propose a hierarchical attribution framework for end-to-end planning. Specifically, using L2 consistency with the original trajectory as the objective, we design a coarse-to-fine region attribution strategy that searches candidate regions across the full six-view input and refines attribution within them. We further extract three attribution statistics as predictive signals for planning risk, including attribution entropy to measure how concentrated the planner's reliance is over the joint visual space, within-camera spatial variance to characterize how spread out the attribution is within each view, and cross-camera Gini coefficient to quantify how unevenly attribution is distributed across the six cameras. Experiments on BridgeAD, UniAD, and GenAD show that these statistics correlate with planning risk, achieving Spearman correlations of $0.30 \pm 0.07$ with trajectory error and AUROC of $0.77 \pm 0.04$ for collision detection. The signal generalizes to held-out scenes with negligible degradation and remains stable under an alternative attribution baseline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Attribution stats from L2-tied maps give moderate correlations with planning errors in end-to-end driving, but the output dependence leaves the independence of the signal unclear.

read the letter

The paper extends attribution to multi-view trajectory planning by building a hierarchical coarse-to-fine search across six cameras, guided by L2 consistency to the original predicted trajectory. It then derives three statistics—attribution entropy, within-camera spatial variance, and cross-camera Gini—and tests them as risk signals on BridgeAD, UniAD, and GenAD. The results show Spearman correlations around 0.30 with trajectory error and AUROC of 0.77 for collision detection, with the signal holding on held-out scenes and under an alternative attribution method. This is a reasonable next step from classification-style attribution to continuous planning outputs, and the multi-model experiments plus generalization check are better than average for the area. The stats themselves are simple and make intuitive sense for monitoring reliance across views and regions. The main soft spot is the L2 consistency term. Because the attribution explicitly searches for regions whose change preserves the trajectory, any statistic that tracks spread or concentration can pick up error magnitude without isolating the visual inputs that actually drive risk. The abstract does not report what happens when that consistency objective is removed or replaced by an output-agnostic method, so the reported numbers could partly reflect the construction rather than independent evidence. The effect sizes are also modest, which caps how much practical safety value they add without further validation. This work is aimed at people building monitoring layers for integrated end-to-end driving systems. A reader already working on attribution or safety in robotics would find the extension useful and the experiments straightforward to build on. The thinking is clear and the claims stay within what the correlations support. Send it to peer review. The idea is grounded enough to merit referee time, mainly to press on the independence question and ask for ablations on the consistency objective.

Referee Report

2 major / 1 minor

Summary. The paper claims that a hierarchical attribution framework for end-to-end autonomous driving planners (BridgeAD, UniAD, GenAD) can extract risk-predictive signals from multi-view camera inputs. Attribution maps are generated via coarse-to-fine search optimized for L2 consistency with the original predicted trajectory; three statistics are then derived—attribution entropy (concentration over joint visual space), within-camera spatial variance, and cross-camera Gini coefficient (unevenness across views)—and shown to correlate with planning risk, yielding Spearman 0.30 ± 0.07 with trajectory error and AUROC 0.77 ± 0.04 for collision detection. The signals generalize to held-out scenes with negligible degradation and remain stable under an alternative attribution baseline.

Significance. If the statistics prove independent of the consistency objective, the work would supply a direct, planning-coupled mechanism for surfacing visual risk evidence in opaque end-to-end models, a meaningful advance over decoupled monitors or textual explanations. Credit is due for the multi-model evaluation, explicit held-out generalization test, and stability check against an alternative baseline; these elements strengthen the empirical case. The moderate correlation magnitudes, however, position the signals as potentially useful complements rather than standalone predictors.

major comments (2)

[Abstract (method description)] Abstract (hierarchical attribution framework): the L2 consistency objective explicitly searches regions whose perturbation preserves the original trajectory. Because the three statistics are extracted from these maps, any measure sensitive to output magnitude or spatial spread can correlate with trajectory error by construction. The reported Spearman 0.30 ± 0.07 and AUROC 0.77 ± 0.04 therefore require an explicit ablation that removes or replaces the L2 term (e.g., output-agnostic perturbation or gradient-based attribution without consistency) to confirm the statistics reflect genuine visual risk evidence rather than an artifact of tying attribution to the very trajectory whose quality is measured.
[Experiments] Experiments section: the generalization claim states 'negligible degradation' on held-out scenes, yet no quantitative delta (e.g., change in Spearman or AUROC) or scene count is supplied. Without these numbers or a description of scene filtering criteria, it is impossible to judge whether the stability result is robust or sensitive to post-hoc selection.

minor comments (1)

[Abstract] Abstract: the ±0.07 and ±0.04 intervals are reported without stating whether they reflect standard error, standard deviation, or bootstrap; clarify the exact statistic and number of runs or scenes underlying them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the constructive comments, which have helped us identify areas for improvement in the manuscript. Below, we provide detailed responses to each major comment and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract (method description)] Abstract (hierarchical attribution framework): the L2 consistency objective explicitly searches regions whose perturbation preserves the original trajectory. Because the three statistics are extracted from these maps, any measure sensitive to output magnitude or spatial spread can correlate with trajectory error by construction. The reported Spearman 0.30 ± 0.07 and AUROC 0.77 ± 0.04 therefore require an explicit ablation that removes or replaces the L2 term (e.g., output-agnostic perturbation or gradient-based attribution without consistency) to confirm the statistics reflect genuine visual risk evidence rather than an artifact of tying attribution to the very trajectory whose quality is measured.

Authors: We appreciate the referee's concern regarding potential artifacts from the L2 consistency objective. Although the hierarchical search is designed to identify regions that influence the planner's output, we acknowledge that an explicit ablation is warranted to strengthen the claim. In the revised manuscript, we will add an ablation study using gradient-based attribution methods (such as Integrated Gradients) applied directly without the L2 consistency optimization. We will report the resulting Spearman correlations and AUROCs to demonstrate that the risk-predictive signals are not solely due to the consistency term. revision: yes
Referee: [Experiments] Experiments section: the generalization claim states 'negligible degradation' on held-out scenes, yet no quantitative delta (e.g., change in Spearman or AUROC) or scene count is supplied. Without these numbers or a description of scene filtering criteria, it is impossible to judge whether the stability result is robust or sensitive to post-hoc selection.

Authors: We thank the referee for pointing out this lack of detail. We will revise the Experiments section to include the quantitative results for the held-out scenes, specifically the changes in Spearman correlation and AUROC, the number of scenes used, and a clear description of the scene selection and filtering criteria to ensure transparency and allow proper evaluation of the generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: risk-signal statistics are post-hoc empirical correlates, not definitional or fitted reductions.

full rationale

The derivation proceeds by first computing hierarchical attribution maps via an L2-consistency objective that identifies regions whose perturbation leaves the model's own trajectory unchanged, then extracting three descriptive statistics (entropy, within-camera variance, cross-camera Gini) from those maps, and finally measuring their Spearman correlation and AUROC against independent external risk labels (trajectory error versus ground-truth and collision events). Because the reported 0.30 ± 0.07 correlation and 0.77 ± 0.04 AUROC are measured quantities on held-out scenes rather than quantities that are algebraically identical to the L2 objective or to any fitted parameter, and because the paper additionally verifies stability under an alternative attribution baseline, the central claim does not reduce to its inputs by construction. No self-citation, ansatz smuggling, or renaming of known results is required for the reported numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that L2 consistency between attributed and original trajectories is a suitable objective for locating influential regions, plus standard mathematical properties of entropy, variance, and Gini coefficient. No free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption L2 consistency with the original trajectory serves as a valid objective for guiding region attribution
Invoked to define the coarse-to-fine search that produces the attribution maps used for the risk statistics.

pith-pipeline@v0.9.0 · 5607 in / 1413 out tokens · 71003 ms · 2026-05-08T13:15:15.584025+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 1 canonical work pages

[1]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

2023
[2]

V AD: Vectorized scene representation for efficient autonomous driving

Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. V AD: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

2023
[3]

Sparsedrive: End-to- end autonomous driving via sparse scene representation

Wenchao Sun, Xuewu Lin, Yining Shi, Chuang Zhang, Haoran Wu, and Sifa Zheng. Sparsedrive: End-to- end autonomous driving via sparse scene representation. InIEEE International Conference on Robotics and Automation, ICRA 2025, Atlanta, GA, USA, May 19-23, 2025, pages 8795–8801. IEEE, 2025

2025
[4]

Bridging past and future: End-to-end autonomous driving with historical prediction and planning

Bozhou Zhang, Nan Song, Xin Jin, and Li Zhang. Bridging past and future: End-to-end autonomous driving with historical prediction and planning. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6854–6863, 2025

2025
[5]

Steering towards safe self-driving laboratories

Shi Xuan Leong, Caleb E Griesbach, Rui Zhang, Kourosh Darvish, Yuchi Zhao, Abhijoy Mandal, Yunheng Zou, Han Hao, Varinia Bernales, and Alán Aspuru-Guzik. Steering towards safe self-driving laboratories. Nature Reviews Chemistry, 9(10):707–722, 2025

2025
[6]

Jinhao Liang, Chaopeng Tan, Longhao Yan, Jingyuan Zhou, Guodong Yin, and Kaidi Yang. Interaction- aware trajectory prediction for safe motion planning in autonomous driving: A transformer-transfer learning approach.IEEE Transactions on Intelligent Transportation Systems, 2025

2025
[7]

Tesla FSD v12: End-to-end neural network for autonomous driving, 2024

Ashok Elluswamy. Tesla FSD v12: End-to-end neural network for autonomous driving, 2024. Tesla AI Day Presentation

2024
[8]

End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, Andreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):10164–10183, 2024

2024
[9]

Explainable ai for safe and trustworthy autonomous driving: A systematic review.IEEE Transactions on Intelligent Transportation Systems, 25(12):19342–19364, 2024

Anton Kuznietsov, Balint Gyevnar, Cheng Wang, Steven Peters, and Stefano V Albrecht. Explainable ai for safe and trustworthy autonomous driving: A systematic review.IEEE Transactions on Intelligent Transportation Systems, 25(12):19342–19364, 2024

2024
[10]

On the road with gpt-4v (ision): Explorations of utilizing visual-language model as autonomous driving agent

Licheng Wen, Xuemeng Yang, Daocheng Fu, Xiaofeng Wang, Pinlong Cai, Xin Li, Tao Ma, Yingxuan Li, Linran Xu, Dengke Shang, et al. On the road with gpt-4v (ision): Explorations of utilizing visual-language model as autonomous driving agent. InICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024

2024
[11]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. In European conference on computer vision, pages 256–274, 2024

2024
[12]

Drivegpt4-v2: Harnessing large language model capabilities for enhanced closed-loop autonomous driving

Zhenhua Xu, Yan Bai, Yujia Zhang, Zhuoling Li, Fei Xia, Kwan-Yee K Wong, Jianqiang Wang, and Hengshuang Zhao. Drivegpt4-v2: Harnessing large language model capabilities for enhanced closed-loop autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 17261–17270, 2025

2025
[13]

Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma

Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. AutoVLA: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and rein- forcement fine-tuning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[14]

Interpretable model-agnostic plausibility verification for 2d object detectors using domain-invariant concept bottleneck models

Mert Keser, Gesina Schwalbe, Azarm Nowzad, and Alois Knoll. Interpretable model-agnostic plausibility verification for 2d object detectors using domain-invariant concept bottleneck models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3891–3900, 2023

2023
[15]

Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting

Miles Turpin, Julian Michael, Ethan Perez, and Samuel Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. InAdvances in Neural Information Processing Systems, pages 74952–74965, 2023

2023
[16]

Chain-of-thought is not explainability.Oxford AI Governance Initiative (AIGI), 2025

Fazl Barez, Tung-Yu Wu, Iván Arcuschin, Michael Lan, Vincent Wang, Noah Siegel, Nicolas Collignon, Clement Neo, Isabelle Lee, Alasdair Paren, Adel Bibi, Robert Trager, Damiano Fornasiere, John Yan, Yanai Elazar, and Yoshua Bengio. Chain-of-thought is not explainability.Oxford AI Governance Initiative (AIGI), 2025. 10

2025
[17]

Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6585–6597, 2025

2025
[18]

Redoubt: Duo safety validation for autonomous vehicle motion planning

Shuguang Wang, Qian Zhou, Kui Wu, Dapeng Wu, Wei-Bin Lee, and Jianping Wang. Redoubt: Duo safety validation for autonomous vehicle motion planning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[19]

Task-aware risk estimation of perception failures for autonomous vehicle

P Antonante, S Veer, K Leung, X Weng, L Carlone, and M Pavone. Task-aware risk estimation of perception failures for autonomous vehicle. InRobotics: Science and Systems (RSS). Robotics: Science and Systems (RSS), 2023

2023
[20]

Criticality metrics for automated driving: A review and suitability analysis of the state of the art: L

Lukas Westhofen, Christian Neurohr, Tjark Koopmann, Martin Butz, Barbara Schütt, Fabian Utesch, Birte Neurohr, Christian Gutenkunst, and Eckard Böde. Criticality metrics for automated driving: A review and suitability analysis of the state of the art: L. westhofen et al.Archives of Computational Methods in Engineering, 30(1):1–35, 2023

2023
[21]

RISE: randomized input sampling for explanation of black-box models

Vitali Petsiuk, Abir Das, and Kate Saenko. RISE: randomized input sampling for explanation of black-box models. InBritish Machine Vision Conference, page 151, 2018

2018
[22]

Less is more: Fewer inter- pretable region via submodular subset selection

Ruoyu Chen, Hua Zhang, Siyuan Liang, Jingzhi Li, and Xiaochun Cao. Less is more: Fewer inter- pretable region via submodular subset selection. InThe Twelfth International Conference on Learning Representations, 2024

2024
[23]

Interpreting object-level foundation models via visual precision search

Ruoyu Chen, Siyuan Liang, Jingzhi Li, Shiming Liu, Maosen Li, Zhen Huang, Hua Zhang, and Xiaochun Cao. Interpreting object-level foundation models via visual precision search. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 30042–30052, 2025

2025
[24]

Amazon sagemaker model monitor: A system for real-time insights into deployed machine learning models

David Nigenda, Zohar Karnin, Muhammad Bilal Zafar, Raghu Ramesha, Alan Tan, Michele Donini, and Krishnaram Kenthapadi. Amazon sagemaker model monitor: A system for real-time insights into deployed machine learning models. InProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 3671–3681, 2022

2022
[25]

Feature attribution explanation to detect harmful dataset shift

Ziming Wang, Changwu Huang, and Xin Yao. Feature attribution explanation to detect harmful dataset shift. InInternational Joint Conference on Neural Networks (IJCNN), pages 1–8, 2023

2023
[26]

Explanatory model monitoring to understand the effects of feature shifts on performance

Thomas Decker, Alexander Koebler, Michael Lebacher, Ingo Thon, V olker Tresp, and Florian Buettner. Explanatory model monitoring to understand the effects of feature shifts on performance. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 550–561, 2024

2024
[27]

Toward explainable artificial intelligence for early anticipation of traffic accidents.Transportation research record, 2676(6):743–755, 2022

Muhammad Monjurul Karim, Yu Li, and Ruwen Qin. Toward explainable artificial intelligence for early anticipation of traffic accidents.Transportation research record, 2676(6):743–755, 2022

2022
[28]

Explaining autonomous driving by learning end-to-end visual attention

Luca Cultrera, Lorenzo Seidenari, Federico Becattini, Pietro Pala, and Alberto Del Bimbo. Explaining autonomous driving by learning end-to-end visual attention. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 340–341, 2020

2020
[29]

Insufficiency-driven dnn error detection in the context of sotif on traffic sign recognition use case.IEEE Open Journal of Intelligent Transportation Systems, 4:58–70, 2023

Lukas Hacker and Jörg Seewig. Insufficiency-driven dnn error detection in the context of sotif on traffic sign recognition use case.IEEE Open Journal of Intelligent Transportation Systems, 4:58–70, 2023

2023
[30]

Thirdeye: Attention maps for safe autonomous driving systems

Andrea Stocco, Paulo J Nunes, Marcelo d’Amorim, and Paolo Tonella. Thirdeye: Attention maps for safe autonomous driving systems. InProceedings of the 37th IEEE/ACM international conference on automated software engineering, pages 1–12, 2022

2022
[31]

Where mllms attend and what they rely on: Explaining autoregressive token generation

Ruoyu Chen, Xiaoqing Guo, Kangwei Liu, Siyuan Liang, Shiming Liu, Qunli Zhang, Hua Zhang, Laiyuan Wang, and Xiaochun Cao. Where mllms attend and what they rely on: Explaining autoregressive token generation. InProceedings of the Computer Vision and Pattern Recognition Conference, 2026

2026
[32]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

2020
[33]

GenAD: Generative end-to-end autonomous driving

Wenzhao Zheng, Ruiqi Song, Xianda Guo, Chenming Zhang, and Long Chen. GenAD: Generative end-to-end autonomous driving. InEuropean Conference on Computer Vision, pages 87–104, 2024. 11

2024
[34]

DriveGPT4: Interpretable end-to-end autonomous driving via large language model

Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. DriveGPT4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters, 9(10):8186–8193, 2024

2024
[35]

Sophia Koepke, Zeynep Akata, and Andreas Geiger

Katrin Renz, Kashyap Chitta, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata, and Andreas Geiger. Plant: Explainable planning transformers via object-level representations. InCoRL 2022 Workshop on Learning, Perception, and Abstraction for Long-Horizon Planning, 2022

2022
[36]

Informer-interpretability founded monitoring of medical image deep learning models

Shelley Zixin Shu, Aurélie Pahud de Mortanges, Alexander Poellinger, Dwarikanath Mahapatra, and Mauricio Reyes. Informer-interpretability founded monitoring of medical image deep learning models. InInternational Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging, pages 215–224, 2024

2024
[37]

Slic superpixels compared to state-of-the-art superpixel methods.IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012

Radhakrishna Achanta, Appu Shaji, Kevin Smith, Aurelien Lucchi, Pascal Fua, and Sabine Süsstrunk. Slic superpixels compared to state-of-the-art superpixel methods.IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11):2274–2282, 2012

2012
[38]

G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions—i.Mathematical Programming, 14(1):265–294, Dec 1978

1978
[39]

Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):12878–12895, 2023

Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(11):12878–12895, 2023

2023
[40]

Waslander, Hongsheng Li, and Yu Liu

Hao Shao, Letian Wang, Ruobing Chen, Steven L. Waslander, Hongsheng Li, and Yu Liu. Reasonnet: End-to-end driving with temporal and global reasoning. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pages 13723–13733. IEEE, 2023

2023
[41]

Neat: Neural attention fields for end-to-end autonomous driving

Kashyap Chitta, Aditya Prakash, and Andreas Geiger. Neat: Neural attention fields for end-to-end autonomous driving. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 15773–15783, 2021

2021
[42]

Large learning rates simultane- ously achieve robustness to spurious correlations and compressibility

Melih Barsbey, Lucas Prieto, Stefanos Zafeiriou, and Tolga Birdal. Large learning rates simultane- ously achieve robustness to spurious correlations and compressibility. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2055–2066, 2025

2055
[43]

Interpreting super-resolution networks with local attribution maps

Jinjin Gu and Chao Dong. Interpreting super-resolution networks with local attribution maps. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9199–9208, 2021

2021
[44]

Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14864–14873, 2024

2024
[45]

Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes.arXiv preprint arXiv:2305.10430,

Jiang-Tian Zhai, Ze Feng, Jinhao Du, Yongqiang Mao, Jiang-Jiang Liu, Zichang Tan, Yifu Zhang, Xiaoqing Ye, and Jingdong Wang. Rethinking the open-loop evaluation of end-to-end autonomous driving in nuscenes. arXiv preprint arXiv:2305.10430, 2023. 12 A Grouped-Domain Property of the Coarse Stage This appendix gives the grouped-domain objective associated w...

work page arXiv 2023