pith. machine review for the scientific record. sign in

arxiv: 2604.11400 · v1 · submitted 2026-04-13 · 💻 cs.RO · cs.CV

Recognition: unknown

EagleVision: A Multi-Task Benchmark for Cross-Domain Perception in High-Speed Autonomous Racing

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:01 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords autonomous racing3D object detectiontrajectory predictioncross-domain transferLiDAR perceptionbenchmarkhigh-speed dynamics
0
0 comments X

The pith

Cross-domain pretraining from urban and racing datasets enhances 3D detection and trajectory prediction in high-speed autonomous racing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmark to measure how well perception models transfer across urban, simulator, and real racing domains under high-speed conditions. It finds that pretraining on urban data lifts detection performance above training from scratch on racing data, that adding real racing data as an intermediate step gives the strongest results when adapting to a new racing test set, and that models trained on one racing dataset can forecast trajectories more accurately on another racing test set than models trained directly on the target data. This matters because high-speed racing creates fast relative motions and large domain shifts that standard urban datasets miss, so mapping which pretraining paths close the gap supports more reliable perception in extreme settings. The work supplies standardized annotations and a common evaluation protocol to make such comparisons possible.

Core claim

The central claim is that the EagleVision benchmark, built from newly annotated LiDAR frames across the Indy Autonomous Challenge, the A2RL competition, and simulator data, demonstrates through a dataset-centric transfer framework that urban pretraining improves 3D detection over scratch training on racing data, that intermediate pretraining on real racing data yields the strongest transfer to new racing environments, and that Indy-trained models outperform direct in-domain training on A2RL test sequences for trajectory prediction because of wider motion-distribution coverage.

What carries the argument

The dataset-centric transfer framework that standardizes LiDAR data from urban, simulator, and real racing sources under one evaluation protocol and measures how pretraining sequences affect detection and prediction in high-dynamic conditions.

Load-bearing premise

The newly created 3D bounding box annotations for the Indy and A2RL datasets are accurate and consistent enough to support reliable cross-domain comparisons.

What would settle it

Independent re-annotation of the same frames producing bounding boxes that reverse the observed performance ordering between urban-pretrained, simulator-adapted, and racing-intermediate models on the A2RL test set.

Figures

Figures reproduced from arXiv: 2604.11400 by Dzmitry Tsetserukou, Jiaqi Huang, Jorge Dias, Majid Khonji, Murad Mebrahtu, Ren Jin, Yujia Yue, Zakhar Yagudin.

Figure 1
Figure 1. Figure 1: Autonomous racing platform operating under real [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed benchmark. (a) 3D detec [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LiDAR point cloud visualization of 3D detection [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative trajectory prediction examples from [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

High-speed autonomous racing presents extreme perception challenges, including large relative velocities and substantial domain shifts from conventional urban-driving datasets. Existing benchmarks do not adequately capture these high-dynamic conditions. We introduce EagleVision, a unified LiDAR-based multi-task benchmark for 3D detection and trajectory prediction in high-speed racing, providing newly annotated 3D bounding boxes for the Indy Autonomous Challenge dataset (14,893 frames) and the A2RL Real competition dataset (1,163 frames), together with 12,000 simulator-generated annotated frames, all standardized under a common evaluation protocol. Using a dataset-centric transfer framework, we quantify cross-domain generalization across urban, simulator, and real racing domains. Urban pretraining improves detection over scratch training (NDS 0.72 vs. 0.69), while intermediate pretraining on real racing data achieves the best transfer to A2RL (NDS 0.726), outperforming simulator-only adaptation. For trajectory prediction, Indy-trained models surpass in-domain A2RL training on A2RL test sequences (FDE 0.947 vs. 1.250), highlighting the role of motion-distribution coverage in cross-domain forecasting. EagleVision enables systematic study of perception generalization under extreme high-speed dynamics. The dataset and benchmark are publicly available at https://avlab.io/EagleVision

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces EagleVision, a LiDAR-based multi-task benchmark for 3D object detection and trajectory prediction in high-speed autonomous racing. It contributes newly annotated 3D bounding boxes for the Indy Autonomous Challenge dataset (14,893 frames) and A2RL Real competition dataset (1,163 frames), plus 12,000 simulator frames, all under a common protocol. Using a dataset-centric transfer framework, it reports cross-domain results: urban pretraining yields NDS 0.72 vs. 0.69 for scratch training on detection; intermediate pretraining on real racing data achieves best transfer to A2RL (NDS 0.726); and Indy-trained models outperform in-domain A2RL training on A2RL test sequences for trajectory prediction (FDE 0.947 vs. 1.250). The datasets and benchmark are released publicly.

Significance. If the annotations are shown to be reliable, this benchmark would be a useful addition for studying perception generalization under high-dynamic conditions that differ from urban driving. The public release of standardized datasets across urban, simulator, and real-racing domains enables reproducible follow-on work. The empirical observation that real racing pretraining outperforms simulator adaptation, and that broader motion coverage from Indy data improves forecasting on A2RL, provides concrete, falsifiable starting points for domain-adaptation research in robotics.

major comments (1)
  1. [Abstract / Dataset creation] Abstract and dataset description: All reported metric deltas (NDS 0.72 vs. 0.69; FDE 0.947 vs. 1.250) are computed directly on the newly supplied 3D bounding-box labels for Indy and A2RL. The manuscript states only that the boxes were “newly annotated” and “standardized under a common protocol,” with no description of the annotation pipeline, LiDAR calibration, motion-compensation procedure, inter-annotator agreement statistics, or quantitative validation (e.g., comparison to an off-the-shelf detector or error rates under high-speed sparsity). Because label noise or domain-specific bias could produce the observed transfer gains, this information is required to attribute results to domain shift rather than annotation artifacts.
minor comments (2)
  1. [Experimental setup] The manuscript should report statistical significance (e.g., confidence intervals or p-values) for the small metric improvements and provide the exact model architectures, training hyperparameters, and data splits used in the transfer experiments.
  2. [Results] Figure and table captions should explicitly state the evaluation protocol (e.g., how NDS is computed across domains) and note any differences in sensor characteristics between the three data sources.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the benchmark's potential contribution and for highlighting the need for greater transparency in dataset construction. We address the major comment below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract / Dataset creation] Abstract and dataset description: All reported metric deltas (NDS 0.72 vs. 0.69; FDE 0.947 vs. 1.250) are computed directly on the newly supplied 3D bounding-box labels for Indy and A2RL. The manuscript states only that the boxes were “newly annotated” and “standardized under a common protocol,” with no description of the annotation pipeline, LiDAR calibration, motion-compensation procedure, inter-annotator agreement statistics, or quantitative validation (e.g., comparison to an off-the-shelf detector or error rates under high-speed sparsity). Because label noise or domain-specific bias could produce the observed transfer gains, this information is required to attribute results to domain shift rather than annotation artifacts.

    Authors: We agree that the current manuscript provides insufficient detail on the annotation process, which is necessary to rule out label noise or bias as alternative explanations for the reported transfer gains. In the revised version we will add a dedicated subsection under Dataset Creation that describes: the semi-automatic annotation pipeline (initial proposals from an off-the-shelf detector followed by human refinement), vehicle-specific LiDAR calibration and extrinsic parameters, motion-compensation procedures that account for ego-velocity during high-speed sweeps, inter-annotator agreement metrics (mean IoU and label-consistency rates across multiple annotators), and quantitative validation results including precision-recall curves against held-out manual labels and error statistics stratified by speed and point-cloud sparsity. These additions will allow readers to assess label quality directly and strengthen the attribution of performance differences to domain shift. revision: yes

Circularity Check

0 steps flagged

No circularity in benchmark construction or transfer results

full rationale

The paper's core contributions are the release of newly annotated datasets (Indy 14,893 frames, A2RL 1,163 frames, plus simulator data) under a common protocol and the reporting of empirical NDS/FDE numbers obtained via standard supervised training and cross-domain transfer on held-out test splits. No equations, self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation of the reported deltas (e.g., urban pretrain NDS 0.72 vs scratch 0.69). All performance figures are computed directly from the supplied labels and models; the chain is externally falsifiable on the public benchmark and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark and transfer-learning study with no mathematical derivations, new physical entities, or theoretical axioms; all results derive from standard supervised learning on annotated point clouds.

pith-pipeline@v0.9.0 · 5572 in / 1115 out tokens · 52504 ms · 2026-05-10T16:01:11.491554+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 1 canonical work pages · 1 internal anchor

  1. [1]

    Vision meets robotics: The kitti dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,”The international journal of robotics research, vol. 32, no. 11, pp. 1231–1237, 2013

  2. [2]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

  3. [3]

    Scalability in perception for autonomous driving: Waymo open dataset,

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454

  4. [4]

    Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

    B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Ponteset al., “Argoverse 2: Next generation datasets for self-driving perception and forecasting,”arXiv preprint arXiv:2301.00493, 2023

  5. [5]

    Spg: Unsupervised domain adaptation for 3d object detection via semantic point generation,

    Q. e. a. Xu, “Spg: Unsupervised domain adaptation for 3d object detection via semantic point generation,” inICCV Workshops, 2021

  6. [6]

    Gpa-3d: Geometry-aware prototype alignment for unsupervised domain adaptive 3d object detection,

    Z. Li, J. Guo, and T. e. a. Cao, “Gpa-3d: Geometry-aware prototype alignment for unsupervised domain adaptive 3d object detection,” arXiv, 2023

  7. [7]

    Vectornet: Encoding hd maps and agent dynamics from vectorized representation,

    J. Gao, X. Yuanet al., “Vectornet: Encoding hd maps and agent dynamics from vectorized representation,” inCVPR, 2020

  8. [8]

    Mtr++: Multi-agent mo- tion prediction with symmetric scene modeling and guided intention querying,

    S. Shi, L. Jiang, D. Dai, and B. Schiele, “Mtr++: Multi-agent mo- tion prediction with symmetric scene modeling and guided intention querying,”arXiv, 2023

  9. [9]

    Fadet: A multi-sensor 3d object detection network based on local featured attention,

    Z. Guo, Z. Yagudin, S. Asfaw, A. Lykov, and D. Tsetserukou, “Fadet: A multi-sensor 3d object detection network based on local featured attention,” in2025 IEEE Intelligent V ehicles Symposium (IV). IEEE, 2025, pp. 202–208

  10. [10]

    3d object detection for autonomous driving: A survey,

    R. Qian, X. Lai, and X. Li, “3d object detection for autonomous driving: A survey,”arXiv, 2022

  11. [11]

    A survey on deep-learning-based lidar 3d object detection,

    S. Alabaet al., “A survey on deep-learning-based lidar 3d object detection,”PMC Free Article, 2022

  12. [12]

    Vlm-auto: Vlm-based autonomous driving assistant with human-like behavior and understanding for complex road scenes,

    Z. Guo, Z. Yagudin, A. Lykov, M. Konenkov, and D. Tsetserukou, “Vlm-auto: Vlm-based autonomous driving assistant with human-like behavior and understanding for complex road scenes,” in2024 2nd International Conference on F oundation and Large Language Models (FLLM), 2024, pp. 501–507

  13. [13]

    Racecar-the dataset for high- speed autonomous racing,

    A. Kulkarni, J. Chrosniak, E. Ducote, F. Sauerbeck, A. Saba, U. Chir- imar, J. Link, M. Behl, and M. Cellina, “Racecar-the dataset for high- speed autonomous racing,” in2023 IEEE/RSJ International Confer- ence on Intelligent Robots and Systems (IROS). IEEE, 2023, pp. 11 458–11 463

  14. [14]

    Betty dataset: A multi-modal dataset for full-stack autonomy,

    M. Nye, A. Raji, A. Saba, E. Erlich, R. Exley, A. Goyal, A. Matros, R. Misra, M. Sivaprakasam, M. Bertognaet al., “Betty dataset: A multi-modal dataset for full-stack autonomy,” in2025 IEEE Interna- tional Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 2453–2460

  15. [15]

    Ms3d++: Ensemble of experts for multi-source unsupervised domain adaptation in 3d object detection,

    D. Tsai, J. S. Berrio, and M. e. a. Shan, “Ms3d++: Ensemble of experts for multi-source unsupervised domain adaptation in 3d object detection,”arXiv, 2023

  16. [16]

    Bev-dg: Cross-modal learning under bird’s-eye view for domain generalization,

    M. e. a. Li, “Bev-dg: Cross-modal learning under bird’s-eye view for domain generalization,” inICCV, 2023

  17. [17]

    Sim-to-real adversarial domain adaptation for 3d object detection,

    M. Wozniak, M. Hansson, and P. Jensfelt, “Sim-to-real adversarial domain adaptation for 3d object detection,” inCVPR, 2024

  18. [18]

    Metdrive: Multimodal end-to-end autonomous driv- ing with temporal guidance,

    Z. Guo, X. Lin, Z. Yagudin, A. Lykov, Y . Wang, Y . Li, and D. Tsetserukou, “Metdrive: Multimodal end-to-end autonomous driv- ing with temporal guidance,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 6027–6032

  19. [19]

    Sustechpoints: 3d point cloud annotation platform,

    SUSTechPOINTS Development Team, “Sustechpoints: 3d point cloud annotation platform,” https://github.com/naurril/SUSTechPOINTS/ tree/dev-auto-annotate, 2023

  20. [20]

    Indy autonomous challenge,

    Indy Autonomous Challenge, “Indy autonomous challenge,” https:// www.indyautonomouschallenge.com, 2023, accessed: 2024

  21. [21]

    Abu dhabi autonomous racing league (a2rl),

    A2RL, “Abu dhabi autonomous racing league (a2rl),” https://a2rl.io, 2023, accessed: 2024

  22. [22]

    Scalability in perception for au- tonomous driving: Waymo open dataset,

    P. Sun, H. Kretzschmaret al., “Scalability in perception for au- tonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  23. [23]

    Pointpillars: Fast encoders for object detection from point clouds,

    A. H. Lang, S. V ora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “Pointpillars: Fast encoders for object detection from point clouds,” inCVPR, 2019

  24. [24]

    Centerpoint: Center-based 3d object detection and tracking,

    T. Yin, X. Zhou, and P. Krahenbuhl, “Centerpoint: Center-based 3d object detection and tracking,” inCVPR, 2021