Recognition: unknown
FishRoPE: Projective Rotary Position Embeddings for Omnidirectional Visual Perception
Pith reviewed 2026-05-10 16:28 UTC · model grok-4.3
The pith
FishRoPE reparameterizes attention to angular separation in spherical coordinates to adapt frozen vision foundation models to fisheye geometry.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FishRoPE is a projective rotary position embedding that reparameterizes the attention mechanism in the spherical coordinates of the fisheye projection so that both self-attention and cross-attention operate on angular separation rather than pixel distance. This, together with Low-Rank Adaptation on a frozen DINOv2 backbone, adapts vision foundation models to fisheye geometry and achieves state-of-the-art performance on WoodScape 2D detection at 54.3 mAP and SynWoodScapes BEV segmentation at 65.1 mIoU.
What carries the argument
FishRoPE, the Fisheye Rotary Position Embedding that reparameterizes attention to angular separation in spherical coordinates instead of pixel distance.
If this is right
- Achieves state-of-the-art 54.3 mAP on WoodScape 2D object detection.
- Achieves state-of-the-art 65.1 mIoU on SynWoodScapes BEV segmentation.
- Introduces negligible computational overhead and is architecture-agnostic.
- Naturally reduces to the standard rotary position embedding formulation under pinhole geometry.
- Enables feature transfer from pinhole-trained models to fisheye without task-specific pretraining.
Where Pith is reading between the lines
- The approach could be generalized to other camera models with non-linear projections by deriving appropriate coordinate reparameterizations for attention.
- It highlights the importance of geometric consistency in positional encodings for cross-domain transfer in computer vision.
- Future work might explore integrating FishRoPE with multi-view fusion techniques for improved omnidirectional 3D perception.
- Validation on additional real-world fisheye datasets would test if the gains hold beyond the synthetic and specific benchmarks used.
Load-bearing premise
Reparameterizing attention to angular separation in spherical coordinates suffices to align geometrically inconsistent features from pinhole-trained models with fisheye images.
What would settle it
If a baseline using standard position embeddings with the same LoRA adaptation matches or exceeds FishRoPE's performance on the WoodScape and SynWoodScapes benchmarks, the claim that angular reparameterization is key would be falsified.
Figures
read the original abstract
Vision foundation models (VFMs) and Bird's Eye View (BEV) representation have advanced visual perception substantially, yet their internal spatial representations assume the rectilinear geometry of pinhole cameras. Fisheye cameras, widely deployed on production autonomous vehicles for their surround-view coverage, exhibit severe radial distortion that renders these representations geometrically inconsistent. At the same time, the scarcity of large-scale fisheye annotations makes retraining foundation models from scratch impractical. We present \ours, a lightweight framework that adapts frozen VFMs to fisheye geometry through two components: a frozen DINOv2 backbone with Low-Rank Adaptation (LoRA) that transfers rich self-supervised features to fisheye without task-specific pretraining, and Fisheye Rotary Position Embedding (FishRoPE), which reparameterizes the attention mechanism in the spherical coordinates of the fisheye projection so that both self-attention and cross-attention operate on angular separation rather than pixel distance. FishRoPE is architecture-agnostic, introduces negligible computational overhead, and naturally reduces to the standard formulation under pinhole geometry. We evaluate \ours on WoodScape 2D detection (54.3 mAP) and SynWoodScapes BEV segmentation (65.1 mIoU), where it achieves state-of-the-art results on both benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes FishRoPE, a lightweight adaptation framework for vision foundation models (VFMs) on fisheye imagery. It freezes a DINOv2 backbone, applies LoRA for feature transfer without task-specific pretraining, and introduces Fisheye Rotary Position Embeddings that reparameterize self- and cross-attention to operate on angular separation in spherical coordinates rather than pixel distance. The method is architecture-agnostic, adds negligible overhead, and reduces to standard RoPE under pinhole geometry. It reports SOTA results of 54.3 mAP on WoodScape 2D detection and 65.1 mIoU on SynWoodScapes BEV segmentation.
Significance. If the results and geometric consistency claims hold, the work provides a practical route to deploy existing pinhole-trained VFMs on omnidirectional fisheye perception without full retraining or large fisheye datasets, which is valuable for autonomous driving applications. The architecture-agnostic design, explicit reduction to standard RoPE, and emphasis on frozen backbones with minimal adaptation are clear strengths that could enable broader adoption.
major comments (2)
- [Abstract / Method] Abstract and Method description: the central claim that reparameterizing only self- and cross-attention to angular separation in spherical coordinates renders pinhole-trained DINOv2 features geometrically consistent on fisheye inputs is load-bearing but unsupported by isolating evidence. The frozen backbone's patch embedding and early layers still operate directly on the distorted pixel grid (trained under rectilinear assumptions) and receive only light LoRA adaptation; no experiments, feature visualizations, or alignment metrics demonstrate that attention reparameterization alone propagates consistency to overcome local receptive-field misalignment with the fisheye projection model.
- [§4] §4 (Experiments): the SOTA claims of 54.3 mAP and 65.1 mIoU are presented without error bars, standard deviations across runs, or ablation tables separating FishRoPE from LoRA. This makes it impossible to determine whether the reported gains arise from the angular reparameterization or from other factors, directly undermining the sufficiency argument for the attention-only modification.
minor comments (2)
- [§4.1] The selection and train/test splits of WoodScape and SynWoodScapes are not justified in §4.1; adding a brief rationale or reference to standard splits would improve reproducibility.
- [Method] Notation for the spherical coordinate mapping and the exact form of the angular-separation rotary embedding could be clarified with an explicit equation in the Method section to make the reduction to standard RoPE fully transparent.
Simulated Author's Rebuttal
We thank the referee for the constructive and insightful comments. We address each major point below, providing our strongest honest defense while committing to revisions that strengthen the evidence and statistical rigor of the manuscript.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and Method description: the central claim that reparameterizing only self- and cross-attention to angular separation in spherical coordinates renders pinhole-trained DINOv2 features geometrically consistent on fisheye inputs is load-bearing but unsupported by isolating evidence. The frozen backbone's patch embedding and early layers still operate directly on the distorted pixel grid (trained under rectilinear assumptions) and receive only light LoRA adaptation; no experiments, feature visualizations, or alignment metrics demonstrate that attention reparameterization alone propagates consistency to overcome local receptive-field misalignment with the fisheye projection model.
Authors: We agree that isolating evidence would strengthen the central claim. The patch embeddings and early frozen layers extract local descriptors from the distorted grid, but these descriptors remain largely useful as they capture appearance rather than global geometry. LoRA provides targeted adaptation to these features for fisheye inputs, while FishRoPE ensures that subsequent self- and cross-attention layers operate on angular separations in spherical coordinates, thereby enforcing geometric consistency at the level where spatial relationships are integrated. This design choice is supported by the explicit reduction to standard RoPE under pinhole geometry and the architecture-agnostic nature of the method. However, we acknowledge the absence of direct isolating experiments in the current version. In the revision we will add: (i) attention map visualizations contrasting standard RoPE and FishRoPE on fisheye inputs, and (ii) quantitative alignment metrics (e.g., reprojection error of known 3D landmarks) comparing the two while holding LoRA fixed. These additions will demonstrate how consistency propagates through the attention layers. revision: yes
-
Referee: [§4] §4 (Experiments): the SOTA claims of 54.3 mAP and 65.1 mIoU are presented without error bars, standard deviations across runs, or ablation tables separating FishRoPE from LoRA. This makes it impossible to determine whether the reported gains arise from the angular reparameterization or from other factors, directly undermining the sufficiency argument for the attention-only modification.
Authors: We concur that error bars and isolating ablations are necessary for rigorous evaluation. The reported figures reflect our primary experimental runs. In the revised manuscript we will rerun all benchmarks across multiple random seeds (minimum three) and report means with standard deviations for both 54.3 mAP and 65.1 mIoU. We will also insert a dedicated ablation table that includes: (a) frozen DINOv2 + LoRA with standard RoPE, and (b) frozen DINOv2 + LoRA + FishRoPE. This comparison will isolate the contribution of the angular reparameterization from LoRA adaptation alone, directly addressing whether the performance gains are attributable to FishRoPE. revision: yes
Circularity Check
No circularity: FishRoPE is a direct geometric reparameterization
full rationale
The paper presents FishRoPE as an explicit reparameterization of rotary position embeddings to operate on angular separation in spherical coordinates rather than pixel distance, with the explicit property that it naturally reduces to standard RoPE under pinhole geometry. This reduction is a designed consistency property of the formulation, not a derived claim obtained by fitting parameters to target metrics or by self-referential definition. The adaptation framework (frozen DINOv2 + LoRA) and reported results (54.3 mAP on WoodScape, 65.1 mIoU on SynWoodScapes) are empirical evaluations on external benchmarks; no load-bearing derivation step in the abstract or described method reduces by construction to the inputs or to a self-citation chain. The approach is stated to be architecture-agnostic with negligible overhead, confirming the derivation chain remains self-contained and independent.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention mechanisms can be reparameterized using angular distances derived from fisheye projection geometry
invented entities (1)
-
FishRoPE
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A. Athwale et al. FishBEV: Distortion-resilient bird’s eye view segmentation with surround-view fisheye cameras.arXiv preprint arXiv:2509.13681, 2025. 1, 2, 3, 6
-
[2]
Vision transformers need registers.ICLR, 2024
Timoth´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.ICLR, 2024. 3
2024
-
[3]
Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki
Adam W. Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki. Simple-BEV: What really matters for multi-sensor BEV perception? InICRA, 2023. 2
2023
-
[4]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 3
2016
-
[5]
Rotary position embedding for vision transformer
Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. In ECCV, 2024. 1, 2, 3, 6
2024
-
[6]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 1, 2, 3
2022
-
[7]
Hang Ji et al. RoPETR: Improving temporal camera-only 3D detection by integrating enhanced rotary position embedding. arXiv preprint arXiv:2504.12643, 2025. 2
-
[8]
A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses.IEEE TPAMI, 28(8):1335–1340, 2006
Juho Kannala and Sami S Brandt. A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses.IEEE TPAMI, 28(8):1335–1340, 2006. 4
2006
-
[9]
Near-field depth estimation using monocular fisheye camera: A semi-supervised learning approach using sparse lidar data
Varun Ravi Kumar, Stefan Milz, Christian Witt, Martin Simon, Karl Amende, Johannes Petzold, Senthil Yogamani, and Timo Pech. Near-field depth estimation using monocular fisheye camera: A semi-supervised learning approach using sparse lidar data. InCVPR Workshop, page 2, 2018. 1
2018
-
[10]
OmniDet: Surround view cameras based multi- task visual perception network for autonomous driving.IEEE RAL, 2021
Varun Ravi Kumar, Stefan Milz, Christian Witt, Martin Simon, Karl Amende, Johannes Pfeuffer, Hazem Rashed, and Senthil Yogamani. OmniDet: Surround view cameras based multi- task visual perception network for autonomous driving.IEEE RAL, 2021. 1, 2, 3
2021
-
[11]
BEVFormer: Learn- ing bird’s-eye-view representation from multi-camera images via spatiotemporal transformers
Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. BEVFormer: Learn- ing bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InECCV, 2022. 1, 2, 5
2022
-
[12]
Feature pyramid networks for object detection
Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InCVPR, 2017. 3
2017
-
[13]
Focal loss for dense object detection
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InICCV,
-
[14]
PETR: Position embedding transformer for multi-camera 3D object detection
Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. PETR: Position embedding transformer for multi-camera 3D object detection. InECCV, 2022. 2
2022
-
[15]
Swin transformer: Hierarchical vision transformer using shifted windows
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 6
2021
-
[16]
Maxime Oquab, Timoth´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Pat...
-
[17]
Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D
Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. InECCV, 2020. 2, 5
2020
-
[18]
Generalized object detection on fisheye cameras for autonomous driving: Dataset, representations and baseline
Hazem Rashed, Eslam Mohamed, Ganesh Sistu, Varun Ravi Kumar, Ciaran Eising, Ahmad El-Sallab, and Senthil Yoga- mani. Generalized object detection on fisheye cameras for autonomous driving: Dataset, representations and baseline. InWACV, 2021. 1, 2, 3, 6
2021
-
[19]
Bridging perspectives: Foundation model guided bev maps for 3d object detection and tracking,
B. Ravi Kiran et al. Bridging perspectives: Foundation model guided BEV maps for 3D object detection and tracking.arXiv preprint arXiv:2510.10287, 2025. 2
-
[20]
Samani, Feng Tao, Harshavardhan R
Ekta U. Samani, Feng Tao, Harshavardhan R. Dasari, Sihao Ding, and Ashis G. Banerjee. F2BEV: Bird’s eye view genera- tion from surround-view fisheye camera images for automated driving. InIROS, 2023. 1, 2, 3, 6
2023
-
[21]
BEVCar: Camera-radar fusion for BEV map and object seg- mentation
Jonas Schramm, Niclas V¨odisch, Kaan Petek, B Ravi Kiran, Senthil Yogamani, Wolfram Burgard, and Abhinav Valada. BEVCar: Camera-radar fusion for BEV map and object seg- mentation. InIROS, 2024. 2
2024
-
[22]
Syn- WoodScapes: Synthetic fisheye dataset for autonomous driv- ing
Ahmed Rida Sekkat, Yohan Dupuis, Varun Ravi Kumar, Hazem Rashed, Senthil Yogamani, Luka Music, et al. Syn- WoodScapes: Synthetic fisheye dataset for autonomous driv- ing. InCVPR Workshops, 2022. 2, 5
2022
-
[23]
Ganesh Sistu and Senthil Yogamani. FisheyeDetNet: Object detection on fisheye surround view camera systems for auto- mated driving.arXiv preprint arXiv:2404.13443, 2024. 2, 6
-
[24]
RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,
-
[25]
Challenges in designing datasets and validation for autonomous driving
Michal Uric´ar, David Hurych, Pavel Krizek, and Senthil Yo- gamani. Challenges in designing datasets and validation for autonomous driving. InProceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAPP), 2019. 2
2019
-
[26]
Overview and empirical analysis of isp parameter tuning for visual perception in autonomous driving.Journal of Imaging, 5(10): 78, 2019
Lucie Yahiaoui, Jonathan Horgan, Brian Deegan, Senthil Yogamani, Ciar´an Hughes, and Patrick Denny. Overview and empirical analysis of isp parameter tuning for visual perception in autonomous driving.Journal of Imaging, 5(10): 78, 2019. 1
2019
-
[27]
WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driv- ing
Senthil Yogamani, Ciar´an Hughes, Jonathan Horgan, Ganesh Sistu, Padraig Varley, Derek O’Dea, Michal Uric´ar, Stefan Milz, Martin Simon, Karl Amende, et al. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driv- ing. InICCV, 2019. 1, 2, 3, 5
2019
-
[28]
FisheyeBEVSeg: Surround view fisheye cameras based bird’s-eye view segmentation for au- tonomous driving
Senthil Yogamani, David Unger, Venkatraman Narayanan, and Varun Ravi Kumar. FisheyeBEVSeg: Surround view fisheye cameras based bird’s-eye view segmentation for au- tonomous driving. InCVPR Workshops, 2024. 2, 3
2024
-
[29]
Senthil Yogamani et al. DaF-BEVSeg: Distortion-aware fisheye camera based bird’s eye view segmentation with oc- clusion reasoning.arXiv preprint arXiv:2404.06352, 2024. 2, 5
-
[30]
Xingyi Zhou, Dequan Wang, and Philipp Kr¨ahenb¨uhl. Objects as points. InarXiv preprint arXiv:1904.07850, 2019. 4
work page Pith review arXiv 1904
-
[31]
Deformable DETR: Deformable transformers for end-to-end object detection
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transformers for end-to-end object detection. InICLR, 2021. 2, 5
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.