pith. machine review for the scientific record. sign in

arxiv: 2604.10391 · v1 · submitted 2026-04-12 · 💻 cs.CV · cs.AI

Recognition: unknown

FishRoPE: Projective Rotary Position Embeddings for Omnidirectional Visual Perception

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:28 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords fisheye camerasrotary position embeddingsvision foundation modelsomnidirectional visionBEV segmentationobject detectionLoRA adaptationautonomous vehicle perception
0
0 comments X

The pith

FishRoPE reparameterizes attention to angular separation in spherical coordinates to adapt frozen vision foundation models to fisheye geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision foundation models assume rectilinear pinhole camera geometry, but fisheye cameras introduce severe radial distortion that breaks this assumption. The paper establishes that reparameterizing rotary position embeddings to compute attention based on angular separation in spherical coordinates, rather than pixel distances, allows these models to handle fisheye inputs consistently. Combined with lightweight LoRA fine-tuning on a frozen DINOv2 backbone, this transfers rich self-supervised features to fisheye perception tasks without requiring task-specific pretraining or large annotated datasets. A reader would care because fisheye cameras are standard in autonomous vehicles for their wide surround coverage, making efficient adaptation crucial where full retraining is impractical. The method delivers state-of-the-art results on 2D detection and bird's-eye-view segmentation benchmarks.

Core claim

FishRoPE is a projective rotary position embedding that reparameterizes the attention mechanism in the spherical coordinates of the fisheye projection so that both self-attention and cross-attention operate on angular separation rather than pixel distance. This, together with Low-Rank Adaptation on a frozen DINOv2 backbone, adapts vision foundation models to fisheye geometry and achieves state-of-the-art performance on WoodScape 2D detection at 54.3 mAP and SynWoodScapes BEV segmentation at 65.1 mIoU.

What carries the argument

FishRoPE, the Fisheye Rotary Position Embedding that reparameterizes attention to angular separation in spherical coordinates instead of pixel distance.

If this is right

  • Achieves state-of-the-art 54.3 mAP on WoodScape 2D object detection.
  • Achieves state-of-the-art 65.1 mIoU on SynWoodScapes BEV segmentation.
  • Introduces negligible computational overhead and is architecture-agnostic.
  • Naturally reduces to the standard rotary position embedding formulation under pinhole geometry.
  • Enables feature transfer from pinhole-trained models to fisheye without task-specific pretraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be generalized to other camera models with non-linear projections by deriving appropriate coordinate reparameterizations for attention.
  • It highlights the importance of geometric consistency in positional encodings for cross-domain transfer in computer vision.
  • Future work might explore integrating FishRoPE with multi-view fusion techniques for improved omnidirectional 3D perception.
  • Validation on additional real-world fisheye datasets would test if the gains hold beyond the synthetic and specific benchmarks used.

Load-bearing premise

Reparameterizing attention to angular separation in spherical coordinates suffices to align geometrically inconsistent features from pinhole-trained models with fisheye images.

What would settle it

If a baseline using standard position embeddings with the same LoRA adaptation matches or exceeds FishRoPE's performance on the WoodScape and SynWoodScapes benchmarks, the claim that angular reparameterization is key would be falsified.

Figures

Figures reproduced from arXiv: 2604.10391 by Bala Murali Manoghar Sai Sudhakar, Mudit Jain, Pratik Likhar, Rahul Ahuja, Senthil Yogamani, Varun Ravi Kumar, Venkatraman Narayanan.

Figure 1
Figure 1. Figure 1: Architecture overview. FishRoPE comprises (1) a frozen DINOv2 backbone with LoRA adaptation for multi-scale feature extraction from fisheye images, (2) a FishRoPE-enhanced feature encoder that embeds fisheye-aware angular geometry into self-attention, and (3) task-specific heads for 2D detection and BEV segmentation via Kannala–Brandt view transformation. the inverse Kannala–Brandt (KB) projection [8]: r &… view at source ↗
Figure 2
Figure 2. Figure 2: shows qualitative detection comparisons on Wood￾Scape. FishRoPE correctly localizes a peripheral pedestrian that the baseline misses, illustrating the benefit of angular position encoding at high incidence angles [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Vision foundation models (VFMs) and Bird's Eye View (BEV) representation have advanced visual perception substantially, yet their internal spatial representations assume the rectilinear geometry of pinhole cameras. Fisheye cameras, widely deployed on production autonomous vehicles for their surround-view coverage, exhibit severe radial distortion that renders these representations geometrically inconsistent. At the same time, the scarcity of large-scale fisheye annotations makes retraining foundation models from scratch impractical. We present \ours, a lightweight framework that adapts frozen VFMs to fisheye geometry through two components: a frozen DINOv2 backbone with Low-Rank Adaptation (LoRA) that transfers rich self-supervised features to fisheye without task-specific pretraining, and Fisheye Rotary Position Embedding (FishRoPE), which reparameterizes the attention mechanism in the spherical coordinates of the fisheye projection so that both self-attention and cross-attention operate on angular separation rather than pixel distance. FishRoPE is architecture-agnostic, introduces negligible computational overhead, and naturally reduces to the standard formulation under pinhole geometry. We evaluate \ours on WoodScape 2D detection (54.3 mAP) and SynWoodScapes BEV segmentation (65.1 mIoU), where it achieves state-of-the-art results on both benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes FishRoPE, a lightweight adaptation framework for vision foundation models (VFMs) on fisheye imagery. It freezes a DINOv2 backbone, applies LoRA for feature transfer without task-specific pretraining, and introduces Fisheye Rotary Position Embeddings that reparameterize self- and cross-attention to operate on angular separation in spherical coordinates rather than pixel distance. The method is architecture-agnostic, adds negligible overhead, and reduces to standard RoPE under pinhole geometry. It reports SOTA results of 54.3 mAP on WoodScape 2D detection and 65.1 mIoU on SynWoodScapes BEV segmentation.

Significance. If the results and geometric consistency claims hold, the work provides a practical route to deploy existing pinhole-trained VFMs on omnidirectional fisheye perception without full retraining or large fisheye datasets, which is valuable for autonomous driving applications. The architecture-agnostic design, explicit reduction to standard RoPE, and emphasis on frozen backbones with minimal adaptation are clear strengths that could enable broader adoption.

major comments (2)
  1. [Abstract / Method] Abstract and Method description: the central claim that reparameterizing only self- and cross-attention to angular separation in spherical coordinates renders pinhole-trained DINOv2 features geometrically consistent on fisheye inputs is load-bearing but unsupported by isolating evidence. The frozen backbone's patch embedding and early layers still operate directly on the distorted pixel grid (trained under rectilinear assumptions) and receive only light LoRA adaptation; no experiments, feature visualizations, or alignment metrics demonstrate that attention reparameterization alone propagates consistency to overcome local receptive-field misalignment with the fisheye projection model.
  2. [§4] §4 (Experiments): the SOTA claims of 54.3 mAP and 65.1 mIoU are presented without error bars, standard deviations across runs, or ablation tables separating FishRoPE from LoRA. This makes it impossible to determine whether the reported gains arise from the angular reparameterization or from other factors, directly undermining the sufficiency argument for the attention-only modification.
minor comments (2)
  1. [§4.1] The selection and train/test splits of WoodScape and SynWoodScapes are not justified in §4.1; adding a brief rationale or reference to standard splits would improve reproducibility.
  2. [Method] Notation for the spherical coordinate mapping and the exact form of the angular-separation rotary embedding could be clarified with an explicit equation in the Method section to make the reduction to standard RoPE fully transparent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below, providing our strongest honest defense while committing to revisions that strengthen the evidence and statistical rigor of the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and Method description: the central claim that reparameterizing only self- and cross-attention to angular separation in spherical coordinates renders pinhole-trained DINOv2 features geometrically consistent on fisheye inputs is load-bearing but unsupported by isolating evidence. The frozen backbone's patch embedding and early layers still operate directly on the distorted pixel grid (trained under rectilinear assumptions) and receive only light LoRA adaptation; no experiments, feature visualizations, or alignment metrics demonstrate that attention reparameterization alone propagates consistency to overcome local receptive-field misalignment with the fisheye projection model.

    Authors: We agree that isolating evidence would strengthen the central claim. The patch embeddings and early frozen layers extract local descriptors from the distorted grid, but these descriptors remain largely useful as they capture appearance rather than global geometry. LoRA provides targeted adaptation to these features for fisheye inputs, while FishRoPE ensures that subsequent self- and cross-attention layers operate on angular separations in spherical coordinates, thereby enforcing geometric consistency at the level where spatial relationships are integrated. This design choice is supported by the explicit reduction to standard RoPE under pinhole geometry and the architecture-agnostic nature of the method. However, we acknowledge the absence of direct isolating experiments in the current version. In the revision we will add: (i) attention map visualizations contrasting standard RoPE and FishRoPE on fisheye inputs, and (ii) quantitative alignment metrics (e.g., reprojection error of known 3D landmarks) comparing the two while holding LoRA fixed. These additions will demonstrate how consistency propagates through the attention layers. revision: yes

  2. Referee: [§4] §4 (Experiments): the SOTA claims of 54.3 mAP and 65.1 mIoU are presented without error bars, standard deviations across runs, or ablation tables separating FishRoPE from LoRA. This makes it impossible to determine whether the reported gains arise from the angular reparameterization or from other factors, directly undermining the sufficiency argument for the attention-only modification.

    Authors: We concur that error bars and isolating ablations are necessary for rigorous evaluation. The reported figures reflect our primary experimental runs. In the revised manuscript we will rerun all benchmarks across multiple random seeds (minimum three) and report means with standard deviations for both 54.3 mAP and 65.1 mIoU. We will also insert a dedicated ablation table that includes: (a) frozen DINOv2 + LoRA with standard RoPE, and (b) frozen DINOv2 + LoRA + FishRoPE. This comparison will isolate the contribution of the angular reparameterization from LoRA adaptation alone, directly addressing whether the performance gains are attributable to FishRoPE. revision: yes

Circularity Check

0 steps flagged

No circularity: FishRoPE is a direct geometric reparameterization

full rationale

The paper presents FishRoPE as an explicit reparameterization of rotary position embeddings to operate on angular separation in spherical coordinates rather than pixel distance, with the explicit property that it naturally reduces to standard RoPE under pinhole geometry. This reduction is a designed consistency property of the formulation, not a derived claim obtained by fitting parameters to target metrics or by self-referential definition. The adaptation framework (frozen DINOv2 + LoRA) and reported results (54.3 mAP on WoodScape, 65.1 mIoU on SynWoodScapes) are empirical evaluations on external benchmarks; no load-bearing derivation step in the abstract or described method reduces by construction to the inputs or to a self-citation chain. The approach is stated to be architecture-agnostic with negligible overhead, confirming the derivation chain remains self-contained and independent.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that angular separation in spherical coordinates adequately models fisheye projection for attention; no free parameters or new entities are explicitly introduced beyond the method itself.

axioms (1)
  • domain assumption Attention mechanisms can be reparameterized using angular distances derived from fisheye projection geometry
    Invoked to justify why FishRoPE transfers features from pinhole-trained models
invented entities (1)
  • FishRoPE no independent evidence
    purpose: Reparameterized rotary position embedding for fisheye spherical coordinates
    New method component introduced to handle radial distortion

pith-pipeline@v0.9.0 · 5571 in / 1329 out tokens · 37713 ms · 2026-05-10T16:28:59.817026+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 6 canonical work pages

  1. [1]

    Athwale et al

    A. Athwale et al. FishBEV: Distortion-resilient bird’s eye view segmentation with surround-view fisheye cameras.arXiv preprint arXiv:2509.13681, 2025. 1, 2, 3, 6

  2. [2]

    Vision transformers need registers.ICLR, 2024

    Timoth´ee Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers.ICLR, 2024. 3

  3. [3]

    Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki

    Adam W. Harley, Zhaoyuan Fang, Jie Li, Rares Ambrus, and Katerina Fragkiadaki. Simple-BEV: What really matters for multi-sensor BEV perception? InICRA, 2023. 2

  4. [4]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, 2016. 3

  5. [5]

    Rotary position embedding for vision transformer

    Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. In ECCV, 2024. 1, 2, 3, 6

  6. [6]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. 1, 2, 3

  7. [7]

    RoPETR: Improving temporal camera-only 3D detection by integrating enhanced rotary position embedding

    Hang Ji et al. RoPETR: Improving temporal camera-only 3D detection by integrating enhanced rotary position embedding. arXiv preprint arXiv:2504.12643, 2025. 2

  8. [8]

    A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses.IEEE TPAMI, 28(8):1335–1340, 2006

    Juho Kannala and Sami S Brandt. A generic camera model and calibration method for conventional, wide-angle, and fish-eye lenses.IEEE TPAMI, 28(8):1335–1340, 2006. 4

  9. [9]

    Near-field depth estimation using monocular fisheye camera: A semi-supervised learning approach using sparse lidar data

    Varun Ravi Kumar, Stefan Milz, Christian Witt, Martin Simon, Karl Amende, Johannes Petzold, Senthil Yogamani, and Timo Pech. Near-field depth estimation using monocular fisheye camera: A semi-supervised learning approach using sparse lidar data. InCVPR Workshop, page 2, 2018. 1

  10. [10]

    OmniDet: Surround view cameras based multi- task visual perception network for autonomous driving.IEEE RAL, 2021

    Varun Ravi Kumar, Stefan Milz, Christian Witt, Martin Simon, Karl Amende, Johannes Pfeuffer, Hazem Rashed, and Senthil Yogamani. OmniDet: Surround view cameras based multi- task visual perception network for autonomous driving.IEEE RAL, 2021. 1, 2, 3

  11. [11]

    BEVFormer: Learn- ing bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

    Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. BEVFormer: Learn- ing bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. InECCV, 2022. 1, 2, 5

  12. [12]

    Feature pyramid networks for object detection

    Tsung-Yi Lin, Piotr Doll ´ar, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. InCVPR, 2017. 3

  13. [13]

    Focal loss for dense object detection

    Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InICCV,

  14. [14]

    PETR: Position embedding transformer for multi-camera 3D object detection

    Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. PETR: Position embedding transformer for multi-camera 3D object detection. InECCV, 2022. 2

  15. [15]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, 2021. 6

  16. [16]

    Maxime Oquab, Timoth´ee Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Pat...

  17. [17]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D

    Jonah Philion and Sanja Fidler. Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3D. InECCV, 2020. 2, 5

  18. [18]

    Generalized object detection on fisheye cameras for autonomous driving: Dataset, representations and baseline

    Hazem Rashed, Eslam Mohamed, Ganesh Sistu, Varun Ravi Kumar, Ciaran Eising, Ahmad El-Sallab, and Senthil Yoga- mani. Generalized object detection on fisheye cameras for autonomous driving: Dataset, representations and baseline. InWACV, 2021. 1, 2, 3, 6

  19. [19]

    Bridging perspectives: Foundation model guided bev maps for 3d object detection and tracking,

    B. Ravi Kiran et al. Bridging perspectives: Foundation model guided BEV maps for 3D object detection and tracking.arXiv preprint arXiv:2510.10287, 2025. 2

  20. [20]

    Samani, Feng Tao, Harshavardhan R

    Ekta U. Samani, Feng Tao, Harshavardhan R. Dasari, Sihao Ding, and Ashis G. Banerjee. F2BEV: Bird’s eye view genera- tion from surround-view fisheye camera images for automated driving. InIROS, 2023. 1, 2, 3, 6

  21. [21]

    BEVCar: Camera-radar fusion for BEV map and object seg- mentation

    Jonas Schramm, Niclas V¨odisch, Kaan Petek, B Ravi Kiran, Senthil Yogamani, Wolfram Burgard, and Abhinav Valada. BEVCar: Camera-radar fusion for BEV map and object seg- mentation. InIROS, 2024. 2

  22. [22]

    Syn- WoodScapes: Synthetic fisheye dataset for autonomous driv- ing

    Ahmed Rida Sekkat, Yohan Dupuis, Varun Ravi Kumar, Hazem Rashed, Senthil Yogamani, Luka Music, et al. Syn- WoodScapes: Synthetic fisheye dataset for autonomous driv- ing. InCVPR Workshops, 2022. 2, 5

  23. [23]

    FisheyeDetNet: Object detection on fisheye surround view camera systems for auto- mated driving.arXiv preprint arXiv:2404.13443, 2024

    Ganesh Sistu and Senthil Yogamani. FisheyeDetNet: Object detection on fisheye surround view camera systems for auto- mated driving.arXiv preprint arXiv:2404.13443, 2024. 2, 6

  24. [24]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063,

  25. [25]

    Challenges in designing datasets and validation for autonomous driving

    Michal Uric´ar, David Hurych, Pavel Krizek, and Senthil Yo- gamani. Challenges in designing datasets and validation for autonomous driving. InProceedings of the International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAPP), 2019. 2

  26. [26]

    Overview and empirical analysis of isp parameter tuning for visual perception in autonomous driving.Journal of Imaging, 5(10): 78, 2019

    Lucie Yahiaoui, Jonathan Horgan, Brian Deegan, Senthil Yogamani, Ciar´an Hughes, and Patrick Denny. Overview and empirical analysis of isp parameter tuning for visual perception in autonomous driving.Journal of Imaging, 5(10): 78, 2019. 1

  27. [27]

    WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driv- ing

    Senthil Yogamani, Ciar´an Hughes, Jonathan Horgan, Ganesh Sistu, Padraig Varley, Derek O’Dea, Michal Uric´ar, Stefan Milz, Martin Simon, Karl Amende, et al. WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driv- ing. InICCV, 2019. 1, 2, 3, 5

  28. [28]

    FisheyeBEVSeg: Surround view fisheye cameras based bird’s-eye view segmentation for au- tonomous driving

    Senthil Yogamani, David Unger, Venkatraman Narayanan, and Varun Ravi Kumar. FisheyeBEVSeg: Surround view fisheye cameras based bird’s-eye view segmentation for au- tonomous driving. InCVPR Workshops, 2024. 2, 3

  29. [29]

    DaF-BEVSeg: Distortion-aware fisheye camera based bird’s eye view segmentation with oc- clusion reasoning.arXiv preprint arXiv:2404.06352, 2024

    Senthil Yogamani et al. DaF-BEVSeg: Distortion-aware fisheye camera based bird’s eye view segmentation with oc- clusion reasoning.arXiv preprint arXiv:2404.06352, 2024. 2, 5

  30. [30]

    Objects as Points

    Xingyi Zhou, Dequan Wang, and Philipp Kr¨ahenb¨uhl. Objects as points. InarXiv preprint arXiv:1904.07850, 2019. 4

  31. [31]

    Deformable DETR: Deformable transformers for end-to-end object detection

    Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable DETR: Deformable transformers for end-to-end object detection. InICLR, 2021. 2, 5