pith. sign in

arxiv: 2607.02479 · v1 · pith:N663ZKXPnew · submitted 2026-07-02 · 💻 cs.CV

EAGLE-360: Embodied Active Global-to-Local Exploration in 360^circ

Pith reviewed 2026-07-03 14:48 UTC · model grok-4.3

classification 💻 cs.CV
keywords 360 panoramic searchmultimodal large language modelsactive visual explorationglobal-to-local reasoningpositional encoding adaptationembodied agentspanoramic dataset
0
0 comments X

The pith

Adapting positional encoding and global priors lets MLLMs search 360-degree panoramas by starting wide then narrowing locally.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to show that standard multimodal models fail at 360-degree visual search because they cannot handle polar distortion and continuous cylindrical wraps, forcing inefficient local-only scans. It builds a large dataset of panoramas and dialogues, then trains the model with a rolling coordinate shift in positional encoding plus supervised and policy optimization steps so the model reasons from an initial full view before refining. If correct, this would replace myopic exploration with error-recovering global-to-local steps that keep targets in context even when they leave the current frame. Readers would care because the change directly improves accuracy and efficiency for any agent operating in immersive or robotic panoramic settings.

Core claim

EAGLE-360 claims that leveraging global priors for an initial holistic perspective, iteratively narrowing the search space, and adapting RoPE Rolling to model continuous cylindrical topologies allows MLLMs to overcome panoramic distortion; when combined with SFT and GRPO training on the 14,000+ panorama and 70,000+ dialogue dataset, this elicits spatial reasoning and tool use that produces nearly an eight-fold accuracy gain over the base model and markedly better exploration efficiency.

What carries the argument

RoPE Rolling, the coordinate-shifting positional encoding adapted to continuous cylindrical topologies, which carries the global-to-local reasoning pipeline by letting the model maintain awareness across the full panorama rather than isolated local views.

If this is right

  • Agents maintain global context and recover from cases where a target leaves the current local view.
  • Exploration avoids rigid initialization and exhaustive local scanning by using holistic priors to shrink the search space step by step.
  • The training pipeline produces both spatial reasoning and tool-calling behavior from the VQA dialogues.
  • Efficiency gains appear as fewer steps are needed to locate targets compared with fragmented local methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same global-first narrowing pattern could be tested in other continuous domains such as spherical medical scans or full-sphere astronomy data.
  • Real-world robotic deployment would reveal whether the learned reasoning transfers when camera motion and lighting vary beyond the static panorama dataset.
  • If the accuracy lift holds, the approach supplies a template for adding cylindrical-aware positional encodings to other multimodal tasks that wrap around a closed surface.

Load-bearing premise

That the RoPE Rolling adaptation together with SFT and GRPO training on the constructed dataset will overcome panoramic distortion and cylindrical topology problems to produce reliable spatial reasoning.

What would settle it

Running the base model and the trained EAGLE-360 model on the same held-out set of 360-degree images and finding that accuracy does not rise by a large factor when RoPE Rolling and the global-to-local pipeline are removed.

Figures

Figures reproduced from arXiv: 2607.02479 by Jianwen Sun, Jingtao Xu, Yawei Luo, Yi Yang, Zizhuo Lin.

Figure 1
Figure 1. Figure 1: Overview of Target Object Search in Panoramic Images. Left: Conventional MLLMs struggle with inherent 360◦ properties, such as polar distortion and edge continuity. Top Right: Previous panoramic search methods rely on fragmented local exploration. Burdened by fixed initialization and rigid rotations without a panoramic prior, they suffer from poor error recovery and lack fine-grained grounding capabilities… view at source ↗
Figure 2
Figure 2. Figure 2: The overall pipeline of the proposed EAGLE-360 framework. Our method adopts an embodied active exploration paradigm that transitions from a global panorama to a precise local view. Stage 1: Global Perception processes the input 360◦ panoramic image utilizing a novel RoPE Rolling Mechanism. This mechanism explicitly addresses the left-right boundary continuity issue, enabling the model to effectively unders… view at source ↗
Figure 3
Figure 3. Figure 3: Analysis of multi-turn reasoning dynamics. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A training data sample from kuula (Kuula, 2026) 19 [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A training data sample from Matterport3D ( [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: A training data sample from 2D-3D-S (Armeni et al., 2017) 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A inference case of EAGLE-360 22 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A inference case of EAGLE-360 23 [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: A inference case of EAGLE-360 24 [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
read the original abstract

While Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in standard visual understanding, adapting them for active visual search in 360$^\circ$ panoramic environments exposes fundamental limitations. Specifically, standard MLLMs struggle to effectively model inherent panoramic properties, such as severe polar distortion and continuous cylindrical topologies, which significantly degrades target detection accuracy. Consequently, existing panoramic search methods attempt to compensate by relying heavily on fragmented local viewpoints. Burdened by rigid initialization and a lack of global panoramic priors, these approaches suffer from myopic, inefficient exploration and struggle with robust error recovery when targets are out of view. To overcome these challenges, we propose EAGLE-360, a novel Embodied Active Global-to-Local Exploration framework. Rather than performing exhaustive local searches, EAGLE-360 leverages global priors to establish an initial holistic perspective, iteratively reasoning and progressively narrowing the search space. Architecturally, we adapt RoPE Rolling, a coordinate-shifting positional encoding mechanism, to seamlessly model the continuous topologies of panoramas. To facilitate this paradigm, we construct the large-scale EAGLE-360 dataset, comprising 14,000+ 4K panoramas and 70,000+ rounds of high-quality VQA dialogues. By employing a training pipeline that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), we effectively elicit complex spatial reasoning and tool-calling capabilities. Extensive experiments demonstrate that EAGLE-360 establishes a new state-of-the-art for 360$^\circ$ visual search, achieving nearly an 8-fold increase in accuracy over the base model while significantly enhancing exploration efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes EAGLE-360, an Embodied Active Global-to-Local Exploration framework for 360° visual search with MLLMs. It adapts RoPE Rolling to handle panoramic distortion and cylindrical topologies, constructs a dataset of 14,000+ 4K panoramas and 70,000+ VQA dialogues, and applies SFT followed by GRPO to train models for global-to-local reasoning and tool use. The central empirical claim is that this yields nearly an 8-fold accuracy gain over the base model while improving exploration efficiency, establishing new SOTA performance.

Significance. If the reported gains prove robust across baselines, metrics, and held-out data, the work would advance active panoramic search by moving beyond myopic local viewpoints toward global-prior-guided strategies. The dataset and RoPE Rolling adaptation could serve as useful resources for future embodied vision research.

major comments (1)
  1. [Experiments / Results] The central claim of an 8-fold accuracy increase rests on experimental results whose details (baselines, metrics, error bars, dataset splits, and statistical significance) are not verifiable from the provided abstract; this directly affects assessment of whether the RoPE Rolling + SFT+GRPO pipeline overcomes the stated limitations.
minor comments (2)
  1. Clarify the exact definition and implementation of 'RoPE Rolling' (coordinate-shifting mechanism) with reference to the original RoPE formulation to allow reproduction.
  2. The abstract states 'extensive experiments' but should explicitly list the evaluation environments, target detection criteria, and comparison methods in the main text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for greater experimental transparency. We address the single major comment below and will incorporate revisions to strengthen verifiability of the reported results.

read point-by-point responses
  1. Referee: [Experiments / Results] The central claim of an 8-fold accuracy increase rests on experimental results whose details (baselines, metrics, error bars, dataset splits, and statistical significance) are not verifiable from the provided abstract; this directly affects assessment of whether the RoPE Rolling + SFT+GRPO pipeline overcomes the stated limitations.

    Authors: We agree that an abstract alone cannot supply the full experimental protocol required for independent verification. The complete manuscript (Sections 4 and 5) specifies the baselines (vanilla MLLMs, local-viewpoint search agents, and prior panoramic methods), evaluation metrics (target detection accuracy and exploration steps), error bars from repeated runs with different seeds, the train/validation/test splits of the 14k-panorama EAGLE-360 dataset, and statistical significance via paired t-tests. Nevertheless, to make these details immediately accessible, we will revise the abstract to include a concise statement of the evaluation protocol and add an early consolidated results table. These changes directly respond to the concern while preserving the manuscript's existing technical content. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical ML framework: adaptation of RoPE Rolling for panoramic topologies, construction of a new EAGLE-360 dataset with VQA dialogues, and a training pipeline of SFT followed by GRPO to elicit spatial reasoning. All load-bearing claims rest on experimental performance metrics (accuracy gains, efficiency) rather than any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain. No equations are presented that reduce to inputs by construction, and the central results are externally falsifiable via the reported benchmarks on the constructed dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities detailed beyond the high-level proposal of the framework and dataset.

axioms (1)
  • domain assumption Adapting RoPE Rolling positional encoding will seamlessly model continuous cylindrical topologies in panoramas
    Invoked as the architectural solution to panoramic properties without further justification in abstract.
invented entities (1)
  • EAGLE-360 framework no independent evidence
    purpose: To enable global-to-local embodied exploration in 360 environments
    New proposed system without independent evidence outside the paper.

pith-pipeline@v0.9.1-grok · 5849 in / 1245 out tokens · 27199 ms · 2026-07-03T14:48:09.899926+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    Joint 2d-3d-semantic data for indoor scene understanding.ArXiv, abs/1702.01105. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhi- fang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, and 45 others. 2025a. Qwen3-vl tech...

  2. [2]

    In2023 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 8042–8052

    Fedseg: Class-heterogeneous federated learn- ing for semantic segmentation. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 8042–8052. OpenAI. 2024. Hello gpt-4o. Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. 2020. Reverie: Remote embodied vi- sual referring ex...

  3. [3]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Panosplatt3r: Leveraging perspective pretrain- ing for generalized unposed wide-baseline panorama reconstruction. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 28959–28969. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun- Mei Song, Mingchuan Zhang, Y . K. Li, Yu Wu, and Daya Guo. 2024. Deepseekmath: Pushing ...

  4. [4]

    Active-o3: Empowering multimodal large lan- guage models with active perception via grpo.ArXiv, abs/2505.21457. 11 A Code and Reproducibility To facilitate reproducibility and strictly adhere to the double-blind review process, we have anonymized and open-sourced the complete im- plementation ofEAGLE-360. The repository in- cludes the core code for the Ro...

  5. [5]

    is prompted to select exactly four spatially discriminative objects per panorama, subject to two hard constraints:spatial uniqueness(each selected object must appear in exactly one cubemap face; semantically equivalent instances across different faces are treated as duplicates) andcontent speci- ficity(generic structural elements such as floor, wall, ceil...

  6. [6]

    contributes 84%, 2D-3D-Semantics (Armeni et al., 2017) contributes 12.4%, and Kuula (Kuula,

  7. [7]

    type": "function

    contributes 3.6% in both subsets. Stage 5: Rotation-Based Data Augmentation. To enlarge the training set and improve rotational invariance, we apply three horizontal rotations— +90◦, +180◦, and +270◦—to every training panorama. Each rotation is implemented as a pixel-level horizontal roll of the ERP image by a fraction {1/4,1/2,3/4} of the image width, wh...

  8. [8]

    If you decide to call a tool, output ONLY the tool call and STOP immediately

  9. [9]

    DO NOT provide the final answer in the same response as the tool call

  10. [10]

    The tool will return a projected image, and you should wait for it before giving your answer

  11. [11]

    Note that when calling the tool, the pa- rameters should not be the same as those used previously. # Response Format • If you can answer directly with- out tools: <think>...</think> <answer>...</answer> • If you need to use a tool: <think>...</think> <tool_call>...</tool_call> (STOP HERE, wait for tool result) • After receiving the tool result and ready t...

  12. [12]

    For example, describe ’two chairs’ simply as ’chairs’

    Do not use numbers to describe the quan- tity of objects; use the plural form directly. For example, describe ’two chairs’ simply as ’chairs’

  13. [13]

    left" or

    Don’t describe items using horizontal di- rectional words such as "left" or "right"

  14. [14]

    a wooden dining table

    Ensure the correctness of the object de- scription and try to describe objects that you are 100% certain about. If an object is difficult to identify or cannot be described in words, do not describe it. Return your answer as a single Python list of strings. For example: ["a wooden dining table", "a hanging pendant light", "a large window with sheer curtai...

  15. [15]

    Unique Location:Select only items that appear in ONE direction. You must per- form a semantic check; for example, if a ’wooden chair’ is in the ’left’ view and a ’chair’ is in the ’front’ view, you cannot select it because the concept of a ’chair’ exists in multiple views

  16. [16]

    front": [

    Unique Items:Do not select generic background elements. Avoid items like ’sky’, ’clouds’, ’ground’, ’floor’, ’walls’, ’ceiling’, or ’water’. Focus on specific, identifiable objects that are salient to the scene. Return your answer as a Python Dictionary in standard JSON format. Example Input: { "front": ["a stone kitchen island", ...], ... } Example Outpu...