EAGLE-360: Embodied Active Global-to-Local Exploration in 360^circ
Pith reviewed 2026-07-03 14:48 UTC · model grok-4.3
The pith
Adapting positional encoding and global priors lets MLLMs search 360-degree panoramas by starting wide then narrowing locally.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EAGLE-360 claims that leveraging global priors for an initial holistic perspective, iteratively narrowing the search space, and adapting RoPE Rolling to model continuous cylindrical topologies allows MLLMs to overcome panoramic distortion; when combined with SFT and GRPO training on the 14,000+ panorama and 70,000+ dialogue dataset, this elicits spatial reasoning and tool use that produces nearly an eight-fold accuracy gain over the base model and markedly better exploration efficiency.
What carries the argument
RoPE Rolling, the coordinate-shifting positional encoding adapted to continuous cylindrical topologies, which carries the global-to-local reasoning pipeline by letting the model maintain awareness across the full panorama rather than isolated local views.
If this is right
- Agents maintain global context and recover from cases where a target leaves the current local view.
- Exploration avoids rigid initialization and exhaustive local scanning by using holistic priors to shrink the search space step by step.
- The training pipeline produces both spatial reasoning and tool-calling behavior from the VQA dialogues.
- Efficiency gains appear as fewer steps are needed to locate targets compared with fragmented local methods.
Where Pith is reading between the lines
- The same global-first narrowing pattern could be tested in other continuous domains such as spherical medical scans or full-sphere astronomy data.
- Real-world robotic deployment would reveal whether the learned reasoning transfers when camera motion and lighting vary beyond the static panorama dataset.
- If the accuracy lift holds, the approach supplies a template for adding cylindrical-aware positional encodings to other multimodal tasks that wrap around a closed surface.
Load-bearing premise
That the RoPE Rolling adaptation together with SFT and GRPO training on the constructed dataset will overcome panoramic distortion and cylindrical topology problems to produce reliable spatial reasoning.
What would settle it
Running the base model and the trained EAGLE-360 model on the same held-out set of 360-degree images and finding that accuracy does not rise by a large factor when RoPE Rolling and the global-to-local pipeline are removed.
Figures
read the original abstract
While Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in standard visual understanding, adapting them for active visual search in 360$^\circ$ panoramic environments exposes fundamental limitations. Specifically, standard MLLMs struggle to effectively model inherent panoramic properties, such as severe polar distortion and continuous cylindrical topologies, which significantly degrades target detection accuracy. Consequently, existing panoramic search methods attempt to compensate by relying heavily on fragmented local viewpoints. Burdened by rigid initialization and a lack of global panoramic priors, these approaches suffer from myopic, inefficient exploration and struggle with robust error recovery when targets are out of view. To overcome these challenges, we propose EAGLE-360, a novel Embodied Active Global-to-Local Exploration framework. Rather than performing exhaustive local searches, EAGLE-360 leverages global priors to establish an initial holistic perspective, iteratively reasoning and progressively narrowing the search space. Architecturally, we adapt RoPE Rolling, a coordinate-shifting positional encoding mechanism, to seamlessly model the continuous topologies of panoramas. To facilitate this paradigm, we construct the large-scale EAGLE-360 dataset, comprising 14,000+ 4K panoramas and 70,000+ rounds of high-quality VQA dialogues. By employing a training pipeline that integrates Supervised Fine-Tuning (SFT) with Group Relative Policy Optimization (GRPO), we effectively elicit complex spatial reasoning and tool-calling capabilities. Extensive experiments demonstrate that EAGLE-360 establishes a new state-of-the-art for 360$^\circ$ visual search, achieving nearly an 8-fold increase in accuracy over the base model while significantly enhancing exploration efficiency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes EAGLE-360, an Embodied Active Global-to-Local Exploration framework for 360° visual search with MLLMs. It adapts RoPE Rolling to handle panoramic distortion and cylindrical topologies, constructs a dataset of 14,000+ 4K panoramas and 70,000+ VQA dialogues, and applies SFT followed by GRPO to train models for global-to-local reasoning and tool use. The central empirical claim is that this yields nearly an 8-fold accuracy gain over the base model while improving exploration efficiency, establishing new SOTA performance.
Significance. If the reported gains prove robust across baselines, metrics, and held-out data, the work would advance active panoramic search by moving beyond myopic local viewpoints toward global-prior-guided strategies. The dataset and RoPE Rolling adaptation could serve as useful resources for future embodied vision research.
major comments (1)
- [Experiments / Results] The central claim of an 8-fold accuracy increase rests on experimental results whose details (baselines, metrics, error bars, dataset splits, and statistical significance) are not verifiable from the provided abstract; this directly affects assessment of whether the RoPE Rolling + SFT+GRPO pipeline overcomes the stated limitations.
minor comments (2)
- Clarify the exact definition and implementation of 'RoPE Rolling' (coordinate-shifting mechanism) with reference to the original RoPE formulation to allow reproduction.
- The abstract states 'extensive experiments' but should explicitly list the evaluation environments, target detection criteria, and comparison methods in the main text.
Simulated Author's Rebuttal
We thank the referee for their review and for highlighting the need for greater experimental transparency. We address the single major comment below and will incorporate revisions to strengthen verifiability of the reported results.
read point-by-point responses
-
Referee: [Experiments / Results] The central claim of an 8-fold accuracy increase rests on experimental results whose details (baselines, metrics, error bars, dataset splits, and statistical significance) are not verifiable from the provided abstract; this directly affects assessment of whether the RoPE Rolling + SFT+GRPO pipeline overcomes the stated limitations.
Authors: We agree that an abstract alone cannot supply the full experimental protocol required for independent verification. The complete manuscript (Sections 4 and 5) specifies the baselines (vanilla MLLMs, local-viewpoint search agents, and prior panoramic methods), evaluation metrics (target detection accuracy and exploration steps), error bars from repeated runs with different seeds, the train/validation/test splits of the 14k-panorama EAGLE-360 dataset, and statistical significance via paired t-tests. Nevertheless, to make these details immediately accessible, we will revise the abstract to include a concise statement of the evaluation protocol and add an early consolidated results table. These changes directly respond to the concern while preserving the manuscript's existing technical content. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper describes an empirical ML framework: adaptation of RoPE Rolling for panoramic topologies, construction of a new EAGLE-360 dataset with VQA dialogues, and a training pipeline of SFT followed by GRPO to elicit spatial reasoning. All load-bearing claims rest on experimental performance metrics (accuracy gains, efficiency) rather than any closed-form derivation, fitted parameter renamed as prediction, or self-citation chain. No equations are presented that reduce to inputs by construction, and the central results are externally falsifiable via the reported benchmarks on the constructed dataset.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Adapting RoPE Rolling positional encoding will seamlessly model continuous cylindrical topologies in panoramas
invented entities (1)
-
EAGLE-360 framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Joint 2d-3d-semantic data for indoor scene understanding.ArXiv, abs/1702.01105. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhi- fang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, and 45 others. 2025a. Qwen3-vl tech...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
In2023 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 8042–8052
Fedseg: Class-heterogeneous federated learn- ing for semantic segmentation. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recog- nition (CVPR), pages 8042–8052. OpenAI. 2024. Hello gpt-4o. Yuankai Qi, Qi Wu, Peter Anderson, Xin Wang, William Yang Wang, Chunhua Shen, and Anton van den Hengel. 2020. Reverie: Remote embodied vi- sual referring ex...
2024
-
[3]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Panosplatt3r: Leveraging perspective pretrain- ing for generalized unposed wide-baseline panorama reconstruction. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 28959–28969. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Jun- Mei Song, Mingchuan Zhang, Y . K. Li, Yu Wu, and Daya Guo. 2024. Deepseekmath: Pushing ...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Active-o3: Empowering multimodal large lan- guage models with active perception via grpo.ArXiv, abs/2505.21457. 11 A Code and Reproducibility To facilitate reproducibility and strictly adhere to the double-blind review process, we have anonymized and open-sourced the complete im- plementation ofEAGLE-360. The repository in- cludes the core code for the Ro...
work page internal anchor Pith review Pith/arXiv arXiv 2048
-
[5]
is prompted to select exactly four spatially discriminative objects per panorama, subject to two hard constraints:spatial uniqueness(each selected object must appear in exactly one cubemap face; semantically equivalent instances across different faces are treated as duplicates) andcontent speci- ficity(generic structural elements such as floor, wall, ceil...
2024
-
[6]
contributes 84%, 2D-3D-Semantics (Armeni et al., 2017) contributes 12.4%, and Kuula (Kuula,
2017
-
[7]
type": "function
contributes 3.6% in both subsets. Stage 5: Rotation-Based Data Augmentation. To enlarge the training set and improve rotational invariance, we apply three horizontal rotations— +90◦, +180◦, and +270◦—to every training panorama. Each rotation is implemented as a pixel-level horizontal roll of the ERP image by a fraction {1/4,1/2,3/4} of the image width, wh...
-
[8]
If you decide to call a tool, output ONLY the tool call and STOP immediately
-
[9]
DO NOT provide the final answer in the same response as the tool call
-
[10]
The tool will return a projected image, and you should wait for it before giving your answer
-
[11]
Note that when calling the tool, the pa- rameters should not be the same as those used previously. # Response Format • If you can answer directly with- out tools: <think>...</think> <answer>...</answer> • If you need to use a tool: <think>...</think> <tool_call>...</tool_call> (STOP HERE, wait for tool result) • After receiving the tool result and ready t...
-
[12]
For example, describe ’two chairs’ simply as ’chairs’
Do not use numbers to describe the quan- tity of objects; use the plural form directly. For example, describe ’two chairs’ simply as ’chairs’
-
[13]
left" or
Don’t describe items using horizontal di- rectional words such as "left" or "right"
-
[14]
a wooden dining table
Ensure the correctness of the object de- scription and try to describe objects that you are 100% certain about. If an object is difficult to identify or cannot be described in words, do not describe it. Return your answer as a single Python list of strings. For example: ["a wooden dining table", "a hanging pendant light", "a large window with sheer curtai...
-
[15]
Unique Location:Select only items that appear in ONE direction. You must per- form a semantic check; for example, if a ’wooden chair’ is in the ’left’ view and a ’chair’ is in the ’front’ view, you cannot select it because the concept of a ’chair’ exists in multiple views
-
[16]
front": [
Unique Items:Do not select generic background elements. Avoid items like ’sky’, ’clouds’, ’ground’, ’floor’, ’walls’, ’ceiling’, or ’water’. Focus on specific, identifiable objects that are salient to the scene. Return your answer as a Python Dictionary in standard JSON format. Example Input: { "front": ["a stone kitchen island", ...], ... } Example Outpu...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.