NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks

Jingyu Gong; Min Wang; Wentao Yan; Xin Tan; Xuhong Wang; Yuan Xie; Zhihao Luo; Zhizhong Zhang

arxiv: 2508.02046 · v4 · submitted 2025-08-04 · 💻 cs.RO · cs.LG

NaviMaster: Learning a Unified Policy for GUI and Embodied Navigation Tasks

Zhihao Luo , Wentao Yan , Jingyu Gong , Min Wang , Zhizhong Zhang , Xuhong Wang , Yuan Xie , Xin Tan This is my paper

Pith reviewed 2026-05-19 01:28 UTC · model grok-4.3

classification 💻 cs.RO cs.LG

keywords unified navigation policyGUI navigationembodied navigationMarkov Decision Processreinforcement learningtrajectory collectiondistance-aware rewarddata mixing

0 comments

The pith

A single policy trained on mixed GUI and embodied trajectories unifies navigation across digital interfaces and physical robots.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that GUI navigation and embodied navigation can be treated as instances of the same Markov Decision Process. It introduces NaviMaster, which collects trajectories for both using one visual-target pipeline, trains a shared reinforcement learning agent on the combined data, and applies a distance-aware reward to guide efficient behavior. Experiments show this unified agent beats prior specialized systems on out-of-domain GUI tasks, spatial affordance prediction, and embodied navigation benchmarks. The approach reduces the need for separate engineering per domain and improves generalization through data mixing.

Core claim

NaviMaster is the first unified agent that formulates both GUI navigation and embodied navigation as Markov Decision Processes, generates trajectories for both via a single visual-target collection pipeline, trains one reinforcement learning policy on the mixed dataset, and uses a novel distance-aware reward, resulting in superior performance on out-of-domain benchmarks for GUI navigation, spatial affordance prediction, and embodied navigation.

What carries the argument

The visual-target trajectory collection pipeline combined with a unified MDP formulation and distance-aware reward that enables training a single policy on data from both GUI and embodied domains.

If this is right

Mixing data from GUI and embodied sources improves generalization to new tasks in each domain.
The distance-aware reward supports efficient learning without separate reward engineering for each setting.
Ablation results confirm that the unified training strategy and data mixing contribute measurably to the gains.
The same policy achieves strong results on spatial affordance prediction alongside the two navigation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The unification suggests that other screen-based and physical interaction tasks could be folded into the same framework with minimal additional design.
If negative transfer remains absent at larger scales, developers could maintain one navigation model instead of separate GUI and robot systems.
Hybrid scenarios that combine screen use and physical movement, such as controlling a robot via a tablet interface, become a direct next test case.

Load-bearing premise

A single visual-target trajectory pipeline and shared MDP formulation can produce high-quality training data for both GUI and embodied tasks without major domain-specific changes or harmful interference between the two.

What would settle it

Training the same policy on the mixed dataset yields lower success rates on GUI navigation benchmarks than a GUI-only baseline, or lower rates on embodied navigation benchmarks than an embodied-only baseline.

read the original abstract

Recent advances in Graphical User Interface (GUI) and embodied navigation have driven progress, yet these domains have largely evolved in isolation, with disparate datasets and training paradigms. In this paper, we observe that both tasks can be formulated as Markov Decision Processes (MDP), suggesting a foundational principle for their unification. Hence, we present NaviMaster, the first unified agent capable of unifying GUI navigation and embodied navigation within a single framework. Specifically, NaviMaster (i) proposes a visual-target trajectory collection pipeline that generates trajectories for both GUI and embodied tasks using a single formulation. (ii) employs a unified reinforcement learning framework on the mix data to improve generalization. (iii) designs a novel distance-aware reward to ensure efficient learning from the trajectories. Through extensive experiments on out-of-domain benchmarks, NaviMaster is shown to outperform state-of-the-art agents in GUI navigation, spatial affordance prediction, and embodied navigation. Ablation studies further demonstrate the efficacy of our unified training strategy, data mixing strategy, and reward design. Our codes, data, and checkpoints are available at https://iron-boyy.github.io/navimaster-page/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NaviMaster gives a concrete empirical test of unifying GUI and embodied navigation under one policy via shared MDP framing, mixed trajectories, and a distance-aware reward, with released code that lets others check the results.

read the letter

NaviMaster shows that both GUI navigation and embodied navigation can be cast as MDPs and trained together in one reinforcement learning setup. The paper introduces a visual-target trajectory collection pipeline that works for both domains, runs joint training on the mixed data, and adds a distance-aware reward to guide learning. They report gains over prior specialized agents on out-of-domain GUI benchmarks, spatial affordance prediction, and embodied navigation tasks, plus ablations that isolate the effects of the unified training, the data mix, and the reward term. Releasing the code, data, and checkpoints is a clear plus for anyone who wants to reproduce or extend the work.

Referee Report

2 major / 2 minor

Summary. The paper claims that GUI navigation and embodied navigation can both be cast as MDPs and presents NaviMaster as the first unified agent. It introduces a single visual-target trajectory collection pipeline, trains a shared policy via reinforcement learning on mixed GUI and embodied data, and uses a novel distance-aware reward. Experiments on out-of-domain benchmarks reportedly show outperformance over prior SOTA methods in GUI navigation, spatial affordance prediction, and embodied navigation; ablations are said to confirm the value of unified training, data mixing, and the reward design. Code, data, and checkpoints are released.

Significance. If the empirical results hold and the unification avoids negative transfer, the work would be significant for bridging two previously separate navigation domains under a shared MDP and training paradigm, potentially enabling more generalist agents that operate across digital interfaces and physical environments. The open release of code, data, and checkpoints is a clear strength that supports reproducibility and follow-on research.

major comments (2)

[Ablation studies] Ablation studies (and the associated tables/figures): while the paper reports that unified training and data mixing improve performance, it does not present a direct per-domain comparison of the joint policy against separately trained GUI-only and embodied-only policies that use identical trajectory data, the same distance-aware reward, and the same MDP formulation. Without this control, it remains possible that the reported gains reflect data volume rather than successful unification and that negative transfer occurs on at least one domain given the substantial differences in observation spaces (2-D pixel vs. 3-D sensor), action spaces (discrete clicks vs. locomotion primitives), and reward scales.
[Method (reward design)] Section describing the distance-aware reward and the unified MDP: the claim that this reward ensures efficient learning across domains would be strengthened by an explicit analysis showing that the reward formulation does not require domain-specific scaling or clipping; otherwise the unification may still embed hidden per-domain engineering.

minor comments (2)

[Introduction] The introduction and abstract refer to 'out-of-domain benchmarks' without immediately listing the concrete datasets and metrics used for each task (GUI, affordance, embodied); moving a concise summary table or bullet list to the introduction would improve readability.
[Preliminaries / Method] Notation for the shared MDP components (state, action, reward) should be introduced once with a single table or diagram rather than re-defined separately for GUI and embodied sections to avoid any appearance of domain-specific re-engineering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully addressed each major comment below and revised the paper to strengthen the presentation of our results and method.

read point-by-point responses

Referee: [Ablation studies] Ablation studies (and the associated tables/figures): while the paper reports that unified training and data mixing improve performance, it does not present a direct per-domain comparison of the joint policy against separately trained GUI-only and embodied-only policies that use identical trajectory data, the same distance-aware reward, and the same MDP formulation. Without this control, it remains possible that the reported gains reflect data volume rather than successful unification and that negative transfer occurs on at least one domain given the substantial differences in observation spaces (2-D pixel vs. 3-D sensor), action spaces (discrete clicks vs. locomotion primitives), and reward scales.

Authors: We appreciate this valid point regarding the need for a controlled comparison. Our existing ablations isolate the contributions of unified training and data mixing by comparing against ablated variants of the joint policy. To directly address the concern about data volume versus unification benefits and potential negative transfer, we have added new experiments in the revised manuscript. These train separate GUI-only and embodied-only policies using identical trajectory data, the same distance-aware reward, and the same MDP formulation. The results, presented in a new Table 6 and Section 4.3, show that the unified policy matches or exceeds the performance of the domain-specific policies on both GUI and embodied benchmarks, with no observed negative transfer despite differences in observation and action spaces. This supports the value of the shared policy and mixed-data training. revision: yes
Referee: [Method (reward design)] Section describing the distance-aware reward and the unified MDP: the claim that this reward ensures efficient learning across domains would be strengthened by an explicit analysis showing that the reward formulation does not require domain-specific scaling or clipping; otherwise the unification may still embed hidden per-domain engineering.

Authors: We thank the referee for this suggestion to clarify the domain-agnostic nature of the reward. The distance-aware reward uses a normalized formulation based on the relative distance to the visual target, applied uniformly without per-domain adjustments. In the revised manuscript, we have added an explicit analysis in Section 3.3 that reports reward statistics and distributions for both GUI and embodied tasks using identical hyperparameters. This confirms that the same reward function operates effectively across domains without requiring scaling or clipping, supporting the unification claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical unification via shared MDP and training pipeline

full rationale

The paper observes that both GUI and embodied navigation can be cast as MDPs, then introduces a visual-target trajectory pipeline, unified RL training on mixed data, and a distance-aware reward. All load-bearing claims are validated through out-of-domain benchmarks, ablations on joint vs. separate training, and comparisons to SOTA agents. No equation or result reduces by construction to a fitted parameter, self-citation, or renamed input; the unification is presented as an engineering outcome whose efficacy is measured externally rather than assumed tautologically.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides almost no explicit free parameters, axioms, or invented entities; the unification rests on the unstated assumption that both domains share a compatible MDP structure and that mixed training does not degrade performance.

pith-pipeline@v0.9.0 · 5751 in / 1061 out tokens · 22095 ms · 2026-05-19T01:28:22.826400+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

designs a novel distance-aware reward... RG(i, j) = 1 - dj / θd if dj < θd... dj = sqrt((x̂j - xi)² + (ŷj - yi)²)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

both tasks can be formulated as Markov Decision Processes (MDP)... unified reinforcement learning framework on the mix data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.