AdaTracker: Learning Adaptive In-Context Policy for Cross-Embodiment Active Visual Tracking

Churan Wang; Fangwei Zhong; Haijun Liu; Hao Chen; Jinzhu Han; Kui Wu; Si Liu; Yizhou Wang; Zhoujun Li

arxiv: 2604.20305 · v1 · submitted 2026-04-22 · 💻 cs.RO

AdaTracker: Learning Adaptive In-Context Policy for Cross-Embodiment Active Visual Tracking

Kui Wu , Hao Chen , Jinzhu Han , Haijun Liu , Churan Wang , Yizhou Wang , Zhoujun Li , Si Liu

show 1 more author

Fangwei Zhong

This is my paper

Pith reviewed 2026-05-10 00:31 UTC · model grok-4.3

classification 💻 cs.RO

keywords cross-embodiment learningactive visual trackingin-context policyembodiment context encoderzero-shot adaptationrobot morphologiescontext-aware policyauxiliary objectives

0 comments

The pith

A single learned policy can track visual targets across unseen robot bodies by encoding embodiment constraints from interaction history.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that active visual tracking does not require separate models for each robot shape. Instead, an Embodiment Context Encoder reads a short history of actions and observations to infer the hidden physical limits of the current body. This inferred context then modulates a shared Context-Aware Policy so that the same network produces suitable control commands for any new morphology without further training. If the approach holds, one model could serve many different platforms, cutting the cost of retraining and improving sample efficiency. The authors add two auxiliary losses to keep the context estimate accurate and temporally stable.

Core claim

AdaTracker trains an Embodiment Context Encoder that extracts a compact representation of embodiment-specific constraints directly from a short trajectory of past states and actions. This representation is fed as additional input to a Context-Aware Policy, allowing the policy to generate embodiment-appropriate tracking actions in a zero-shot manner. Two auxiliary objectives—one for context reconstruction and one for temporal consistency—are optimized jointly to make the inferred context reliable.

What carries the argument

Embodiment Context Encoder that infers morphology constraints from interaction history and dynamically modulates the Context-Aware Policy.

If this is right

One policy network suffices for active tracking on robots with widely different kinematics and dynamics.
New robot platforms can be used immediately without collecting embodiment-specific training data.
Sample efficiency improves because the model reuses experience across morphologies rather than starting from scratch for each.
Zero-shot adaptation becomes possible in both simulation and real-world hardware.
Auxiliary context and consistency losses stabilize performance when embodiment identification is noisy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same history-based context mechanism could be tested on other control tasks such as navigation or manipulation where body limits affect action feasibility.
If history proves insufficient for some morphologies, an explicit embodiment description or brief calibration phase might be needed as a fallback.
Scaling the approach to very high-dimensional or deformable bodies would require checking whether the context encoder still captures the relevant constraints in a low-dimensional latent space.

Load-bearing premise

A short history of past interactions always contains enough information to correctly identify the physical constraints of a previously unseen robot body.

What would settle it

Deploy the trained model on a new robot whose motion limits cannot be inferred from any short interaction history (for example, a robot with an unobservable joint range that only appears after many steps) and measure whether zero-shot tracking success drops to the level of a non-adaptive baseline.

Figures

Figures reproduced from arXiv: 2604.20305 by Churan Wang, Fangwei Zhong, Haijun Liu, Hao Chen, Jinzhu Han, Kui Wu, Si Liu, Yizhou Wang, Zhoujun Li.

**Figure 1.** Figure 1: AdaTracker enables a single, unified policy to perform Embodied Visual Tracking (EVT) across heterogeneous robotic platforms with diverse viewpoints and motion dynamics, achieving cross-embodiment generalization and robust real-world tracking without retraining or recalibration. configuration. Consequently, a policy trained for a wheeled robot often fails catastrophically when deployed on a highaltitude d… view at source ↗

**Figure 2.** Figure 2: Our framework learns a single, unified policy capable of zero-shot transfer across diverse robot morphologies. The Embodiment Context Encoder [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 1.** Figure 1: During evaluation, the target person follows an [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

read the original abstract

Realizing active visual tracking with a single unified model across diverse robots is challenging, as the physical constraints and motion dynamics vary drastically from one platform to another. Existing approaches typically train separate models for each embodiment, leading to poor scalability and limited generalization. To address this, we propose AdaTracker, an adaptive in-context policy learning framework that robustly tracks targets on diverse robot morphologies. Our key insight is to explicitly model embodiment-specific constraints through an Embodiment Context Encoder, which infers embodiment-specific constraints from history. This contextual representation dynamically modulates a Context-Aware Policy, enabling it to infer optimal control actions for unseen embodiments in a zero-shot manner. To enhance robustness, we introduce two auxiliary objectives to ensure accurate context identification and temporal consistency. Experiments in both simulation and the real world demonstrate that AdaTracker significantly outperforms state-of-the-art methods in cross-embodiment generalization, sample efficiency, and zero-shot adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaTracker gives a tested way to do zero-shot cross-embodiment active tracking by encoding robot constraints from interaction history.

read the letter

The main point is that this paper shows a working in-context approach for active visual tracking that adapts to new robot bodies without retraining. They add an Embodiment Context Encoder that reads history to infer morphology-specific limits, then uses that to modulate the policy on the fly. Two auxiliary losses push the encoder toward accurate context and temporal stability. That setup is the concrete new piece compared to standard per-embodiment training or generic meta-learning for tracking.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes AdaTracker, an adaptive in-context policy learning framework for active visual tracking that operates across diverse robot embodiments. It introduces an Embodiment Context Encoder to infer embodiment-specific constraints from interaction history, which dynamically modulates a Context-Aware Policy to enable zero-shot adaptation to unseen morphologies. Two auxiliary objectives are added to promote accurate context identification and temporal consistency. The approach is evaluated in simulation across multiple robot platforms with held-out morphologies for zero-shot testing, plus real-world transfer experiments, claiming superior cross-embodiment generalization, sample efficiency, and adaptation relative to state-of-the-art methods.

Significance. If the experimental results hold, the work would be significant for robotics by offering a scalable alternative to per-embodiment model training for visual tracking tasks such as surveillance or navigation. The explicit use of in-context learning to model morphology constraints from history represents a targeted approach to generalization. Credit is given for the simulation experiments that directly test zero-shot transfer on held-out embodiments, the real-world validation, and the ablations on auxiliary objectives that address the assumption that interaction history suffices to identify unseen morphology constraints.

minor comments (3)

[Abstract] The abstract asserts outperformance and zero-shot capability but does not include any quantitative metrics, specific baselines, or numerical results; while the full experiments section supplies these, including one or two key numbers in the abstract would improve immediate readability.
[§3.1] §3.1: The input representation to the Embodiment Context Encoder (e.g., exact history length, state features included, and handling of variable-length sequences across embodiments) is described at a high level; a concrete example or pseudocode would clarify how the encoder processes raw interaction trajectories.
[Experiments] Table 2 (or equivalent results table): Reporting standard deviations or confidence intervals across random seeds for the success rate and tracking error metrics would strengthen the statistical reliability of the cross-embodiment and ablation comparisons.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of AdaTracker and the recommendation for minor revision. The provided summary accurately reflects the manuscript's contributions to zero-shot cross-embodiment active visual tracking via in-context learning.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's core derivation introduces an Embodiment Context Encoder that processes interaction history to produce a contextual representation, which then modulates a Context-Aware Policy for zero-shot control on unseen morphologies. Two auxiliary objectives are added explicitly for context accuracy and temporal consistency. These elements are trained end-to-end and evaluated via held-out embodiment experiments in simulation plus real-world transfer, providing external validation rather than any definitional reduction. No equations equate a claimed prediction to a fitted input by construction, no load-bearing self-citations justify uniqueness, and no ansatz or renaming of known results is presented as a first-principles derivation. The architecture follows standard in-context adaptation patterns without internal circular equivalence to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Only the abstract is available; the ledger is therefore incomplete and based on high-level description.

free parameters (1)

neural network weights in policy and encoder
Standard deep learning parameters fitted during training on simulation data.

axioms (1)

domain assumption Interaction history contains sufficient information to identify embodiment-specific constraints
Invoked by the design of the Embodiment Context Encoder.

invented entities (1)

Embodiment Context Encoder no independent evidence
purpose: Infer embodiment-specific constraints from history to modulate the policy
New component introduced to enable zero-shot adaptation.

pith-pipeline@v0.9.0 · 5479 in / 1153 out tokens · 45459 ms · 2026-05-10T00:31:33.867626+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Ad-vat+: An asymmetric dueling mechanism for learning and understanding visual active tracking,

F. Zhong, P. Sun, W. Luo, T. Yan, and Y . Wang, “Ad-vat+: An asymmetric dueling mechanism for learning and understanding visual active tracking,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 5, pp. 1467–1482, 2019

work page 2019
[2]

Trackvla: Embodied visual tracking in the wild,

S. Wang, J. Zhang, M. Li, J. Liu, A. Li, K. Wu, F. Zhong, J. Yu, Z. Zhang, and H. Wang, “Trackvla: Embodied visual tracking in the wild,” inCoRL, pp. 4139–4164, PMLR, 2025

work page 2025
[3]

End-to-end active object tracking and its real-world deployment via reinforcement learning,

W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y . Wang, “End-to-end active object tracking and its real-world deployment via reinforcement learning,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 42, no. 6, pp. 1317–1332, 2020

work page 2020
[4]

Coarse-to-fine uav target tracking with deep reinforcement learning,

W. Zhang, K. Song, X. Rong, and Y . Li, “Coarse-to-fine uav target tracking with deep reinforcement learning,”IEEE Transactions on Automation Science and Engineering, vol. 16, no. 4, pp. 1522–1530, 2018

work page 2018
[5]

Anti-distractor active object tracking in 3d environments,

M. Xi, Y . Zhou, Z. Chen, W. Zhou, and H. Li, “Anti-distractor active object tracking in 3d environments,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3697–3707, 2021

work page 2021
[6]

Pose-assisted multi-camera collaboration for active object tracking,

J. Li, J. Xu, F. Zhong, X. Kong, Y . Qiao, and Y . Wang, “Pose-assisted multi-camera collaboration for active object tracking,” inAAAI, vol. 34, pp. 759–766, 2020

work page 2020
[7]

Enhancing continuous control of mobile robots for end-to-end visual active tracking,

A. Devo, A. Dionigi, and G. Costante, “Enhancing continuous control of mobile robots for end-to-end visual active tracking,”Robotics and Autonomous Systems, vol. 142, p. 103799, 2021

work page 2021
[8]

Rspt: reconstruct surroundings and predict trajectory for generalizable active object track- ing,

F. Zhong, X. Bi, Y . Zhang, W. Zhang, and Y . Wang, “Rspt: reconstruct surroundings and predict trajectory for generalizable active object track- ing,” inAAAI, pp. 3705–3714, 2023

work page 2023
[9]

Conservative q-learning for offline reinforcement learning,

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,”NeurIPS, vol. 33, pp. 1179–1191, 2020

work page 2020
[10]

Empowering embodied visual tracking with visual foundation models and offline rl,

F. Zhong, K. Wu, H. Ci, C. Wang, and H. Chen, “Empowering embodied visual tracking with visual foundation models and offline rl,” inECCV, pp. 139–155, 2024

work page 2024
[11]

Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration0,

A. O’Neill, Rehman,et al., “Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration0,” inICRA, pp. 6892–6903, 2024

work page 2024
[12]

Body transformer: Leveraging robot embodiment for policy learning,

C. Sferrazza, D.-M. Huang, F. Liu, J. Lee, and P. Abbeel, “Body transformer: Leveraging robot embodiment for policy learning,” in CoRL, 2024

work page 2024
[13]

AMAGO: Scalable in-context reinforce- ment learning for adaptive agents,

J. Grigsby, L. Fan, and Y . Zhu, “AMAGO: Scalable in-context reinforce- ment learning for adaptive agents,” inICLR, 2024

work page 2024
[14]

Icrt: In-context imitation learning via next- token prediction,

M. Fu, H. Huang,et al., “Icrt: In-context imitation learning via next- token prediction,” inICRA, 2025

work page 2025
[15]

Less is more: Token context-aware learning for object tracking,

C. Xu, B. Zhong, Q. Liang, Y . Zheng, G. Li, and S. Song, “Less is more: Token context-aware learning for object tracking,” inAAAI, pp. 8824– 8832, 2025

work page 2025
[16]

Context-aware dynamics model for generalization in model-based reinforcement learning,

K. Lee, Y . Seo, S. Lee, H. Lee, and J. Shin, “Context-aware dynamics model for generalization in model-based reinforcement learning,” in ICML, pp. 5757–5766, PMLR, 2020

work page 2020
[17]

Goal-conditioned reinforcement learning: Problems and solutions

M. Liu, M. Zhu, and W. Zhang, “Goal-conditioned reinforcement learning: Problems and solutions,”arXiv preprint arXiv:2201.08299, 2022

work page arXiv 2022
[18]

Towards distraction- robust active visual tracking,

F. Zhong, P. Sun, W. Luo, T. Yan, and Y . Wang, “Towards distraction- robust active visual tracking,” inICML, pp. 12782–12792, PMLR, 2021

work page 2021
[19]

arXiv preprint arXiv:2305.06558 (2023)

Y . Cheng, L. Li, Y . Xu, X. Li, Z. Yang, W. Wang, and Y . Yang, “Segment and track anything,”arXiv preprint arXiv:2305.06558, 2023

work page arXiv 2023
[20]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR, pp. 770–778, 2016

work page 2016
[21]

Gradient-based learning applied to document recognition,

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998

work page 1998
[22]

Unrealcv: Virtual worlds for computer vision,

W. Qiu, F. Zhong, Y . Zhang, S. Qiao, Z. Xiao, T. S. Kim, Y . Wang, and A. Yuille, “Unrealcv: Virtual worlds for computer vision,” inACM MM, pp. 1221–1224, 2017

work page 2017
[23]

Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai,

F. Zhong, K. Wu, C. Wang, H. Chen, H. Ci, Z. Li, and Y . Wang, “Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai,” inICCV, pp. 5769–5779, 2025

work page 2025
[24]

SAM 2: Segment anything in images and videos,

N. Ravi, V . Gabeur,et al., “SAM 2: Segment anything in images and videos,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[25]

Grounding dino: Marrying dino with grounded pre- training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su,et al., “Grounding dino: Marrying dino with grounded pre- training for open-set object detection,” inECCV, pp. 38–55, Springer, 2024

work page 2024

[1] [1]

Ad-vat+: An asymmetric dueling mechanism for learning and understanding visual active tracking,

F. Zhong, P. Sun, W. Luo, T. Yan, and Y . Wang, “Ad-vat+: An asymmetric dueling mechanism for learning and understanding visual active tracking,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 5, pp. 1467–1482, 2019

work page 2019

[2] [2]

Trackvla: Embodied visual tracking in the wild,

S. Wang, J. Zhang, M. Li, J. Liu, A. Li, K. Wu, F. Zhong, J. Yu, Z. Zhang, and H. Wang, “Trackvla: Embodied visual tracking in the wild,” inCoRL, pp. 4139–4164, PMLR, 2025

work page 2025

[3] [3]

End-to-end active object tracking and its real-world deployment via reinforcement learning,

W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y . Wang, “End-to-end active object tracking and its real-world deployment via reinforcement learning,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 42, no. 6, pp. 1317–1332, 2020

work page 2020

[4] [4]

Coarse-to-fine uav target tracking with deep reinforcement learning,

W. Zhang, K. Song, X. Rong, and Y . Li, “Coarse-to-fine uav target tracking with deep reinforcement learning,”IEEE Transactions on Automation Science and Engineering, vol. 16, no. 4, pp. 1522–1530, 2018

work page 2018

[5] [5]

Anti-distractor active object tracking in 3d environments,

M. Xi, Y . Zhou, Z. Chen, W. Zhou, and H. Li, “Anti-distractor active object tracking in 3d environments,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3697–3707, 2021

work page 2021

[6] [6]

Pose-assisted multi-camera collaboration for active object tracking,

J. Li, J. Xu, F. Zhong, X. Kong, Y . Qiao, and Y . Wang, “Pose-assisted multi-camera collaboration for active object tracking,” inAAAI, vol. 34, pp. 759–766, 2020

work page 2020

[7] [7]

Enhancing continuous control of mobile robots for end-to-end visual active tracking,

A. Devo, A. Dionigi, and G. Costante, “Enhancing continuous control of mobile robots for end-to-end visual active tracking,”Robotics and Autonomous Systems, vol. 142, p. 103799, 2021

work page 2021

[8] [8]

Rspt: reconstruct surroundings and predict trajectory for generalizable active object track- ing,

F. Zhong, X. Bi, Y . Zhang, W. Zhang, and Y . Wang, “Rspt: reconstruct surroundings and predict trajectory for generalizable active object track- ing,” inAAAI, pp. 3705–3714, 2023

work page 2023

[9] [9]

Conservative q-learning for offline reinforcement learning,

A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,”NeurIPS, vol. 33, pp. 1179–1191, 2020

work page 2020

[10] [10]

Empowering embodied visual tracking with visual foundation models and offline rl,

F. Zhong, K. Wu, H. Ci, C. Wang, and H. Chen, “Empowering embodied visual tracking with visual foundation models and offline rl,” inECCV, pp. 139–155, 2024

work page 2024

[11] [11]

Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration0,

A. O’Neill, Rehman,et al., “Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration0,” inICRA, pp. 6892–6903, 2024

work page 2024

[12] [12]

Body transformer: Leveraging robot embodiment for policy learning,

C. Sferrazza, D.-M. Huang, F. Liu, J. Lee, and P. Abbeel, “Body transformer: Leveraging robot embodiment for policy learning,” in CoRL, 2024

work page 2024

[13] [13]

AMAGO: Scalable in-context reinforce- ment learning for adaptive agents,

J. Grigsby, L. Fan, and Y . Zhu, “AMAGO: Scalable in-context reinforce- ment learning for adaptive agents,” inICLR, 2024

work page 2024

[14] [14]

Icrt: In-context imitation learning via next- token prediction,

M. Fu, H. Huang,et al., “Icrt: In-context imitation learning via next- token prediction,” inICRA, 2025

work page 2025

[15] [15]

Less is more: Token context-aware learning for object tracking,

C. Xu, B. Zhong, Q. Liang, Y . Zheng, G. Li, and S. Song, “Less is more: Token context-aware learning for object tracking,” inAAAI, pp. 8824– 8832, 2025

work page 2025

[16] [16]

Context-aware dynamics model for generalization in model-based reinforcement learning,

K. Lee, Y . Seo, S. Lee, H. Lee, and J. Shin, “Context-aware dynamics model for generalization in model-based reinforcement learning,” in ICML, pp. 5757–5766, PMLR, 2020

work page 2020

[17] [17]

Goal-conditioned reinforcement learning: Problems and solutions

M. Liu, M. Zhu, and W. Zhang, “Goal-conditioned reinforcement learning: Problems and solutions,”arXiv preprint arXiv:2201.08299, 2022

work page arXiv 2022

[18] [18]

Towards distraction- robust active visual tracking,

F. Zhong, P. Sun, W. Luo, T. Yan, and Y . Wang, “Towards distraction- robust active visual tracking,” inICML, pp. 12782–12792, PMLR, 2021

work page 2021

[19] [19]

arXiv preprint arXiv:2305.06558 (2023)

Y . Cheng, L. Li, Y . Xu, X. Li, Z. Yang, W. Wang, and Y . Yang, “Segment and track anything,”arXiv preprint arXiv:2305.06558, 2023

work page arXiv 2023

[20] [20]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR, pp. 770–778, 2016

work page 2016

[21] [21]

Gradient-based learning applied to document recognition,

Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998

work page 1998

[22] [22]

Unrealcv: Virtual worlds for computer vision,

W. Qiu, F. Zhong, Y . Zhang, S. Qiao, Z. Xiao, T. S. Kim, Y . Wang, and A. Yuille, “Unrealcv: Virtual worlds for computer vision,” inACM MM, pp. 1221–1224, 2017

work page 2017

[23] [23]

Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai,

F. Zhong, K. Wu, C. Wang, H. Chen, H. Ci, Z. Li, and Y . Wang, “Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai,” inICCV, pp. 5769–5779, 2025

work page 2025

[24] [24]

SAM 2: Segment anything in images and videos,

N. Ravi, V . Gabeur,et al., “SAM 2: Segment anything in images and videos,” inThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[25] [25]

Grounding dino: Marrying dino with grounded pre- training for open-set object detection,

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su,et al., “Grounding dino: Marrying dino with grounded pre- training for open-set object detection,” inECCV, pp. 38–55, Springer, 2024

work page 2024