AdaTracker: Learning Adaptive In-Context Policy for Cross-Embodiment Active Visual Tracking
Pith reviewed 2026-05-10 00:31 UTC · model grok-4.3
The pith
A single learned policy can track visual targets across unseen robot bodies by encoding embodiment constraints from interaction history.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdaTracker trains an Embodiment Context Encoder that extracts a compact representation of embodiment-specific constraints directly from a short trajectory of past states and actions. This representation is fed as additional input to a Context-Aware Policy, allowing the policy to generate embodiment-appropriate tracking actions in a zero-shot manner. Two auxiliary objectives—one for context reconstruction and one for temporal consistency—are optimized jointly to make the inferred context reliable.
What carries the argument
Embodiment Context Encoder that infers morphology constraints from interaction history and dynamically modulates the Context-Aware Policy.
If this is right
- One policy network suffices for active tracking on robots with widely different kinematics and dynamics.
- New robot platforms can be used immediately without collecting embodiment-specific training data.
- Sample efficiency improves because the model reuses experience across morphologies rather than starting from scratch for each.
- Zero-shot adaptation becomes possible in both simulation and real-world hardware.
- Auxiliary context and consistency losses stabilize performance when embodiment identification is noisy.
Where Pith is reading between the lines
- The same history-based context mechanism could be tested on other control tasks such as navigation or manipulation where body limits affect action feasibility.
- If history proves insufficient for some morphologies, an explicit embodiment description or brief calibration phase might be needed as a fallback.
- Scaling the approach to very high-dimensional or deformable bodies would require checking whether the context encoder still captures the relevant constraints in a low-dimensional latent space.
Load-bearing premise
A short history of past interactions always contains enough information to correctly identify the physical constraints of a previously unseen robot body.
What would settle it
Deploy the trained model on a new robot whose motion limits cannot be inferred from any short interaction history (for example, a robot with an unobservable joint range that only appears after many steps) and measure whether zero-shot tracking success drops to the level of a non-adaptive baseline.
Figures
read the original abstract
Realizing active visual tracking with a single unified model across diverse robots is challenging, as the physical constraints and motion dynamics vary drastically from one platform to another. Existing approaches typically train separate models for each embodiment, leading to poor scalability and limited generalization. To address this, we propose AdaTracker, an adaptive in-context policy learning framework that robustly tracks targets on diverse robot morphologies. Our key insight is to explicitly model embodiment-specific constraints through an Embodiment Context Encoder, which infers embodiment-specific constraints from history. This contextual representation dynamically modulates a Context-Aware Policy, enabling it to infer optimal control actions for unseen embodiments in a zero-shot manner. To enhance robustness, we introduce two auxiliary objectives to ensure accurate context identification and temporal consistency. Experiments in both simulation and the real world demonstrate that AdaTracker significantly outperforms state-of-the-art methods in cross-embodiment generalization, sample efficiency, and zero-shot adaptation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes AdaTracker, an adaptive in-context policy learning framework for active visual tracking that operates across diverse robot embodiments. It introduces an Embodiment Context Encoder to infer embodiment-specific constraints from interaction history, which dynamically modulates a Context-Aware Policy to enable zero-shot adaptation to unseen morphologies. Two auxiliary objectives are added to promote accurate context identification and temporal consistency. The approach is evaluated in simulation across multiple robot platforms with held-out morphologies for zero-shot testing, plus real-world transfer experiments, claiming superior cross-embodiment generalization, sample efficiency, and adaptation relative to state-of-the-art methods.
Significance. If the experimental results hold, the work would be significant for robotics by offering a scalable alternative to per-embodiment model training for visual tracking tasks such as surveillance or navigation. The explicit use of in-context learning to model morphology constraints from history represents a targeted approach to generalization. Credit is given for the simulation experiments that directly test zero-shot transfer on held-out embodiments, the real-world validation, and the ablations on auxiliary objectives that address the assumption that interaction history suffices to identify unseen morphology constraints.
minor comments (3)
- [Abstract] The abstract asserts outperformance and zero-shot capability but does not include any quantitative metrics, specific baselines, or numerical results; while the full experiments section supplies these, including one or two key numbers in the abstract would improve immediate readability.
- [§3.1] §3.1: The input representation to the Embodiment Context Encoder (e.g., exact history length, state features included, and handling of variable-length sequences across embodiments) is described at a high level; a concrete example or pseudocode would clarify how the encoder processes raw interaction trajectories.
- [Experiments] Table 2 (or equivalent results table): Reporting standard deviations or confidence intervals across random seeds for the success rate and tracking error metrics would strengthen the statistical reliability of the cross-embodiment and ablation comparisons.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of AdaTracker and the recommendation for minor revision. The provided summary accurately reflects the manuscript's contributions to zero-shot cross-embodiment active visual tracking via in-context learning.
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper's core derivation introduces an Embodiment Context Encoder that processes interaction history to produce a contextual representation, which then modulates a Context-Aware Policy for zero-shot control on unseen morphologies. Two auxiliary objectives are added explicitly for context accuracy and temporal consistency. These elements are trained end-to-end and evaluated via held-out embodiment experiments in simulation plus real-world transfer, providing external validation rather than any definitional reduction. No equations equate a claimed prediction to a fitted input by construction, no load-bearing self-citations justify uniqueness, and no ansatz or renaming of known results is presented as a first-principles derivation. The architecture follows standard in-context adaptation patterns without internal circular equivalence to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- neural network weights in policy and encoder
axioms (1)
- domain assumption Interaction history contains sufficient information to identify embodiment-specific constraints
invented entities (1)
-
Embodiment Context Encoder
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Ad-vat+: An asymmetric dueling mechanism for learning and understanding visual active tracking,
F. Zhong, P. Sun, W. Luo, T. Yan, and Y . Wang, “Ad-vat+: An asymmetric dueling mechanism for learning and understanding visual active tracking,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 5, pp. 1467–1482, 2019
work page 2019
-
[2]
Trackvla: Embodied visual tracking in the wild,
S. Wang, J. Zhang, M. Li, J. Liu, A. Li, K. Wu, F. Zhong, J. Yu, Z. Zhang, and H. Wang, “Trackvla: Embodied visual tracking in the wild,” inCoRL, pp. 4139–4164, PMLR, 2025
work page 2025
-
[3]
End-to-end active object tracking and its real-world deployment via reinforcement learning,
W. Luo, P. Sun, F. Zhong, W. Liu, T. Zhang, and Y . Wang, “End-to-end active object tracking and its real-world deployment via reinforcement learning,”IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 42, no. 6, pp. 1317–1332, 2020
work page 2020
-
[4]
Coarse-to-fine uav target tracking with deep reinforcement learning,
W. Zhang, K. Song, X. Rong, and Y . Li, “Coarse-to-fine uav target tracking with deep reinforcement learning,”IEEE Transactions on Automation Science and Engineering, vol. 16, no. 4, pp. 1522–1530, 2018
work page 2018
-
[5]
Anti-distractor active object tracking in 3d environments,
M. Xi, Y . Zhou, Z. Chen, W. Zhou, and H. Li, “Anti-distractor active object tracking in 3d environments,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3697–3707, 2021
work page 2021
-
[6]
Pose-assisted multi-camera collaboration for active object tracking,
J. Li, J. Xu, F. Zhong, X. Kong, Y . Qiao, and Y . Wang, “Pose-assisted multi-camera collaboration for active object tracking,” inAAAI, vol. 34, pp. 759–766, 2020
work page 2020
-
[7]
Enhancing continuous control of mobile robots for end-to-end visual active tracking,
A. Devo, A. Dionigi, and G. Costante, “Enhancing continuous control of mobile robots for end-to-end visual active tracking,”Robotics and Autonomous Systems, vol. 142, p. 103799, 2021
work page 2021
-
[8]
Rspt: reconstruct surroundings and predict trajectory for generalizable active object track- ing,
F. Zhong, X. Bi, Y . Zhang, W. Zhang, and Y . Wang, “Rspt: reconstruct surroundings and predict trajectory for generalizable active object track- ing,” inAAAI, pp. 3705–3714, 2023
work page 2023
-
[9]
Conservative q-learning for offline reinforcement learning,
A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,”NeurIPS, vol. 33, pp. 1179–1191, 2020
work page 2020
-
[10]
Empowering embodied visual tracking with visual foundation models and offline rl,
F. Zhong, K. Wu, H. Ci, C. Wang, and H. Chen, “Empowering embodied visual tracking with visual foundation models and offline rl,” inECCV, pp. 139–155, 2024
work page 2024
-
[11]
Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration0,
A. O’Neill, Rehman,et al., “Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration0,” inICRA, pp. 6892–6903, 2024
work page 2024
-
[12]
Body transformer: Leveraging robot embodiment for policy learning,
C. Sferrazza, D.-M. Huang, F. Liu, J. Lee, and P. Abbeel, “Body transformer: Leveraging robot embodiment for policy learning,” in CoRL, 2024
work page 2024
-
[13]
AMAGO: Scalable in-context reinforce- ment learning for adaptive agents,
J. Grigsby, L. Fan, and Y . Zhu, “AMAGO: Scalable in-context reinforce- ment learning for adaptive agents,” inICLR, 2024
work page 2024
-
[14]
Icrt: In-context imitation learning via next- token prediction,
M. Fu, H. Huang,et al., “Icrt: In-context imitation learning via next- token prediction,” inICRA, 2025
work page 2025
-
[15]
Less is more: Token context-aware learning for object tracking,
C. Xu, B. Zhong, Q. Liang, Y . Zheng, G. Li, and S. Song, “Less is more: Token context-aware learning for object tracking,” inAAAI, pp. 8824– 8832, 2025
work page 2025
-
[16]
Context-aware dynamics model for generalization in model-based reinforcement learning,
K. Lee, Y . Seo, S. Lee, H. Lee, and J. Shin, “Context-aware dynamics model for generalization in model-based reinforcement learning,” in ICML, pp. 5757–5766, PMLR, 2020
work page 2020
-
[17]
Goal-conditioned reinforcement learning: Problems and solutions
M. Liu, M. Zhu, and W. Zhang, “Goal-conditioned reinforcement learning: Problems and solutions,”arXiv preprint arXiv:2201.08299, 2022
-
[18]
Towards distraction- robust active visual tracking,
F. Zhong, P. Sun, W. Luo, T. Yan, and Y . Wang, “Towards distraction- robust active visual tracking,” inICML, pp. 12782–12792, PMLR, 2021
work page 2021
-
[19]
arXiv preprint arXiv:2305.06558 (2023)
Y . Cheng, L. Li, Y . Xu, X. Li, Z. Yang, W. Wang, and Y . Yang, “Segment and track anything,”arXiv preprint arXiv:2305.06558, 2023
-
[20]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inCVPR, pp. 770–778, 2016
work page 2016
-
[21]
Gradient-based learning applied to document recognition,
Y . LeCun, L. Bottou, Y . Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998
work page 1998
-
[22]
Unrealcv: Virtual worlds for computer vision,
W. Qiu, F. Zhong, Y . Zhang, S. Qiao, Z. Xiao, T. S. Kim, Y . Wang, and A. Yuille, “Unrealcv: Virtual worlds for computer vision,” inACM MM, pp. 1221–1224, 2017
work page 2017
-
[23]
Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai,
F. Zhong, K. Wu, C. Wang, H. Chen, H. Ci, Z. Li, and Y . Wang, “Unrealzoo: Enriching photo-realistic virtual worlds for embodied ai,” inICCV, pp. 5769–5779, 2025
work page 2025
-
[24]
SAM 2: Segment anything in images and videos,
N. Ravi, V . Gabeur,et al., “SAM 2: Segment anything in images and videos,” inThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[25]
Grounding dino: Marrying dino with grounded pre- training for open-set object detection,
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su,et al., “Grounding dino: Marrying dino with grounded pre- training for open-set object detection,” inECCV, pp. 38–55, Springer, 2024
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.