BFMTrack: Latent Sequence Optimization for Physics-Based Motion Tracking with Behavioral Foundation Models
Pith reviewed 2026-06-26 00:06 UTC · model grok-4.3
The pith
Latent Sequence Optimization lets Behavioral Foundation Models track arbitrary motion sequences in physics simulation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our approach combines simulation rollouts with a policy gradient update to optimize over a sequence of latents, extending the capabilities of BFMs toward precise motion tracking without requiring reward engineering and tuning. To guide the optimization toward smooth, coherent latent trajectories, we model the latent sequence using temporally correlated noise.
What carries the argument
Latent Sequence Optimization (LSO), which treats a sequence of BFM latents as optimizable variables, updates them via policy gradients computed from physics simulation rollouts, and regularizes them with temporally correlated noise.
If this is right
- BFMs can now solve dense motion tracking tasks that previously required hand-crafted rewards.
- Sparse keyframe specifications become sufficient to generate full-body physically valid motions.
- Tracked behaviors transfer directly to real humanoid hardware without additional tuning.
- The same latent space supports both time-invariant goals and time-varying tracking objectives.
Where Pith is reading between the lines
- If LSO works across different BFMs, it may reduce the need to train separate controllers for each new tracking task.
- Extending the noise model to non-stationary correlations could handle motions with abrupt style changes.
- The approach suggests that other sequence-based control problems in robotics could be reframed as latent optimization rather than reward design.
Load-bearing premise
The BFM latent space already contains enough structure that optimizing a sequence of latents under temporal correlation will produce motions matching a given target trajectory.
What would settle it
Running LSO on a highly dynamic target motion such as a backflip and observing that the resulting simulated trajectory deviates substantially in joint angles or root position from the target at multiple keyframes.
Figures
read the original abstract
Behavioral Foundation Models (BFMs) offer a promising path toward universal physics-based character control by organizing a rich repertoire of physically plausible behaviors into a latent space, guided by a large-scale motion dataset. While these models excel at time-invariant tasks, such as goal-reaching and state-based reward optimization, their latent space does not directly support time-varying objectives, such as tracking a motion sequence. For tracking, existing heuristics rely on moving-window-averaging that fails to capture the nuances of highly dynamic motions. In this work, we propose a novel Latent Sequence Optimization (LSO) to address these shortcomings. Our approach combines simulation rollouts with a policy gradient update to optimize over a sequence of latents, extending the capabilities of BFMs toward precise motion tracking without requiring reward engineering and tuning. To guide the optimization toward smooth, coherent latent trajectories, we model the latent sequence using temporally correlated noise. We validate our approach across dense tracking, sparse keyframing, and direct deployment onto a real humanoid robot.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BFMTrack, which proposes Latent Sequence Optimization (LSO) to extend Behavioral Foundation Models (BFMs) from time-invariant tasks to time-varying motion tracking. LSO optimizes sequences of latent variables by combining simulation rollouts with policy-gradient updates, regularized via temporally correlated noise to promote smooth trajectories, without reward engineering or tuning. The approach is validated on dense tracking, sparse keyframing, and direct deployment to a real humanoid robot.
Significance. If the empirical results hold, the work would be moderately significant for physics-based character control: it provides a falsifiable, simulation-driven procedure to repurpose pre-trained BFM latent spaces for tracking without per-task reward design. The integration of standard policy gradients with temporally correlated latent regularization is a direct extension rather than a fundamental theoretical advance, but successful real-robot transfer would strengthen the practical utility of BFMs.
major comments (2)
- [Experiments / Validation] The central claim that LSO produces coherent trajectories matching target motions rests on the assumption that the pre-trained BFM latent space, when optimized over sequences with temporally correlated noise, yields smooth and accurate tracking. The manuscript should provide a quantitative ablation (e.g., in the experiments section) comparing LSO against moving-window averaging on highly dynamic motions to demonstrate that the regularization actually resolves the stated failure mode.
- [Method / LSO] The policy-gradient update on latent sequences is described as parameter-free in spirit, yet the temporally correlated noise model introduces hyperparameters (correlation length, variance schedule). The paper should clarify in the LSO formulation whether these are fixed across all experiments or tuned per task, as this affects the claim of 'without requiring reward engineering and tuning.'
minor comments (2)
- [Method] Notation for the latent sequence and the noise process should be introduced with explicit equations early in the method section to improve readability.
- [Real-robot experiments] The real-robot deployment section would benefit from reporting quantitative tracking error metrics (e.g., joint-angle RMSE or end-effector deviation) alongside qualitative success, to allow direct comparison with simulation results.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and recommendation for minor revision. We address each major comment below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [Experiments / Validation] The central claim that LSO produces coherent trajectories matching target motions rests on the assumption that the pre-trained BFM latent space, when optimized over sequences with temporally correlated noise, yields smooth and accurate tracking. The manuscript should provide a quantitative ablation (e.g., in the experiments section) comparing LSO against moving-window averaging on highly dynamic motions to demonstrate that the regularization actually resolves the stated failure mode.
Authors: We agree that a direct quantitative ablation on highly dynamic motions would strengthen the validation of the temporally correlated noise regularization. In the revised manuscript we will add this comparison in the experiments section, reporting tracking error and smoothness metrics against moving-window averaging on selected highly dynamic sequences from the test set. revision: yes
-
Referee: [Method / LSO] The policy-gradient update on latent sequences is described as parameter-free in spirit, yet the temporally correlated noise model introduces hyperparameters (correlation length, variance schedule). The paper should clarify in the LSO formulation whether these are fixed across all experiments or tuned per task, as this affects the claim of 'without requiring reward engineering and tuning.'
Authors: The correlation length and variance schedule of the temporally correlated noise are fixed to the same values for all experiments and tasks; they were chosen once from statistics of the pre-training motion dataset and held constant thereafter. We will add an explicit statement to this effect in the LSO formulation section of the revised manuscript to reinforce the minimal-tuning claim. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper introduces LSO as a direct combination of simulation rollouts and policy-gradient optimization over BFM latent sequences, regularized by temporally correlated noise. No load-bearing step reduces a claimed result to a fitted input, self-defined quantity, or self-citation chain by construction. The approach is presented as an extension of pre-existing BFM capabilities, with validation on external tasks (dense tracking, keyframing, real-robot deployment) that remain falsifiable outside any internal fit. This is the most common honest outcome for a methods paper whose central procedure does not collapse to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Behavioral Foundation Models organize a rich repertoire of physically plausible behaviors into a latent space guided by a large-scale motion dataset.
Reference graph
Works this paper leans on
-
[1]
Marco Bagatella, Matteo Pirotta, Ahmed Touati, Alessandro Lazaric, and Andrea Tirin- zoni
Skel-betweener: a neural motion rig for interactive motion authoring.ACM Transactions on Graphics (TOG)43, 6 (2024), 1–11. Marco Bagatella, Matteo Pirotta, Ahmed Touati, Alessandro Lazaric, and Andrea Tirin- zoni. 2026a. TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforce- ment Learning. InThe Fourteenth International Conference on Learning...
-
[2]
SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control
SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control.arXiv preprint arXiv:2511.07820(2025). Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. InInternational Conference on Computer Vision. 5442–5451. Viktor Makoviychuk, Lukasz Wawrzynia...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3197517.3201311 2025
-
[3]
Fast Imitation via Behavior Foundation Models. InThe Twelfth International Conference on Learning Representations. Yossi Rubner, Carlo Tomasi, and Leonidas Guibas. 2000. The Earth Mover’s Distance as a Metric for Image Retrieval.International Journal of Computer Vision40 (11 2000), 99–121. doi:10.1023/A:1026543900054 Thomas Rupf, Marco Bagatella, Marin Vl...
-
[4]
CALM: Conditional Adversarial Latent Models for Directable Virtual Charac- ters. InSpecial Interest Group on Computer Graphics and Interactive Techniques Con- ference Conference Proceedings. 1–9. arXiv:2305.02195 doi:10.1145/3588432.3591541 Jens Timmer and Michel Koenig. 1995. On generating power law noise.Astronomy and Astrophysics, v. 300, p. 707300 (19...
-
[5]
with the massively parallel training setup of Li et al. [2026]. Table 3 reports parameters that differ from these works; all other settings are identical. Table 3.BFM Training Parameters. BFM training parameters that differ from Tirinzoni et al. [2025] and Li et al. [2026]. Parameter SMPL Lima Episode length𝑇300 500 Seeding steps (random actions) 8 500 Fa...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.