BFMTrack: Latent Sequence Optimization for Physics-Based Motion Tracking with Behavioral Foundation Models

Agon Serifi; David M\"uller; Espen Knoop; Moritz B\"acher; Ruben Grandia; Sammy Christen; Thomas Rupf

arxiv: 2606.25056 · v1 · pith:NTMZR2TWnew · submitted 2026-06-23 · 💻 cs.RO

BFMTrack: Latent Sequence Optimization for Physics-Based Motion Tracking with Behavioral Foundation Models

Thomas Rupf , Agon Serifi , David M\"uller , Sammy Christen , Ruben Grandia , Espen Knoop , Moritz B\"acher This is my paper

Pith reviewed 2026-06-26 00:06 UTC · model grok-4.3

classification 💻 cs.RO

keywords Behavioral Foundation ModelsLatent Sequence OptimizationPhysics-based character controlMotion trackingHumanoid roboticsPolicy gradient methodsLatent space optimization

0 comments

The pith

Latent Sequence Optimization lets Behavioral Foundation Models track arbitrary motion sequences in physics simulation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Behavioral Foundation Models organize physically plausible behaviors into a latent space from large motion datasets, yet their time-invariant design does not directly support tracking time-varying motion sequences. Existing heuristics such as moving-window averaging lose detail on dynamic motions. The paper proposes Latent Sequence Optimization, which runs simulation rollouts and applies policy gradient updates directly to a sequence of latent variables. Temporally correlated noise regularizes the sequence to keep trajectories smooth and coherent. The method is shown to handle dense tracking, sparse keyframing, and transfer to a physical humanoid without reward engineering.

Core claim

Our approach combines simulation rollouts with a policy gradient update to optimize over a sequence of latents, extending the capabilities of BFMs toward precise motion tracking without requiring reward engineering and tuning. To guide the optimization toward smooth, coherent latent trajectories, we model the latent sequence using temporally correlated noise.

What carries the argument

Latent Sequence Optimization (LSO), which treats a sequence of BFM latents as optimizable variables, updates them via policy gradients computed from physics simulation rollouts, and regularizes them with temporally correlated noise.

If this is right

BFMs can now solve dense motion tracking tasks that previously required hand-crafted rewards.
Sparse keyframe specifications become sufficient to generate full-body physically valid motions.
Tracked behaviors transfer directly to real humanoid hardware without additional tuning.
The same latent space supports both time-invariant goals and time-varying tracking objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If LSO works across different BFMs, it may reduce the need to train separate controllers for each new tracking task.
Extending the noise model to non-stationary correlations could handle motions with abrupt style changes.
The approach suggests that other sequence-based control problems in robotics could be reframed as latent optimization rather than reward design.

Load-bearing premise

The BFM latent space already contains enough structure that optimizing a sequence of latents under temporal correlation will produce motions matching a given target trajectory.

What would settle it

Running LSO on a highly dynamic target motion such as a backflip and observing that the resulting simulated trajectory deviates substantially in joint angles or root position from the target at multiple keyframes.

Figures

Figures reproduced from arXiv: 2606.25056 by Agon Serifi, David M\"uller, Espen Knoop, Moritz B\"acher, Ruben Grandia, Sammy Christen, Thomas Rupf.

**Figure 1.** Figure 1: Our method repurposes a trained Behavioral Foundation Model (BFM) for motion tracking by optimizing a temporal sequence of latents to approximate [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview. In Stage I, we train a latent-conditioned policy via the Forward-Backward framework. In Stage II, we map the goal keyframes into the latent space via the backward map 𝐵, further optimizing the latent sequence to improve the temporal coherence and tracking accuracy. 3 Overview Problem Statement. We focus on tracking reference motion on physics-based characters using a control policy. More specific… view at source ↗

**Figure 3.** Figure 3: Lookahead Horizon on ER Baseline. Tracking performance peaks at a window size of 𝐿 = 5 and degrades for larger horizons, illustrating the coherence-precision trade-off discussed in Tab. 2. Reference LSO ER (L=1) ER (L=8) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative Comparison. Comparison of LSO against ER (L=1) and ER (L=8) for tracking a kicking reference motion. A short horizon lacks the necessary context to prepare for the kick, exhibiting a lag, whereas the long horizon averages too far ahead, leading to a premature kicking motion [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Qualitative Sparse Motion Tracking. LSO accurately tracks a sparse keyframe sequence of an underlying, highly dynamic and contactrich motion like a roll. See supplemental video for the full dynamic sequence. 7.4 Robot Deployment We demonstrate qualitatively that our method can be transferred to a real robotic platform. See the supplemental video for results. The robot can execute dynamic and contact-rich … view at source ↗

**Figure 5.** Figure 5: Sparse Keyframe Tracking. Zero-shot initializations (lighter bars, behind) versus their optimized variants (darker bars, in front). Each column corresponds to a different keyframe density. LSO with correlated noise achieves the best trade-off between tracking accuracy and motion quality. Lower is better [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: Keyframe Distribution. Distribution of keyframe spacings across the AMASS test set for varying minimum temporal spacing parameters Δ𝑡 using the DoG extraction heuristic. Box plots display the median and interquartile range (IQR), with whiskers extending to 1.5× the IQR. C BFM Training Details Our BFM training follows the FB-CPR algorithm of Tirinzoni et al. [2025] with the massively parallel training setup… view at source ↗

read the original abstract

Behavioral Foundation Models (BFMs) offer a promising path toward universal physics-based character control by organizing a rich repertoire of physically plausible behaviors into a latent space, guided by a large-scale motion dataset. While these models excel at time-invariant tasks, such as goal-reaching and state-based reward optimization, their latent space does not directly support time-varying objectives, such as tracking a motion sequence. For tracking, existing heuristics rely on moving-window-averaging that fails to capture the nuances of highly dynamic motions. In this work, we propose a novel Latent Sequence Optimization (LSO) to address these shortcomings. Our approach combines simulation rollouts with a policy gradient update to optimize over a sequence of latents, extending the capabilities of BFMs toward precise motion tracking without requiring reward engineering and tuning. To guide the optimization toward smooth, coherent latent trajectories, we model the latent sequence using temporally correlated noise. We validate our approach across dense tracking, sparse keyframing, and direct deployment onto a real humanoid robot.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LSO gives BFMs a workable way to handle time-varying tracking by optimizing full latent sequences with simulation rollouts and policy gradients plus correlated noise.

read the letter

The main takeaway is that this paper shows how to extend Behavioral Foundation Models beyond their usual time-invariant tasks. They introduce Latent Sequence Optimization, which rolls out simulations, applies policy gradient updates across a sequence of latents, and regularizes with temporally correlated noise to produce smooth trajectories. This directly targets the limitation of moving-window averaging for dynamic motion tracking.

What the work does well is keep the approach simple and avoid reward engineering. The validation covers dense tracking, sparse keyframing, and real-robot deployment on a humanoid, which gives the claims some practical grounding. The stress-test note confirms no internal inconsistencies or unsupported leaps in the optimization setup, so the central idea holds together on its own terms.

Soft spots are limited. The method still depends on repeated simulation rollouts, which could scale poorly for very long sequences or high-dimensional motions, though the paper does not flag this as a blocker. Without seeing detailed ablations or error metrics in the provided summary, it is hard to judge how much the noise model contributes versus the base optimization. That said, nothing looks load-bearing or circular.

This paper is aimed at researchers in physics-based character animation and humanoid control who already work with foundation models. A reader interested in practical extensions of latent-space control will get value from the concrete procedure and robot results. It deserves a serious referee because the extension is falsifiable and builds on existing components without obvious flaws.

Referee Report

2 major / 2 minor

Summary. The paper introduces BFMTrack, which proposes Latent Sequence Optimization (LSO) to extend Behavioral Foundation Models (BFMs) from time-invariant tasks to time-varying motion tracking. LSO optimizes sequences of latent variables by combining simulation rollouts with policy-gradient updates, regularized via temporally correlated noise to promote smooth trajectories, without reward engineering or tuning. The approach is validated on dense tracking, sparse keyframing, and direct deployment to a real humanoid robot.

Significance. If the empirical results hold, the work would be moderately significant for physics-based character control: it provides a falsifiable, simulation-driven procedure to repurpose pre-trained BFM latent spaces for tracking without per-task reward design. The integration of standard policy gradients with temporally correlated latent regularization is a direct extension rather than a fundamental theoretical advance, but successful real-robot transfer would strengthen the practical utility of BFMs.

major comments (2)

[Experiments / Validation] The central claim that LSO produces coherent trajectories matching target motions rests on the assumption that the pre-trained BFM latent space, when optimized over sequences with temporally correlated noise, yields smooth and accurate tracking. The manuscript should provide a quantitative ablation (e.g., in the experiments section) comparing LSO against moving-window averaging on highly dynamic motions to demonstrate that the regularization actually resolves the stated failure mode.
[Method / LSO] The policy-gradient update on latent sequences is described as parameter-free in spirit, yet the temporally correlated noise model introduces hyperparameters (correlation length, variance schedule). The paper should clarify in the LSO formulation whether these are fixed across all experiments or tuned per task, as this affects the claim of 'without requiring reward engineering and tuning.'

minor comments (2)

[Method] Notation for the latent sequence and the noise process should be introduced with explicit equations early in the method section to improve readability.
[Real-robot experiments] The real-robot deployment section would benefit from reporting quantitative tracking error metrics (e.g., joint-angle RMSE or end-effector deviation) alongside qualitative success, to allow direct comparison with simulation results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [Experiments / Validation] The central claim that LSO produces coherent trajectories matching target motions rests on the assumption that the pre-trained BFM latent space, when optimized over sequences with temporally correlated noise, yields smooth and accurate tracking. The manuscript should provide a quantitative ablation (e.g., in the experiments section) comparing LSO against moving-window averaging on highly dynamic motions to demonstrate that the regularization actually resolves the stated failure mode.

Authors: We agree that a direct quantitative ablation on highly dynamic motions would strengthen the validation of the temporally correlated noise regularization. In the revised manuscript we will add this comparison in the experiments section, reporting tracking error and smoothness metrics against moving-window averaging on selected highly dynamic sequences from the test set. revision: yes
Referee: [Method / LSO] The policy-gradient update on latent sequences is described as parameter-free in spirit, yet the temporally correlated noise model introduces hyperparameters (correlation length, variance schedule). The paper should clarify in the LSO formulation whether these are fixed across all experiments or tuned per task, as this affects the claim of 'without requiring reward engineering and tuning.'

Authors: The correlation length and variance schedule of the temporally correlated noise are fixed to the same values for all experiments and tasks; they were chosen once from statistics of the pre-training motion dataset and held constant thereafter. We will add an explicit statement to this effect in the LSO formulation section of the revised manuscript to reinforce the minimal-tuning claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces LSO as a direct combination of simulation rollouts and policy-gradient optimization over BFM latent sequences, regularized by temporally correlated noise. No load-bearing step reduces a claimed result to a fitted input, self-defined quantity, or self-citation chain by construction. The approach is presented as an extension of pre-existing BFM capabilities, with validation on external tasks (dense tracking, keyframing, real-robot deployment) that remain falsifiable outside any internal fit. This is the most common honest outcome for a methods paper whose central procedure does not collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; the central claim rests on the unverified effectiveness of sequence-level latent optimization inside an existing BFM. No free parameters, axioms, or invented entities are explicitly quantified in the provided text.

axioms (1)

domain assumption Behavioral Foundation Models organize a rich repertoire of physically plausible behaviors into a latent space guided by a large-scale motion dataset.
Core premise stated in the first sentence of the abstract.

pith-pipeline@v0.9.1-grok · 5726 in / 1263 out tokens · 34464 ms · 2026-06-26T00:06:42.495771+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Marco Bagatella, Matteo Pirotta, Ahmed Touati, Alessandro Lazaric, and Andrea Tirin- zoni

Skel-betweener: a neural motion rig for interactive motion authoring.ACM Transactions on Graphics (TOG)43, 6 (2024), 1–11. Marco Bagatella, Matteo Pirotta, Ahmed Touati, Alessandro Lazaric, and Andrea Tirin- zoni. 2026a. TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforce- ment Learning. InThe Fourteenth International Conference on Learning...

work page doi:10.15607/rss.2024.xx.103 2024
[2]

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control.arXiv preprint arXiv:2511.07820(2025). Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. InInternational Conference on Computer Vision. 5442–5451. Viktor Makoviychuk, Lukasz Wawrzynia...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3197517.3201311 2025
[3]

Rubner, C

Fast Imitation via Behavior Foundation Models. InThe Twelfth International Conference on Learning Representations. Yossi Rubner, Carlo Tomasi, and Leonidas Guibas. 2000. The Earth Mover’s Distance as a Metric for Image Retrieval.International Journal of Computer Vision40 (11 2000), 99–121. doi:10.1023/A:1026543900054 Thomas Rupf, Marco Bagatella, Marin Vl...

work page doi:10.1023/a:1026543900054 2000
[4]

InSpecial Interest Group on Computer Graphics and Interactive Techniques Con- ference Conference Proceedings

CALM: Conditional Adversarial Latent Models for Directable Virtual Charac- ters. InSpecial Interest Group on Computer Graphics and Interactive Techniques Con- ference Conference Proceedings. 1–9. arXiv:2305.02195 doi:10.1145/3588432.3591541 Jens Timmer and Michel Koenig. 1995. On generating power law noise.Astronomy and Astrophysics, v. 300, p. 707300 (19...

work page doi:10.1145/3588432.3591541 1995
[5]

with the massively parallel training setup of Li et al. [2026]. Table 3 reports parameters that differ from these works; all other settings are identical. Table 3.BFM Training Parameters. BFM training parameters that differ from Tirinzoni et al. [2025] and Li et al. [2026]. Parameter SMPL Lima Episode length𝑇300 500 Seeding steps (random actions) 8 500 Fa...

2026

[1] [1]

Marco Bagatella, Matteo Pirotta, Ahmed Touati, Alessandro Lazaric, and Andrea Tirin- zoni

Skel-betweener: a neural motion rig for interactive motion authoring.ACM Transactions on Graphics (TOG)43, 6 (2024), 1–11. Marco Bagatella, Matteo Pirotta, Ahmed Touati, Alessandro Lazaric, and Andrea Tirin- zoni. 2026a. TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforce- ment Learning. InThe Fourteenth International Conference on Learning...

work page doi:10.15607/rss.2024.xx.103 2024

[2] [2]

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control

SONIC: Supersizing Motion Tracking for Natural Humanoid Whole-Body Control.arXiv preprint arXiv:2511.07820(2025). Naureen Mahmood, Nima Ghorbani, Nikolaus F. Troje, Gerard Pons-Moll, and Michael J. Black. 2019. AMASS: Archive of Motion Capture as Surface Shapes. InInternational Conference on Computer Vision. 5442–5451. Viktor Makoviychuk, Lukasz Wawrzynia...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3197517.3201311 2025

[3] [3]

Rubner, C

Fast Imitation via Behavior Foundation Models. InThe Twelfth International Conference on Learning Representations. Yossi Rubner, Carlo Tomasi, and Leonidas Guibas. 2000. The Earth Mover’s Distance as a Metric for Image Retrieval.International Journal of Computer Vision40 (11 2000), 99–121. doi:10.1023/A:1026543900054 Thomas Rupf, Marco Bagatella, Marin Vl...

work page doi:10.1023/a:1026543900054 2000

[4] [4]

InSpecial Interest Group on Computer Graphics and Interactive Techniques Con- ference Conference Proceedings

CALM: Conditional Adversarial Latent Models for Directable Virtual Charac- ters. InSpecial Interest Group on Computer Graphics and Interactive Techniques Con- ference Conference Proceedings. 1–9. arXiv:2305.02195 doi:10.1145/3588432.3591541 Jens Timmer and Michel Koenig. 1995. On generating power law noise.Astronomy and Astrophysics, v. 300, p. 707300 (19...

work page doi:10.1145/3588432.3591541 1995

[5] [5]

with the massively parallel training setup of Li et al. [2026]. Table 3 reports parameters that differ from these works; all other settings are identical. Table 3.BFM Training Parameters. BFM training parameters that differ from Tirinzoni et al. [2025] and Li et al. [2026]. Parameter SMPL Lima Episode length𝑇300 500 Seeding steps (random actions) 8 500 Fa...

2026