pith. machine review for the scientific record. sign in

arxiv: 2412.02125 · v2 · submitted 2024-12-03 · 💻 cs.AI · cs.LG

Recognition: unknown

Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies

Authors on Pith no claims yet
classification 💻 cs.AI cs.LG
keywords goallatentcontrolfrozenpolicypoliciespreferenceprompts
0
0 comments X
read the original abstract

Goal-conditioned policies enable decision-making models to execute diverse behaviors based on specified goals, yet their downstream performance is often highly sensitive to the choice of instructions or prompts. To bypass the limitations of discrete text prompts, we formulate post-training adaptation as a latent control problem, where the goal embedding serves as a continuous control variable to modulate the behavior of a frozen policy. We propose Preference Goal Tuning (PGT), a framework that optimizes this latent control variable to align the induced trajectory distribution with task preferences. Unlike standard fine-tuning that updates policy parameters, PGT keeps the policy frozen and updates only the latent goal using a trajectory-level preference objective. This approach essentially searches for the optimal conditioning input that maximizes the likelihood of preferred behaviors while suppressing undesirable ones. We evaluate PGT on the Minecraft SkillForge benchmark across 17 tasks. With minimal data, PGT achieves average relative improvements of 72.0\% and 81.6\% on two foundation policies, consistently outperforming expert-crafted prompts. Crucially, by decoupling task alignment (latent goal) from physical dynamics (frozen policy), PGT surpasses full fine-tuning by 13.4\% in out-of-distribution settings, demonstrating superior robustness and generalization.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.