MAPL: Multi-Objective Preference Learning for Robot Locomotion

Joseph Campbell; Muhan Lin; Shuyang Shi; Xiyue Chen

arxiv: 2606.25398 · v1 · pith:64IO6GUKnew · submitted 2026-06-24 · 💻 cs.RO

MAPL: Multi-Objective Preference Learning for Robot Locomotion

Xiyue Chen , Muhan Lin , Shuyang Shi , Joseph Campbell This is my paper

Pith reviewed 2026-06-25 21:09 UTC · model grok-4.3

classification 💻 cs.RO

keywords preference learningrobot locomotionLLM rewardsreinforcement learningquadruped robotsmulti-objective learningreward design

0 comments

The pith

MAPL learns quadruped locomotion policies from LLM preferences on multiple generic objectives that match or exceed expert-designed rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the bottleneck of manual reward engineering in reinforcement learning for robot locomotion. It proposes MAPL, which prompts a large language model to compare trajectories independently along several terrain-invariant language criteria. These comparisons train a multi-head scoring model whose outputs are aggregated into a single reward signal used to optimize policies. Experiments across four quadruped environments demonstrate that the resulting policies perform comparably to or better than those trained with hand-tuned expert rewards.

Core claim

MAPL prompts a large language model to compare trajectories independently along semantically meaningful criteria, using generic language descriptions that are terrain-invariant and require little domain expertise. These objective-wise preferences are used to train a multi-head preference scoring model, whose outputs are aggregated to form a scalar reward for policy optimization. Across four quadruped locomotion environments, MAPL trains policies using only LLM-generated preferences and achieves performance comparable to or better than expert-designed rewards, while eliminating task-specific reward engineering.

What carries the argument

Multi-head preference scoring model trained on objective-wise LLM comparisons that aggregates outputs into a scalar reward signal.

If this is right

Locomotion policies can be trained without any task-specific reward equations or domain-expert tuning.
The same set of generic language criteria can be reused across different quadruped environments and terrains.
Multiple competing objectives in locomotion behavior are captured through separate preference heads rather than a single overall judgment.
Policy optimization proceeds with a learned reward derived entirely from language-model comparisons instead of hand-crafted terms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on other robot morphologies or non-locomotion tasks where multiple objectives must be balanced.
Consistency of the LLM preferences might be checked by repeating comparisons with different models or prompt phrasings.
The aggregation step from multi-head scores to scalar reward could be varied to measure sensitivity of final policy quality.

Load-bearing premise

Independent LLM comparisons along generic, terrain-invariant criteria produce preference signals whose aggregation yields rewards that produce locomotion policies comparable in quality to those from expert-designed rewards.

What would settle it

Training policies with MAPL rewards and expert rewards on a new quadruped environment or robot morphology and observing that MAPL policies underperform on standard locomotion metrics would falsify the comparability result.

Figures

Figures reproduced from arXiv: 2606.25398 by Joseph Campbell, Muhan Lin, Shuyang Shi, Xiyue Chen.

**Figure 1.** Figure 1: MAPL consists of two iterative steps. Scoring Model Training: (1) Sample trajectory pairs from replay buffer. (2) Query LLM to obtain multiobjective pairwise preferences. (3) Train a transformer-based preference scoring model for each objective. Policy Training: (1) Sample state-action pairs by rolling out policy. (2) Compute scores for each state using the preference scoring model. (3) Aggregate scores t… view at source ↗

**Figure 2.** Figure 2: Evaluation terrains where the quadratic robots are trained to walk. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Velocity training curves of MAPL and baseline methods across diverse terrains. All results are averaged over five independent random seeds. The [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: All results are averaged over three independent random seeds for the ablation studies. Solid lines denote the mean training performance, and shaded [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Velocity tracking error of the trained policies across different methods. The error is computed under given forward command velocities, where [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Velocity learning curves of all methods with and without the potential difference. The results are running for 5 seeds, and the moving average with [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: The robot height, roll, pitch distribution over 20 random steps rolled [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Evaluation rewards under different λ settings for each preference objective. Results are averaged over three random seeds with a smoothing window of 100. The shaded areas indicate the standard error. a tight clustering even closer to the origin, demonstrating improved stability even without expert-crafted terms and manual tuning. In contrast, LLM rewards without preference modeling exhibit higher variance … view at source ↗

read the original abstract

Reward design remains a major bottleneck in reinforcement learning for robot locomotion, where successful policies often depend on carefully tuned, task-specific reward functions. Preference-based reinforcement learning offers an alternative, but existing LLM-based methods typically ask for a single overall judgment between behaviors, making it difficult to capture the multiple competing objectives that underlie high-quality locomotion. We present Multi-Objective AI-Informed Preference Learning (MAPL), a framework that learns locomotion rewards from high-level natural language objectives rather than manually engineered reward equations. MAPL prompts a large language model to compare trajectories independently along semantically meaningful criteria, using generic language descriptions that are terrain-invariant and require little domain expertise. These objective-wise preferences are used to train a multi-head preference scoring model, whose outputs are aggregated to form a scalar reward for policy optimization. Across four quadruped locomotion environments, MAPL trains policies using only LLM-generated preferences and achieves performance comparable to or better than expert-designed rewards, while eliminating task-specific reward engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MAPL uses LLM multi-objective preferences to train a multi-head scorer for locomotion rewards, but the abstract gives no results so the aggregation claim stays untested.

read the letter

The new piece is eliciting separate LLM judgments on distinct criteria (velocity, energy, stability) rather than one overall preference, then training a multi-head model whose outputs get aggregated into a scalar reward. This targets the known issue that single-judgment methods struggle with competing locomotion objectives.

It frames the reward-design bottleneck cleanly and picks generic, terrain-invariant language prompts that need little robotics knowledge. That setup is a straightforward extension of existing preference learning and avoids some manual tuning.

The main gap is evidence. The abstract claims performance comparable to or better than expert rewards across four quadruped environments, yet supplies no numbers, baselines, variance, or protocol. Without those, the claim cannot be checked. The aggregation step is also the least secure part: independent LLM scores on broad criteria may not recover the precise weightings that expert rewards use for dynamics trade-offs, and nothing in the description shows why the chosen aggregation would preserve them.

This is for people working on RL reward design or LLM-assisted robotics. A reader already in that area could get value from the method description if the experiments hold up. It deserves a serious referee to check the data and the aggregation details rather than a desk reject.

Referee Report

3 major / 2 minor

Summary. The paper introduces MAPL, a framework that prompts an LLM to compare locomotion trajectories independently along multiple semantically meaningful, terrain-invariant criteria; these preferences train a multi-head scoring model whose outputs are aggregated into a scalar reward for policy optimization in RL. The central claim is that, across four quadruped environments, policies trained solely from these LLM-generated preferences achieve performance comparable to or better than those obtained from expert-designed rewards, while eliminating task-specific reward engineering.

Significance. If the empirical results and the aggregation step are validated, the work would offer a concrete route to replacing manual multi-objective reward engineering with high-level language specifications in robot locomotion, a persistent bottleneck in RL for legged systems. The multi-criterion preference elicitation is a clear technical contribution over single-judgment LLM preference methods.

major comments (3)

[Abstract] Abstract (final sentence): the performance claim that MAPL policies are 'comparable to or better than expert-designed rewards' is stated without any quantitative results, baselines, statistical details, or experimental protocol, so it is impossible to verify whether the data support the claim as written.
[Method] Method section (aggregation of multi-head scores): no derivation, sensitivity analysis, or ablation is supplied showing that independent LLM judgments on generic criteria, once aggregated, recover the precise trade-offs among competing objectives (e.g., velocity vs. energy vs. stability) that expert rewards encode; this step is load-bearing for the claim that LLM-based rewards match expert performance.
[Experiments] Experiments section: the abstract asserts results 'across four quadruped locomotion environments' yet supplies no table or figure reporting the actual metrics, variance, or statistical tests that would allow assessment of whether the LLM-derived rewards truly match or exceed the expert baselines.

minor comments (2)

[Method] The description of the multi-head model architecture and the exact aggregation function (linear, learned, etc.) should be stated with equations for reproducibility.
[Method] Clarify whether the LLM criteria are fixed across all terrains or adapted; the claim of 'terrain-invariant' descriptions needs an explicit list or example in the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract, method, and experiments. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract (final sentence): the performance claim that MAPL policies are 'comparable to or better than expert-designed rewards' is stated without any quantitative results, baselines, statistical details, or experimental protocol, so it is impossible to verify whether the data support the claim as written.

Authors: We agree the abstract claim would be more verifiable with quantitative support. In revision we will update the final sentence to reference key metrics (e.g., normalized returns and success rates) and the evaluation protocol across environments and random seeds. revision: yes
Referee: [Method] Method section (aggregation of multi-head scores): no derivation, sensitivity analysis, or ablation is supplied showing that independent LLM judgments on generic criteria, once aggregated, recover the precise trade-offs among competing objectives (e.g., velocity vs. energy vs. stability) that expert rewards encode; this step is load-bearing for the claim that LLM-based rewards match expert performance.

Authors: The current manuscript describes the multi-head model and a simple weighted aggregation but lacks the requested analysis. We will add a derivation of the aggregation, a sensitivity study on weights, and an ablation comparing aggregated LLM scores against expert reward trade-offs. revision: yes
Referee: [Experiments] Experiments section: the abstract asserts results 'across four quadruped locomotion environments' yet supplies no table or figure reporting the actual metrics, variance, or statistical tests that would allow assessment of whether the LLM-derived rewards truly match or exceed the expert baselines.

Authors: The experiments section contains tables and figures with per-environment metrics; however, to improve clarity we will insert a consolidated main-body table reporting means, standard deviations over multiple seeds, and direct statistical comparisons to the expert-designed reward baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external LLM judgments

full rationale

The paper describes training a multi-head scoring model from LLM-generated per-criterion preferences on generic language criteria, then aggregating outputs into a scalar reward for RL policy optimization. This chain depends on external LLM comparisons rather than any fitted parameter of the target performance claim or self-citation that reduces the result to its inputs by construction. No equations, uniqueness theorems, or ansatzes are shown to be smuggled or self-defined. The reported performance comparability is presented as an empirical outcome across environments, not a definitional equivalence. This is the normal case of a self-contained empirical method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient detail to identify concrete free parameters, axioms, or invented entities; the framework description mentions an LLM, multi-head model, and aggregation step but gives no equations or implementation specifics.

pith-pipeline@v0.9.1-grok · 5698 in / 1098 out tokens · 26952 ms · 2026-06-25T21:09:04.169753+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 18 canonical work pages · 6 internal anchors

[1]

RMA: Rapid Motor Adaptation for Legged Robots

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,”arXiv preprint arXiv:2107.04034, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

Learning quadrupedal locomotion over challenging terrain,

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning quadrupedal locomotion over challenging terrain,”Science robotics, vol. 5, no. 47, p. eabc5986, 2020

2020
[3]

Walk these ways: Tuning robot control for generalization with multiplicity of behavior,

G. B. Margolis and P. Agrawal, “Walk these ways: Tuning robot control for generalization with multiplicity of behavior,” inConference on Robot Learning. PMLR, 2023, pp. 22–31

2023
[4]

Rapid locomotion via reinforcement learning,

G. B. Margolis, G. Yang, K. Paigwar, T. Chen, and P. Agrawal, “Rapid locomotion via reinforcement learning,”The International Journal of Robotics Research, vol. 43, no. 4, pp. 572–587, 2024

2024
[5]

Defining and characterizing reward gaming,

J. Skalse, N. Howe, D. Krasheninnikov, and D. Krueger, “Defining and characterizing reward gaming,”Advances in Neural Information Processing Systems, vol. 35, pp. 9460–9471, 2022

2022
[6]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017

2017
[7]

Surf: Semi-supervised reward learning with data augmentation for feedback- efficient preference-based reinforcement learning,

J. Park, Y . Seo, J. Shin, H. Lee, P. Abbeel, and K. Lee, “Surf: Semi-supervised reward learning with data augmentation for feedback- efficient preference-based reinforcement learning,”arXiv preprint arXiv:2203.10050, 2022

work page arXiv 2022
[8]

Pebble: Feedback-efficient interac- tive reinforcement learning via relabeling experience and unsupervised pre-training,

K. Lee, L. Smith, and P. Abbeel, “Pebble: Feedback-efficient interac- tive reinforcement learning via relabeling experience and unsupervised pre-training,”arXiv preprint arXiv:2106.05091, 2021

work page arXiv 2021
[9]

Eureka: Human-Level Reward Design via Coding Large Language Models

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Ja- yaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human- level reward design via coding large language models,”arXiv preprint arXiv:2310.12931, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Boosting universal llm reward design through heuristic reward observation space evolution,

Z. K. Heng, Z. Zhao, T. Wu, Y . Wang, M. Wu, Y . Wang, and H. Dong, “Boosting universal llm reward design through heuristic reward observation space evolution,”arXiv preprint arXiv:2504.07596, 2025

work page arXiv 2025
[11]

Learning reward for robot skills using large language models via self-alignment,

Y . Zeng, Y . Mu, and L. Shao, “Learning reward for robot skills using large language models via self-alignment,”arXiv preprint arXiv:2405.07162, 2024

work page arXiv 2024
[12]

Skill preferences: Learning to extract and execute robotic skills from human feedback,

X. Wang, K. Lee, K. Hakhamaneshi, P. Abbeel, and M. Laskin, “Skill preferences: Learning to extract and execute robotic skills from human feedback,” inConference on robot learning. PMLR, 2022, pp. 1259– 1268

2022
[13]

Primt: Preference-based reinforcement learning with multimodal feedback and trajectory synthesis from foundation mod- els,

R. Wang, D. Zhao, Z. Yuan, T. Shao, G. Chen, D. Kao, S. Hong, and B.-C. Min, “Primt: Preference-based reinforcement learning with multimodal feedback and trajectory synthesis from foundation mod- els,”arXiv preprint arXiv:2509.15607, 2025

work page arXiv 2025
[14]

Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,

Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson, “Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,”arXiv preprint arXiv:2402.03681, 2024

work page arXiv 2024
[15]

Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

P. Zhou, Y . Zhou, Q. K. Luu, S. Han, H. Zhang, B. Huang, Y . Li, A. Ajoudani, Z. Xu, and Y . She, “Learning tactile-aware quadrupedal loco-manipulation policies,”arXiv preprint arXiv:2604.27224, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

A bayesian approach for policy learning from trajectory preference queries,

A. Wilson, A. Fern, and P. Tadepalli, “A bayesian approach for policy learning from trajectory preference queries,”Advances in neural information processing systems, vol. 25, 2012

2012
[17]

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

H. Lee, S. Phatale, H. Mansoor, T. Mesnard, J. Ferret, K. Lu, C. Bishop, E. Hall, V . Carbune, A. Rastogiet al., “Rlaif vs. rlhf: Scal- ing reinforcement learning from human feedback with ai feedback,” arXiv preprint arXiv:2309.00267, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnonet al., “Con- stitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,

T. Yu, H. Zhang, Q. Li, Q. Xu, Y . Yao, D. Chen, X. Lu, G. Cui, Y . Dang, T. Heet al., “Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 985–19 995

2025
[20]

Silkie: Preference distillation for large visual language models,

L. Li, Z. Xie, M. Li, S. Chen, P. Wang, L. Chen, Y . Yang, B. Wang, and L. Kong, “Silkie: Preference distillation for large visual language models,”arXiv preprint arXiv:2312.10665, 2023

work page arXiv 2023
[21]

Online preference- based reinforcement learning with self-augmented feedback from large language model,

S. Tu, J. Sun, Q. Zhang, X. Lan, and D. Zhao, “Online preference- based reinforcement learning with self-augmented feedback from large language model,”arXiv preprint arXiv:2412.16878, 2024

work page arXiv 2024
[22]

Prefclm: Enhanc- ing preference-based reinforcement learning with crowdsourced large language models,

R. Wang, D. Zhao, Z. Yuan, I. Obi, and B.-C. Min, “Prefclm: Enhanc- ing preference-based reinforcement learning with crowdsourced large language models,”IEEE Robotics and Automation Letters, vol. 10, no. 3, pp. 2486–2493, 2025

2025
[23]

Real-world offline reinforcement learning from vision lan- guage model feedback,

S. Venkataraman, Y . Wang, Z. Wang, N. S. Ravie, Z. Erickson, and D. Held, “Real-world offline reinforcement learning from vision lan- guage model feedback,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 13 452– 13 459

2025
[24]

Lapp: Large language model feedback for preference-driven rein- forcement learning,

P. Jian, X. Wei, Y . Liu, S. A. Moore, M. M. Zavlanos, and B. Chen, “Lapp: Large language model feedback for preference-driven rein- forcement learning,”arXiv preprint arXiv:2504.15472, 2025

work page arXiv 2025
[25]

Dap- per: Discriminability-aware policy-to-policy preference-based rein- forcement learning for query-efficient robot skill acquisition,

Y . Kadokawa, J. Frey, T. Miki, T. Matsubara, and M. Hutter, “Dap- per: Discriminability-aware policy-to-policy preference-based rein- forcement learning for query-efficient robot skill acquisition,”IEEE Robotics & Automation Magazine, 2026

2026
[26]

R. S. Sutton, A. G. Bartoet al.,Reinforcement learning: An introduc- tion. MIT press Cambridge, 1998, vol. 1, no. 1

1998
[27]

Rank analysis of incomplete block designs: I. the method of paired comparisons,

R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

1952
[28]

Preference transformer: Modeling human preferences using transformers for rl,

C. Kim, J. Park, J. Shin, H. Lee, P. Abbeel, and K. Lee, “Preference transformer: Modeling human preferences using transformers for rl,” arXiv preprint arXiv:2303.00957, 2023

work page arXiv 2023
[29]

Navigating noisy feedback: Enhancing reinforcement learning with error-prone language models,

M. Lin, S. Shi, Y . Guo, B. Chalaki, V . Tadiparthi, E. M. Pari, S. Stepputtis, J. Campbell, and K. Sycara, “Navigating noisy feedback: Enhancing reinforcement learning with error-prone language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

2024
[30]

A minimaximalist approach to reinforcement learning from human feedback,

G. Swamy, C. Dann, R. Kidambi, Z. S. Wu, and A. Agarwal, “A minimaximalist approach to reinforcement learning from human feedback,”arXiv preprint arXiv:2401.04056, 2024

work page arXiv 2024
[31]

Learning to walk in minutes using massively parallel deep reinforcement learning,

N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Conference on robot learning. PMLR, 2022, pp. 91–100

2022
[32]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Rsl-rl: A learning library for robotics research,

C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter, “Rsl-rl: A learning library for robotics research,”arXiv preprint arXiv:2509.10771, 2025. APPENDIX A. Reward Model Hyperparameter The hyparameter of Transformer-based reward model are tuned manually. The details are listed below. Hyperparameter Transformer Reward Model Embedding Dimension 256 Nu...

work page arXiv 2025
[34]

You will be given two trajectories of the Unitree Go2 Robot Dog, and you need to decide which trajectory is better

V elocity Preference Prompt: You are a robotics engineer specializing in analyzing and comparing the trajectories of a Unitree Go2 Robot Dog. You will be given two trajectories of the Unitree Go2 Robot Dog, and you need to decide which trajectory is better. A trajectory will include the following information:
[35]

”The base linear velocity” (m/s): Current Unitree Go2 robot dog x and y velocity, you will need that information to decide if the robot dog is moving forward, the shape of this term is (6, 2), where 6 is the time length, and 2 is x and y linear velocity respectively
[36]

”The base angular velocity” (rad/s): Current Unitree Go2 robot dog yaw velocity, you will need that information to decide if the robot dog is turning, the shape of this term is (6, 1)
[37]

”The commands”(m/s, m/s, rad/s): The desired x, y, yaw velocity of Unitree Go2 robot dog, you will need that information to decide if the robot dog is following the commands, the shape of this term is (6, 3)
[38]

1 means touching the ground while 0 means in the air

”The feet contacts”[front left, front right, rear left, rear right]: The contact boolean values of the four feet on the ground. 1 means touching the ground while 0 means in the air. You only need that information for considering Gait pattern consistency. The shape of this term is (6, 4)
[39]

Smaller deviations correspond to better tracking, and preference should decrease smoothly as deviations increase

The negative sign only represents direction of velocity Decision Rules 1)Linear velocity tracking (primary criterion) At each timestep, compare the robot’s actual x–y linear velocity to the commanded x–y velocity. Smaller deviations correspond to better tracking, and preference should decrease smoothly as deviations increase. Very small errors should stil...
[40]

Yaw (angular) velocity tracking (secondary criterion) Evaluate how closely the robot’s yaw rate follows the commanded yaw rate across timesteps in the same smooth and continuous manner
[41]

Front-right and rear-left feet tend to be in the same contact state at the same time

Gait pattern consistency (conditional criterion) The robot is encouraged to use a diagonal gait (trot): Front-left and rear-right feet tend to be in the same contact state at the same time. Front-right and rear-left feet tend to be in the same contact state at the same time. Overall preference rule The final decision must consider both linear and angular ...
[42]

If the first trajectory is better than the second trajectory, the preference value for this pair of trajectory is 0
[43]

If the second trajectory is better than the first trajectory, the preference value for this pair of trajectory is 1
[45]

Please return a Python list of preference values

You should analyze each pair of trajectory independently, do not refer to previous results. Please return a Python list of preference values. You must output ONLY a JSON array, do not output anything else, do not hallucinate or make up data, strictly follow the decision rules
[46]

Stability Preference Prompt: You are a robotics engineer specializing in analyzing and comparing the trajectories of a Unitree Go2 Robot. You will be given two trajectories of the Unitree Go2 Robot, each trajectory consisting of continuous 6 states, and you need to decide which trajectory is better given 6 states in the trajectory. A trajectory will inclu...
[47]

Shape (6,1), one value per step

”The base height” (m): The z position (height) of the robot base torso. Shape (6,1), one value per step. Height should stay close to 0.34 m with minimal fluctuation
[48]

Shape (6, 1)

”The vertical linear velocity” (m/s): The z-axis component of the base linear velocity. Shape (6, 1). Values should be near 0 to avoid bouncing
[49]

Shape (6, 2)

”The roll/pitch angular velocity” (rad/s): The base angular rates around x (roll) and y (pitch). Shape (6, 2). Magnitudes should be close to 0 to avoid rocking. Decision Rules 1)Base height consistency The base height should stay very close to 0.34 m throughout the trajectory. Even small deviations should noticeably reduce preference, and larger deviation...
[54]

Smoothness Preference Prompt: You are a robotics engineer specializing in analyzing and comparing the trajectories of a Unitree Go2 Robot. You will be given two trajectories of the Unitree Go2 Robot, each trajectory consisting of continuous 6 states, and you need to decide which trajectory is better given these states in the trajectory. A trajectory will ...
[55]

This value is already computed as the sum of per-step, per-joint action differences between consecutive action commands

”sum of joint action change∆u” (unitless): The total accumulated change in joint action commands across the entire trajectory. This value is already computed as the sum of per-step, per-joint action differences between consecutive action commands. The shape for this term is (6, 1)
[56]

The shape for this term is (6, 1) Decision Rules 1)Action Smoothness: We prefer the trajectory with smaller overall joint action change

”stumble”: The number of foot slipping, scraping, or skidding, where the foot is not properly supporting the robot’s weight but is experiencing strong lateral forces. The shape for this term is (6, 1) Decision Rules 1)Action Smoothness: We prefer the trajectory with smaller overall joint action change. 2)Stumble avoidance: Trajectories with fewer or weake...
[57]

If the trajectory 0 is better, the preference value should be 0
[58]

If the trajectory 1 is better, the preference value should be 1
[59]

If the two trajectories are equally preferable, the preference value for this pair of trajectory is 2
[60]

You must output ONLY a JSON array, do not output anything else, do not hallucinate or make up data, strictly follow the decision rules

You should analyze each pair of trajectory independently, do not refer to previous results Please return a Python list of preference values. You must output ONLY a JSON array, do not output anything else, do not hallucinate or make up data, strictly follow the decision rules

[1] [1]

RMA: Rapid Motor Adaptation for Legged Robots

A. Kumar, Z. Fu, D. Pathak, and J. Malik, “Rma: Rapid motor adaptation for legged robots,”arXiv preprint arXiv:2107.04034, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

Learning quadrupedal locomotion over challenging terrain,

J. Lee, J. Hwangbo, L. Wellhausen, V . Koltun, and M. Hutter, “Learning quadrupedal locomotion over challenging terrain,”Science robotics, vol. 5, no. 47, p. eabc5986, 2020

2020

[3] [3]

Walk these ways: Tuning robot control for generalization with multiplicity of behavior,

G. B. Margolis and P. Agrawal, “Walk these ways: Tuning robot control for generalization with multiplicity of behavior,” inConference on Robot Learning. PMLR, 2023, pp. 22–31

2023

[4] [4]

Rapid locomotion via reinforcement learning,

G. B. Margolis, G. Yang, K. Paigwar, T. Chen, and P. Agrawal, “Rapid locomotion via reinforcement learning,”The International Journal of Robotics Research, vol. 43, no. 4, pp. 572–587, 2024

2024

[5] [5]

Defining and characterizing reward gaming,

J. Skalse, N. Howe, D. Krasheninnikov, and D. Krueger, “Defining and characterizing reward gaming,”Advances in Neural Information Processing Systems, vol. 35, pp. 9460–9471, 2022

2022

[6] [6]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017

2017

[7] [7]

Surf: Semi-supervised reward learning with data augmentation for feedback- efficient preference-based reinforcement learning,

J. Park, Y . Seo, J. Shin, H. Lee, P. Abbeel, and K. Lee, “Surf: Semi-supervised reward learning with data augmentation for feedback- efficient preference-based reinforcement learning,”arXiv preprint arXiv:2203.10050, 2022

work page arXiv 2022

[8] [8]

Pebble: Feedback-efficient interac- tive reinforcement learning via relabeling experience and unsupervised pre-training,

K. Lee, L. Smith, and P. Abbeel, “Pebble: Feedback-efficient interac- tive reinforcement learning via relabeling experience and unsupervised pre-training,”arXiv preprint arXiv:2106.05091, 2021

work page arXiv 2021

[9] [9]

Eureka: Human-Level Reward Design via Coding Large Language Models

Y . J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Ja- yaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human- level reward design via coding large language models,”arXiv preprint arXiv:2310.12931, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Boosting universal llm reward design through heuristic reward observation space evolution,

Z. K. Heng, Z. Zhao, T. Wu, Y . Wang, M. Wu, Y . Wang, and H. Dong, “Boosting universal llm reward design through heuristic reward observation space evolution,”arXiv preprint arXiv:2504.07596, 2025

work page arXiv 2025

[11] [11]

Learning reward for robot skills using large language models via self-alignment,

Y . Zeng, Y . Mu, and L. Shao, “Learning reward for robot skills using large language models via self-alignment,”arXiv preprint arXiv:2405.07162, 2024

work page arXiv 2024

[12] [12]

Skill preferences: Learning to extract and execute robotic skills from human feedback,

X. Wang, K. Lee, K. Hakhamaneshi, P. Abbeel, and M. Laskin, “Skill preferences: Learning to extract and execute robotic skills from human feedback,” inConference on robot learning. PMLR, 2022, pp. 1259– 1268

2022

[13] [13]

Primt: Preference-based reinforcement learning with multimodal feedback and trajectory synthesis from foundation mod- els,

R. Wang, D. Zhao, Z. Yuan, T. Shao, G. Chen, D. Kao, S. Hong, and B.-C. Min, “Primt: Preference-based reinforcement learning with multimodal feedback and trajectory synthesis from foundation mod- els,”arXiv preprint arXiv:2509.15607, 2025

work page arXiv 2025

[14] [14]

Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,

Y . Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson, “Rl-vlm-f: Reinforcement learning from vision language foundation model feedback,”arXiv preprint arXiv:2402.03681, 2024

work page arXiv 2024

[15] [15]

Learning Tactile-Aware Quadrupedal Loco-Manipulation Policies

P. Zhou, Y . Zhou, Q. K. Luu, S. Han, H. Zhang, B. Huang, Y . Li, A. Ajoudani, Z. Xu, and Y . She, “Learning tactile-aware quadrupedal loco-manipulation policies,”arXiv preprint arXiv:2604.27224, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

A bayesian approach for policy learning from trajectory preference queries,

A. Wilson, A. Fern, and P. Tadepalli, “A bayesian approach for policy learning from trajectory preference queries,”Advances in neural information processing systems, vol. 25, 2012

2012

[17] [17]

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

H. Lee, S. Phatale, H. Mansoor, T. Mesnard, J. Ferret, K. Lu, C. Bishop, E. Hall, V . Carbune, A. Rastogiet al., “Rlaif vs. rlhf: Scal- ing reinforcement learning from human feedback with ai feedback,” arXiv preprint arXiv:2309.00267, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Constitutional AI: Harmlessness from AI Feedback

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnonet al., “Con- stitutional ai: Harmlessness from ai feedback,”arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,

T. Yu, H. Zhang, Q. Li, Q. Xu, Y . Yao, D. Chen, X. Lu, G. Cui, Y . Dang, T. Heet al., “Rlaif-v: Open-source ai feedback leads to super gpt-4v trustworthiness,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 19 985–19 995

2025

[20] [20]

Silkie: Preference distillation for large visual language models,

L. Li, Z. Xie, M. Li, S. Chen, P. Wang, L. Chen, Y . Yang, B. Wang, and L. Kong, “Silkie: Preference distillation for large visual language models,”arXiv preprint arXiv:2312.10665, 2023

work page arXiv 2023

[21] [21]

Online preference- based reinforcement learning with self-augmented feedback from large language model,

S. Tu, J. Sun, Q. Zhang, X. Lan, and D. Zhao, “Online preference- based reinforcement learning with self-augmented feedback from large language model,”arXiv preprint arXiv:2412.16878, 2024

work page arXiv 2024

[22] [22]

Prefclm: Enhanc- ing preference-based reinforcement learning with crowdsourced large language models,

R. Wang, D. Zhao, Z. Yuan, I. Obi, and B.-C. Min, “Prefclm: Enhanc- ing preference-based reinforcement learning with crowdsourced large language models,”IEEE Robotics and Automation Letters, vol. 10, no. 3, pp. 2486–2493, 2025

2025

[23] [23]

Real-world offline reinforcement learning from vision lan- guage model feedback,

S. Venkataraman, Y . Wang, Z. Wang, N. S. Ravie, Z. Erickson, and D. Held, “Real-world offline reinforcement learning from vision lan- guage model feedback,” in2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 13 452– 13 459

2025

[24] [24]

Lapp: Large language model feedback for preference-driven rein- forcement learning,

P. Jian, X. Wei, Y . Liu, S. A. Moore, M. M. Zavlanos, and B. Chen, “Lapp: Large language model feedback for preference-driven rein- forcement learning,”arXiv preprint arXiv:2504.15472, 2025

work page arXiv 2025

[25] [25]

Dap- per: Discriminability-aware policy-to-policy preference-based rein- forcement learning for query-efficient robot skill acquisition,

Y . Kadokawa, J. Frey, T. Miki, T. Matsubara, and M. Hutter, “Dap- per: Discriminability-aware policy-to-policy preference-based rein- forcement learning for query-efficient robot skill acquisition,”IEEE Robotics & Automation Magazine, 2026

2026

[26] [26]

R. S. Sutton, A. G. Bartoet al.,Reinforcement learning: An introduc- tion. MIT press Cambridge, 1998, vol. 1, no. 1

1998

[27] [27]

Rank analysis of incomplete block designs: I. the method of paired comparisons,

R. A. Bradley and M. E. Terry, “Rank analysis of incomplete block designs: I. the method of paired comparisons,”Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952

1952

[28] [28]

Preference transformer: Modeling human preferences using transformers for rl,

C. Kim, J. Park, J. Shin, H. Lee, P. Abbeel, and K. Lee, “Preference transformer: Modeling human preferences using transformers for rl,” arXiv preprint arXiv:2303.00957, 2023

work page arXiv 2023

[29] [29]

Navigating noisy feedback: Enhancing reinforcement learning with error-prone language models,

M. Lin, S. Shi, Y . Guo, B. Chalaki, V . Tadiparthi, E. M. Pari, S. Stepputtis, J. Campbell, and K. Sycara, “Navigating noisy feedback: Enhancing reinforcement learning with error-prone language models,” inFindings of the Association for Computational Linguistics: EMNLP 2024, 2024

2024

[30] [30]

A minimaximalist approach to reinforcement learning from human feedback,

G. Swamy, C. Dann, R. Kidambi, Z. S. Wu, and A. Agarwal, “A minimaximalist approach to reinforcement learning from human feedback,”arXiv preprint arXiv:2401.04056, 2024

work page arXiv 2024

[31] [31]

Learning to walk in minutes using massively parallel deep reinforcement learning,

N. Rudin, D. Hoeller, P. Reist, and M. Hutter, “Learning to walk in minutes using massively parallel deep reinforcement learning,” in Conference on robot learning. PMLR, 2022, pp. 91–100

2022

[32] [32]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [33]

Rsl-rl: A learning library for robotics research,

C. Schwarke, M. Mittal, N. Rudin, D. Hoeller, and M. Hutter, “Rsl-rl: A learning library for robotics research,”arXiv preprint arXiv:2509.10771, 2025. APPENDIX A. Reward Model Hyperparameter The hyparameter of Transformer-based reward model are tuned manually. The details are listed below. Hyperparameter Transformer Reward Model Embedding Dimension 256 Nu...

work page arXiv 2025

[34] [34]

You will be given two trajectories of the Unitree Go2 Robot Dog, and you need to decide which trajectory is better

V elocity Preference Prompt: You are a robotics engineer specializing in analyzing and comparing the trajectories of a Unitree Go2 Robot Dog. You will be given two trajectories of the Unitree Go2 Robot Dog, and you need to decide which trajectory is better. A trajectory will include the following information:

[35] [35]

”The base linear velocity” (m/s): Current Unitree Go2 robot dog x and y velocity, you will need that information to decide if the robot dog is moving forward, the shape of this term is (6, 2), where 6 is the time length, and 2 is x and y linear velocity respectively

[36] [36]

”The base angular velocity” (rad/s): Current Unitree Go2 robot dog yaw velocity, you will need that information to decide if the robot dog is turning, the shape of this term is (6, 1)

[37] [37]

”The commands”(m/s, m/s, rad/s): The desired x, y, yaw velocity of Unitree Go2 robot dog, you will need that information to decide if the robot dog is following the commands, the shape of this term is (6, 3)

[38] [38]

1 means touching the ground while 0 means in the air

”The feet contacts”[front left, front right, rear left, rear right]: The contact boolean values of the four feet on the ground. 1 means touching the ground while 0 means in the air. You only need that information for considering Gait pattern consistency. The shape of this term is (6, 4)

[39] [39]

Smaller deviations correspond to better tracking, and preference should decrease smoothly as deviations increase

The negative sign only represents direction of velocity Decision Rules 1)Linear velocity tracking (primary criterion) At each timestep, compare the robot’s actual x–y linear velocity to the commanded x–y velocity. Smaller deviations correspond to better tracking, and preference should decrease smoothly as deviations increase. Very small errors should stil...

[40] [40]

Yaw (angular) velocity tracking (secondary criterion) Evaluate how closely the robot’s yaw rate follows the commanded yaw rate across timesteps in the same smooth and continuous manner

[41] [41]

Front-right and rear-left feet tend to be in the same contact state at the same time

Gait pattern consistency (conditional criterion) The robot is encouraged to use a diagonal gait (trot): Front-left and rear-right feet tend to be in the same contact state at the same time. Front-right and rear-left feet tend to be in the same contact state at the same time. Overall preference rule The final decision must consider both linear and angular ...

[42] [42]

If the first trajectory is better than the second trajectory, the preference value for this pair of trajectory is 0

[43] [43]

If the second trajectory is better than the first trajectory, the preference value for this pair of trajectory is 1

[44] [45]

Please return a Python list of preference values

You should analyze each pair of trajectory independently, do not refer to previous results. Please return a Python list of preference values. You must output ONLY a JSON array, do not output anything else, do not hallucinate or make up data, strictly follow the decision rules

[45] [46]

Stability Preference Prompt: You are a robotics engineer specializing in analyzing and comparing the trajectories of a Unitree Go2 Robot. You will be given two trajectories of the Unitree Go2 Robot, each trajectory consisting of continuous 6 states, and you need to decide which trajectory is better given 6 states in the trajectory. A trajectory will inclu...

[46] [47]

Shape (6,1), one value per step

”The base height” (m): The z position (height) of the robot base torso. Shape (6,1), one value per step. Height should stay close to 0.34 m with minimal fluctuation

[47] [48]

Shape (6, 1)

”The vertical linear velocity” (m/s): The z-axis component of the base linear velocity. Shape (6, 1). Values should be near 0 to avoid bouncing

[48] [49]

Shape (6, 2)

”The roll/pitch angular velocity” (rad/s): The base angular rates around x (roll) and y (pitch). Shape (6, 2). Magnitudes should be close to 0 to avoid rocking. Decision Rules 1)Base height consistency The base height should stay very close to 0.34 m throughout the trajectory. Even small deviations should noticeably reduce preference, and larger deviation...

[49] [54]

Smoothness Preference Prompt: You are a robotics engineer specializing in analyzing and comparing the trajectories of a Unitree Go2 Robot. You will be given two trajectories of the Unitree Go2 Robot, each trajectory consisting of continuous 6 states, and you need to decide which trajectory is better given these states in the trajectory. A trajectory will ...

[50] [55]

This value is already computed as the sum of per-step, per-joint action differences between consecutive action commands

”sum of joint action change∆u” (unitless): The total accumulated change in joint action commands across the entire trajectory. This value is already computed as the sum of per-step, per-joint action differences between consecutive action commands. The shape for this term is (6, 1)

[51] [56]

The shape for this term is (6, 1) Decision Rules 1)Action Smoothness: We prefer the trajectory with smaller overall joint action change

”stumble”: The number of foot slipping, scraping, or skidding, where the foot is not properly supporting the robot’s weight but is experiencing strong lateral forces. The shape for this term is (6, 1) Decision Rules 1)Action Smoothness: We prefer the trajectory with smaller overall joint action change. 2)Stumble avoidance: Trajectories with fewer or weake...

[52] [57]

If the trajectory 0 is better, the preference value should be 0

[53] [58]

If the trajectory 1 is better, the preference value should be 1

[54] [59]

If the two trajectories are equally preferable, the preference value for this pair of trajectory is 2

[55] [60]

You must output ONLY a JSON array, do not output anything else, do not hallucinate or make up data, strictly follow the decision rules

You should analyze each pair of trajectory independently, do not refer to previous results Please return a Python list of preference values. You must output ONLY a JSON array, do not output anything else, do not hallucinate or make up data, strictly follow the decision rules