Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies
Pith reviewed 2026-05-23 08:17 UTC · model grok-4.3
The pith
Optimizing only a latent goal embedding lets a frozen policy match task preferences without any parameter updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Preference Goal Tuning keeps the policy frozen and updates only the latent goal embedding using a trajectory-level preference objective, achieving average relative improvements of 72.0% and 81.6% on two foundation policies across 17 Minecraft tasks while surpassing full fine-tuning by 13.4% in out-of-distribution settings.
What carries the argument
The latent goal embedding, used as a continuous control variable that is optimized by a trajectory-level preference objective while the policy parameters stay frozen.
If this is right
- PGT needs only minimal data to adapt a policy to new task preferences.
- The same frozen policy can be reused across many tasks by storing different learned goal embeddings.
- Out-of-distribution robustness exceeds that of standard fine-tuning on the Minecraft benchmark.
- Expert-crafted text prompts are outperformed by the optimized latent goals on every reported task.
Where Pith is reading between the lines
- Storing multiple goal embeddings could let one policy serve many distinct preference alignments without retraining.
- If the preference signal comes from human feedback or demonstrations, the method could lower the cost of adapting large agents in robotics or games.
- The approach might be tested on other goal-conditioned models whose embeddings can be treated as continuous controls.
- Whether performance holds when preferences become more complex or when the frozen policy is much larger is not addressed.
Load-bearing premise
That changing only the goal embedding can sufficiently alter the trajectory distribution induced by the frozen policy to satisfy arbitrary task preferences.
What would settle it
Run the identical out-of-distribution tasks with full fine-tuning given exactly the same preference data and training budget as PGT, then check whether the 13.4% performance gap remains or reverses.
read the original abstract
Goal-conditioned policies enable decision-making models to execute diverse behaviors based on specified goals, yet their downstream performance is often highly sensitive to the choice of instructions or prompts. To bypass the limitations of discrete text prompts, we formulate post-training adaptation as a latent control problem, where the goal embedding serves as a continuous control variable to modulate the behavior of a frozen policy. We propose Preference Goal Tuning (PGT), a framework that optimizes this latent control variable to align the induced trajectory distribution with task preferences. Unlike standard fine-tuning that updates policy parameters, PGT keeps the policy frozen and updates only the latent goal using a trajectory-level preference objective. This approach essentially searches for the optimal conditioning input that maximizes the likelihood of preferred behaviors while suppressing undesirable ones. We evaluate PGT on the Minecraft SkillForge benchmark across 17 tasks. With minimal data, PGT achieves average relative improvements of 72.0\% and 81.6\% on two foundation policies, consistently outperforming expert-crafted prompts. Crucially, by decoupling task alignment (latent goal) from physical dynamics (frozen policy), PGT surpasses full fine-tuning by 13.4\% in out-of-distribution settings, demonstrating superior robustness and generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Preference Goal Tuning (PGT), a post-training method that formulates adaptation of goal-conditioned policies as optimization of a continuous latent goal embedding (the control variable) while keeping the policy parameters frozen. A trajectory-level preference objective is used to align the induced distribution with task preferences. On the Minecraft SkillForge benchmark across 17 tasks, PGT reports average relative improvements of 72.0% and 81.6% over two foundation policies, outperforming expert prompts, and a 13.4% gain over full fine-tuning in out-of-distribution settings.
Significance. If the central empirical claims hold with proper verification, PGT would demonstrate that latent control via preference optimization over goal embeddings can yield more robust generalization than parameter updates, offering an efficient alternative for adapting frozen policies without access to gradients or policy updates.
major comments (3)
- [Abstract] Abstract: the concrete relative improvement figures (72.0%, 81.6%, and the 13.4% OOD gain over full fine-tuning) are stated without any accompanying experimental protocol, number of runs, variance estimates, statistical tests, or ablation results, leaving the load-bearing empirical claim only partially supported.
- [Method] Method (latent goal optimization): the claim that a trajectory-level preference objective applied solely to the goal embedding is sufficient to modulate the frozen policy's induced distribution rests on the unverified assumption that the embedding space contains points producing preferred behaviors and that the preference model can distinguish nearby embeddings; no analysis, controllability test, or coverage argument is provided to establish this property.
- [Experiments] Experiments: the OOD superiority claim (13.4% over full fine-tuning) is measured on held-out tasks, but the manuscript provides no comparison of how the preference objective is evaluated or optimized across in-distribution vs. OOD regimes, nor any diagnostic showing that the selected goal embeddings actually alter behavior as intended rather than selecting from the original training distribution.
minor comments (2)
- [Method] Notation for the preference objective and goal embedding update rule should be introduced with explicit equations rather than prose descriptions.
- [Experiments] The Minecraft SkillForge benchmark tasks and the two foundation policies should be referenced with citations or a table of task definitions.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive comments. We provide point-by-point responses below and will make revisions to address the concerns raised.
read point-by-point responses
-
Referee: [Abstract] Abstract: the concrete relative improvement figures (72.0%, 81.6%, and the 13.4% OOD gain over full fine-tuning) are stated without any accompanying experimental protocol, number of runs, variance estimates, statistical tests, or ablation results, leaving the load-bearing empirical claim only partially supported.
Authors: We agree that the abstract would be strengthened by including a reference to the experimental protocol. In the revised manuscript, we will update the abstract to briefly describe the evaluation setup, number of runs, and note the presence of variance estimates and statistical tests in the main body. revision: yes
-
Referee: [Method] Method (latent goal optimization): the claim that a trajectory-level preference objective applied solely to the goal embedding is sufficient to modulate the frozen policy's induced distribution rests on the unverified assumption that the embedding space contains points producing preferred behaviors and that the preference model can distinguish nearby embeddings; no analysis, controllability test, or coverage argument is provided to establish this property.
Authors: The empirical results on 17 tasks demonstrate the effectiveness, but we acknowledge the lack of explicit controllability analysis. We will add a controllability test and coverage argument in the method section of the revision. revision: yes
-
Referee: [Experiments] Experiments: the OOD superiority claim (13.4% over full fine-tuning) is measured on held-out tasks, but the manuscript provides no comparison of how the preference objective is evaluated or optimized across in-distribution vs. OOD regimes, nor any diagnostic showing that the selected goal embeddings actually alter behavior as intended rather than selecting from the original training distribution.
Authors: We will revise the experiments section to include a direct comparison of the preference objective evaluation between ID and OOD regimes, as well as additional diagnostics such as behavior alteration visualizations to confirm the embeddings induce intended changes. revision: yes
Circularity Check
No significant circularity; claims rest on external benchmark evaluations rather than internal reductions.
full rationale
The paper formulates PGT as optimizing a latent goal embedding via a trajectory-level preference objective while keeping the policy frozen, then reports empirical gains (e.g., 13.4% OOD improvement over fine-tuning) on the independent Minecraft SkillForge benchmark across 17 held-out tasks. No equations or steps reduce by construction to fitted inputs, self-definitions, or self-citation chains; the controllability assumption is tested via external performance metrics rather than being presupposed in the derivation. This is the standard case of a self-contained empirical method whose validity is assessed outside its own fitted quantities.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The behavior of a goal-conditioned policy can be modulated to arbitrary preferred trajectory distributions solely by optimizing its conditioning input while parameters remain fixed.
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Riad Akrour, Marc Schoenauer, Mich \`e le Sebag, and Jean-Christophe Souplet. Programming by feedback. In International Conference on Machine Learning, volume 32, pp.\ 1503--1511. JMLR. org, 2014
work page 2014
-
[3]
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight experience replay. Advances in neural information processing systems, 30, 2017
work page 2017
-
[4]
A general theoretical paradigm to understand learning from human preferences
Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp.\ 4447--4455. PMLR, 2024
work page 2024
-
[5]
Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022
Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, and Jeff Clune. Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022. URL https://arxiv.org/abs/2206.11795
-
[6]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[9]
Shaofei Cai, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 13734--13744, 2023 a
work page 2023
-
[10]
Groot: Learning to follow instructions by watching gameplay videos, 2023 b
Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. Groot: Learning to follow instructions by watching gameplay videos, 2023 b
work page 2023
-
[11]
GROOT -1.5: Learning to follow multi-modal instructions from weak supervision
Shaofei Cai, Bowei Zhang, Zihao Wang, Xiaojian Ma, Anji Liu, and Yitao Liang. GROOT -1.5: Learning to follow multi-modal instructions from weak supervision. In Multi-modal Foundation Model meets Embodied AI Workshop @ ICML2024, 2024. URL https://openreview.net/forum?id=zxdi4Kdfjq
work page 2024
-
[12]
Goal-conditioned reinforcement learning with imagined subgoals
Elliot Chane-Sane, Cordelia Schmid, and Ivan Laptev. Goal-conditioned reinforcement learning with imagined subgoals. In International conference on machine learning, pp.\ 1430--1440. PMLR, 2021
work page 2021
-
[13]
Exploring large language model based intelligent agents: Definitions, methods, and prospects
Yuheng Cheng, Ceyao Zhang, Zhengwen Zhang, Xiangrui Meng, Sirui Hong, Wenhao Li, Zihao Wang, Zekai Wang, Feng Yin, Junhua Zhao, et al. Exploring large language model based intelligent agents: Definitions, methods, and prospects. arXiv preprint arXiv:2401.03428, 2024
-
[14]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017
work page 2017
-
[15]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Embodiment Collaboration et al. Open x-embodiment: Robotic learning datasets and rt-x models, 2024. URL https://arxiv.org/abs/2310.08864
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Goal-conditioned imitation learning
Yiming Ding, Carlos Florensa, Pieter Abbeel, and Mariano Phielipp. Goal-conditioned imitation learning. Advances in neural information processing systems, 32, 2019
work page 2019
-
[17]
KTO: Model Alignment as Prospect Theoretic Optimization
Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Minedojo: Building open-ended embodied agents with internet-scale knowledge
Linxi Fan, Guanzhi Wang, Yunfan Jiang, Ajay Mandlekar, Yuncong Yang, Haoyi Zhu, Andrew Tang, De-An Huang, Yuke Zhu, and Anima Anandkumar. Minedojo: Building open-ended embodied agents with internet-scale knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022. URL https://openreview.net/forum?id=r...
work page 2022
-
[19]
Johannes F \"u rnkranz, Eyke H \"u llermeier, Weiwei Cheng, and Sang-Hyeun Park. Preference-based reinforcement learning: a formal framework and a policy iteration algorithm. Machine learning, 89: 0 123--156, 2012
work page 2012
-
[20]
Reinforced Self-Training (ReST) for Language Modeling
Caglar Gulcehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, et al. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
arXiv preprint arXiv:1907.13440 , year=
William H. Guss, Brandon Houghton, Nicholay Topin, Phillip Wang, Cayden Codel, Manuela Veloso, and Ruslan Salakhutdinov. Minerl: A large-scale dataset of minecraft demonstrations, 2019. URL https://arxiv.org/abs/1907.13440
-
[22]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016
work page 2016
-
[23]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network, 2015. URL https://arxiv.org/abs/1503.02531
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[24]
ORPO: Monolithic Preference Optimization without Reference Model
Jiwoo Hong, Noah Lee, and James Thorne. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[26]
Bc-z: Zero-shot task generalization with robotic imitation learning, 2022
Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Frederik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. Bc-z: Zero-shot task generalization with robotic imitation learning, 2022. URL https://arxiv.org/abs/2202.02005
-
[27]
The malmo platform for artificial intelligence experimentation
Matthew Johnson, Katja Hofmann, Tim Hutton, and David Bignell. The malmo platform for artificial intelligence experimentation. In Ijcai, volume 16, pp.\ 4246--4247, 2016
work page 2016
-
[28]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023. URL https://arxiv.org/abs/2304.02643
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114 0 (13): ...
-
[31]
Interactively shaping agents via human reinforcement: The tamer framework
W Bradley Knox and Peter Stone. Interactively shaping agents via human reinforcement: The tamer framework. In Proceedings of the fifth international conference on Knowledge capture, pp.\ 9--16, 2009
work page 2009
-
[32]
Vera: Vector-based random matrix adaptation,
Dawid J. Kopiczko, Tijmen Blankevoort, and Yuki M. Asano. Vera: Vector-based random matrix adaptation, 2024. URL https://arxiv.org/abs/2310.11454
-
[33]
Behavior generation with latent actions
Seungjae Lee, Yibin Wang, Haritheja Etukuru, H Jin Kim, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. Behavior generation with latent actions. arXiv preprint arXiv:2403.03181, 2024
-
[34]
Steve-1: A generative model for text-to-behavior in minecraft
Shalev Lifshitz, Keiran Paster, Harris Chan, Jimmy Ba, and Sheila McIlraith. Steve-1: A generative model for text-to-behavior in minecraft. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[35]
Mcu: A task-centric framework for open-ended agent evaluation in minecraft
Haowei Lin, Zihao Wang, Jianzhu Ma, and Yitao Liang. Mcu: A task-centric framework for open-ended agent evaluation in minecraft. arXiv preprint arXiv:2310.08367, 2023
-
[36]
Selecting large language model to fine-tune via rectified scaling law
Haowei Lin, Baizhou Huang, Haotian Ye, Qinyu Chen, Zihao Wang, Sujian Li, Jianzhu Ma, Xiaojun Wan, James Zou, and Yitao Liang. Selecting large language model to fine-tune via rectified scaling law. arXiv preprint arXiv:2402.02314, 2024
-
[37]
Gradient episodic memory for continual learning, 2022
David Lopez-Paz and Marc'Aurelio Ranzato. Gradient episodic memory for continual learning, 2022. URL https://arxiv.org/abs/1706.08840
-
[38]
Simpo: Simple preference optimization with a reference-free reward
Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. arXiv preprint arXiv:2405.14734, 2024
-
[39]
Junhyuk Oh, Yijie Guo, Satinder Singh, and Honglak Lee. Self-imitation learning. In International conference on machine learning, pp.\ 3878--3887. PMLR, 2018
work page 2018
-
[40]
OpenAI. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022
work page 2022
-
[42]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model, 2024. URL https://arxiv.org/abs/2305.18290
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, Tom Eccles, Jake Bruce, Ali Razavi, Ashley Edwards, Nicolas Heess, Yutian Chen, Raia Hadsell, Oriol Vinyals, Mahyar Bordbar, and Nando de Freitas. A generalist agent, 2022. URL https:...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Mastering the game of go with deep neural networks and tree search
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529 0 (7587): 0 484--489, 2016
work page 2016
-
[45]
Learning structured output representation using deep conditional generative models
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. Learning structured output representation using deep conditional generative models. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/8d5...
work page 2015
-
[46]
Preference fine-tuning of llms should leverage suboptimal, on-policy data
Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, and Aviral Kumar. Preference fine-tuning of llms should leverage suboptimal, on-policy data. arXiv preprint arXiv:2404.14367, 2024
-
[47]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[48]
Advances in prospect theory: Cumulative representation of uncertainty
Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and uncertainty, 5: 0 297--323, 1992
work page 1992
-
[49]
Will we run out of data? limits of llm scaling based on human-generated data, 2024
Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. Will we run out of data? limits of llm scaling based on human-generated data, 2024. URL https://arxiv.org/abs/2211.04325
-
[50]
Zihao Wang, Shaofei Cai, Guanzhou Chen, Anji Liu, Xiaojian Ma, Yitao Liang, and Team CraftJarvis. Describe, explain, plan and select: interactive planning with large language models enables open-world multi-task agents. In Proceedings of the 37th International Conference on Neural Information Processing Systems, pp.\ 34153--34189, 2023 a
work page 2023
-
[51]
Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models
Zihao Wang, Shaofei Cai, Anji Liu, Yonggang Jin, Jinbing Hou, Bowei Zhang, Haowei Lin, Zhaofeng He, Zilong Zheng, Yaodong Yang, Xiaojian Ma, and Yitao Liang. Jarvis-1: Open-world multi-task agents with memory-augmented multimodal language models. arXiv preprint arXiv: 2311.05997, 2023 b
-
[52]
Foundation models for decision making: Problems, methods, and opportunities, 2023
Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and Dale Schuurmans. Foundation models for decision making: Problems, methods, and opportunities, 2023. URL https://arxiv.org/abs/2303.04129
-
[53]
Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022
Elad Ben Zaken, Shauli Ravfogel, and Yoav Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models, 2022. URL https://arxiv.org/abs/2106.10199
-
[54]
Proagent: Building proactive cooperative ai with large language models
Ceyao Zhang, Kaijie Yang, Siyi Hu, Zihao Wang, Guanghe Li, Yihang Sun, Cheng Zhang, Zhaowei Zhang, Anji Liu, Song-Chun Zhu, et al. Proagent: Building proactive cooperative ai with large language models. CoRR, 2023
work page 2023
-
[55]
Qihang Zhang, Zhenghao Peng, and Bolei Zhou. Learning to drive by watching youtube videos: Action-conditioned contrastive policy pretraining, 2022. URL https://arxiv.org/abs/2204.02393
-
[56]
A Survey of Large Language Models
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023 a
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Calibrating sequence likelihood improves conditional language generation
Yao Zhao, Misha Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J Liu. Calibrating sequence likelihood improves conditional language generation. arXiv preprint arXiv:2210.00045, 2022
-
[58]
Slic-hf: Sequence likelihood calibration with human feedback
Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J Liu. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425, 2023 b
-
[59]
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020. URL https://arxiv.org/abs/1909.08593
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[60]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[61]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[62]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.