Recognition: 2 theorem links
· Lean TheoremA Generalist Agent
Pith reviewed 2026-05-13 06:19 UTC · model grok-4.3
The pith
A single transformer with fixed weights can play Atari, caption images, chat, and control a robot arm by choosing the right output tokens
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens.
What carries the argument
A decoder-only transformer that tokenizes inputs and outputs from text, vision, and embodiment domains into one shared sequence space, allowing next-token prediction to produce actions or language as needed.
Load-bearing premise
Training one transformer on a mixed collection of multi-modal and multi-embodiment data produces competent performance across domains without large negative interference between tasks.
What would settle it
A direct comparison showing that the joint model scores substantially lower on any single task than a model trained only on that task, or that the model frequently selects the wrong output modality for a given context.
read the original abstract
Inspired by progress in large-scale language modeling, we apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which we refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report we describe the model and the data, and document the current capabilities of Gato.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Gato, a single transformer network with fixed weights that functions as a multi-modal, multi-task, multi-embodiment generalist policy. The same model processes a unified token sequence and, depending on context, outputs text, joint torques, button presses, or other actions to perform tasks including Atari gameplay, image captioning, dialogue, and real-robot block stacking.
Significance. If the reported capabilities hold under fuller scrutiny, the work provides concrete empirical support for scaling transformer architectures to generalist agents that operate across modalities and embodiments without task-specific heads or weights, which could accelerate progress toward unified, context-adaptive AI systems.
major comments (1)
- [Evaluation] Evaluation sections: the manuscript documents capabilities through training and testing but supplies only limited quantitative results, ablations on data-mixture proportions, and head-to-head baselines against specialist models; this weakens the support for the central claim that a single set of weights achieves competent performance across domains without major interference.
minor comments (2)
- Figure captions and axis labels in the results plots would benefit from explicit units and clearer distinction between modalities to improve readability.
- [Model] The description of the tokenization scheme for continuous control signals (joint torques) could be expanded with a short equation or pseudocode for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation for minor revision. We address the single major comment on evaluation below.
read point-by-point responses
-
Referee: [Evaluation] Evaluation sections: the manuscript documents capabilities through training and testing but supplies only limited quantitative results, ablations on data-mixture proportions, and head-to-head baselines against specialist models; this weakens the support for the central claim that a single set of weights achieves competent performance across domains without major interference.
Authors: We acknowledge the evaluation is primarily demonstration-oriented across many tasks and modalities. The manuscript does report quantitative metrics (e.g., Atari scores, captioning BLEU, robotics success rates) and includes direct comparisons to specialist models on several benchmarks, showing that the single set of weights reaches competent performance without task-specific heads. We agree that additional ablations on data-mixture proportions would further substantiate the absence of major interference. In the revised version we will expand the evaluation section with further head-to-head baselines where data permits and include a brief discussion of the mixture ratios used during training. Full exhaustive ablations remain computationally prohibitive, but the existing results already illustrate that a unified transformer can handle the reported range of embodiments and modalities without catastrophic forgetting or degradation. revision: partial
Circularity Check
No significant circularity
full rationale
The paper is an empirical report describing the training and evaluation of a single transformer (Gato) on a mixture of multi-modal and multi-embodiment data. It makes no first-principles derivations, uniqueness theorems, or mathematical predictions that could reduce to fitted inputs by construction. Capabilities are documented via direct training runs and task-specific testing; the central claim (one network handling text, Atari, robotics, etc., via context) follows from the architecture and data mixture without self-referential definitions or load-bearing self-citations. This is a standard empirical result with no circular steps.
Axiom & Free-Parameter Ledger
free parameters (2)
- Transformer size and hyperparameters
- Data mixture proportions
axioms (1)
- domain assumption A transformer can process tokenized sequences from multiple modalities and generate appropriate outputs for each.
Forward citations
Cited by 30 Pith papers
-
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
-
TokaMind for Power Grid: Cross-Domain Transfer from Fusion Plasma
TokaMind, pre-trained on MAST tokamak data, transfers to power grid PMU data for severe event classification with F1 0.837, where difficulty depends on grid topology and CSD indicators boost early-warning performance ...
-
Transformers Efficiently Perform In-Context Logistic Regression via Normalized Gradient Descent
Multi-layer transformers can implement in-context logistic regression by performing normalized gradient descent steps layer by layer, obtained via supervised training of a single attention layer followed by recurrent ...
-
KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis
KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.
-
Factorization Regret mediates compositional generalization in latent space
Factorization Regret measures how latent variable interactions affect performance, and RCCs enable learning them to achieve compositional generalization in partially observable tasks.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
Mastering Diverse Domains through World Models
DreamerV3 uses world models and robustness techniques to solve over 150 tasks across domains with a single configuration, including Minecraft diamond collection from scratch.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
-
Geometric Pareto Control: Riemannian Gradient Flow of Energy Function via Lie Group Homotopy
Geometric Pareto Control embeds Pareto solutions in a Lie group submanifold and navigates via Riemannian gradient flow to achieve 100% feasibility and low suboptimality in control tasks without retraining.
-
RELO: Reinforcement Learning to Localize for Visual Object Tracking
RELO replaces handcrafted spatial priors with a reinforcement learning policy for target localization in visual tracking and reports 57.5% AUC on LaSOText without template updates.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning
Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.
-
$M^2$-VLA: Boosting Vision-Language Models for Generalizable Manipulation via Layer Mixture and Meta-Skills
M²-VLA shows that generalized VLMs can serve as direct backbones for robotic manipulation by selectively extracting task-critical features via Mixture of Layers and adding Meta Skill Modules for efficient trajectory learning.
-
Bridging MARL to SARL: An Order-Independent Multi-Agent Transformer via Latent Consensus
CMAT uses a transformer decoder to produce a high-level consensus vector in latent space, enabling simultaneous order-independent actions by all agents and optimization via single-agent PPO, with superior results on S...
-
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
-
ADAPTive Input Training for Many-to-One Pre-Training on Time-Series Classification
ADAPT is a new pre-training paradigm that aligns physical properties of time-series data to allow simultaneous training on 162 diverse classification datasets, achieving new state-of-the-art performance.
-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
GR-2 pre-trains on web-scale videos then fine-tunes on robot data to reach 97.7% average success across over 100 manipulation tasks with strong generalization to new scenes and objects.
-
LLaVA-Video: Video Instruction Tuning With Synthetic Data
LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.
-
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
-
TD-MPC2: Scalable, Robust World Models for Continuous Control
TD-MPC2 scales an implicit world-model RL method to a 317M-parameter agent that masters 80 tasks across four domains with a single hyperparameter configuration.
-
PaLM-E: An Embodied Multimodal Language Model
PaLM-E is a single 562B-parameter multimodal model that performs embodied reasoning tasks like robotic manipulation planning and visual question answering by interleaving vision, state, and text inputs with positive t...
-
Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning
Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by int...
-
Towards Systematic Generalization for Power Grid Optimization Problems
A shared graph neural network framework jointly solves ACOPF and SCUC problems using physics constraints and shows improved generalization to unseen grid topologies.
-
Neural Computers
Neural Computers are introduced as a new machine form where computation, memory, and I/O are unified in a learned runtime state, with initial video-model experiments showing acquisition of basic interface primitives f...
-
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
UI-TARS-2 reaches 88.2 on Online-Mind2Web, 47.5 on OSWorld, 50.6 on WindowsAgentArena, and 73.3 on AndroidWorld while attaining 59.8 mean normalized score on a 15-game suite through multi-turn RL and scalable data generation.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
Bridging Perception and Action: A Lightweight Multimodal Meta-Planner Framework for Robust Earth Observation Agents
The LMMP framework improves tool-calling accuracy and task success rates for Earth observation agents by grounding plans in multimodal features and remote sensing expert knowledge via a two-stage training process.
-
The Biggest Risk of Embodied AI is Governance Lag
Governance lag in observing, regulating, and distributing embodied AI is presented as the primary risk, appearing in observational, institutional, and distributive forms.
Reference graph
Works this paper leans on
-
[1]
Maximum a posteriori policy optimisation
Abbas Abdolmaleki, Jost Tobias Springenberg, Yuval Tassa, Remi Munos, Nicolas Heess, and Martin Ried- miller. Maximum a posteriori policy optimisation.Preprint arXiv:1806.06920,
-
[2]
ACL, 2020.ht tps://arxiv.org/abs/2005.00928
Samira Abnar and Willem Zuidema. Quantifying attention flow in transformers.Preprint arXiv:2005.00928,
-
[3]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances.Preprint arXiv:2204.01691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, ZhitaoGong, SinaSamangooei, MarianneMonteiro, JacobMenick, SebastianBorgeaud, AndyBrock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Concrete Problems in AI Safety
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul F. Christiano, John Schulman, and Dan Mané. Concrete problems in AI safety.Preprint arXiv:1606.06565,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization.Preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
- [7]
-
[8]
Distributed distributional deterministic policy gradients
Gabriel Barth-Maron, Matthew W Hoffman, David Budden, Will Dabney, Dan Horgan, Dhruva Tb, Alistair Muldal, Nicolas Heess, and Timothy Lillicrap. Distributed distributional deterministic policy gradients. Preprint arXiv:1804.08617,
-
[9]
Charles Beattie, Joel Z Leibo, Denis Teplyashin, Tom Ward, Marcus Wainwright, Heinrich Küttler, Andrew Lefrancq, Simon Green, Víctor Valdés, Amir Sadik, et al. DeepMind lab. Preprint arXiv:1612.03801,
-
[10]
On the Opportunities and Risks of Foundation Models
Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. Preprint arXiv:2108.07258,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,
Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,
-
[12]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym.Preprint arXiv:1606.01540,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Language models are few-shot learners
21 Published in Transactions on Machine Learning Research (11/2022) TB Brown, B Mann, N Ryder, M Subbiah, J Kaplan, P Dhariwal, A Neelakantan, P Shyam, G Sastry, A Askell, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems, pp. 1877–1901,
work page 2022
-
[14]
Serkan Cabi, Sergio Gómez Colmenarejo, Alexander Novikov, Ksenia Konyushkova, Scott Reed, Rae Jeong, Konrad Zolna, Yusuf Aytar, David Budden, Mel Vecerik, et al. Scaling data-driven robotics with reward sketching and batch reinforcement learning.Preprint arXiv:1909.12200,
-
[15]
Annie S Chen, Suraj Nair, and Chelsea Finn. Learning generalizable robotic reward functions from “in-the- wild" human videos.Preprint arXiv:2103.16817, 2021a. Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Ar- avind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence model...
-
[16]
Microsoft COCO Captions: Data Collection and Evaluation Server
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. Preprint arXiv:1504.00325,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
BabyAI: A platform to study the sample efficiency of grounded language learning
Maxime Chevalier-Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. BabyAI: A platform to study the sample efficiency of grounded language learning. Preprint arXiv:1810.08272,
-
[18]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. PaLM: Scaling language modeling with pathways.Preprint arXiv:2204.02311,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Leveraging procedural generation to benchmark reinforcement learning
Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. InInternational Conference on Machine Learning, pp. 2048–2056,
work page 2048
-
[20]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirec- tional transformers for language understanding.Preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Un- terthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.Preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[22]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
22 Published in Transactions on Machine Learning Research (11/2022) Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: Datasets for deep data- driven reinforcement learning.Preprint arXiv:2004.07219,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
arXiv preprint arXiv:2111.10364 , year=
Hiroki Furuta, Yutaka Matsuo, and Shixiang Shane Gu. Generalized decision transformer for offline hindsight information matching.Preprint arXiv:2111.10364,
-
[24]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs).Preprint arXiv:1606.08415,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Muesli: Combining improvements in policy optimization.Preprint arXiv:2104.06159,
Matteo Hessel, Ivo Danihelka, Fabio Viola, Arthur Guez, Simon Schmitt, Laurent Sifre, Theophane Weber, David Silver, and Hado van Hasselt. Muesli: Combining improvements in policy optimization.Preprint arXiv:2104.06159,
-
[26]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.Preprint arXiv:2203.15556,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Deep networks with stochastic depth
Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Weinberger. Deep networks with stochastic depth. Preprint arXiv:1603.09382,
-
[28]
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,
Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents.Preprint arXiv:2201.07207,
-
[29]
David Yu-Tung Hui, Maxime Chevalier-Boisvert, Dzmitry Bahdanau, and Yoshua Bengio. Babyai 1.1. Preprint arXiv:2007.12770,
-
[30]
Bowen Jing, Bonnie Berger, and Tommi Jaakkola
Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver IO: A general architecture for structured inputs & outputs.Preprint arXiv:2107.14795,
-
[31]
Massively multilingual neural machine translation
23 Published in Transactions on Machine Learning Research (11/2022) Melvin Johnson, Orhan Firat, and Roee Aharoni. Massively multilingual neural machine translation. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3874–3884,
work page 2022
-
[32]
One model to learn them all.Preprint arXiv:1706.05137,
Lukasz Kaiser, Aidan N Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, and Jakob Uszkoreit. One model to learn them all.Preprint arXiv:1706.05137,
-
[33]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.Preprint arXiv:2001.08361,
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[34]
Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents.Preprint arXiv:2103.14659,
-
[35]
Varshney, Caiming Xiong, and Richard Socher
Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation.Preprint arXiv:1909.05858,
-
[36]
Adam: A Method for Stochastic Optimization
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.Preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer, and Shimon Whiteson. My body is a cage: the role of morphology in graph-based incompatible control.Preprint arXiv:2010.01856,
-
[38]
Alex X Lee, Coline Manon Devin, Jost Tobias Springenberg, Yuxiang Zhou, Thomas Lampe, Abbas Abdol- maleki, and Konstantinos Bousmalis. How to spend your robot time: Bridging kickstarting and offline reinforcement learning for vision-based robotic manipulation.Preprint arXiv:2205.03353,
-
[39]
Pre-trained language models for interactive decision-making.Preprint arXiv:2202.01771, 2022a
Shuang Li, Xavier Puig, Chris Paxton, Yilun Du, Clinton Wang, Linxi Fan, Tao Chen, De-An Huang, Ekin Akyürek, Anima Anandkumar, Jacob Andreas, Igor Mordatch, Antonio Torralba, and Yuke Zhu. Pre-trained language models for interactive decision-making.Preprint arXiv:2202.01771, 2022a. Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser,...
-
[40]
Teaching Language Models to Support Answers with Verified Quotes.CoRR, abs/2203.11147,
24 Published in Transactions on Machine Learning Research (11/2022) JacobMenick, MajaTrebacz, VladimirMikulik, JohnAslanides, FrancisSong, MartinChadwick, MiaGlaese, Susannah Young, Lucy Campbell-Gillingham, Geoffrey Irving, et al. Teaching language models to support answers with verified quotes.Preprint arXiv:2203.11147,
-
[41]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. WebGPT: Browser-assisted question-answering with human feedback.Preprint arXiv:2112.09332,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
WaveNet: A Generative Model for Raw Audio
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A generative model for raw audio. Preprint arXiv:1609.03499,
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
Pedro A Ortega, Markus Kunesch, Grégoire Delétang, Tim Genewein, Jordi Grau-Moya, Joel Veness, Jonas Buchli, Jonas Degrave, Bilal Piot, Julien Perolat, et al. Shaking the foundations: delusions in sequence models for interaction and control.Preprint arXiv:2110.10819,
-
[44]
Training language models to follow instructions with human feedback
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Preprint arXiv:2203.02155,
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
The unsurprising effectiveness of pre-trained vision models for control
Simone Parisi, Aravind Rajeswaran, Senthil Purushwalkam, and Abhinav Gupta. The unsurprising effec- tiveness of pre-trained vision models for control.Preprint arXiv:2203.03580,
-
[46]
Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters.Preprint arXiv:2007.03001,
Vineel Pratap, Anuroop Sriram, Paden Tomasello, Awni Hannun, Vitaliy Liptchinsky, Gabriel Synnaeve, and Ronan Collobert. Massively multilingual ASR: 50 languages, 1 model, 1 billion parameters.Preprint arXiv:2007.03001,
-
[47]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.Preprint arXiv:2112.11446,
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Can Wikipedia help offline reinforcement learning? Preprint arXiv:2201.12122,
Machel Reid, Yutaro Yamada, and Shixiang Shane Gu. Can Wikipedia help offline reinforcement learning? Preprint arXiv:2201.12122,
-
[49]
Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James Kirkpatrick, Koray Kavukcuoglu, RazvanPascanu, andRaiaHadsell. Progressiveneuralnetworks. Preprint arXiv:1606.04671,
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Multitask prompted training enables zero-shot task generalization
25 Published in Transactions on Machine Learning Research (11/2022) Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey, M Saiful Bari, Canwen Xu, Urmish Thakker, Shanya Sharma Sharma, Eliza Szczechla, Taewoon Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti Datta, Jonat...
work page 2022
-
[51]
One big net for everything.Preprint arXiv:1802.08864,
Jürgen Schmidhuber. One big net for everything.Preprint arXiv:1802.08864,
-
[52]
GLU Variants Improve Transformer
Noam Shazeer. Glu variants improve transformer.Preprint arXiv::2002.05202,
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[53]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(56): 1929–1958,
work page 1929
-
[54]
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. DeepMind control suite.Preprint arXiv:1801.00690,
work page internal anchor Pith review arXiv
-
[55]
LaMDA: Language Models for Dialog Applications
Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. LaMDA: Language models for dialog applications. Preprint arXiv:2201.08239,
work page internal anchor Pith review Pith/arXiv arXiv
-
[56]
arXiv preprint arXiv:2108.10904 , year=
Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Simple visual language model pretraining with weak supervision.Preprint arXiv:2108.10904,
-
[57]
Finetuned Language Models Are Zero-Shot Learners
26 Published in Transactions on Machine Learning Research (11/2022) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.Preprint arXiv:2109.01652,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[58]
Ethical and social risks of harm from Language Models
LauraWeidinger, JohnMellor, MaribethRauh, ConorGriffin, JonathanUesato, Po-SenHuang, MyraCheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. Preprint arXiv:2112.04359,
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
Online decision transformer.Preprint arXiv:2202.05607,
Qinqing Zheng, Amy Zhang, and Aditya Grover. Online decision transformer.Preprint arXiv:2202.05607,
-
[60]
Offline learning from demonstrations and unlabeled experience
Konrad Zolna, Alexander Novikov, Ksenia Konyushkova, Caglar Gulcehre, Ziyu Wang, Yusuf Aytar, Misha Denil, Nando de Freitas, and Scott Reed. Offline learning from demonstrations and unlabeled experience. Preprint arXiv:2011.13885,
-
[61]
27 Published in Transactions on Machine Learning Research (11/2022) Supplementary Material A Model card We present a model card for Gato in Table
work page 2022
-
[62]
Table 4:Gato Model Card.We follow the framework proposed in (Mitchell et al., 2019). Model details Organization DeepMind Model Date May 2022 Model Type Transformer with ResNet patch embedding for multi-task, multi-modal behavior cloning. Model Version Initial release. Feedback on the Model reedscot@google.com Intended Uses Primary Intended Uses Learn to a...
work page 2019
-
[63]
Finally, they are discretized using bins of uniform width on the domain[−1, 1]
(If the floating-point tensor is in the action set, we do not need to compand the elements in the sequence because actions are only defined in the range[−1, 1] for all our environments.) All the elements are subsequently clipped so that they fall in the set[−1, 1]. Finally, they are discretized using bins of uniform width on the domain[−1, 1]. We use 1024 b...
work page 2022
-
[64]
(instead of LayerNorm (Ba et al., 2016)) normalization, and GELU (Hendrycks & Gimpel,
work page 2016
-
[65]
C Model Architecture C.1 Transformer Hyperparameters Table 5:Gato transformer hyperparameters
(instead of RELU) activation functions. C Model Architecture C.1 Transformer Hyperparameters Table 5:Gato transformer hyperparameters. Hyperparameter Gato 1.18B 364M 79M Transformer blocks 24 12 8 Attention heads 16 12 24 Layer width 2048 1536 768 Feedforward hidden size 8192 6144 3072 Key/value size 128 128 32 Shared embedding True Layer normalization Pr...
work page 2048
-
[66]
with 32 groups instead of LayerNorm (Ba et al., 2016), and GELU (Hendrycks & Gimpel,
work page 2016
-
[67]
C.3 Position Encodings After tokens are mapped into token embeddings, two position encodings are added to the token embeddings (when applicable) to provide temporal and spatial information to the model. These are described below. 33 Published in Transactions on Machine Learning Research (11/2022) Figure 17:Patch position encodings.Calculating patch positi...
work page 2022
-
[68]
The image is of resolution80× 64 and each patch is 16× 16, meaning there are 5× 4 = 20 patches total
We will follow the process with the patch highlighted in red on the left of the subfigure. The image is of resolution80× 64 and each patch is 16× 16, meaning there are 5× 4 = 20 patches total. The highlighted patch starts at pixel row interval [16, 32] and pixel column interval[32, 64]. Normalized, the row interval is therefore[0.25, 0.5] and the column in...
work page 2022
-
[69]
each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1
during pretraining, where each of the transformer sub-layers (i.e. each Multi-Head Attention and Dense Feedforward layer) is skipped with a probability of 0.1. Table 6:Learning rate schedule hyperparameters for the different model scales. Hyperparameter Gato 1.18B 364M 79M Maximum Learning Rate 1e-4 2e-4 1e-4 Minimum Learning Rate 1e-5 2e-5 1e-5 35 Publish...
work page 2022
-
[70]
Evaluation: We evaluate agent every 100 learning steps
with a rate of 0.1. Evaluation: We evaluate agent every 100 learning steps. Each evaluation reports the average of 10 runs of a given checkpoint. The moving average of 5 such scores is computed (to gather 50 runs together). The final fine-tuning performance is defined as the maximum of these smoothed scores. Datasets: We generated data for the fine-tuning tas...
work page 2000
-
[71]
We record approximately 20,000 random episodes generated by the agent during training
agent for 200M total environment steps. We record approximately 20,000 random episodes generated by the agent during training. F.2 Sokoban Sokoban is a planning problem (Racanière et al., 2017), in which the agent has to push boxes to target locations. Some of the moves are irreversible and consequently mistakes can render the puzzle unsolvable. Planning ...
work page 2017
-
[72]
We collect 100,000 episodes for each level
for more details about the bot. We collect 100,000 episodes for each level. 3Basic Math, Breakout, Crossbow, Darkchambers, Entombed, ET, Flag Capture, Human Cannonball, Klax, Laser Gates, Ms. Pac-Man, Solaris, Space War. 36 Published in Transactions on Machine Learning Research (11/2022) F.4 DeepMind Control Suite The DeepMind Control Suite (Tunyasuvunako...
work page 2022
-
[73]
agent jointly on a set of 18 parent DM Lab levels that generate maps procedurally for each new episode. Data was collected by executing the agent on these 18 levels, as well as an additional set of 237 levels handcrafted to test a diverse set of skills. The 18 parent levels are characterized by high diversity of generated maps. The difference between the l...
work page 2016
-
[74]
Walker2d-v2, Humanoid-v2, and Hopper-v2. Each variant is a morphological modification of the original body: the set of 4Available athttps://neurips.cc/virtual/2021/workshop/21865#wse-detail-22801. 37 Published in Transactions on Machine Learning Research (11/2022) morphologies is generated by enumerating all possible subsets of limbs, and keeping only thos...
work page 2021
-
[75]
We also implemented a parallel sampling scheme where all the action tokens are zeroed out in the input sequences during training so we can sample all tokens corresponding to a robot action in a single model inference step instead of autoregressively as it’s done in other domains. We found that the1.18B parameter model was able to run on the hardware accel...
work page 2021
-
[76]
Table 7:Success rates of specialist Meta-World agent.Averaged over 500 evaluations
We evaluated agent 500 times for each task. Table 7:Success rates of specialist Meta-World agent.Averaged over 500 evaluations. Task name Success rate assembly-v2 0.980 basketball-v2 0.964 bin-picking-v2 0.954 box-close-v2 0.958 button-press-topdown-v2 0.996 button-press-topdown-wall-v2 0.998 button-press-v2 0.996 button-press-wall-v2 1.000 coffee-button-...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.