Reinforcement Learning with Foundation Priors: Let the Embodied Agent Efficiently Learn on Its Own
Pith reviewed 2026-05-24 06:38 UTC · model grok-4.3
The pith
Foundation models can supply automatic rewards and guidance that let reinforcement learning agents master dexterous tasks in under an hour of real time.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Foundation-guided Actor-Critic algorithm inserts priors from foundation models for policy, value, and success reward into the actor-critic loop, yielding automatic reward functions that enable embodied agents to explore more efficiently. Across five dexterous tasks on real robots the method reaches an average success rate of 86 percent after one hour of real-time learning; across eight Meta-world tasks it reaches 100 percent success in seven of the eight under less than 100k frames while outperforming baselines that use manually designed rewards trained for 1M frames. The framework is stated to be agnostic to the specific form of the foundation models and robust to noisy priors.
What carries the argument
The Foundation-guided Actor-Critic (FAC) algorithm, which inserts guidance and feedback from policy, value, and success-reward foundation models directly into the actor-critic training loop to generate automatic rewards.
If this is right
- Robotic manipulation tasks become solvable with roughly one hour of real-time interaction instead of millions of samples.
- Reward engineering effort drops to near zero because success-reward signals come from the foundation model rather than manual design.
- The same FAC structure works across both real hardware and simulation without changes to the foundation-model components.
- Performance remains high even when the foundation-model priors contain noise, provided they are used as guidance inside the actor-critic update.
Where Pith is reading between the lines
- The same prior-injection pattern could be tested on non-manipulation RL problems such as navigation or locomotion if comparable foundation models exist for those domains.
- Combining FAC with larger or more recent foundation models might further reduce the required interaction count, an extension the paper does not measure.
- If the priors prove stable across many tasks, the approach could support continual learning on a single robot without repeated manual reward redesign.
Load-bearing premise
Foundation models trained on non-robotic data continue to supply useful and stable priors when placed inside the actor-critic loop without any task-specific fine-tuning or manual re-weighting.
What would settle it
Running the FAC algorithm on a new dexterous manipulation task and observing success rates no higher than those of standard RL with hand-crafted rewards after the same number of real-robot trials would falsify the central claim.
Figures
read the original abstract
Reinforcement learning (RL) is a promising approach for solving robotic manipulation tasks. However, it is challenging to apply the RL algorithms directly in the real world. For one thing, RL is data-intensive and typically requires millions of interactions with environments, which are impractical in real scenarios. For another, it is necessary to make heavy engineering efforts to design reward functions manually. To address these issues, we leverage foundation models in this paper. We propose Reinforcement Learning with Foundation Priors (RLFP) to utilize guidance and feedback from policy, value, and success-reward foundation models. Within this framework, we introduce the Foundation-guided Actor-Critic (FAC) algorithm, which enables embodied agents to explore more efficiently with automatic reward functions. The benefits of our framework are threefold: (1) \textit{sample efficient}; (2) \textit{minimal and effective reward engineering}; (3) \textit{agnostic to foundation model forms and robust to noisy priors}. Our method achieves remarkable performances in various manipulation tasks on both real robots and in simulation. Across 5 dexterous tasks with real robots, FAC achieves an average success rate of 86\% after one hour of real-time learning. Across 8 tasks in the simulated Meta-world, FAC achieves 100\% success rates in 7/8 tasks under less than 100k frames (about 1-hour training), outperforming baseline methods with manual-designed rewards in 1M frames. We believe the RLFP framework can enable future robots to explore and learn autonomously in the physical world for more tasks. Visualizations and code are available at https://yewr.github.io/rlfp.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes the Reinforcement Learning with Foundation Priors (RLFP) framework and the Foundation-guided Actor-Critic (FAC) algorithm, which incorporates outputs from policy, value, and success-reward foundation models into actor-critic RL updates. This is claimed to yield sample-efficient learning, minimal manual reward engineering, and robustness to noisy priors without task-specific fine-tuning. Empirical results include an 86% average success rate across 5 real-robot dexterous tasks after one hour of interaction and 100% success in 7/8 Meta-world tasks under 100k frames, outperforming baselines that use manual rewards over 1M frames.
Significance. If the integration of non-robotic foundation-model priors can be shown to be stable and task-agnostic as asserted, the work would meaningfully lower the barrier to real-world robotic RL by replacing hand-crafted rewards and improving exploration efficiency. The real-robot results are practically relevant; however, the absence of fusion details and controls leaves the attribution of gains to the claimed mechanism unverified.
major comments (3)
- [Methods (FAC algorithm)] Methods (FAC algorithm): the description of how foundation-model outputs (especially the success-reward signal) are fused into the actor-critic loss lacks any specification of scaling coefficients, normalization, or noise-filtering thresholds. This detail is load-bearing for the abstract's claim of being 'agnostic to foundation model forms and robust to noisy priors' and for the 'minimal engineering' guarantee.
- [Experiments] Experiments: no ablation studies isolate the contribution of the policy prior, value prior, or success-reward prior, nor test sensitivity to any weighting or prompting choices. Without these, it is impossible to confirm that the reported 86% and 100% success rates do not rely on per-task engineering that would contradict the central 'minimal and effective reward engineering' claim.
- [Results] Results: the headline success rates (86% real-robot average, 100% on 7/8 Meta-world tasks) are given without error bars, number of random seeds, or statistical significance tests against the manual-reward baselines. This prevents assessment of whether the gains are reliable or attributable to the foundation priors rather than implementation details.
minor comments (1)
- [Abstract] The abstract states that FAC 'outperforms baseline methods with manual-designed rewards in 1M frames' but does not name the baselines or report the exact frame count used by FAC for the comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our RLFP framework and FAC algorithm. We address each major comment below and commit to revisions that strengthen the manuscript's clarity, rigor, and support for its central claims.
read point-by-point responses
-
Referee: the description of how foundation-model outputs (especially the success-reward signal) are fused into the actor-critic loss lacks any specification of scaling coefficients, normalization, or noise-filtering thresholds. This detail is load-bearing for the abstract's claim of being 'agnostic to foundation model forms and robust to noisy priors' and for the 'minimal engineering' guarantee.
Authors: We agree that the current manuscript provides only a high-level overview of the fusion process without explicit implementation details. In the revised version, we will expand the Methods section (specifically the FAC algorithm description) to include the exact loss formulation, scaling coefficients (lambda values for policy, value, and reward priors), normalization steps applied to each foundation model output, and any noise-filtering thresholds used. These additions will directly support the robustness and minimal-engineering claims. revision: yes
-
Referee: no ablation studies isolate the contribution of the policy prior, value prior, or success-reward prior, nor test sensitivity to any weighting or prompting choices. Without these, it is impossible to confirm that the reported 86% and 100% success rates do not rely on per-task engineering that would contradict the central 'minimal and effective reward engineering' claim.
Authors: The manuscript does not contain ablation studies isolating each prior or testing sensitivity to weights and prompts. We acknowledge this gap weakens attribution of results to the framework. We will add a new subsection with ablations (removing one prior at a time, varying weights, and testing alternative prompts) on both simulation and real-robot tasks to demonstrate that performance does not depend on per-task tuning. revision: yes
-
Referee: the headline success rates (86% real-robot average, 100% on 7/8 Meta-world tasks) are given without error bars, number of random seeds, or statistical significance tests against the manual-reward baselines. This prevents assessment of whether the gains are reliable or attributable to the foundation priors rather than implementation details.
Authors: We agree that statistical reporting is essential. The experiments used multiple seeds (5 for Meta-world, 3 for real robots), but these details and variance measures were omitted from the main text. In revision, we will update the Results section with error bars (standard deviation), explicit seed counts, and statistical significance tests (paired t-tests) against the manual-reward baselines to allow proper evaluation of reliability. revision: yes
Circularity Check
No significant circularity; empirical validation stands independently
full rationale
The paper introduces the RLFP framework and FAC algorithm to leverage foundation model priors for RL in robotics. The central claims are supported by reported success rates on real dexterous tasks (86% average) and Meta-world tasks (100% in 7/8 under 100k frames). No equations or derivations are presented that reduce by construction to inputs or self-citations. The method is described as agnostic and robust, but validation relies on external performance metrics rather than internal self-definition or fitted predictions. This is a standard empirical contribution without load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Foundation models trained outside robotics supply useful policy, value, and success signals for manipulation tasks
Forward citations
Cited by 3 Pith papers
-
Beyond Pixels: Learning Invariant Rewards for Real-World Robotics From a Few Demonstrations
A framework learns invariant symbolic reward functions from few demonstrations that generalize zero-shot to variations in robotic manipulation tasks.
-
SLOPE: Optimistic Potential Landscape Shaping for Model-based Reinforcement Learning
SLOPE improves MBRL in sparse reward settings by using optimistic distributional regression to build informative potential landscapes that provide better exploration gradients, outperforming baselines across 30+ tasks...
-
Meta-Learning and Meta-Reinforcement Learning -- Tracing the Path towards DeepMind's Adaptive Agent
A survey provides a task-based formalization of meta-learning and meta-RL while chronicling algorithms that lead to DeepMind's Adaptive Agent.
Reference graph
Works this paper leans on
-
[1]
J. Schrittwieser, I. Antonoglou, T. Hubert, K. Simonyan, L. Sifre, S. Schmitt, A. Guez, E. Lock- hart, D. Hassabis, T. Graepel, et al. Mastering atari, go, chess and shogi by planning with a learned model. Nature, 588(7839):604–609, 2020
work page 2020
-
[2]
W. Ye, S. Liu, T. Kurutach, P. Abbeel, and Y . Gao. Mastering atari games with limited data. Advances in Neural Information Processing Systems, 34:25476–25488, 2021
work page 2021
-
[3]
K. Arulkumaran, A. Cully, and J. Togelius. Alphastar: An evolutionary computation perspec- tive. In Proceedings of the genetic and evolutionary computation conference companion, pages 314–315, 2019
work page 2019
-
[5]
A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE transactions on systems, man, and cybernetics , (5): 834–846, 1983
work page 1983
-
[6]
S. Mahadevan and J. Connell. Automatic programming of behavior-based robots using rein- forcement learning. Artificial intelligence, 55(2-3):311–365, 1992
work page 1992
-
[7]
P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg. Daydreamer: World models for physical robot learning. In Conference on Robot Learning, pages 2226–2240. PMLR, 2023
work page 2023
- [8]
- [9]
-
[10]
Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. Mastering chess and shogi by self-play with a general reinforce- ment learning algorithm. arXiv preprint arXiv:1712.01815, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [11]
-
[13]
TD-MPC2: Scalable, Robust World Models for Continuous Control
N. Hansen, H. Su, and X. Wang. Td-mpc2: Scalable, robust world models for continuous control. arXiv preprint arXiv:2310.16828, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[14]
A. N. Meltzoff. Infant imitation after a 1-week delay: long-term memory for novel acts and multiple stimuli. Developmental psychology, 24(4):470, 1988
work page 1988
-
[15]
A. N. Meltzoff. Understanding the intentions of others: re-enactment of intended acts by 18-month-old children. Developmental psychology, 31(5):838, 1995. 9
work page 1995
-
[16]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017
work page 2017
-
[17]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [18]
-
[19]
OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi:10.48550/arXiv.2303. 08774. URL https://doi.org/10.48550/arXiv.2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303 2023
-
[20]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De- hghani, M. Minderer, G. Heigold, S. Gelly, et al. An image is worth 16x16 words: Transform- ers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[21]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[22]
Hierarchical Text-Conditional Image Generation with CLIP Latents
A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[23]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[24]
Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang. Vip: Towards universal visual reward and representation via value-implicit pre-training. arXiv preprint arXiv:2210.00030, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [25]
-
[26]
X. Gu, C. Wen, J. Song, and Y . Gao. Seer: Language instructed video prediction with latent diffusion models. arXiv preprint arXiv:2303.14897, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
RT-1: Robotics Transformer for Real-World Control at Scale
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Haus- man, A. Herzog, J. Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[28]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choromanski, T. Ding, D. Driess, A. Dubey, C. Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[29]
S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
T. Yu, T. Xiao, A. Stone, J. Tompson, A. Brohan, S. Wang, J. Singh, C. Tan, J. Peralta, B. Ichter, et al. Scaling robot learning with semantically imagined experience. arXiv preprint arXiv:2302.11550, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Y . Chebotar, K. Hausman, F. Xia, Y . Lu, A. Irpan, A. Kumar, T. Yu, A. Herzog, K. Pertsch, K. Gopalakrishnan, et al. Q-transformer: Scalable offline reinforcement learning via autore- gressive q-functions. In 7th Annual Conference on Robot Learning, 2023. 10
work page 2023
-
[32]
N. Di Palo, A. Byravan, L. Hasenclever, M. Wulfmeier, N. Heess, and M. Riedmiller. To- wards a unified agent with foundation models. In Workshop on Reincarnating Reinforcement Learning at ICLR 2023
work page 2023
-
[33]
Inner Monologue: Embodied Reasoning through Planning with Language Models
W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
M. Ahn, A. Brohan, N. Brown, Y . Chebotar, O. Cortes, B. David, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
PaLM-E: An Embodied Multimodal Language Model
D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [36]
-
[37]
I. Singh, V . Blukis, A. Mousavian, A. Goyal, D. Xu, J. Tremblay, D. Fox, J. Thomason, and A. Garg. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 11523– 11530. IEEE, 2023
work page 2023
-
[38]
M. Shridhar, L. Manuelli, and D. Fox. Cliport: What and where pathways for robotic manipu- lation. In Conference on Robot Learning, pages 894–906. PMLR, 2022
work page 2022
- [39]
- [40]
-
[41]
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models
W. Huang, C. Wang, R. Zhang, Y . Li, J. Wu, and L. Fei-Fei. V oxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans. Imagen video: High definition video generation with diffusion models, 2022
work page 2022
-
[43]
S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang. Language- driven representation learning for robotics. arXiv preprint arXiv:2302.12766, 2023
-
[44]
R. Shah and V . Kumar. Rrl: Resnet as representation for reinforcement learning.arXiv preprint arXiv:2107.03380, 2021
-
[45]
A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V .-P. Berges, P. Abbeel, J. Malik, et al. Where are we in the search for an artificial visual cortex for embodied intelligence? arXiv preprint arXiv:2303.18240, 2023
-
[46]
S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta. R3m: A universal visual represen- tation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[47]
M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023. 11
work page 2023
- [49]
- [50]
-
[51]
S. Nair, E. Mitchell, K. Chen, S. Savarese, C. Finn, et al. Learning language-conditioned robot behavior from offline data and crowd-sourced annotation. In Conference on Robot Learning, pages 1303–1315. PMLR, 2022
work page 2022
-
[52]
P. Mahmoudieh, D. Pathak, and T. Darrell. Zero-shot reward specification via grounded natural language. In International Conference on Machine Learning , pages 14743–14752. PMLR, 2022
work page 2022
- [53]
-
[54]
B. Wu, F. Xu, Z. He, A. Gupta, and P. K. Allen. Squirl: Robust and efficient learning from video demonstration of long-horizon robotic manipulation tasks. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 9720–9727. IEEE, 2020
work page 2020
- [55]
-
[56]
Data-efficient Deep Reinforcement Learning for Dexterous Manipulation
I. Popov, N. Heess, T. Lillicrap, R. Hafner, G. Barth-Maron, M. Vecerik, T. Lampe, Y . Tassa, T. Erez, and M. Riedmiller. Data-efficient deep reinforcement learning for dexterous manipu- lation. arXiv preprint arXiv:1704.03073, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[57]
W. Ye, Y . Zhang, P. Abbeel, and Y . Gao. Become a proficient player with limited data through watching pure videos. In The Eleventh International Conference on Learning Representations, 2022
work page 2022
- [58]
-
[59]
Mastering Diverse Domains through World Models
D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [60]
-
[61]
P. Lancaster, N. Hansen, A. Rajeswaran, and V . Kumar. Modem-v2: Visuo-motor world models for real-world robot manipulation. arXiv preprint arXiv:2309.14236, 2023
-
[62]
A. Y . Ng, D. Harada, and S. Russell. Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278–287. Citeseer, 1999
work page 1999
-
[63]
S. Fujimoto, H. Hoof, and D. Meger. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018
work page 2018
- [64]
-
[65]
T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020
work page 2020
- [66]
-
[67]
M. Dorigo and M. Colombetti. Robot shaping: an experiment in behavior engineering . MIT press, 1998
work page 1998
-
[68]
M. J. Mataric. Reward functions for accelerated learning. In Machine learning proceedings 1994, pages 181–189. Elsevier, 1994
work page 1994
-
[69]
J. Randløv and P. Alstrøm. Learning to drive a bicycle using reinforcement learning and shap- ing. In ICML, volume 98, pages 463–471, 1998
work page 1998
- [70]
- [71]
-
[72]
R. Cheng, A. Verma, G. Orosz, S. Chaudhuri, Y . Yue, and J. Burdick. Control regularization for reduced variance reinforcement learning. In International Conference on Machine Learning, pages 1141–1150. PMLR, 2019. 13 A Appendix Appendix Table of Contents • Appendix A.1: Reward Shaping in FAC • Appendix A.2: Experimental Details of FAC • Appendix A.3: Det...
work page 2019
-
[73]
The total frames of the 8 tasks are 100k, except for the task bin-picking, which is set to 1M. Notably, we set the same camera view of all the tasks for consistency. On real robots, we set the 14 0.0 0.5 1.0 1.5 2.0 1e5 250 200 150 100 50 0 50 Success rate actor_loss bin-picking-v2 0.0 0.5 1.0 1.5 2.0 1e5 0 100 200 300 400Success rate critic_loss bin-pick...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.