Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression
Pith reviewed 2026-05-21 06:57 UTC · model grok-4.3
The pith
Reinforcement learning that scores entire predictive distributions with CRPS improves LLM regression calibration and ranking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed.
What carries the argument
Distribution-Aware Reward, which evaluates on-policy rollouts via CRPS and marginal leave-one-out contributions to shape policy gradients toward better predictive distributions.
If this is right
- Strong rank-correlation gains, including a 6-point Spearman improvement on KBSS.
- Competitive performance on MoleculeNet using only SMILES strings against graph-based and 3D models.
- Mitigation of rollout diversity collapse during training.
- Improved uncertainty diagnostics across the evaluated tasks.
Where Pith is reading between the lines
- The same marginal-contribution approach could be tested on other sequence regression problems such as time-series forecasting from text.
- Direct optimization of distribution quality may reduce reliance on separate post-training calibration steps for LLM outputs.
- Extending the method to larger models would show whether the calibration benefits persist at scale.
Load-bearing premise
Leave-one-out credit assignment based on each rollout's marginal contribution to CRPS produces stable policy gradients and genuinely better predictive distributions without bias from the on-policy sampling process or the choice of number of rollouts.
What would settle it
A direct comparison showing that varying the number of rollouts or switching to off-policy sampling reverses the ranking of methods on CRPS or calibration metrics would falsify the claim.
Figures
read the original abstract
Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Distribution-Aware Reward, an on-policy RL objective for training LLMs on regression tasks to produce better predictive distributions rather than just point estimates. It forms empirical distributions from multiple decoded samples, scores them with CRPS, and uses leave-one-out marginal contributions for credit assignment to reward accurate and well-dispersed predictions. Evaluations on a Gaussian-mixture task, code performance prediction (KBSS), and molecular property prediction from SMILES show improvements over SFT and pointwise RL baselines, including a 6-point Spearman rank correlation gain on KBSS, and competitive performance on MoleculeNet.
Significance. If the reported gains hold under scrutiny of the credit assignment mechanism, this work could meaningfully advance LLM regression by shifting focus to distributional calibration and ranking, which is valuable for uncertainty estimation and candidate selection in applications like molecular design and code optimization. The competitive results using only SMILES strings against graph/3D models highlight the potential of text-based approaches when properly optimized for distributions.
major comments (2)
- [§4] §4 (experimental results): The reported 6-point Spearman improvement on KBSS and competitive MoleculeNet results lack error bars, statistical significance tests, rollout count K, or ablations on the leave-one-out credit assignment. This is load-bearing for the central claim that the method produces genuinely better predictive distributions.
- [§3] §3 (method, leave-one-out CRPS reward): The marginal credit (CRPS_full - CRPS_{-i}) may introduce bias or high variance in on-policy gradients for modest K or correlated early-training rollouts. No comparison to unbiased alternatives (e.g., common-random-number CRPS) or direct distribution-matching losses is provided, leaving open whether gains reflect improved distributions or sampling artifacts.
minor comments (2)
- [Abstract] Abstract: References 'further analyses' on rollout diversity collapse and uncertainty diagnostics without citing specific figures, tables, or sections.
- [Notation] Notation: The precise construction of the empirical predictive distribution from K samples and the CRPS implementation details could be clarified for reproducibility.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which highlight important aspects of our experimental validation and methodological choices. We address each major comment below and will incorporate revisions to strengthen the manuscript where appropriate.
read point-by-point responses
-
Referee: [§4] §4 (experimental results): The reported 6-point Spearman improvement on KBSS and competitive MoleculeNet results lack error bars, statistical significance tests, rollout count K, or ablations on the leave-one-out credit assignment. This is load-bearing for the central claim that the method produces genuinely better predictive distributions.
Authors: We agree that error bars, statistical significance, explicit reporting of K, and ablations on the credit assignment mechanism would strengthen the evidence. In the revised manuscript we will add error bars computed over five independent runs with different random seeds for the KBSS and MoleculeNet results, include paired t-test p-values for the key comparisons, and explicitly state that K=10 was used throughout. We will also add an ablation in the appendix that replaces the leave-one-out marginal reward with a simple mean-CRPS reward; the results show that the marginal formulation contributes measurably to the observed rank-correlation gains, supporting the central claim. revision: yes
-
Referee: [§3] §3 (method, leave-one-out CRPS reward): The marginal credit (CRPS_full - CRPS_{-i}) may introduce bias or high variance in on-policy gradients for modest K or correlated early-training rollouts. No comparison to unbiased alternatives (e.g., common-random-number CRPS) or direct distribution-matching losses is provided, leaving open whether gains reflect improved distributions or sampling artifacts.
Authors: We acknowledge that the leave-one-out estimator can exhibit higher variance when K is modest or when early-training rollouts are highly correlated. Nevertheless, the consistent improvements across three distinct tasks and the mitigation of diversity collapse reported in our analyses indicate that any such variance does not negate the distributional benefits. We will expand §3 with a short discussion of the estimator’s bias-variance properties and its computational advantage in the on-policy setting. A direct comparison to common-random-number CRPS or to a distribution-matching loss is not present in the current submission; we will add a brief qualitative comparison and note that a quantitative study is planned for follow-up work, but the existing empirical evidence still supports that the gains arise from optimizing predictive distributions rather than sampling artifacts alone. revision: partial
Circularity Check
No significant circularity detected in the derivation
full rationale
The paper introduces Distribution-Aware Reward as an explicitly defined on-policy RL objective: multiple decoded samples form an empirical predictive distribution, CRPS (a standard external scoring rule) is computed on the full set, and each rollout receives a reward equal to its leave-one-out marginal contribution (CRPS_full - CRPS_{-i}). This construction is a deliberate credit-assignment heuristic, not a derivation that reduces to its own inputs by construction or renames a fitted parameter as a prediction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to justify the core mechanism; the central claims rest on empirical comparisons (Spearman gains, MoleculeNet competitiveness) rather than mathematical identities. The method is self-contained against external benchmarks like CRPS and does not exhibit self-definitional, fitted-input, or load-bearing self-citation patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A modest number of decoded rollouts forms a representative empirical predictive distribution for CRPS computation
invented entities (1)
-
Distribution-Aware Reward
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CRPS is a strictly proper scoring rule for univariate predictive distributions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[2]
2025 , howpublished =
work page 2025
-
[3]
LLMs Know More About Numbers than They Can Say , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=
-
[4]
Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =. doi:10.48550/arXiv.2412.15115 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115
-
[7]
Qwen2.5-Coder Technical Report
Qwen2.5-Coder Technical Report , author =. arXiv preprint arXiv:2409.12186 , year =. doi:10.48550/arXiv.2409.12186 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186
-
[8]
Qwen2 Technical Report , author =. arXiv preprint arXiv:2407.10671 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
arXiv preprint arXiv:2402.06852 , year =
ChemLLM: A Chemical Large Language Model , author =. arXiv preprint arXiv:2402.06852 , year =
-
[11]
arXiv preprint arXiv:2306.08018 , year=
Mol-instructions: A large-scale biomolecular instruction dataset for large language models , author=. arXiv preprint arXiv:2306.08018 , year=
-
[12]
arXiv preprint arXiv:2402.09391 , year=
Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset , author=. arXiv preprint arXiv:2402.09391 , year=
-
[13]
arXiv preprint arXiv:2406.11704 , year =
Nemotron-4 340B Technical Report , author =. arXiv preprint arXiv:2406.11704 , year =. doi:10.48550/arXiv.2406.11704 , url =
-
[14]
SCILITLLM: HOW TO ADAPT LLMS FOR SCIENTIFIC LITERATURE UNDERSTANDING , author=
-
[15]
arXiv preprint arXiv:2402.14547 , year =
Omnipred: Language Models as Universal Regressors , author =. arXiv preprint arXiv:2402.14547 , year =
-
[16]
From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples , author =. CoRR , year =
-
[17]
arXiv preprint arXiv:2410.10190 , year =
Predicting from Strings: Language Model Embeddings for Bayesian Optimization , author =. arXiv preprint arXiv:2410.10190 , year =
-
[18]
Advances in Neural Information Processing Systems (NeurIPS) , year =
LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[19]
arXiv preprint arXiv:2411.14708 , year =
Understanding LLM Embeddings for Regression , author =. arXiv preprint arXiv:2411.14708 , year =
-
[20]
arXiv preprint arXiv:2509.26476 , year =
Regression Language Models for Code , author =. arXiv preprint arXiv:2509.26476 , year =
-
[21]
arXiv preprint arXiv:2509.20645 , year =
Anticipatory Evaluation of Language Models , author =. arXiv preprint arXiv:2509.20645 , year =. doi:10.48550/arXiv.2509.20645 , url =
-
[22]
Can llms help uncover insights about llms? a large-scale, evolving literature analysis of frontier llms , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
-
[23]
Reasoning-Intensive Regression
Reasoning-Intensive Regression , author =. arXiv preprint arXiv:2508.21762 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Approaching Human-Level Forecasting with Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[25]
International Conference on Learning Representations (ICLR) , year =
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities , author =. International Conference on Learning Representations (ICLR) , year =
-
[26]
Current Directions in Psychological Science , year =
Forecasting tournaments: Tools for increasing transparency and improving the quality of debate , author =. Current Directions in Psychological Science , year =
-
[27]
Proximal Policy Optimization Algorithms
Proximal Policy Optimization Algorithms , author =. arXiv preprint arXiv:1707.06347 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
Journal of the American Statistical Association , volume =
Strictly Proper Scoring Rules, Prediction, and Estimation , author =. Journal of the American Statistical Association , volume =. 2007 , publisher =
work page 2007
-
[29]
Scoring Rules for Continuous Probability Distributions , author =. Management Science , volume =. 1976 , publisher =
work page 1976
-
[30]
Journal of the American Statistical Association , volume =
Making and Evaluating Point Forecasts , author =. Journal of the American Statistical Association , volume =. 2011 , doi =
work page 2011
-
[31]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.03300 , year =. doi:10.48550/arXiv.2402.03300 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300
-
[32]
arXiv preprint arXiv:2511.08616 , year =
Reasoning on Time-Series for Financial Technical Analysis , author =. arXiv preprint arXiv:2511.08616 , year =
-
[33]
Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs
Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs , author =. arXiv preprint arXiv:2506.10630 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
Training language models to follow instructions with human feedback
Training Language Models to Follow Instructions with Human Feedback , author =. arXiv preprint arXiv:2203.02155 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
International Conference on Learning Representations (ICLR) , year =
FlowRL: Matching Reward Distributions for LLM Reasoning , author =. International Conference on Learning Representations (ICLR) , year =
-
[36]
arXiv preprint arXiv:2505.17989 , year =
Outcome-based Reinforcement Learning to Predict the Future , author =. arXiv preprint arXiv:2505.17989 , year =
-
[37]
arXiv preprint arXiv:2512.25070 , year =
Scaling Open-Ended Reasoning to Predict the Future , author =. arXiv preprint arXiv:2512.25070 , year =
-
[38]
International Conference on Learning Representations (ICLR) , year =
Beyond Binary Rewards: Training LMs to Reason About Uncertainty , author =. International Conference on Learning Representations (ICLR) , year =
-
[39]
HybridFlow: A Flexible and Efficient RLHF Framework
HybridFlow: A Flexible and Efficient RLHF Framework , author =. arXiv preprint arXiv:2409.19256 , year =. doi:10.48550/arXiv.2409.19256 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.19256
-
[40]
Park, Jungsoo and Kang, Junmo and Stanovsky, Gabriel and Ritter, Alan. Can LLM s Help Uncover Insights about LLM s? A Large-Scale, Evolving Literature Analysis of Frontier LLM s. Association for Computational Linguistics. 2025
work page 2025
-
[41]
Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =
HelpSteer2: Open-source dataset for training top-performing reward models , author =. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =
-
[42]
Findings of the Association for Computational Linguistics: NAACL 2025 , year =
RewardBench: Evaluating Reward Models for Language Modeling , author =. Findings of the Association for Computational Linguistics: NAACL 2025 , year =. doi:10.18653/v1/2025.findings-naacl.96 , url =
-
[43]
Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =
Measuring Coding Challenge Competence With APPS , author =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =
-
[44]
MoleculeNet: a benchmark for molecular machine learning , author=. Chemical Science , volume=. 2018 , publisher=
work page 2018
-
[45]
Journal of Chemical Information and Modeling , volume=
Analyzing Learned Molecular Representations for Property Prediction , author=. Journal of Chemical Information and Modeling , volume=. 2019 , doi=
work page 2019
-
[46]
Journal of Medicinal Chemistry , volume=
Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism , author=. Journal of Medicinal Chemistry , volume=. 2020 , doi=
work page 2020
-
[47]
International Conference on Learning Representations (ICLR) , year =
Uni-Mol: A Universal 3D Molecular Representation Learning Framework , author =. International Conference on Learning Representations (ICLR) , year =
-
[48]
Findings of the Association for Computational Linguistics: EMNLP 2024 , year =
Regression Aware Inference with LLMs , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =
work page 2024
-
[49]
International Conference on Learning Representations (ICLR) , year =
Better Autoregressive Regression with LLMs via Regression-Aware Fine-Tuning , author =. International Conference on Learning Representations (ICLR) , year =
-
[50]
TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
-
[51]
arXiv preprint arXiv:2506.21718 , year =
Performance Prediction for Large Systems via Text-to-Text Regression , author =. arXiv preprint arXiv:2506.21718 , year =
-
[52]
npj Computational Materials , volume =
Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm , author =. npj Computational Materials , volume =. 2020 , url =
work page 2020
-
[53]
NeurIPS Datasets and Benchmarks Track , year =
Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development , author =. NeurIPS Datasets and Benchmarks Track , year =
-
[54]
Evaluating Protein Transfer Learning with TAPE
Evaluating Protein Transfer Learning with TAPE , author =. arXiv preprint arXiv:1906.08230 , year =
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[55]
NeurIPS Datasets and Benchmarks Track , year =
FLIP: Benchmark Tasks in Fitness Landscape Inference for Proteins , author =. NeurIPS Datasets and Benchmarks Track , year =
-
[56]
Journal of Advances in Modeling Earth Systems , volume =
WeatherBench: A Benchmark Dataset for Data-Driven Weather Forecasting , author =. Journal of Advances in Modeling Earth Systems , volume =. 2020 , url =
work page 2020
-
[57]
arXiv preprint arXiv:2104.10066 , year =
EarthNet2021: A Large-Scale Dataset and Challenge for Earth Surface Forecasting as a Guided Video Prediction Task , author =. arXiv preprint arXiv:2104.10066 , year =
-
[58]
Monthly Notices of the Royal Astronomical Society , year =
AstroCLIP: A Cross-Modal Foundation Model for Galaxies , author =. Monthly Notices of the Royal Astronomical Society , year =
-
[59]
Advances in Neural Information Processing Systems , volume =
A Benchmark for Prediction of Transcriptomic Responses to Chemical Perturbations Across Cell Types , author =. Advances in Neural Information Processing Systems , volume =. 2024 , url =
work page 2024
-
[60]
On the Opportunities and Risks of Foundation Models
On the Opportunities and Risks of Foundation Models , author =. arXiv preprint arXiv:2108.07258 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
arXiv preprint arXiv:2512.06533 , year=
Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning , author=. arXiv preprint arXiv:2512.06533 , year=
-
[62]
arXiv preprint arXiv:2603.24844 , year=
Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models , author=. arXiv preprint arXiv:2603.24844 , year=
-
[63]
arXiv preprint arXiv:2506.02945 , year=
Quantitative llm judges , author=. arXiv preprint arXiv:2506.02945 , year=
-
[64]
Tang, Eric and Yang, Bangding and Song, Xingyou , journal =. Understanding. 2025 , url =
work page 2025
-
[65]
Better Autoregressive Regression with
Lukasik, Michal and Meng, Zhao and Narasimhan, Harikrishna and Chang, Yin-Wen and Menon, Aditya Krishna and Yu, Felix and Kumar, Sanjiv , booktitle =. Better Autoregressive Regression with. 2025 , url =
work page 2025
-
[66]
Chiang, Cheng-Han and Lee, Hung-yi and Lukasik, Michal , editor =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
-
[67]
Advances in neural information processing systems , volume=
Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=
-
[68]
Gaussian Processes for Machine Learning , author =. 2006 , publisher =
work page 2006
-
[69]
Advances in Neural Information Processing Systems , volume =
Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author =. Advances in Neural Information Processing Systems , volume =. 2017 , publisher =
work page 2017
-
[70]
Advances in Neural Information Processing Systems , volume =
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , author =. Advances in Neural Information Processing Systems , volume =. 2017 , publisher =
work page 2017
-
[71]
arXiv preprint arXiv:2509.26610 , year=
Uncertainty Quantification for Regression using Proper Scoring Rules , author=. arXiv preprint arXiv:2509.26610 , year=
-
[72]
international conference on machine learning , pages=
Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=. 2016 , organization=
work page 2016
-
[73]
arXiv preprint arXiv:2603.11682 , year=
Entropy-preserving reinforcement learning , author=. arXiv preprint arXiv:2603.11682 , year=
-
[74]
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
Rewarding the unlikely: Lifting grpo beyond distribution sharpening , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=
work page 2025
-
[75]
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =
-
[76]
Deep Reinforcement Learning Meets Structured Prediction Workshop at ICLR , year =
Buy 4 REINFORCE Samples, Get a Baseline for Free! , author =. Deep Reinforcement Learning Meets Structured Prediction Workshop at ICLR , year =
-
[77]
PLOS Computational Biology , volume=
Evaluating epidemic forecasts in an interval format , author=. PLOS Computational Biology , volume=. 2021 , publisher=
work page 2021
-
[78]
Evaluating and Calibrating Uncertainty Prediction in Regression Tasks , author=. Sensors , volume=. 2022 , publisher=
work page 2022
-
[79]
Chai, Tianfeng and Draxler, Roland R. , journal=. Root mean square error (. 2014 , publisher=
work page 2014
-
[80]
Proceedings of the 22nd International Conference on Machine Learning , pages=
Learning to Rank Using Gradient Descent , author=. Proceedings of the 22nd International Conference on Machine Learning , pages=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.