Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

Alan Ritter; Ethan Mendes; Hyungjoo Chae; Jay DeYoung; Jungsoo Park; Varsha Kishore; Wei Xu

arxiv: 2605.20740 · v1 · pith:NWOU6MHMnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· cs.CL

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

Jungsoo Park , Hyungjoo Chae , Ethan Mendes , Jay DeYoung , Varsha Kishore , Wei Xu , Alan Ritter This is my paper

Pith reviewed 2026-05-21 06:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords reinforcement learningLLM regressionpredictive distributionsCRPSmolecular property predictioncalibrationranking correlation

0 comments

The pith

Reinforcement learning that scores entire predictive distributions with CRPS improves LLM regression calibration and ranking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard fine-tuning and pointwise reinforcement learning for language models on regression tasks optimize individual decoded numbers against targets, which often produces poorly calibrated uncertainty estimates. The paper introduces a reward that treats multiple sampled outputs as an empirical predictive distribution, scores the set with the Continuous Ranked Probability Score, and gives each rollout leave-one-out credit for its marginal effect on overall distribution quality. This trains the model to generate predictions that are simultaneously accurate and appropriately dispersed. The resulting gains appear in stronger rank correlation on code performance tasks and competitive results on molecular properties using only SMILES strings.

Core claim

Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed.

What carries the argument

Distribution-Aware Reward, which evaluates on-policy rollouts via CRPS and marginal leave-one-out contributions to shape policy gradients toward better predictive distributions.

If this is right

Strong rank-correlation gains, including a 6-point Spearman improvement on KBSS.
Competitive performance on MoleculeNet using only SMILES strings against graph-based and 3D models.
Mitigation of rollout diversity collapse during training.
Improved uncertainty diagnostics across the evaluated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same marginal-contribution approach could be tested on other sequence regression problems such as time-series forecasting from text.
Direct optimization of distribution quality may reduce reliance on separate post-training calibration steps for LLM outputs.
Extending the method to larger models would show whether the calibration benefits persist at scale.

Load-bearing premise

Leave-one-out credit assignment based on each rollout's marginal contribution to CRPS produces stable policy gradients and genuinely better predictive distributions without bias from the on-policy sampling process or the choice of number of rollouts.

What would settle it

A direct comparison showing that varying the number of rollouts or switching to off-policy sampling reverses the ranking of methods on CRPS or calibration metrics would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.20740 by Alan Ritter, Ethan Mendes, Hyungjoo Chae, Jay DeYoung, Jungsoo Park, Varsha Kishore, Wei Xu.

**Figure 1.** Figure 1: Pointwise versus distribution-aware reward shaping. Pointwise MSE reward scores each rollout independently, encouraging predictions to collapse toward the mean. DAR instead scores each rollout by its leave-one-out contribution to the full predictive distribution, rewarding both accuracy and useful spread around the ground truth. samples from a shared predictive distribution ( [PITH_FULL_IMAGE:figures/full… view at source ↗

**Figure 2.** Figure 2: Synthetic distributional regression. Mean predictions and ±1σ predictive spread on the 1D Gaussian mixture task. Dashed vertical lines at x = ±6 mark the boundary between interpolation and extrapolation regions. function. Qualitatively, it also follows the target structure more closely, especially in the extrapolation region (6, 10). In contrast, SFT and MSE reward shaping show larger deviations, and all n… view at source ↗

**Figure 3.** Figure 3: Distribution-level calibration and uncertainty diagnostics. We compare predictive distributions from SFT, MSE reward shaping, and DAR. The plots show normalized predictive standard deviation versus normalized prediction error on a log-log scale, along with fitted trends and the ideal diagonal. DAR achieves stronger standard deviation-error alignment, indicating more informative rollout dispersion and bette… view at source ↗

**Figure 4.** Figure 4: Training dynamics during RL under DAR (ours) and MSE reward shaping. Bracket rate measures the fraction of examples whose rollout set contains predictions on both sides of the ground truth, while entropy measures prediction diversity [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of data samples Each component is Gaussian with shared heteroscedastic variance: P1(y | x) = N [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a new on-policy RL objective that scores empirical predictive distributions with CRPS and uses leave-one-out marginal credit to train LLMs for better-calibrated regression outputs rather than point estimates.

read the letter

The central idea here is to treat multiple model rollouts as an empirical distribution and optimize directly for its quality using CRPS, with each rollout getting credit for its marginal effect on the score. This is a clear step past standard pointwise RL or supervised fine-tuning for regression tasks where spread and ranking matter as much as the mean prediction. They show this on a synthetic Gaussian mixture, code performance prediction, and MoleculeNet properties from SMILES strings, reporting better rank correlation including a 6-point Spearman lift on KBSS and results competitive with graph or 3D models despite using only text input. The approach also appears to reduce diversity collapse in rollouts and improve some uncertainty checks. That part is useful and worth noting for anyone doing LLM-based scientific prediction or optimization. The main soft spot is the leave-one-out credit mechanism itself. With modest rollout counts, which are common for cost reasons, the marginal differences can be high-variance or sensitive to correlation among samples, especially early in training. The abstract gives no ablations on rollout number, no variance numbers for the credit signal, and no comparison to lower-variance alternatives like common-random-number CRPS. Without those controls it is difficult to know whether the reported gains come from genuinely better distributions or from the particular heuristic interacting with the sampling process. The full paper may address this, but based on what is shown the experimental support is thinner than the claim requires. This work is for people already working on distributional regression or uncertainty-aware RL with language models. A reader focused on calibration or ranking in molecular or code tasks would find the objective worth trying, even if they end up modifying the credit assignment. It is coherent on its own terms and engages the right prior literature, so it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Distribution-Aware Reward, an on-policy RL objective for training LLMs on regression tasks to produce better predictive distributions rather than just point estimates. It forms empirical distributions from multiple decoded samples, scores them with CRPS, and uses leave-one-out marginal contributions for credit assignment to reward accurate and well-dispersed predictions. Evaluations on a Gaussian-mixture task, code performance prediction (KBSS), and molecular property prediction from SMILES show improvements over SFT and pointwise RL baselines, including a 6-point Spearman rank correlation gain on KBSS, and competitive performance on MoleculeNet.

Significance. If the reported gains hold under scrutiny of the credit assignment mechanism, this work could meaningfully advance LLM regression by shifting focus to distributional calibration and ranking, which is valuable for uncertainty estimation and candidate selection in applications like molecular design and code optimization. The competitive results using only SMILES strings against graph/3D models highlight the potential of text-based approaches when properly optimized for distributions.

major comments (2)

[§4] §4 (experimental results): The reported 6-point Spearman improvement on KBSS and competitive MoleculeNet results lack error bars, statistical significance tests, rollout count K, or ablations on the leave-one-out credit assignment. This is load-bearing for the central claim that the method produces genuinely better predictive distributions.
[§3] §3 (method, leave-one-out CRPS reward): The marginal credit (CRPS_full - CRPS_{-i}) may introduce bias or high variance in on-policy gradients for modest K or correlated early-training rollouts. No comparison to unbiased alternatives (e.g., common-random-number CRPS) or direct distribution-matching losses is provided, leaving open whether gains reflect improved distributions or sampling artifacts.

minor comments (2)

[Abstract] Abstract: References 'further analyses' on rollout diversity collapse and uncertainty diagnostics without citing specific figures, tables, or sections.
[Notation] Notation: The precise construction of the empirical predictive distribution from K samples and the CRPS implementation details could be clarified for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important aspects of our experimental validation and methodological choices. We address each major comment below and will incorporate revisions to strengthen the manuscript where appropriate.

read point-by-point responses

Referee: [§4] §4 (experimental results): The reported 6-point Spearman improvement on KBSS and competitive MoleculeNet results lack error bars, statistical significance tests, rollout count K, or ablations on the leave-one-out credit assignment. This is load-bearing for the central claim that the method produces genuinely better predictive distributions.

Authors: We agree that error bars, statistical significance, explicit reporting of K, and ablations on the credit assignment mechanism would strengthen the evidence. In the revised manuscript we will add error bars computed over five independent runs with different random seeds for the KBSS and MoleculeNet results, include paired t-test p-values for the key comparisons, and explicitly state that K=10 was used throughout. We will also add an ablation in the appendix that replaces the leave-one-out marginal reward with a simple mean-CRPS reward; the results show that the marginal formulation contributes measurably to the observed rank-correlation gains, supporting the central claim. revision: yes
Referee: [§3] §3 (method, leave-one-out CRPS reward): The marginal credit (CRPS_full - CRPS_{-i}) may introduce bias or high variance in on-policy gradients for modest K or correlated early-training rollouts. No comparison to unbiased alternatives (e.g., common-random-number CRPS) or direct distribution-matching losses is provided, leaving open whether gains reflect improved distributions or sampling artifacts.

Authors: We acknowledge that the leave-one-out estimator can exhibit higher variance when K is modest or when early-training rollouts are highly correlated. Nevertheless, the consistent improvements across three distinct tasks and the mitigation of diversity collapse reported in our analyses indicate that any such variance does not negate the distributional benefits. We will expand §3 with a short discussion of the estimator’s bias-variance properties and its computational advantage in the on-policy setting. A direct comparison to common-random-number CRPS or to a distribution-matching loss is not present in the current submission; we will add a brief qualitative comparison and note that a quantitative study is planned for follow-up work, but the existing empirical evidence still supports that the gains arise from optimizing predictive distributions rather than sampling artifacts alone. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in the derivation

full rationale

The paper introduces Distribution-Aware Reward as an explicitly defined on-policy RL objective: multiple decoded samples form an empirical predictive distribution, CRPS (a standard external scoring rule) is computed on the full set, and each rollout receives a reward equal to its leave-one-out marginal contribution (CRPS_full - CRPS_{-i}). This construction is a deliberate credit-assignment heuristic, not a derivation that reduces to its own inputs by construction or renames a fitted parameter as a prediction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to justify the core mechanism; the central claims rest on empirical comparisons (Spearman gains, MoleculeNet competitiveness) rather than mathematical identities. The method is self-contained against external benchmarks like CRPS and does not exhibit self-definitional, fitted-input, or load-bearing self-citation patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on treating finite rollouts as a sufficient empirical distribution and on the effectiveness of marginal CRPS credit assignment; no explicit free parameters or new physical entities are introduced beyond the RL objective itself.

axioms (1)

domain assumption A modest number of decoded rollouts forms a representative empirical predictive distribution for CRPS computation
Invoked when the method evaluates multiple samples as the predictive distribution in the abstract description of the reward.

invented entities (1)

Distribution-Aware Reward no independent evidence
purpose: New on-policy RL objective that rewards marginal improvements to predictive distribution quality via CRPS
Introduced as the main contribution; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5780 in / 1314 out tokens · 37207 ms · 2026-05-21T06:57:20.616066+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CRPS is a strictly proper scoring rule for univariate predictive distributions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 15 internal anchors

[1]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[2]

2025 , howpublished =

work page 2025
[3]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

LLMs Know More About Numbers than They Can Say , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

work page
[4]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Qwen3 Technical Report

Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =. doi:10.48550/arXiv.2412.15115 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115
[7]

Qwen2.5-Coder Technical Report

Qwen2.5-Coder Technical Report , author =. arXiv preprint arXiv:2409.12186 , year =. doi:10.48550/arXiv.2409.12186 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186
[8]

Qwen2 Technical Report

Qwen2 Technical Report , author =. arXiv preprint arXiv:2407.10671 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[9]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[10]

arXiv preprint arXiv:2402.06852 , year =

ChemLLM: A Chemical Large Language Model , author =. arXiv preprint arXiv:2402.06852 , year =

work page arXiv
[11]

arXiv preprint arXiv:2306.08018 , year=

Mol-instructions: A large-scale biomolecular instruction dataset for large language models , author=. arXiv preprint arXiv:2306.08018 , year=

work page arXiv
[12]

arXiv preprint arXiv:2402.09391 , year=

Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset , author=. arXiv preprint arXiv:2402.09391 , year=

work page arXiv
[13]

arXiv preprint arXiv:2406.11704 , year =

Nemotron-4 340B Technical Report , author =. arXiv preprint arXiv:2406.11704 , year =. doi:10.48550/arXiv.2406.11704 , url =

work page doi:10.48550/arxiv.2406.11704
[14]

SCILITLLM: HOW TO ADAPT LLMS FOR SCIENTIFIC LITERATURE UNDERSTANDING , author=

work page
[15]

arXiv preprint arXiv:2402.14547 , year =

Omnipred: Language Models as Universal Regressors , author =. arXiv preprint arXiv:2402.14547 , year =

work page arXiv
[16]

CoRR , year =

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples , author =. CoRR , year =

work page
[17]

arXiv preprint arXiv:2410.10190 , year =

Predicting from Strings: Language Model Embeddings for Bayesian Optimization , author =. arXiv preprint arXiv:2410.10190 , year =

work page arXiv
[18]

Advances in Neural Information Processing Systems (NeurIPS) , year =

LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[19]

arXiv preprint arXiv:2411.14708 , year =

Understanding LLM Embeddings for Regression , author =. arXiv preprint arXiv:2411.14708 , year =

work page arXiv
[20]

arXiv preprint arXiv:2509.26476 , year =

Regression Language Models for Code , author =. arXiv preprint arXiv:2509.26476 , year =

work page arXiv
[21]

arXiv preprint arXiv:2509.20645 , year =

Anticipatory Evaluation of Language Models , author =. arXiv preprint arXiv:2509.20645 , year =. doi:10.48550/arXiv.2509.20645 , url =

work page doi:10.48550/arxiv.2509.20645
[22]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Can llms help uncover insights about llms? a large-scale, evolving literature analysis of frontier llms , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[23]

Reasoning-Intensive Regression

Reasoning-Intensive Regression , author =. arXiv preprint arXiv:2508.21762 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Approaching Human-Level Forecasting with Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page
[25]

International Conference on Learning Representations (ICLR) , year =

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities , author =. International Conference on Learning Representations (ICLR) , year =

work page
[26]

Current Directions in Psychological Science , year =

Forecasting tournaments: Tools for increasing transparency and improving the quality of debate , author =. Current Directions in Psychological Science , year =

work page
[27]

Proximal Policy Optimization Algorithms

Proximal Policy Optimization Algorithms , author =. arXiv preprint arXiv:1707.06347 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Journal of the American Statistical Association , volume =

Strictly Proper Scoring Rules, Prediction, and Estimation , author =. Journal of the American Statistical Association , volume =. 2007 , publisher =

work page 2007
[29]

Management Science , volume =

Scoring Rules for Continuous Probability Distributions , author =. Management Science , volume =. 1976 , publisher =

work page 1976
[30]

Journal of the American Statistical Association , volume =

Making and Evaluating Point Forecasts , author =. Journal of the American Statistical Association , volume =. 2011 , doi =

work page 2011
[31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.03300 , year =. doi:10.48550/arXiv.2402.03300 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300
[32]

arXiv preprint arXiv:2511.08616 , year =

Reasoning on Time-Series for Financial Technical Analysis , author =. arXiv preprint arXiv:2511.08616 , year =

work page arXiv
[33]

Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs

Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs , author =. arXiv preprint arXiv:2506.10630 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Training language models to follow instructions with human feedback

Training Language Models to Follow Instructions with Human Feedback , author =. arXiv preprint arXiv:2203.02155 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[35]

International Conference on Learning Representations (ICLR) , year =

FlowRL: Matching Reward Distributions for LLM Reasoning , author =. International Conference on Learning Representations (ICLR) , year =

work page
[36]

arXiv preprint arXiv:2505.17989 , year =

Outcome-based Reinforcement Learning to Predict the Future , author =. arXiv preprint arXiv:2505.17989 , year =

work page arXiv
[37]

arXiv preprint arXiv:2512.25070 , year =

Scaling Open-Ended Reasoning to Predict the Future , author =. arXiv preprint arXiv:2512.25070 , year =

work page arXiv
[38]

International Conference on Learning Representations (ICLR) , year =

Beyond Binary Rewards: Training LMs to Reason About Uncertainty , author =. International Conference on Learning Representations (ICLR) , year =

work page
[39]

HybridFlow: A Flexible and Efficient RLHF Framework

HybridFlow: A Flexible and Efficient RLHF Framework , author =. arXiv preprint arXiv:2409.19256 , year =. doi:10.48550/arXiv.2409.19256 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.19256
[40]

Can LLM s Help Uncover Insights about LLM s? A Large-Scale, Evolving Literature Analysis of Frontier LLM s

Park, Jungsoo and Kang, Junmo and Stanovsky, Gabriel and Ritter, Alan. Can LLM s Help Uncover Insights about LLM s? A Large-Scale, Evolving Literature Analysis of Frontier LLM s. Association for Computational Linguistics. 2025

work page 2025
[41]

Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

HelpSteer2: Open-source dataset for training top-performing reward models , author =. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

work page
[42]

Findings of the Association for Computational Linguistics: NAACL 2025 , year =

RewardBench: Evaluating Reward Models for Language Modeling , author =. Findings of the Association for Computational Linguistics: NAACL 2025 , year =. doi:10.18653/v1/2025.findings-naacl.96 , url =

work page doi:10.18653/v1/2025.findings-naacl.96 2025
[43]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

Measuring Coding Challenge Competence With APPS , author =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

work page
[44]

Chemical Science , volume=

MoleculeNet: a benchmark for molecular machine learning , author=. Chemical Science , volume=. 2018 , publisher=

work page 2018
[45]

Journal of Chemical Information and Modeling , volume=

Analyzing Learned Molecular Representations for Property Prediction , author=. Journal of Chemical Information and Modeling , volume=. 2019 , doi=

work page 2019
[46]

Journal of Medicinal Chemistry , volume=

Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism , author=. Journal of Medicinal Chemistry , volume=. 2020 , doi=

work page 2020
[47]

International Conference on Learning Representations (ICLR) , year =

Uni-Mol: A Universal 3D Molecular Representation Learning Framework , author =. International Conference on Learning Representations (ICLR) , year =

work page
[48]

Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

Regression Aware Inference with LLMs , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

work page 2024
[49]

International Conference on Learning Representations (ICLR) , year =

Better Autoregressive Regression with LLMs via Regression-Aware Fine-Tuning , author =. International Conference on Learning Representations (ICLR) , year =

work page
[50]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

work page
[51]

arXiv preprint arXiv:2506.21718 , year =

Performance Prediction for Large Systems via Text-to-Text Regression , author =. arXiv preprint arXiv:2506.21718 , year =

work page arXiv
[52]

npj Computational Materials , volume =

Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm , author =. npj Computational Materials , volume =. 2020 , url =

work page 2020
[53]

NeurIPS Datasets and Benchmarks Track , year =

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development , author =. NeurIPS Datasets and Benchmarks Track , year =

work page
[54]

Evaluating Protein Transfer Learning with TAPE

Evaluating Protein Transfer Learning with TAPE , author =. arXiv preprint arXiv:1906.08230 , year =

work page internal anchor Pith review Pith/arXiv arXiv 1906
[55]

NeurIPS Datasets and Benchmarks Track , year =

FLIP: Benchmark Tasks in Fitness Landscape Inference for Proteins , author =. NeurIPS Datasets and Benchmarks Track , year =

work page
[56]

Journal of Advances in Modeling Earth Systems , volume =

WeatherBench: A Benchmark Dataset for Data-Driven Weather Forecasting , author =. Journal of Advances in Modeling Earth Systems , volume =. 2020 , url =

work page 2020
[57]

arXiv preprint arXiv:2104.10066 , year =

EarthNet2021: A Large-Scale Dataset and Challenge for Earth Surface Forecasting as a Guided Video Prediction Task , author =. arXiv preprint arXiv:2104.10066 , year =

work page arXiv
[58]

Monthly Notices of the Royal Astronomical Society , year =

AstroCLIP: A Cross-Modal Foundation Model for Galaxies , author =. Monthly Notices of the Royal Astronomical Society , year =

work page
[59]

Advances in Neural Information Processing Systems , volume =

A Benchmark for Prediction of Transcriptomic Responses to Chemical Perturbations Across Cell Types , author =. Advances in Neural Information Processing Systems , volume =. 2024 , url =

work page 2024
[60]

On the Opportunities and Risks of Foundation Models

On the Opportunities and Risks of Foundation Models , author =. arXiv preprint arXiv:2108.07258 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[61]

arXiv preprint arXiv:2512.06533 , year=

Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning , author=. arXiv preprint arXiv:2512.06533 , year=

work page arXiv
[62]

arXiv preprint arXiv:2603.24844 , year=

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models , author=. arXiv preprint arXiv:2603.24844 , year=

work page arXiv
[63]

arXiv preprint arXiv:2506.02945 , year=

Quantitative llm judges , author=. arXiv preprint arXiv:2506.02945 , year=

work page arXiv
[64]

Understanding

Tang, Eric and Yang, Bangding and Song, Xingyou , journal =. Understanding. 2025 , url =

work page 2025
[65]

Better Autoregressive Regression with

Lukasik, Michal and Meng, Zhao and Narasimhan, Harikrishna and Chang, Yin-Wen and Menon, Aditya Krishna and Yu, Felix and Kumar, Sanjiv , booktitle =. Better Autoregressive Regression with. 2025 , url =

work page 2025
[66]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Chiang, Cheng-Han and Lee, Hung-yi and Lukasik, Michal , editor =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

work page
[67]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

work page
[68]

2006 , publisher =

Gaussian Processes for Machine Learning , author =. 2006 , publisher =

work page 2006
[69]

Advances in Neural Information Processing Systems , volume =

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author =. Advances in Neural Information Processing Systems , volume =. 2017 , publisher =

work page 2017
[70]

Advances in Neural Information Processing Systems , volume =

What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , author =. Advances in Neural Information Processing Systems , volume =. 2017 , publisher =

work page 2017
[71]

arXiv preprint arXiv:2509.26610 , year=

Uncertainty Quantification for Regression using Proper Scoring Rules , author=. arXiv preprint arXiv:2509.26610 , year=

work page arXiv
[72]

international conference on machine learning , pages=

Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=. 2016 , organization=

work page 2016
[73]

arXiv preprint arXiv:2603.11682 , year=

Entropy-preserving reinforcement learning , author=. arXiv preprint arXiv:2603.11682 , year=

work page arXiv
[74]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Rewarding the unlikely: Lifting grpo beyond distribution sharpening , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[75]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

work page
[76]

Deep Reinforcement Learning Meets Structured Prediction Workshop at ICLR , year =

Buy 4 REINFORCE Samples, Get a Baseline for Free! , author =. Deep Reinforcement Learning Meets Structured Prediction Workshop at ICLR , year =

work page
[77]

PLOS Computational Biology , volume=

Evaluating epidemic forecasts in an interval format , author=. PLOS Computational Biology , volume=. 2021 , publisher=

work page 2021
[78]

Sensors , volume=

Evaluating and Calibrating Uncertainty Prediction in Regression Tasks , author=. Sensors , volume=. 2022 , publisher=

work page 2022
[79]

, journal=

Chai, Tianfeng and Draxler, Roland R. , journal=. Root mean square error (. 2014 , publisher=

work page 2014
[80]

Proceedings of the 22nd International Conference on Machine Learning , pages=

Learning to Rank Using Gradient Descent , author=. Proceedings of the 22nd International Conference on Machine Learning , pages=

work page

Showing first 80 references.

[1] [1]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[2] [2]

2025 , howpublished =

work page 2025

[3] [3]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

LLMs Know More About Numbers than They Can Say , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

work page

[4] [4]

GPT-4 Technical Report

Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Qwen3 Technical Report

Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Qwen2.5 Technical Report

Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =. doi:10.48550/arXiv.2412.15115 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115

[7] [7]

Qwen2.5-Coder Technical Report

Qwen2.5-Coder Technical Report , author =. arXiv preprint arXiv:2409.12186 , year =. doi:10.48550/arXiv.2409.12186 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.12186

[8] [8]

Qwen2 Technical Report

Qwen2 Technical Report , author =. arXiv preprint arXiv:2407.10671 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

The Llama 3 Herd of Models

The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

arXiv preprint arXiv:2402.06852 , year =

ChemLLM: A Chemical Large Language Model , author =. arXiv preprint arXiv:2402.06852 , year =

work page arXiv

[11] [11]

arXiv preprint arXiv:2306.08018 , year=

Mol-instructions: A large-scale biomolecular instruction dataset for large language models , author=. arXiv preprint arXiv:2306.08018 , year=

work page arXiv

[12] [12]

arXiv preprint arXiv:2402.09391 , year=

Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset , author=. arXiv preprint arXiv:2402.09391 , year=

work page arXiv

[13] [13]

arXiv preprint arXiv:2406.11704 , year =

Nemotron-4 340B Technical Report , author =. arXiv preprint arXiv:2406.11704 , year =. doi:10.48550/arXiv.2406.11704 , url =

work page doi:10.48550/arxiv.2406.11704

[14] [14]

SCILITLLM: HOW TO ADAPT LLMS FOR SCIENTIFIC LITERATURE UNDERSTANDING , author=

work page

[15] [15]

arXiv preprint arXiv:2402.14547 , year =

Omnipred: Language Models as Universal Regressors , author =. arXiv preprint arXiv:2402.14547 , year =

work page arXiv

[16] [16]

CoRR , year =

From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples , author =. CoRR , year =

work page

[17] [17]

arXiv preprint arXiv:2410.10190 , year =

Predicting from Strings: Language Model Embeddings for Bayesian Optimization , author =. arXiv preprint arXiv:2410.10190 , year =

work page arXiv

[18] [18]

Advances in Neural Information Processing Systems (NeurIPS) , year =

LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[19] [19]

arXiv preprint arXiv:2411.14708 , year =

Understanding LLM Embeddings for Regression , author =. arXiv preprint arXiv:2411.14708 , year =

work page arXiv

[20] [20]

arXiv preprint arXiv:2509.26476 , year =

Regression Language Models for Code , author =. arXiv preprint arXiv:2509.26476 , year =

work page arXiv

[21] [21]

arXiv preprint arXiv:2509.20645 , year =

Anticipatory Evaluation of Language Models , author =. arXiv preprint arXiv:2509.20645 , year =. doi:10.48550/arXiv.2509.20645 , url =

work page doi:10.48550/arxiv.2509.20645

[22] [22]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Can llms help uncover insights about llms? a large-scale, evolving literature analysis of frontier llms , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page

[23] [23]

Reasoning-Intensive Regression

Reasoning-Intensive Regression , author =. arXiv preprint arXiv:2508.21762 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Approaching Human-Level Forecasting with Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

work page

[25] [25]

International Conference on Learning Representations (ICLR) , year =

ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities , author =. International Conference on Learning Representations (ICLR) , year =

work page

[26] [26]

Current Directions in Psychological Science , year =

Forecasting tournaments: Tools for increasing transparency and improving the quality of debate , author =. Current Directions in Psychological Science , year =

work page

[27] [27]

Proximal Policy Optimization Algorithms

Proximal Policy Optimization Algorithms , author =. arXiv preprint arXiv:1707.06347 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Journal of the American Statistical Association , volume =

Strictly Proper Scoring Rules, Prediction, and Estimation , author =. Journal of the American Statistical Association , volume =. 2007 , publisher =

work page 2007

[29] [29]

Management Science , volume =

Scoring Rules for Continuous Probability Distributions , author =. Management Science , volume =. 1976 , publisher =

work page 1976

[30] [30]

Journal of the American Statistical Association , volume =

Making and Evaluating Point Forecasts , author =. Journal of the American Statistical Association , volume =. 2011 , doi =

work page 2011

[31] [31]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.03300 , year =. doi:10.48550/arXiv.2402.03300 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.03300

[32] [32]

arXiv preprint arXiv:2511.08616 , year =

Reasoning on Time-Series for Financial Technical Analysis , author =. arXiv preprint arXiv:2511.08616 , year =

work page arXiv

[33] [33]

Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs

Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs , author =. arXiv preprint arXiv:2506.10630 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Training language models to follow instructions with human feedback

Training Language Models to Follow Instructions with Human Feedback , author =. arXiv preprint arXiv:2203.02155 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[35] [35]

International Conference on Learning Representations (ICLR) , year =

FlowRL: Matching Reward Distributions for LLM Reasoning , author =. International Conference on Learning Representations (ICLR) , year =

work page

[36] [36]

arXiv preprint arXiv:2505.17989 , year =

Outcome-based Reinforcement Learning to Predict the Future , author =. arXiv preprint arXiv:2505.17989 , year =

work page arXiv

[37] [37]

arXiv preprint arXiv:2512.25070 , year =

Scaling Open-Ended Reasoning to Predict the Future , author =. arXiv preprint arXiv:2512.25070 , year =

work page arXiv

[38] [38]

International Conference on Learning Representations (ICLR) , year =

Beyond Binary Rewards: Training LMs to Reason About Uncertainty , author =. International Conference on Learning Representations (ICLR) , year =

work page

[39] [39]

HybridFlow: A Flexible and Efficient RLHF Framework

HybridFlow: A Flexible and Efficient RLHF Framework , author =. arXiv preprint arXiv:2409.19256 , year =. doi:10.48550/arXiv.2409.19256 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2409.19256

[40] [40]

Can LLM s Help Uncover Insights about LLM s? A Large-Scale, Evolving Literature Analysis of Frontier LLM s

Park, Jungsoo and Kang, Junmo and Stanovsky, Gabriel and Ritter, Alan. Can LLM s Help Uncover Insights about LLM s? A Large-Scale, Evolving Literature Analysis of Frontier LLM s. Association for Computational Linguistics. 2025

work page 2025

[41] [41]

Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

HelpSteer2: Open-source dataset for training top-performing reward models , author =. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

work page

[42] [42]

Findings of the Association for Computational Linguistics: NAACL 2025 , year =

RewardBench: Evaluating Reward Models for Language Modeling , author =. Findings of the Association for Computational Linguistics: NAACL 2025 , year =. doi:10.18653/v1/2025.findings-naacl.96 , url =

work page doi:10.18653/v1/2025.findings-naacl.96 2025

[43] [43]

Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

Measuring Coding Challenge Competence With APPS , author =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

work page

[44] [44]

Chemical Science , volume=

MoleculeNet: a benchmark for molecular machine learning , author=. Chemical Science , volume=. 2018 , publisher=

work page 2018

[45] [45]

Journal of Chemical Information and Modeling , volume=

Analyzing Learned Molecular Representations for Property Prediction , author=. Journal of Chemical Information and Modeling , volume=. 2019 , doi=

work page 2019

[46] [46]

Journal of Medicinal Chemistry , volume=

Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism , author=. Journal of Medicinal Chemistry , volume=. 2020 , doi=

work page 2020

[47] [47]

International Conference on Learning Representations (ICLR) , year =

Uni-Mol: A Universal 3D Molecular Representation Learning Framework , author =. International Conference on Learning Representations (ICLR) , year =

work page

[48] [48]

Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

Regression Aware Inference with LLMs , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

work page 2024

[49] [49]

International Conference on Learning Representations (ICLR) , year =

Better Autoregressive Regression with LLMs via Regression-Aware Fine-Tuning , author =. International Conference on Learning Representations (ICLR) , year =

work page

[50] [50]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

work page

[51] [51]

arXiv preprint arXiv:2506.21718 , year =

Performance Prediction for Large Systems via Text-to-Text Regression , author =. arXiv preprint arXiv:2506.21718 , year =

work page arXiv

[52] [52]

npj Computational Materials , volume =

Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm , author =. npj Computational Materials , volume =. 2020 , url =

work page 2020

[53] [53]

NeurIPS Datasets and Benchmarks Track , year =

Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development , author =. NeurIPS Datasets and Benchmarks Track , year =

work page

[54] [54]

Evaluating Protein Transfer Learning with TAPE

Evaluating Protein Transfer Learning with TAPE , author =. arXiv preprint arXiv:1906.08230 , year =

work page internal anchor Pith review Pith/arXiv arXiv 1906

[55] [55]

NeurIPS Datasets and Benchmarks Track , year =

FLIP: Benchmark Tasks in Fitness Landscape Inference for Proteins , author =. NeurIPS Datasets and Benchmarks Track , year =

work page

[56] [56]

Journal of Advances in Modeling Earth Systems , volume =

WeatherBench: A Benchmark Dataset for Data-Driven Weather Forecasting , author =. Journal of Advances in Modeling Earth Systems , volume =. 2020 , url =

work page 2020

[57] [57]

arXiv preprint arXiv:2104.10066 , year =

EarthNet2021: A Large-Scale Dataset and Challenge for Earth Surface Forecasting as a Guided Video Prediction Task , author =. arXiv preprint arXiv:2104.10066 , year =

work page arXiv

[58] [58]

Monthly Notices of the Royal Astronomical Society , year =

AstroCLIP: A Cross-Modal Foundation Model for Galaxies , author =. Monthly Notices of the Royal Astronomical Society , year =

work page

[59] [59]

Advances in Neural Information Processing Systems , volume =

A Benchmark for Prediction of Transcriptomic Responses to Chemical Perturbations Across Cell Types , author =. Advances in Neural Information Processing Systems , volume =. 2024 , url =

work page 2024

[60] [60]

On the Opportunities and Risks of Foundation Models

On the Opportunities and Risks of Foundation Models , author =. arXiv preprint arXiv:2108.07258 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[61] [61]

arXiv preprint arXiv:2512.06533 , year=

Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning , author=. arXiv preprint arXiv:2512.06533 , year=

work page arXiv

[62] [62]

arXiv preprint arXiv:2603.24844 , year=

Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models , author=. arXiv preprint arXiv:2603.24844 , year=

work page arXiv

[63] [63]

arXiv preprint arXiv:2506.02945 , year=

Quantitative llm judges , author=. arXiv preprint arXiv:2506.02945 , year=

work page arXiv

[64] [64]

Understanding

Tang, Eric and Yang, Bangding and Song, Xingyou , journal =. Understanding. 2025 , url =

work page 2025

[65] [65]

Better Autoregressive Regression with

Lukasik, Michal and Meng, Zhao and Narasimhan, Harikrishna and Chang, Yin-Wen and Menon, Aditya Krishna and Yu, Felix and Kumar, Sanjiv , booktitle =. Better Autoregressive Regression with. 2025 , url =

work page 2025

[66] [66]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Chiang, Cheng-Han and Lee, Hung-yi and Lukasik, Michal , editor =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

work page

[67] [67]

Advances in neural information processing systems , volume=

Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

work page

[68] [68]

2006 , publisher =

Gaussian Processes for Machine Learning , author =. 2006 , publisher =

work page 2006

[69] [69]

Advances in Neural Information Processing Systems , volume =

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author =. Advances in Neural Information Processing Systems , volume =. 2017 , publisher =

work page 2017

[70] [70]

Advances in Neural Information Processing Systems , volume =

What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , author =. Advances in Neural Information Processing Systems , volume =. 2017 , publisher =

work page 2017

[71] [71]

arXiv preprint arXiv:2509.26610 , year=

Uncertainty Quantification for Regression using Proper Scoring Rules , author=. arXiv preprint arXiv:2509.26610 , year=

work page arXiv

[72] [72]

international conference on machine learning , pages=

Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=. 2016 , organization=

work page 2016

[73] [73]

arXiv preprint arXiv:2603.11682 , year=

Entropy-preserving reinforcement learning , author=. arXiv preprint arXiv:2603.11682 , year=

work page arXiv

[74] [74]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Rewarding the unlikely: Lifting grpo beyond distribution sharpening , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[75] [75]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

work page

[76] [76]

Deep Reinforcement Learning Meets Structured Prediction Workshop at ICLR , year =

Buy 4 REINFORCE Samples, Get a Baseline for Free! , author =. Deep Reinforcement Learning Meets Structured Prediction Workshop at ICLR , year =

work page

[77] [77]

PLOS Computational Biology , volume=

Evaluating epidemic forecasts in an interval format , author=. PLOS Computational Biology , volume=. 2021 , publisher=

work page 2021

[78] [78]

Sensors , volume=

Evaluating and Calibrating Uncertainty Prediction in Regression Tasks , author=. Sensors , volume=. 2022 , publisher=

work page 2022

[79] [79]

, journal=

Chai, Tianfeng and Draxler, Roland R. , journal=. Root mean square error (. 2014 , publisher=

work page 2014

[80] [80]

Proceedings of the 22nd International Conference on Machine Learning , pages=

Learning to Rank Using Gradient Descent , author=. Proceedings of the 22nd International Conference on Machine Learning , pages=

work page