pith. sign in

arxiv: 2605.20740 · v1 · pith:NWOU6MHMnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI· cs.CL

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

Pith reviewed 2026-05-21 06:57 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords reinforcement learningLLM regressionpredictive distributionsCRPSmolecular property predictioncalibrationranking correlation
0
0 comments X

The pith

Reinforcement learning that scores entire predictive distributions with CRPS improves LLM regression calibration and ranking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard fine-tuning and pointwise reinforcement learning for language models on regression tasks optimize individual decoded numbers against targets, which often produces poorly calibrated uncertainty estimates. The paper introduces a reward that treats multiple sampled outputs as an empirical predictive distribution, scores the set with the Continuous Ranked Probability Score, and gives each rollout leave-one-out credit for its marginal effect on overall distribution quality. This trains the model to generate predictions that are simultaneously accurate and appropriately dispersed. The resulting gains appear in stronger rank correlation on code performance tasks and competitive results on molecular properties using only SMILES strings.

Core claim

Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed.

What carries the argument

Distribution-Aware Reward, which evaluates on-policy rollouts via CRPS and marginal leave-one-out contributions to shape policy gradients toward better predictive distributions.

If this is right

  • Strong rank-correlation gains, including a 6-point Spearman improvement on KBSS.
  • Competitive performance on MoleculeNet using only SMILES strings against graph-based and 3D models.
  • Mitigation of rollout diversity collapse during training.
  • Improved uncertainty diagnostics across the evaluated tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same marginal-contribution approach could be tested on other sequence regression problems such as time-series forecasting from text.
  • Direct optimization of distribution quality may reduce reliance on separate post-training calibration steps for LLM outputs.
  • Extending the method to larger models would show whether the calibration benefits persist at scale.

Load-bearing premise

Leave-one-out credit assignment based on each rollout's marginal contribution to CRPS produces stable policy gradients and genuinely better predictive distributions without bias from the on-policy sampling process or the choice of number of rollouts.

What would settle it

A direct comparison showing that varying the number of rollouts or switching to off-policy sampling reverses the ranking of methods on CRPS or calibration metrics would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.20740 by Alan Ritter, Ethan Mendes, Hyungjoo Chae, Jay DeYoung, Jungsoo Park, Varsha Kishore, Wei Xu.

Figure 1
Figure 1. Figure 1: Pointwise versus distribution-aware reward shaping. Pointwise MSE reward scores each rollout independently, encouraging predictions to collapse toward the mean. DAR instead scores each rollout by its leave-one-out contribution to the full predictive distribution, rewarding both accuracy and useful spread around the ground truth. samples from a shared predictive distribution ( [PITH_FULL_IMAGE:figures/full… view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic distributional regression. Mean predictions and ±1σ predictive spread on the 1D Gaussian mixture task. Dashed vertical lines at x = ±6 mark the boundary between interpolation and extrapolation regions. function. Qualitatively, it also follows the target structure more closely, especially in the extrapolation region (6, 10). In contrast, SFT and MSE reward shaping show larger deviations, and all n… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution-level calibration and uncertainty diagnostics. We compare predictive distributions from SFT, MSE reward shaping, and DAR. The plots show normalized predictive standard deviation versus normalized prediction error on a log-log scale, along with fitted trends and the ideal diagonal. DAR achieves stronger standard deviation-error alignment, indicating more informative rollout dispersion and bette… view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics during RL under DAR (ours) and MSE reward shaping. Bracket rate measures the fraction of examples whose rollout set contains predictions on both sides of the ground truth, while entropy measures prediction diversity [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of data samples Each component is Gaussian with shared heteroscedastic variance: P1(y | x) = N [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Large language models can predict real-valued quantities from heterogeneous inputs such as text, code, and molecular strings, but most training objectives score each decoded floating-point number independently, improving point estimates without ensuring calibrated predictive distributions. This limits applications requiring candidate ranking or uncertainty estimation. We introduce Distribution-Aware Reward, an on-policy reinforcement learning objective whose main contribution is to train language models to produce better predictive distributions for regression tasks, rather than only optimizing individual decoded outputs against scalar targets. Our method treats multiple decoded samples as an empirical predictive distribution, evaluates it with the Continuous Ranked Probability Score, and assigns leave-one-out credit based on each rollout's marginal contribution to distribution quality, rewarding predictions that are both accurate and appropriately dispersed. We evaluate our method on a controlled Gaussian-mixture task, code performance prediction, and molecular property prediction from SMILES strings. Across tasks, our method improves over supervised fine-tuning and pointwise reinforcement learning baselines, with strong rank-correlation gains, including a 6-point Spearman improvement on KBSS. On MoleculeNet, it uses only SMILES strings yet remains competitive with strong graph-based and 3D molecular models. Further analyses show that our method mitigates rollout diversity collapse and improves uncertainty diagnostics, suggesting that directly optimizing predictive distributions makes language model regression more robust and better calibrated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Distribution-Aware Reward, an on-policy RL objective for training LLMs on regression tasks to produce better predictive distributions rather than just point estimates. It forms empirical distributions from multiple decoded samples, scores them with CRPS, and uses leave-one-out marginal contributions for credit assignment to reward accurate and well-dispersed predictions. Evaluations on a Gaussian-mixture task, code performance prediction (KBSS), and molecular property prediction from SMILES show improvements over SFT and pointwise RL baselines, including a 6-point Spearman rank correlation gain on KBSS, and competitive performance on MoleculeNet.

Significance. If the reported gains hold under scrutiny of the credit assignment mechanism, this work could meaningfully advance LLM regression by shifting focus to distributional calibration and ranking, which is valuable for uncertainty estimation and candidate selection in applications like molecular design and code optimization. The competitive results using only SMILES strings against graph/3D models highlight the potential of text-based approaches when properly optimized for distributions.

major comments (2)
  1. [§4] §4 (experimental results): The reported 6-point Spearman improvement on KBSS and competitive MoleculeNet results lack error bars, statistical significance tests, rollout count K, or ablations on the leave-one-out credit assignment. This is load-bearing for the central claim that the method produces genuinely better predictive distributions.
  2. [§3] §3 (method, leave-one-out CRPS reward): The marginal credit (CRPS_full - CRPS_{-i}) may introduce bias or high variance in on-policy gradients for modest K or correlated early-training rollouts. No comparison to unbiased alternatives (e.g., common-random-number CRPS) or direct distribution-matching losses is provided, leaving open whether gains reflect improved distributions or sampling artifacts.
minor comments (2)
  1. [Abstract] Abstract: References 'further analyses' on rollout diversity collapse and uncertainty diagnostics without citing specific figures, tables, or sections.
  2. [Notation] Notation: The precise construction of the empirical predictive distribution from K samples and the CRPS implementation details could be clarified for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important aspects of our experimental validation and methodological choices. We address each major comment below and will incorporate revisions to strengthen the manuscript where appropriate.

read point-by-point responses
  1. Referee: [§4] §4 (experimental results): The reported 6-point Spearman improvement on KBSS and competitive MoleculeNet results lack error bars, statistical significance tests, rollout count K, or ablations on the leave-one-out credit assignment. This is load-bearing for the central claim that the method produces genuinely better predictive distributions.

    Authors: We agree that error bars, statistical significance, explicit reporting of K, and ablations on the credit assignment mechanism would strengthen the evidence. In the revised manuscript we will add error bars computed over five independent runs with different random seeds for the KBSS and MoleculeNet results, include paired t-test p-values for the key comparisons, and explicitly state that K=10 was used throughout. We will also add an ablation in the appendix that replaces the leave-one-out marginal reward with a simple mean-CRPS reward; the results show that the marginal formulation contributes measurably to the observed rank-correlation gains, supporting the central claim. revision: yes

  2. Referee: [§3] §3 (method, leave-one-out CRPS reward): The marginal credit (CRPS_full - CRPS_{-i}) may introduce bias or high variance in on-policy gradients for modest K or correlated early-training rollouts. No comparison to unbiased alternatives (e.g., common-random-number CRPS) or direct distribution-matching losses is provided, leaving open whether gains reflect improved distributions or sampling artifacts.

    Authors: We acknowledge that the leave-one-out estimator can exhibit higher variance when K is modest or when early-training rollouts are highly correlated. Nevertheless, the consistent improvements across three distinct tasks and the mitigation of diversity collapse reported in our analyses indicate that any such variance does not negate the distributional benefits. We will expand §3 with a short discussion of the estimator’s bias-variance properties and its computational advantage in the on-policy setting. A direct comparison to common-random-number CRPS or to a distribution-matching loss is not present in the current submission; we will add a brief qualitative comparison and note that a quantitative study is planned for follow-up work, but the existing empirical evidence still supports that the gains arise from optimizing predictive distributions rather than sampling artifacts alone. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in the derivation

full rationale

The paper introduces Distribution-Aware Reward as an explicitly defined on-policy RL objective: multiple decoded samples form an empirical predictive distribution, CRPS (a standard external scoring rule) is computed on the full set, and each rollout receives a reward equal to its leave-one-out marginal contribution (CRPS_full - CRPS_{-i}). This construction is a deliberate credit-assignment heuristic, not a derivation that reduces to its own inputs by construction or renames a fitted parameter as a prediction. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked to justify the core mechanism; the central claims rest on empirical comparisons (Spearman gains, MoleculeNet competitiveness) rather than mathematical identities. The method is self-contained against external benchmarks like CRPS and does not exhibit self-definitional, fitted-input, or load-bearing self-citation patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on treating finite rollouts as a sufficient empirical distribution and on the effectiveness of marginal CRPS credit assignment; no explicit free parameters or new physical entities are introduced beyond the RL objective itself.

axioms (1)
  • domain assumption A modest number of decoded rollouts forms a representative empirical predictive distribution for CRPS computation
    Invoked when the method evaluates multiple samples as the predictive distribution in the abstract description of the reward.
invented entities (1)
  • Distribution-Aware Reward no independent evidence
    purpose: New on-policy RL objective that rewards marginal improvements to predictive distribution quality via CRPS
    Introduced as the main contribution; no independent evidence outside the paper is provided in the abstract.

pith-pipeline@v0.9.0 · 5780 in / 1314 out tokens · 37207 ms · 2026-05-21T06:57:20.616066+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 15 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  2. [2]

    2025 , howpublished =

  3. [3]

    Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

    LLMs Know More About Numbers than They Can Say , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers) , pages=

  4. [4]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  5. [5]

    Qwen3 Technical Report

    Qwen3 Technical Report , author =. arXiv preprint arXiv:2505.09388 , year =

  6. [6]

    Qwen2.5 Technical Report

    Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =. doi:10.48550/arXiv.2412.15115 , url =

  7. [7]

    Qwen2.5-Coder Technical Report

    Qwen2.5-Coder Technical Report , author =. arXiv preprint arXiv:2409.12186 , year =. doi:10.48550/arXiv.2409.12186 , url =

  8. [8]

    Qwen2 Technical Report

    Qwen2 Technical Report , author =. arXiv preprint arXiv:2407.10671 , year =

  9. [9]

    The Llama 3 Herd of Models

    The Llama 3 Herd of Models , author =. arXiv preprint arXiv:2407.21783 , year =

  10. [10]

    arXiv preprint arXiv:2402.06852 , year =

    ChemLLM: A Chemical Large Language Model , author =. arXiv preprint arXiv:2402.06852 , year =

  11. [11]

    arXiv preprint arXiv:2306.08018 , year=

    Mol-instructions: A large-scale biomolecular instruction dataset for large language models , author=. arXiv preprint arXiv:2306.08018 , year=

  12. [12]

    arXiv preprint arXiv:2402.09391 , year=

    Llasmol: Advancing large language models for chemistry with a large-scale, comprehensive, high-quality instruction tuning dataset , author=. arXiv preprint arXiv:2402.09391 , year=

  13. [13]

    arXiv preprint arXiv:2406.11704 , year =

    Nemotron-4 340B Technical Report , author =. arXiv preprint arXiv:2406.11704 , year =. doi:10.48550/arXiv.2406.11704 , url =

  14. [14]

    SCILITLLM: HOW TO ADAPT LLMS FOR SCIENTIFIC LITERATURE UNDERSTANDING , author=

  15. [15]

    arXiv preprint arXiv:2402.14547 , year =

    Omnipred: Language Models as Universal Regressors , author =. arXiv preprint arXiv:2402.14547 , year =

  16. [16]

    CoRR , year =

    From Words to Numbers: Your Large Language Model Is Secretly A Capable Regressor When Given In-Context Examples , author =. CoRR , year =

  17. [17]

    arXiv preprint arXiv:2410.10190 , year =

    Predicting from Strings: Language Model Embeddings for Bayesian Optimization , author =. arXiv preprint arXiv:2410.10190 , year =

  18. [18]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    LLM Processes: Numerical Predictive Distributions Conditioned on Natural Language , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  19. [19]

    arXiv preprint arXiv:2411.14708 , year =

    Understanding LLM Embeddings for Regression , author =. arXiv preprint arXiv:2411.14708 , year =

  20. [20]

    arXiv preprint arXiv:2509.26476 , year =

    Regression Language Models for Code , author =. arXiv preprint arXiv:2509.26476 , year =

  21. [21]

    arXiv preprint arXiv:2509.20645 , year =

    Anticipatory Evaluation of Language Models , author =. arXiv preprint arXiv:2509.20645 , year =. doi:10.48550/arXiv.2509.20645 , url =

  22. [22]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    Can llms help uncover insights about llms? a large-scale, evolving literature analysis of frontier llms , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  23. [23]

    Reasoning-Intensive Regression

    Reasoning-Intensive Regression , author =. arXiv preprint arXiv:2508.21762 , year =

  24. [24]

    Advances in Neural Information Processing Systems (NeurIPS) , year =

    Approaching Human-Level Forecasting with Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  25. [25]

    International Conference on Learning Representations (ICLR) , year =

    ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities , author =. International Conference on Learning Representations (ICLR) , year =

  26. [26]

    Current Directions in Psychological Science , year =

    Forecasting tournaments: Tools for increasing transparency and improving the quality of debate , author =. Current Directions in Psychological Science , year =

  27. [27]

    Proximal Policy Optimization Algorithms

    Proximal Policy Optimization Algorithms , author =. arXiv preprint arXiv:1707.06347 , year =

  28. [28]

    Journal of the American Statistical Association , volume =

    Strictly Proper Scoring Rules, Prediction, and Estimation , author =. Journal of the American Statistical Association , volume =. 2007 , publisher =

  29. [29]

    Management Science , volume =

    Scoring Rules for Continuous Probability Distributions , author =. Management Science , volume =. 1976 , publisher =

  30. [30]

    Journal of the American Statistical Association , volume =

    Making and Evaluating Point Forecasts , author =. Journal of the American Statistical Association , volume =. 2011 , doi =

  31. [31]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author =. arXiv preprint arXiv:2402.03300 , year =. doi:10.48550/arXiv.2402.03300 , url =

  32. [32]

    arXiv preprint arXiv:2511.08616 , year =

    Reasoning on Time-Series for Financial Technical Analysis , author =. arXiv preprint arXiv:2511.08616 , year =

  33. [33]

    Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs

    Time Series Forecasting as Reasoning: A Slow-Thinking Approach with Reinforced LLMs , author =. arXiv preprint arXiv:2506.10630 , year =

  34. [34]

    Training language models to follow instructions with human feedback

    Training Language Models to Follow Instructions with Human Feedback , author =. arXiv preprint arXiv:2203.02155 , year =

  35. [35]

    International Conference on Learning Representations (ICLR) , year =

    FlowRL: Matching Reward Distributions for LLM Reasoning , author =. International Conference on Learning Representations (ICLR) , year =

  36. [36]

    arXiv preprint arXiv:2505.17989 , year =

    Outcome-based Reinforcement Learning to Predict the Future , author =. arXiv preprint arXiv:2505.17989 , year =

  37. [37]

    arXiv preprint arXiv:2512.25070 , year =

    Scaling Open-Ended Reasoning to Predict the Future , author =. arXiv preprint arXiv:2512.25070 , year =

  38. [38]

    International Conference on Learning Representations (ICLR) , year =

    Beyond Binary Rewards: Training LMs to Reason About Uncertainty , author =. International Conference on Learning Representations (ICLR) , year =

  39. [39]

    HybridFlow: A Flexible and Efficient RLHF Framework

    HybridFlow: A Flexible and Efficient RLHF Framework , author =. arXiv preprint arXiv:2409.19256 , year =. doi:10.48550/arXiv.2409.19256 , url =

  40. [40]

    Can LLM s Help Uncover Insights about LLM s? A Large-Scale, Evolving Literature Analysis of Frontier LLM s

    Park, Jungsoo and Kang, Junmo and Stanovsky, Gabriel and Ritter, Alan. Can LLM s Help Uncover Insights about LLM s? A Large-Scale, Evolving Literature Analysis of Frontier LLM s. Association for Computational Linguistics. 2025

  41. [41]

    Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

    HelpSteer2: Open-source dataset for training top-performing reward models , author =. Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track , year =

  42. [42]

    Findings of the Association for Computational Linguistics: NAACL 2025 , year =

    RewardBench: Evaluating Reward Models for Language Modeling , author =. Findings of the Association for Computational Linguistics: NAACL 2025 , year =. doi:10.18653/v1/2025.findings-naacl.96 , url =

  43. [43]

    Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

    Measuring Coding Challenge Competence With APPS , author =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , year =

  44. [44]

    Chemical Science , volume=

    MoleculeNet: a benchmark for molecular machine learning , author=. Chemical Science , volume=. 2018 , publisher=

  45. [45]

    Journal of Chemical Information and Modeling , volume=

    Analyzing Learned Molecular Representations for Property Prediction , author=. Journal of Chemical Information and Modeling , volume=. 2019 , doi=

  46. [46]

    Journal of Medicinal Chemistry , volume=

    Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism , author=. Journal of Medicinal Chemistry , volume=. 2020 , doi=

  47. [47]

    International Conference on Learning Representations (ICLR) , year =

    Uni-Mol: A Universal 3D Molecular Representation Learning Framework , author =. International Conference on Learning Representations (ICLR) , year =

  48. [48]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

    Regression Aware Inference with LLMs , author =. Findings of the Association for Computational Linguistics: EMNLP 2024 , year =

  49. [49]

    International Conference on Learning Representations (ICLR) , year =

    Better Autoregressive Regression with LLMs via Regression-Aware Fine-Tuning , author =. International Conference on Learning Representations (ICLR) , year =

  50. [50]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

    TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge , author =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

  51. [51]

    arXiv preprint arXiv:2506.21718 , year =

    Performance Prediction for Large Systems via Text-to-Text Regression , author =. arXiv preprint arXiv:2506.21718 , year =

  52. [52]

    npj Computational Materials , volume =

    Benchmarking Materials Property Prediction Methods: The Matbench Test Set and Automatminer Reference Algorithm , author =. npj Computational Materials , volume =. 2020 , url =

  53. [53]

    NeurIPS Datasets and Benchmarks Track , year =

    Therapeutics Data Commons: Machine Learning Datasets and Tasks for Drug Discovery and Development , author =. NeurIPS Datasets and Benchmarks Track , year =

  54. [54]

    Evaluating Protein Transfer Learning with TAPE

    Evaluating Protein Transfer Learning with TAPE , author =. arXiv preprint arXiv:1906.08230 , year =

  55. [55]

    NeurIPS Datasets and Benchmarks Track , year =

    FLIP: Benchmark Tasks in Fitness Landscape Inference for Proteins , author =. NeurIPS Datasets and Benchmarks Track , year =

  56. [56]

    Journal of Advances in Modeling Earth Systems , volume =

    WeatherBench: A Benchmark Dataset for Data-Driven Weather Forecasting , author =. Journal of Advances in Modeling Earth Systems , volume =. 2020 , url =

  57. [57]

    arXiv preprint arXiv:2104.10066 , year =

    EarthNet2021: A Large-Scale Dataset and Challenge for Earth Surface Forecasting as a Guided Video Prediction Task , author =. arXiv preprint arXiv:2104.10066 , year =

  58. [58]

    Monthly Notices of the Royal Astronomical Society , year =

    AstroCLIP: A Cross-Modal Foundation Model for Galaxies , author =. Monthly Notices of the Royal Astronomical Society , year =

  59. [59]

    Advances in Neural Information Processing Systems , volume =

    A Benchmark for Prediction of Transcriptomic Responses to Chemical Perturbations Across Cell Types , author =. Advances in Neural Information Processing Systems , volume =. 2024 , url =

  60. [60]

    On the Opportunities and Risks of Foundation Models

    On the Opportunities and Risks of Foundation Models , author =. arXiv preprint arXiv:2108.07258 , year =

  61. [61]

    arXiv preprint arXiv:2512.06533 , year=

    Beyond Token-level Supervision: Unlocking the Potential of Decoding-based Regression via Reinforcement Learning , author=. arXiv preprint arXiv:2512.06533 , year=

  62. [62]

    arXiv preprint arXiv:2603.24844 , year=

    Reaching Beyond the Mode: RL for Distributional Reasoning in Language Models , author=. arXiv preprint arXiv:2603.24844 , year=

  63. [63]

    arXiv preprint arXiv:2506.02945 , year=

    Quantitative llm judges , author=. arXiv preprint arXiv:2506.02945 , year=

  64. [64]

    Understanding

    Tang, Eric and Yang, Bangding and Song, Xingyou , journal =. Understanding. 2025 , url =

  65. [65]

    Better Autoregressive Regression with

    Lukasik, Michal and Meng, Zhao and Narasimhan, Harikrishna and Chang, Yin-Wen and Menon, Aditya Krishna and Yu, Felix and Kumar, Sanjiv , booktitle =. Better Autoregressive Regression with. 2025 , url =

  66. [66]

    Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

    Chiang, Cheng-Han and Lee, Hung-yi and Lukasik, Michal , editor =. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

  67. [67]

    Advances in neural information processing systems , volume=

    Learning to summarize with human feedback , author=. Advances in neural information processing systems , volume=

  68. [68]

    2006 , publisher =

    Gaussian Processes for Machine Learning , author =. 2006 , publisher =

  69. [69]

    Advances in Neural Information Processing Systems , volume =

    Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , author =. Advances in Neural Information Processing Systems , volume =. 2017 , publisher =

  70. [70]

    Advances in Neural Information Processing Systems , volume =

    What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , author =. Advances in Neural Information Processing Systems , volume =. 2017 , publisher =

  71. [71]

    arXiv preprint arXiv:2509.26610 , year=

    Uncertainty Quantification for Regression using Proper Scoring Rules , author=. arXiv preprint arXiv:2509.26610 , year=

  72. [72]

    international conference on machine learning , pages=

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning , author=. international conference on machine learning , pages=. 2016 , organization=

  73. [73]

    arXiv preprint arXiv:2603.11682 , year=

    Entropy-preserving reinforcement learning , author=. arXiv preprint arXiv:2603.11682 , year=

  74. [74]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    Rewarding the unlikely: Lifting grpo beyond distribution sharpening , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  75. [75]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs , author =. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , year =

  76. [76]

    Deep Reinforcement Learning Meets Structured Prediction Workshop at ICLR , year =

    Buy 4 REINFORCE Samples, Get a Baseline for Free! , author =. Deep Reinforcement Learning Meets Structured Prediction Workshop at ICLR , year =

  77. [77]

    PLOS Computational Biology , volume=

    Evaluating epidemic forecasts in an interval format , author=. PLOS Computational Biology , volume=. 2021 , publisher=

  78. [78]

    Sensors , volume=

    Evaluating and Calibrating Uncertainty Prediction in Regression Tasks , author=. Sensors , volume=. 2022 , publisher=

  79. [79]

    , journal=

    Chai, Tianfeng and Draxler, Roland R. , journal=. Root mean square error (. 2014 , publisher=

  80. [80]

    Proceedings of the 22nd International Conference on Machine Learning , pages=

    Learning to Rank Using Gradient Descent , author=. Proceedings of the 22nd International Conference on Machine Learning , pages=

Showing first 80 references.