OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data
Pith reviewed 2026-05-18 05:50 UTC · model grok-4.3
The pith
Frontier language models produce inaccurate and overconfident probabilistic priors on real-world numerical estimation tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that LM-elicited priors on OpenEstimate tasks are often inaccurate relative to the true distribution of the quantity being estimated and are overconfident in their probability assignments; performance improves modestly with different uncertainty elicitation methods but shows little sensitivity to sampling strategy, reasoning effort, or prompt design.
What carries the argument
The OpenEstimate benchmark: a set of multi-domain numerical estimation tasks that require synthesis of background information into probabilistic priors, with evaluation against samples from the true distribution.
If this is right
- LM uncertainty reasoning can be quantified by direct comparison of elicited priors to observed data distributions.
- Elicitation phrasing affects calibration more than prompt wording or chain-of-thought length.
- Overconfidence persists across variations in sampling temperature and reasoning budget.
- The benchmark provides a stable platform for tracking progress on probabilistic estimation.
Where Pith is reading between the lines
- Models might improve if trained explicitly to output calibrated distributions rather than point estimates.
- The same tasks could reveal whether access to external search or data retrieval reduces overconfidence.
- Deployment in domains like finance or medicine may need separate uncertainty-handling modules beyond current prompting.
Load-bearing premise
The benchmark problems are ones that humans can answer reliably with correct probabilistic estimates while LMs will struggle without special training.
What would settle it
A frontier model that produces priors closely matching the empirical distribution on most OpenEstimate tasks would falsify the claim of general inaccuracy and overconfidence.
Figures
read the original abstract
Real-world settings where language models (LMs) are deployed -- in domains spanning healthcare, finance, and other forms of knowledge work -- require models to grapple with incomplete information and reason under uncertainty. Yet most LM evaluations focus on problems with well-defined answers and success criteria. This gap exists in part because natural problems involving uncertainty are difficult to construct: given that LMs have access to most of the same knowledge as humans, it is non-trivial to design questions for which LMs will struggle to produce correct answers, but which humans can answer reliably. As a result, LM performance on reasoning under uncertainty remains poorly characterized. To address this gap, we introduce OpenEstimate, an extensible, multi-domain benchmark for evaluating LMs on numerical estimation tasks that require models to synthesize significant amounts of background information and express predictions as probabilistic priors. We assess these priors for accuracy and calibration, quantifying their usefulness relative to samples from the true distribution of interest. Across six frontier LMs, we find that LM-elicited priors are often inaccurate and overconfident. Performance improves modestly depending on how uncertainty is elicited from the model, but is largely unaffected by changes in sampling strategy, reasoning effort, or prompt design. The OpenEstimate benchmark thus offers a challenging evaluation for frontier LMs and a platform for developing models that are better at probabilistic estimation and reasoning under uncertainty.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces OpenEstimate, an extensible multi-domain benchmark for evaluating LLMs on numerical estimation tasks that require synthesizing background information and expressing predictions as probabilistic priors. Across six frontier LMs, LM-elicited priors are reported to be often inaccurate and overconfident; performance shows modest gains from elicitation method but is largely unaffected by sampling strategy, reasoning effort, or prompt design.
Significance. If the tasks are shown to have objective ground-truth distributions that humans can estimate reliably, the benchmark would address a genuine gap in LM evaluation by focusing on naturalistic uncertainty reasoning rather than well-defined problems. The extensibility and multi-domain coverage are strengths that could support future work on probabilistic estimation.
major comments (1)
- [Abstract and §1] Abstract and §1 (Introduction): The motivation explicitly rests on constructing 'natural problems involving uncertainty' for which 'LMs will struggle to produce correct answers, but which humans can answer reliably.' No human performance baselines, expert validation, or inter-rater reliability results are reported on the actual tasks. This is load-bearing for the central claim, because all accuracy and calibration metrics are computed against the authors' chosen ground truths; without evidence that these targets are reliably estimable by humans, poor LM performance could reflect task ambiguity rather than a specific deficit in uncertainty reasoning.
minor comments (3)
- [Methods] Methods section: Provide explicit details on task construction, statistical tests for cross-condition comparisons, error bars or confidence intervals on accuracy/calibration metrics, and any data exclusion rules. These are needed to evaluate the robustness of the claim that performance is largely unaffected by sampling strategy, reasoning effort, or prompt design.
- [Results] Results: Consider including per-model tables or figures with confidence intervals to support the reported modest effects from elicitation method.
- [Appendix] Reproducibility: Include all task prompts, examples, and ground-truth sources in an appendix or supplementary material.
Simulated Author's Rebuttal
We thank the referee for the detailed review and for identifying a key aspect of our motivation. We address the major comment below and outline targeted revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract and §1] Abstract and §1 (Introduction): The motivation explicitly rests on constructing 'natural problems involving uncertainty' for which 'LMs will struggle to produce correct answers, but which humans can answer reliably.' No human performance baselines, expert validation, or inter-rater reliability results are reported on the actual tasks. This is load-bearing for the central claim, because all accuracy and calibration metrics are computed against the authors' chosen ground truths; without evidence that these targets are reliably estimable by humans, poor LM performance could reflect task ambiguity rather than a specific deficit in uncertainty reasoning.
Authors: We agree that explicit human performance baselines would strengthen the central motivation. The ground-truth distributions, however, are not subjective estimates but are derived directly from objective, publicly verifiable real-world data sources (government statistics, published reports, and empirical studies). The tasks require synthesizing background information that is generally accessible, and the evaluation measures LM deviation from these fixed targets rather than from human judgments. We will revise §1 and the abstract to more explicitly state the objective, data-driven nature of the ground truths and add a dedicated paragraph in the limitations section acknowledging the lack of human baselines while outlining plans for such validation in follow-up work. This addresses the concern without requiring new experiments in the current revision. revision: partial
Circularity Check
No circularity in empirical benchmark evaluation
full rationale
This paper presents an empirical benchmark (OpenEstimate) for assessing LLMs on numerical estimation and uncertainty reasoning tasks drawn from real-world domains. It contains no derivations, equations, fitted parameters renamed as predictions, or first-principles results. Claims rest on direct experimental measurements of accuracy and calibration against author-chosen ground-truth distributions, with no self-definitional loops, load-bearing self-citations, or ansatzes smuggled via prior work. The methodology is self-contained as a standard evaluation study; performance differences are reported as observed outcomes rather than constructed equivalences. No steps reduce the central findings to inputs by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Natural problems involving uncertainty can be designed such that LMs struggle to produce correct answers but humans can answer reliably
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We assess these priors for accuracy and calibration, quantifying their usefulness relative to samples from the true distribution of interest. Across six frontier LMs, we find that LM-elicited priors are often inaccurate and overconfident.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
OPENESTIMATE... numerical estimation tasks that require models to synthesize significant amounts of background information and express predictions as probabilistic priors.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
**Consider the context**: Reflect on what {{variable}} represents and any relevant information you have about its population-level average
-
[2]
**Estimate parameters**: Based on your knowledge and context, determine appropriate values for: *$\mu_0$: your best estimate of the population mean *$\sigma_0$: the standard deviation that reflects your **uncertainty about$\mu$** not the standard deviation of individual-level data
-
[3]
**Construct the prior**: Express the distribution in the form: $$ \mu\sim\mathcal{N}(\mu_0, \sigma_0ˆ2) $$ where$\mu_0$is your belief about the central tendency and$\sigma_0$ reflects the degree of confidence (epistemic uncertainty) in that belief
-
[4]
**Justify your choices**: Explain your reasoning for selecting each parameter, grounding it in evidence or plausible domain knowledge
-
[5]
**Explain confidence**: Discuss the level of confidence implied by your chosen$\sigma_0$, making sure this reflects uncertainty about the mean not about individual values. --- ### Important Guidance * Do **not** base$\sigma_0$on the variability **across individuals** in the population. * Do **not** confuse the standard deviation of individual measurements...
-
[6]
List known facts or context about the variable and its mean
-
[7]
Consider the plausible range of the **population mean**
-
[8]
Propose at least three possible pairs of$\mu_0$and$\sigma_0$, representing different reasonable priors
-
[10]
Reflect on what different choices of$\sigma_0$say about your confidence
-
[11]
Consider edge cases (very large or small$\sigma_0$) and what they would imply
-
[14]
Summarize your final choice and give a clear, reasoned justification. This detailed analysis helps ensure your prior is carefully reasoned and reflects proper statistical thinking. --- ### Final Answer Format After the analysis, return your prior in this format: ‘‘‘ Prior Distribution for the mean: ˜ N(_0, _0ˆ2) <mean>[Your chosen _0 value]</mean> <std>[Y...
-
[15]
Consider the context: Reflect on what {{variable}} represents and any relevant information you have about it
-
[16]
Estimate parameters: Based on your knowledge and the context, determine appropriate and parameters for the Beta distribution. These values should encode your uncertainty about the true population proportion not the variability of observed outcomes
-
[17]
Construct the prior: Express the prior distribution in the form p˜Beta(,)
-
[18]
Justify your choices: Provide a clear explanation for why you selected the specific and parameters
-
[19]
Explain confidence: Discuss the level of confidence implied by your chosen parameters. Before providing your final answer, show your reasoning process by wrapping your analysis in <beta_prior_analysis> tags:
-
[20]
List known facts or context about the variable
-
[21]
State the possible range of the variable (typically 0 to 1 for proportions)
-
[22]
Propose at least three possible pairs of and parameters representing different reasonable priors
-
[23]
Compute the 68% and 95% credible intervals
For each set: a. Compute the 68% and 95% credible intervals. b. Interpret what these intervals imply about your beliefs about the **mean**
-
[24]
Reflect on what different choices of and say about your confidence
-
[25]
Consider edge cases of and and what they would imply
-
[26]
Compare and evaluate the trade-offs of different options
-
[27]
Interpret the final confidence level implied by your chosen prior
-
[28]
This analysis helps ensure a thorough and well-considered response
Summarize your final choice and give a clear, reasoned justification. This analysis helps ensure a thorough and well-considered response. It’s acceptable for this section to be quite extensive. After your analysis, provide your final answer in the following format: Prior Distribution: p˜Beta(,) <alpha>[Your chosen value]</alpha> <beta>[Your chosen value]<...
-
[29]
Consider the context of the variable, including its meaning and any relevant information that informs your beliefs
-
[30]
Estimate the following percentiles of the parameter’s true value: - 5th percentile (only a 5% chance the true value is below this) - 25th percentile - 50th percentile (median - your best estimate of the true value) - 75th percentile - 95th percentile (only a 5% chance the true value is above this)
-
[31]
Begin your analysis by showing your thought process inside <parameter_estimation_process> tags
Explain your reasoning behind these estimates. Begin your analysis by showing your thought process inside <parameter_estimation_process> tags. Include the following elements:
-
[32]
Explicitly state the type of parameter being estimated (e.g., population mean, proportion)
-
[33]
List any known facts or data points about the variable
-
[34]
Consider and list possible data sources or methods for estimating this parameter
-
[35]
Brainstorm factors that might influence the parameter’s value
-
[36]
Note potential biases or limitations in the available information
-
[37]
State any assumptions you’re making
-
[38]
Consider how the parameter might have changed over time or across different subgroups
-
[39]
Provide your quantile estimates with a brief explanation for each
-
[40]
Include relevant facts or context about the variable
-
[41]
Justify your choices
-
[42]
Emphasize population parameter uncertainty (not individual variability)
-
[43]
Reflect on what your estimate spread indicates about your certainty
-
[44]
Consider any plausible edge cases or alternative scenarios. After your analysis, provide your final answer in the following format: <q5>[5th percentile value]</q5> <q25>[25th percentile value]</q25> <q50>[50th percentile (median) value]</q50> <q75>[75th percentile value]</q75> <q95>[95th percentile value]</q95> <justification> [Brief summary of your reaso...
-
[45]
Consider the context of the variable, including what it represents and any relevant information or assumptions that inform your beliefs
-
[46]
Estimate the following quantities: - Best guess: your estimate of the most likely value of the population-level parameter (e.g., mean or proportion) - Standard deviation or variance: a numerical expression of your uncertainty about the true value not the variability across individual observations
-
[47]
Begin your analysis by showing your thought process inside <parameter_estimation_process> tags. Include the following elements: - Clearly state the type of parameter being estimated (e.g., population mean, true proportion). - List any known facts, data points, or previous estimates about the variable. - Consider possible data sources, analogous population...
-
[48]
After your analysis, provide your final answer in the following format: <mean>[Best guess for the true value]</mean> <std_dev>[Standard deviation representing your uncertainty]</std_dev> <justification> [Brief summary of your reasoning and what informed your estimates] </justification> <confidence_level> [Explanation of how confident or uncertain you are,...
-
[49]
The standard deviation reflects uncertainty due to potential sampling biases and regional variations
Gaussian (Normal) Distribution Example: Variable: Average height of adult males in a country Units: Centimeters <mean>175</mean> <std_dev>2.5</std_dev> <justification> Based on global averages, previous studies in similar populations, and considering factors like nutrition and genetics. The standard deviation reflects uncertainty due to potential sampling...
-
[50]
Beta Distribution Example: Variable: Proportion of people who prefer tea over coffee in a city Units: Proportion (0 to 1) <mean>0.6</mean> <std_dev>0.05</std_dev> <justification> Estimated based on local cultural preferences, limited survey data, and comparison with similar cities. The standard deviation accounts for potential biases in available data and...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.