EQMs, sixty LLM-scored reasoning patterns, predict forecast accuracy at both item and person levels and outperform prior text-analysis methods in a large pre-registered tournament dataset.
hub
Verification of forecasts expressed in terms of probability
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ForecastBench-Sim is a simulated-world benchmark using Freeciv game rollouts to generate resolvable forecasting questions at arbitrary horizons with paired intervention worlds.
Proves that no behavior-dependent feedback training strategy can guarantee an honest agent for latent knowledge even with perfect training feedback.
Individually calibrated predictors become collectively miscalibrated under Brier-optimal strategic responses with positive belief correlations, but VCG aggregation restores dominant-strategy incentive compatibility and near-optimal performance.
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
CaliDist calibrates LLMs by scaling confidence according to how much predictions change under semantic distractors, cutting average ECE from 23% to 7% on seven NLU benchmarks across six models.
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.
Non-affine approval functions create unavoidable miscalibration in proper scoring rules for strategic agents, but step-function thresholds enable first-best screening without it, uniquely for the Brier score.
Probabilistic bias correction doubles AI subseasonal forecast skill and wins a 2025 international competition by correcting biases in ECMWF models for pressure, temperature, and precipitation.
SBBT separates Brier-score calibration gains from AUROC ranking gains in prefix-conditioned success estimation for LLM math reasoning, with structure-aware signals yielding up to +0.110 AUROC over baselines.
A CNN-based fusion model trained on multi-instrument solar observations predicts geoeffective CMEs, achieving mean TSS of 0.703 and Brier score of 0.095 via five-fold cross-validation.
A knowledge-data dual paradigm using geomorphic priors and a tabular foundation model achieves baseline-level landslide susceptibility prediction accuracy with only 30% of typical data in tested regions.
Single-seed CRPS estimates in limited-data BDL show high variance and peaks for heteroscedastic methods, with local variance correlating above 0.96 to single-seed error.
Analysis of CHIME/FRB Catalog 2 with synthetic injections and a multidimensional selection function yields evidence for a slight downturn in the intrinsic scattering timescale distribution, though flat or rising distributions remain possible.
Stacked video ensemble model distinguishes BAV from TAV on PLAX cine loops with outer-CV F1 of 0.907 using Grad-CAM and SHAP for explainability.
FDRS combines digit frequency tests, association metrics, entropy, KL divergence, and ML models to assign risk grades to numerical datasets, showing separation between normal and irregular simulated data with high AUC.
citing papers explorer
-
Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments
EQMs, sixty LLM-scored reasoning patterns, predict forecast accuracy at both item and person levels and outperform prior text-analysis methods in a large pre-registered tournament dataset.
-
ForecastBench-Sim: A Simulated-World Forecasting Benchmark
ForecastBench-Sim is a simulated-world benchmark using Freeciv game rollouts to generate resolvable forecasting questions at arbitrary horizons with paired intervention worlds.
-
The Impossibility of Eliciting Latent Knowledge
Proves that no behavior-dependent feedback training strategy can guarantee an honest agent for latent knowledge even with perfect training feedback.
-
When Individually Calibrated Models Become Collectively Miscalibrated
Individually calibrated predictors become collectively miscalibrated under Brier-optimal strategic responses with positive belief correlations, but VCG aggregation restores dominant-strategy incentive compatibility and near-optimal performance.
-
Inducing Artificial Uncertainty in Language Models
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
-
CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction
CaliDist calibrates LLMs by scaling confidence according to how much predictions change under semantic distractors, cutting average ECE from 23% to 7% on seven NLU benchmarks across six models.
-
CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
-
Large Language Models Are Overconfident in Their Own Responses
Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.
-
The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting
Non-affine approval functions create unavoidable miscalibration in proper scoring rules for strategic agents, but step-function thresholds enable first-best screening without it, uniquely for the Brier score.
-
Enhancing AI and Dynamical Subseasonal Forecasts with Probabilistic Bias Correction
Probabilistic bias correction doubles AI subseasonal forecast skill and wins a 2025 international competition by correcting biases in ECMWF models for pressure, temperature, and precipitation.
-
Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking
SBBT separates Brier-score calibration gains from AUROC ranking gains in prefix-conditioned success estimation for LLM math reasoning, with structure-aware signals yielding up to +0.110 AUROC over baselines.
-
Deep Learning-Enabled Prediction of Geoeffective CMEs Using SOHO and SDO Observations
A CNN-based fusion model trained on multi-instrument solar observations predicts geoeffective CMEs, achieving mean TSS of 0.703 and Brier score of 0.095 via five-fold cross-validation.
-
Knowledge-Data Dually Driven Paradigm for Accurate Landslide Susceptibility Prediction under Data-Scarce Conditions Using Geomorphic Priors and Tabular Foundation Model
A knowledge-data dual paradigm using geomorphic priors and a tabular foundation model achieves baseline-level landslide susceptibility prediction accuracy with only 30% of typical data in tested regions.
-
A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning
Single-seed CRPS estimates in limited-data BDL show high variance and peaks for heteroscedastic methods, with local variance correlating above 0.96 to single-seed error.
-
Debiasing the Observed Fast Radio Burst Population with the CHIME/FRB Selection Function
Analysis of CHIME/FRB Catalog 2 with synthetic injections and a multidimensional selection function yields evidence for a slight downturn in the intrinsic scattering timescale distribution, though flat or rising distributions remain possible.
-
Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography
Stacked video ensemble model distinguishes BAV from TAV on PLAX cine loops with outer-CV F1 of 0.907 using Grad-CAM and SHAP for explainability.
-
A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data
FDRS combines digit frequency tests, association metrics, entropy, KL divergence, and ML models to assign risk grades to numerical datasets, showing separation between normal and irregular simulated data with high AUC.
- Similarity-Distance-Magnitude Activations