ForecastBench-Sim is a simulated-world benchmark using Freeciv game rollouts to generate resolvable forecasting questions at arbitrary horizons with paired intervention worlds.
hub
Verification of forecasts expressed in terms of probability
16 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Proves that no behavior-dependent feedback training strategy can guarantee an honest agent for latent knowledge even with perfect training feedback.
Individually calibrated predictors become collectively miscalibrated under Brier-optimal strategic responses with positive belief correlations, but VCG aggregation restores dominant-strategy incentive compatibility and near-optimal performance.
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
CaliDist calibrates LLMs by scaling confidence according to how much predictions change under semantic distractors, cutting average ECE from 23% to 7% on seven NLU benchmarks across six models.
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.
Non-affine approval functions create unavoidable miscalibration in proper scoring rules for strategic agents, but step-function thresholds enable first-best screening without it, uniquely for the Brier score.
Probabilistic bias correction doubles AI subseasonal forecast skill and wins a 2025 international competition by correcting biases in ECMWF models for pressure, temperature, and precipitation.
SBBT separates Brier-score calibration gains from AUROC ranking gains in prefix-conditioned success estimation for LLM math reasoning, with structure-aware signals yielding up to +0.110 AUROC over baselines.
A knowledge-data dual paradigm using geomorphic priors and a tabular foundation model achieves baseline-level landslide susceptibility prediction accuracy with only 30% of typical data in tested regions.
Single-seed CRPS estimates in limited-data BDL show high variance and peaks for heteroscedastic methods, with local variance correlating above 0.96 to single-seed error.
Analysis of CHIME/FRB Catalog 2 with synthetic injections and a multidimensional selection function yields evidence for a slight downturn in the intrinsic scattering timescale distribution, though flat or rising distributions remain possible.
Stacked video ensemble model distinguishes BAV from TAV on PLAX cine loops with outer-CV F1 of 0.907 using Grad-CAM and SHAP for explainability.
FDRS combines digit frequency tests, association metrics, entropy, KL divergence, and ML models to assign risk grades to numerical datasets, showing separation between normal and irregular simulated data with high AUC.
citing papers explorer
-
When Individually Calibrated Models Become Collectively Miscalibrated
Individually calibrated predictors become collectively miscalibrated under Brier-optimal strategic responses with positive belief correlations, but VCG aggregation restores dominant-strategy incentive compatibility and near-optimal performance.