hub

Verification of forecasts expressed in terms of probability

Glenn W · 1950 · DOI 10.1175/1520-0493(1950)078

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

open at publisher browse 18 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

EQMs, sixty LLM-scored reasoning patterns, predict forecast accuracy at both item and person levels and outperform prior text-analysis methods in a large pre-registered tournament dataset.

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

cs.AI · 2026-06-17 · unverdicted · novelty 7.0

ForecastBench-Sim is a simulated-world benchmark using Freeciv game rollouts to generate resolvable forecasting questions at arbitrary horizons with paired intervention worlds.

The Impossibility of Eliciting Latent Knowledge

cs.AI · 2026-06-10 · unverdicted · novelty 7.0

Proves that no behavior-dependent feedback training strategy can guarantee an honest agent for latent knowledge even with perfect training feedback.

When Individually Calibrated Models Become Collectively Miscalibrated

cs.LG · 2026-05-14 · conditional · novelty 7.0

Individually calibrated predictors become collectively miscalibrated under Brier-optimal strategic responses with positive belief correlations, but VCG aggregation restores dominant-strategy incentive compatibility and near-optimal performance.

Inducing Artificial Uncertainty in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

CaliDist calibrates LLMs by scaling confidence according to how much predictions change under semantic distractors, cutting average ECE from 23% to 7% on seven NLU benchmarks across six models.

CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.

Large Language Models Are Overconfident in Their Own Responses

cs.CL · 2026-06-02 · unverdicted · novelty 6.0

Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.

The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting

cs.GT · 2026-05-08 · unverdicted · novelty 6.0

Non-affine approval functions create unavoidable miscalibration in proper scoring rules for strategic agents, but step-function thresholds enable first-best screening without it, uniquely for the Brier score.

Enhancing AI and Dynamical Subseasonal Forecasts with Probabilistic Bias Correction

cs.LG · 2026-04-17 · unverdicted · novelty 6.0

Probabilistic bias correction doubles AI subseasonal forecast skill and wins a 2025 international competition by correcting biases in ECMWF models for pressure, temperature, and precipitation.

Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking

cs.AI · 2026-05-26 · unverdicted · novelty 5.0

SBBT separates Brier-score calibration gains from AUROC ranking gains in prefix-conditioned success estimation for LLM math reasoning, with structure-aware signals yielding up to +0.110 AUROC over baselines.

Deep Learning-Enabled Prediction of Geoeffective CMEs Using SOHO and SDO Observations

astro-ph.SR · 2026-05-23 · unverdicted · novelty 5.0

A CNN-based fusion model trained on multi-instrument solar observations predicts geoeffective CMEs, achieving mean TSS of 0.703 and Brier score of 0.095 via five-fold cross-validation.

Knowledge-Data Dually Driven Paradigm for Accurate Landslide Susceptibility Prediction under Data-Scarce Conditions Using Geomorphic Priors and Tabular Foundation Model

cs.LG · 2026-04-28 · unverdicted · novelty 5.0

A knowledge-data dual paradigm using geomorphic priors and a tabular foundation model achieves baseline-level landslide susceptibility prediction accuracy with only 30% of typical data in tested regions.

A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning

cs.LG · 2026-04-25 · unverdicted · novelty 5.0

Single-seed CRPS estimates in limited-data BDL show high variance and peaks for heteroscedastic methods, with local variance correlating above 0.96 to single-seed error.

Debiasing the Observed Fast Radio Burst Population with the CHIME/FRB Selection Function

astro-ph.HE · 2026-06-24 · unverdicted · novelty 4.0

Analysis of CHIME/FRB Catalog 2 with synthetic injections and a multidimensional selection function yields evidence for a slight downturn in the intrinsic scattering timescale distribution, though flat or rising distributions remain possible.

Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography

cs.LG · 2026-05-13 · unverdicted · novelty 4.0

Stacked video ensemble model distinguishes BAV from TAV on PLAX cine loops with outer-CV F1 of 0.907 using Grad-CAM and SHAP for explainability.

A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data

cs.LG · 2026-06-05 · unverdicted · novelty 3.0

FDRS combines digit frequency tests, association metrics, entropy, KL divergence, and ML models to assign risk grades to numerical datasets, showing separation between normal and irregular simulated data with high AUC.

Similarity-Distance-Magnitude Activations

cs.LG · 2025-09-16

citing papers explorer

Showing 18 of 18 citing papers.

Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments cs.CL · 2026-06-29 · unverdicted · none · ref 80
EQMs, sixty LLM-scored reasoning patterns, predict forecast accuracy at both item and person levels and outperform prior text-analysis methods in a large pre-registered tournament dataset.
ForecastBench-Sim: A Simulated-World Forecasting Benchmark cs.AI · 2026-06-17 · unverdicted · none · ref 1
ForecastBench-Sim is a simulated-world benchmark using Freeciv game rollouts to generate resolvable forecasting questions at arbitrary horizons with paired intervention worlds.
The Impossibility of Eliciting Latent Knowledge cs.AI · 2026-06-10 · unverdicted · none · ref 3
Proves that no behavior-dependent feedback training strategy can guarantee an honest agent for latent knowledge even with perfect training feedback.
When Individually Calibrated Models Become Collectively Miscalibrated cs.LG · 2026-05-14 · conditional · none · ref 47
Individually calibrated predictors become collectively miscalibrated under Brier-optimal strategic responses with positive belief correlations, but VCG aggregation restores dominant-strategy incentive compatibility and near-optimal performance.
Inducing Artificial Uncertainty in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 4
Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.
CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction cs.LG · 2026-06-04 · unverdicted · none · ref 66
CaliDist calibrates LLMs by scaling confidence according to how much predictions change under semantic distractors, cutting average ECE from 23% to 7% on seven NLU benchmarks across six models.
CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts cs.CL · 2026-06-03 · unverdicted · none · ref 74
CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.
Large Language Models Are Overconfident in Their Own Responses cs.CL · 2026-06-02 · unverdicted · none · ref 1
Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.
The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting cs.GT · 2026-05-08 · unverdicted · none · ref 13
Non-affine approval functions create unavoidable miscalibration in proper scoring rules for strategic agents, but step-function thresholds enable first-best screening without it, uniquely for the Brier score.
Enhancing AI and Dynamical Subseasonal Forecasts with Probabilistic Bias Correction cs.LG · 2026-04-17 · unverdicted · none · ref 41
Probabilistic bias correction doubles AI subseasonal forecast skill and wins a 2025 international competition by correcting biases in ECMWF models for pressure, temperature, and precipitation.
Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking cs.AI · 2026-05-26 · unverdicted · none · ref 5
SBBT separates Brier-score calibration gains from AUROC ranking gains in prefix-conditioned success estimation for LLM math reasoning, with structure-aware signals yielding up to +0.110 AUROC over baselines.
Deep Learning-Enabled Prediction of Geoeffective CMEs Using SOHO and SDO Observations astro-ph.SR · 2026-05-23 · unverdicted · none · ref 8
A CNN-based fusion model trained on multi-instrument solar observations predicts geoeffective CMEs, achieving mean TSS of 0.703 and Brier score of 0.095 via five-fold cross-validation.
Knowledge-Data Dually Driven Paradigm for Accurate Landslide Susceptibility Prediction under Data-Scarce Conditions Using Geomorphic Priors and Tabular Foundation Model cs.LG · 2026-04-28 · unverdicted · none · ref 6
A knowledge-data dual paradigm using geomorphic priors and a tabular foundation model achieves baseline-level landslide susceptibility prediction accuracy with only 30% of typical data in tested regions.
A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning cs.LG · 2026-04-25 · unverdicted · none · ref 4
Single-seed CRPS estimates in limited-data BDL show high variance and peaks for heteroscedastic methods, with local variance correlating above 0.96 to single-seed error.
Debiasing the Observed Fast Radio Burst Population with the CHIME/FRB Selection Function astro-ph.HE · 2026-06-24 · unverdicted · none · ref 63
Analysis of CHIME/FRB Catalog 2 with synthetic injections and a multidimensional selection function yields evidence for a slight downturn in the intrinsic scattering timescale distribution, though flat or rising distributions remain possible.
Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography cs.LG · 2026-05-13 · unverdicted · none · ref 19
Stacked video ensemble model distinguishes BAV from TAV on PLAX cine loops with outer-CV F1 of 0.907 using Grad-CAM and SHAP for explainability.
A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data cs.LG · 2026-06-05 · unverdicted · none · ref 47
FDRS combines digit frequency tests, association metrics, entropy, KL divergence, and ML models to assign risk grades to numerical datasets, showing separation between normal and irregular simulated data with high AUC.
Similarity-Distance-Magnitude Activations cs.LG · 2025-09-16 · unreviewed · ref 4

Verification of forecasts expressed in terms of probability

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer