Pandora's Regret is a closed-form pairwise scoring rule derived from expected optimal search costs that elicits true probabilities and outperforms log loss, accuracy, and F1 at predicting diagnostic costs on MedMNIST models.
Mixed citations
Title resolution pending
Mixed citation behavior. Most common role is background (60%).
citation-role summary
citation-polarity summary
representative citing papers
Auto-calibration of forecast sequences equals measure-valued martingales, enabling a statistical test for calibration of updating predictions.
EQMs, sixty LLM-scored reasoning patterns, predict forecast accuracy at both item and person levels and outperform prior text-analysis methods in a large pre-registered tournament dataset.
MACROCAST is the first leakage-free time series foundation model for real-time macroeconomic forecasting, trained exclusively on synthetic series and vintage data, outperforming AR(1), Chronos-2, BVAR, and DFM benchmarks on FRED-MD.
ForecastBench-Sim is a simulated-world benchmark using Freeciv game rollouts to generate resolvable forecasting questions at arbitrary horizons with paired intervention worlds.
EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.
A logistic credibility model with data-driven temporal decay restores calibration slope to 1.00 and reduces exposure-weighted error by 38% versus standard Bühlmann-Straub on US commercial auto held-out data.
A mapping of predictive distributions through the censoring mechanism yields proper right-censored versions of the CRPS, Brier score, energy score and other losses, with the marginalized form proven proper under conditional independent censoring.
EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.
FinStressTS is a parametric synthetic benchmark with 30 environments across six mechanism families for evaluating point and probabilistic forecasting models on financial time series.
Neural network-parameterized regression splines enable joint optimization of forecast quality and stability in distribution-free probabilistic time series models by penalizing dissimilarities from forecast updates.
Introduces Trajectory Proper Score (TPS) as a strictly proper family of trajectory-level scoring rules that elicits the complete prefix-conditioned success probability process.
CopFITi is the first marginalization-consistent copula for irregular multivariate time series, using normalizing flows for marginals and a Gaussian mixture copula for dependencies to reach new state-of-the-art joint density modeling.
Individually calibrated predictors become collectively miscalibrated under Brier-optimal strategic responses with positive belief correlations, but VCG aggregation restores dominant-strategy incentive compatibility and near-optimal performance.
Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.
Introduces decision-alignment to evaluate uncertainty metrics against downstream decision utilities and proposes prior-weighted proper scoring rules that align better in benchmarks and case studies.
Designs a leave-one-out contrastive scoring rule penalty to restore incentive compatibility for prosumers in two-stage energy markets under linear preferences.
A hierarchical Bayesian framework pools information across sparse dynamical system datasets via a shared population distribution to improve parameter inference and prediction over unpooled approaches.
A method called the degeneracy distillery uses symbolic transformations to flatten the Fisher information matrix globally from simulations alone, identifying independent parameter combinations and reducing neural posterior estimation simulation budgets by up to 10x.
A Bayesian hierarchical model integrates coherence penalization and level-specific focus into forecasting estimation, yielding improved predictive accuracy on simulated and Australian tourism data.
Predictively consistent priors let complex Bayesian models match or beat the out-of-sample performance of selected simpler models across linear, logistic, and nonlinear examples without explicit selection.
A low-rank dynamic factor model with AR(1) latent states and binomial observations, when aggregated over time, generates horizon-dependent posterior-implied copulas that reproduce annual eigenvalue amplification on S&P sector default data and improve some forecast scores.
QUEST measures uncertainty via the Lebesgue volume of highest-density regions of a distribution's support, evaluated at robustness parameter alpha, and claims to satisfy UQ axioms while outperforming variance and differential entropy on selective prediction tasks.
Tyan-WP is a pretrained wind power foundation model that outperforms site-specific TSMs and generic LTSMs in zero-shot ultra-short-term probabilistic forecasting on U.S. and U.K. sites via static embeddings and PAMF module.
citing papers explorer
-
Pandora's Regret: A Proper Scoring Rule for Evaluating Sequential Search
Pandora's Regret is a closed-form pairwise scoring rule derived from expected optimal search costs that elicits true probabilities and outperforms log loss, accuracy, and F1 at predicting diagnostic costs on MedMNIST models.
-
Calibrated Probability Forecast Sequences and Measure-Valued Martingales
Auto-calibration of forecast sequences equals measure-valued martingales, enabling a statistical test for calibration of updating predictions.
-
Measuring Judgment Quality in Natural-Language Explanations: Evidence from Forecasting Tournaments
EQMs, sixty LLM-scored reasoning patterns, predict forecast accuracy at both item and person levels and outperform prior text-analysis methods in a large pre-registered tournament dataset.
-
MACROCAST: A Vintage-Consistent Time Series Foundation Model for Real-Time Macroeconomic Forecasting
MACROCAST is the first leakage-free time series foundation model for real-time macroeconomic forecasting, trained exclusively on synthetic series and vintage data, outperforming AR(1), Chronos-2, BVAR, and DFM benchmarks on FRED-MD.
-
ForecastBench-Sim: A Simulated-World Forecasting Benchmark
ForecastBench-Sim is a simulated-world benchmark using Freeciv game rollouts to generate resolvable forecasting questions at arbitrary horizons with paired intervention worlds.
-
Expected Free Energy-based Planning as Variational Inference
EFE-based planning is formulated as variational free energy minimization with epistemic priors, decomposing into expected plan costs plus a complexity term.
-
Logistic Credibility with Temporal Decay: Extending B\"uhlmann--Straub for Commercial Lines
A logistic credibility model with data-driven temporal decay restores calibration slope to 1.00 and reduces exposure-weighted error by 38% versus standard Bühlmann-Straub on US commercial auto held-out data.
-
Proper Scoring Rules for Right-Censored Survival Data
A mapping of predictive distributions through the censoring mechanism yields proper right-censored versions of the CRPS, Brier score, energy score and other losses, with the marginalized form proven proper under conditional independent censoring.
-
What Type of Inference is Active Inference?
EFE-based active inference planning is characterized as VFE on an augmented model plus entropy and planning corrections, with a derived message-passing implementation and grid-world validation.
-
FinStressTS: A Parametric Synthetic Benchmark for Time-Series Forecasting in Finance
FinStressTS is a parametric synthetic benchmark with 30 environments across six mechanism families for evaluating point and probabilistic forecasting models on financial time series.
-
Stabilizing distribution-free probabilistic forecasts
Neural network-parameterized regression splines enable joint optimization of forecast quality and stability in distribution-free probabilistic time series models by penalizing dissimilarities from forecast updates.
-
Proper Scoring Rules for Agentic Uncertainty Quantification
Introduces Trajectory Proper Score (TPS) as a strictly proper family of trajectory-level scoring rules that elicits the complete prefix-conditioned success probability process.
-
Valid and Expressive Copulas for Irregular Multivariate Time Series
CopFITi is the first marginalization-consistent copula for irregular multivariate time series, using normalizing flows for marginals and a Gaussian mixture copula for dependencies to reach new state-of-the-art joint density modeling.
-
When Individually Calibrated Models Become Collectively Miscalibrated
Individually calibrated predictors become collectively miscalibrated under Brier-optimal strategic responses with positive belief correlations, but VCG aggregation restores dominant-strategy incentive compatibility and near-optimal performance.
-
Text-to-Distribution Prediction with Quantile Tokens and Neighbor Context
Quantile tokens inserted into LLM inputs combined with neighbor retrieval enable direct prediction of full distributions, yielding lower MAPE and narrower intervals than baselines on Airbnb and StackSample tasks.
-
Decision-Aligned Evaluation of Uncertainty Quantification
Introduces decision-alignment to evaluate uncertainty metrics against downstream decision utilities and proposes prior-weighted proper scoring rules that align better in benchmarks and case studies.
-
Restoring Incentive Compatibility in Two-Stage Energy Markets with Prosumers
Designs a leave-one-out contrastive scoring rule penalty to restore incentive compatibility for prosumers in two-stage energy markets under linear preferences.
-
Learning Dynamical Systems from Multiple Sparse Datasets: A Hierarchical Bayesian Modeling Approach
A hierarchical Bayesian framework pools information across sparse dynamical system datasets via a shared population distribution to improve parameter inference and prediction over unpooled approaches.
-
The Degeneracy Distillery
A method called the degeneracy distillery uses symbolic transformations to flatten the Fisher information matrix globally from simulations alone, identifying independent parameter combinations and reducing neural posterior estimation simulation budgets by up to 10x.
-
Hierarchical Bayes meets hierarchical forecasting: A flexible framework for level-focused forecasts
A Bayesian hierarchical model integrates coherence penalization and level-specific focus into forecasting estimation, yielding improved predictive accuracy on simulated and Australian tourism data.
-
To select or not to select: predictively consistent priors instead of model selection
Predictively consistent priors let complex Bayesian models match or beat the out-of-sample performance of selected simpler models across linear, logistic, and nonlinear examples without explicit selection.
-
Temporal Coarse-Graining of Multi-Sector Default Count Data Generates Posterior-Implied Copulas
A low-rank dynamic factor model with AR(1) latent states and binomial observations, when aggregated over time, generates horizon-dependent posterior-implied copulas that reproduce annual eigenvalue amplification on S&P sector default data and improve some forecast scores.
-
On the QUEST for Uncertainty Quantification via Highest Density Regions
QUEST measures uncertainty via the Lebesgue volume of highest-density regions of a distribution's support, evaluated at robustness parameter alpha, and claims to satisfy UQ axioms while outperforming variance and differential entropy on selective prediction tasks.
-
Tyan-WP: A Wind Power Foundation Model for Ultra-Short-Term Probabilistic Forecasting
Tyan-WP is a pretrained wind power foundation model that outperforms site-specific TSMs and generic LTSMs in zero-shot ultra-short-term probabilistic forecasting on U.S. and U.K. sites via static embeddings and PAMF module.
-
Mix, Don't Pick: Why Synthetic Corpus Composition Matters for Time Series Foundation Model Pretraining
Equal-weight mixture of synthetic generators matches or exceeds best single generator for time series foundation model pretraining and strengthens further with real data.
-
Scalable Uncertainty Quantification for Extreme Weather Forecasting via Empirical Neural Tangent Kernels
NTK-UQ produces 31-37% sharper 90% prediction intervals than split conformal prediction for extreme weather forecasts, with adaptive scaling via architecture-dependent eigenvalue truncation and ICA decomposition of last-layer features.
-
Probabilistic storyline attribution using machine learning
Distributional autoencoders trained on climate model simulations model full conditional distributions of European temperature fields to enable probabilistic storyline attribution, illustrated by higher intensities and probability ratios for a 2003-like heatwave in 2028 and 2053.
-
Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization
Multi-response training retains multiple responses per prompt to reduce uncertainty about the conditional output distribution, yielding improved distributional generalization especially in high response-diversity and low prompt-redundancy regimes.
-
Probabilistic Data-Driven Modelling of Astrophysical Transients: The Neural Process Family for Ultrafast and Class-Agnostic Light Curve Reconstruction with NightLANP
Attentive Neural Processes outperform Gaussian Processes and neural networks on light curve interpolation quality, feature recovery, calibration, and speed for 15 transient classes under realistic Rubin cadences.
-
Rashomon-Seeded Annealing for Robust Bayesian Inference in Factorial Designs
Rashomon-seeded annealing repurposes Rashomon sets as warm starts for annealed importance sampling to enable full posterior inference in factorial designs without exhaustive enumeration.
-
ECUAS$_n$: A family of metrics for principled evaluation of uncertainty-augmented systems
ECUAS_n is a parameterized family of proper scoring rules for jointly assessing prediction accuracy and uncertainty quality in automated decision systems.
-
A Penalty-Free Pipeline for Direct Quantum-Annealer Portfolio Optimization
A penalty-free pipeline samples an objective-only QUBO on D-Wave hardware and enforces cardinality classically, cutting chain-break fractions from 71-92% to at most 0.04% across tested equity and betting instances.
-
Improving ecological inference and uncertainty quantification from camera trap data through the fusion of AI confidences and manual annotations
A Bayesian data-fusion model combines AI predictions and manual labels from camera traps to yield improved ecological inference and uncertainty quantification for white-tailed deer body condition.
-
Scenario generation of intraday electricity price paths for optimal trading in continuous markets
A kernel-based regression model plus scenario generation from forecast errors and a new Support Vector Sorting step produces ensemble price trajectories that improve both statistical accuracy and trading profits over benchmarks on German intraday continuous market data.
-
Multi-Quantile Regression for Extreme Precipitation Downscaling
Q-SRDRN multi-quantile network with pinball loss and per-quantile heads detects extreme precipitation events up to 18 times more effectively than deterministic baselines while preserving augmentation benefits for the median.
-
The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting
Non-affine approval functions create unavoidable miscalibration in proper scoring rules for strategic agents, but step-function thresholds enable first-best screening without it, uniquely for the Brier score.
-
Bayesian Modeling and Prediction of Generalized Contact Matrices
A Bayesian model for multi-feature contact matrices that uses tensor structures and contingency table theory to satisfy structural constraints and impute missing contact features, validated on simulations and US/German survey data.
-
Perturbation is All You Need for Extrapolating Language Models
Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.
-
Honest Reporting in Scored Oversight: True-KL0 Property via the Prekopa Principle
For heterogeneous power-p pseudospherical scoring rules with d ≤ 4, the True-KL0 property R(M,p,d) < 1 holds for all M > 1, establishing unconditional DSIC via a Prekopa-based log-concavity argument on the loss integral.
-
CERBERUS: A Three-Headed Decoder for Vertical Cloud Profiles
CERBERUS uses a three-headed encoder-decoder to predict zero-inflated probabilistic vertical radar reflectivity profiles from satellite and meteorological inputs.
-
HealDA: Highlighting the importance of initial errors in end-to-end AI weather forecasts
HealDA supplies ML-based initial conditions for AI weather models that produce forecasts trailing ERA5-initialized runs by less than one day of effective lead time, with the skill gap arising mainly from initial error size.
-
Otter Weather: Skillful and Computationally Efficient Medium-Range Weather Forecasting
Otter Weather is a spatiotemporal model that outperforms NWP baselines by 9.6% at 24h lead with under 3.5 A100-days training and extends efficiency gains to probabilistic forecasting via CRPS.
-
Reliability of Probabilistic Emulation of Physical Systems
CRPS-trained ensembles achieve better uncertainty reliability and speed than latent generative models for probabilistic emulation of 2D physical systems.
-
Variational Proximal Policy Optimization
VP2O maps PPO to SVGD in a MoE architecture using functional kernels and expert orthogonalization, claiming +179 ELO on Codeforces and 32% token reduction on AIME for a 33B/4B model.
-
When Should Forecasting Models Be Re-Specified? A Cost-Sensitive Trigger for Adaptive Model-Form Updating
A cost-sensitive trigger using specification debt for deciding when to re-specify forecasting model forms, shown on M4 data to match full-update accuracy at 28% of the compute cost.
-
Controlling False Discovery in Arbitrarily Structured Hypothesis Spaces via Reproducing Kernels
A kernel-based regularized learning framework for FDR control that unifies arbitrary structures and supplies provably valid decision rules with likelihood-based tuning.
-
Soft Learning
Soft Learning optimally combines heterogeneous ML specialists via cross-validated non-negative least squares, achieving top performance on 70% of 37 datasets with formal guarantees and 72-435x CPU speedups over deep networks.
-
A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning
Single-seed CRPS estimates in limited-data BDL show high variance and peaks for heteroscedastic methods, with local variance correlating above 0.96 to single-seed error.
-
Unstable Rankings in Bayesian Deep Learning Evaluation
Bayesian deep learning method rankings are unstable at small sample sizes, dataset-dependent, and require uncertainty-aware evaluation using hierarchical models and minimum detectable difference curves.
-
Adaptive COVID-19 Trajectory Forecasting Using MAB-Inspired Ensemble Weighting
EXP3-based adaptive ensembles achieved the lowest mean weighted interval scores for COVID-19 incidence forecasts compared with individual models and simple ensemble baselines.