{"total":17,"items":[{"citing_arxiv_id":"2606.26334","ref_index":63,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Debiasing the Observed Fast Radio Burst Population with the CHIME/FRB Selection Function","primary_cat":"astro-ph.HE","submitted_at":"2026-06-24T19:21:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Analysis of CHIME/FRB Catalog 2 with synthetic injections and a multidimensional selection function yields evidence for a slight downturn in the intrinsic scattering timescale distribution, though flat or rising distributions remain possible.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18686","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ForecastBench-Sim: A Simulated-World Forecasting Benchmark","primary_cat":"cs.AI","submitted_at":"2026-06-17T04:52:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ForecastBench-Sim is a simulated-world benchmark using Freeciv game rollouts to generate resolvable forecasting questions at arbitrary horizons with paired intervention worlds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12268","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Impossibility of Eliciting Latent Knowledge","primary_cat":"cs.AI","submitted_at":"2026-06-10T16:11:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proves that no behavior-dependent feedback training strategy can guarantee an honest agent for latent knowledge even with perfect training feedback.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07128","ref_index":47,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A machine-learning-assisted progressive digit-randomness screening framework for detecting non-random patterns in raw numerical research data","primary_cat":"cs.LG","submitted_at":"2026-06-05T10:41:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"FDRS combines digit frequency tests, association metrics, entropy, KL divergence, and ML models to assign risk grades to numerical datasets, showing separation between normal and irregular simulated data with high AUC.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05799","ref_index":66,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction","primary_cat":"cs.LG","submitted_at":"2026-06-04T07:27:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CaliDist calibrates LLMs by scaling confidence according to how much predictions change under semantic distractors, cutting average ECE from 23% to 7% on seven NLU benchmarks across six models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04661","ref_index":74,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"CRAFT: Cost-aware Refinement And Front-aware Tuning of Prompts","primary_cat":"cs.CL","submitted_at":"2026-06-03T09:40:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CRAFT is a Pareto-front prompt optimizer that allocates scarce LLM validation calls to candidates near the current front using accuracy- and cost-oriented generators plus NSGA-II retention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03437","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Large Language Models Are Overconfident in Their Own Responses","primary_cat":"cs.CL","submitted_at":"2026-06-02T10:20:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Instruction-tuned LLMs exhibit an ownership bias, assigning up to 26% higher confidence to their own responses than identical user-provided answers; reframing the answer as user input during elicitation reduces overconfidence by up to 26%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27712","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Prefix-Safe Bayesian Belief Tracking for LLM Reasoning Reliability:Separating Calibration from Ranking","primary_cat":"cs.AI","submitted_at":"2026-05-26T21:37:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SBBT separates Brier-score calibration gains from AUROC ranking gains in prefix-conditioned success estimation for LLM math reasoning, with structure-aware signals yielding up to +0.110 AUROC over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24748","ref_index":8,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Deep Learning-Enabled Prediction of Geoeffective CMEs Using SOHO and SDO Observations","primary_cat":"astro-ph.SR","submitted_at":"2026-05-23T21:54:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A CNN-based fusion model trained on multi-instrument solar observations predicts geoeffective CMEs, achieving mean TSS of 0.703 and Brier score of 0.095 via five-fold cross-validation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18858","ref_index":47,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Individually Calibrated Models Become Collectively Miscalibrated","primary_cat":"cs.LG","submitted_at":"2026-05-14T05:25:16+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Individually calibrated predictors become collectively miscalibrated under Brier-optimal strategic responses with positive belief correlations, but VCG aggregation restores dominant-strategy incentive compatibility and near-optimal performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13730","ref_index":19,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Robust and Explainable Bicuspid Aortic Valve Diagnosis Using Stacked Ensembles on Echocardiography","primary_cat":"cs.LG","submitted_at":"2026-05-13T16:10:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Stacked video ensemble model distinguishes BAV from TAV on PLAX cine loops with outer-CV F1 of 0.907 using Grad-CAM and SHAP for explainability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13595","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Inducing Artificial Uncertainty in Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-13T14:30:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07671","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Endogeneity of Miscalibration: Impossibility and Escape in Scored Reporting","primary_cat":"cs.GT","submitted_at":"2026-05-08T12:42:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Non-affine approval functions create unavoidable miscalibration in proper scoring rules for strategic agents, but step-function thresholds enable first-best screening without it, uniquely for the Brier score.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"as certifiers who choose how much information to reveal. Our framework complements Lizzeri's by studying what happens when the intermediary canmisreport(not just withhold), disciplined by a proper scoring mechanism. 1.6 Related Literature Proper scoring rules and elicitation theory.The characterization of strictly proper scoring rules originates with de Finetti [17], Brier [13], McCarthy [44], and Savage [55]. The definitive modern treatment is Gneiting and Raftery [27]. Schervish [56] provides the general characterization linking properness to convex functions. Lambert et al. [37] uses conjugate duality to characterize elicitable properties of probability distributions. The connection between proper scoring rules and convex"},{"citing_arxiv_id":"2604.25196","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Knowledge-Data Dually Driven Paradigm for Accurate Landslide Susceptibility Prediction under Data-Scarce Conditions Using Geomorphic Priors and Tabular Foundation Model","primary_cat":"cs.LG","submitted_at":"2026-04-28T04:05:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A knowledge-data dual paradigm using geomorphic priors and a tabular foundation model achieves baseline-level landslide susceptibility prediction accuracy with only 30% of typical data in tested regions.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"These algorithms were selected as they represent the most extensively validated conventional paradigms in current landslide susceptibility literature (Youssef et al., 2016; Chen et al., 2017). Detailed configurations for these baselines, alongside the rigorous mathematical formulations and justification for the complementary AUC-ROC (Reichenbach et al., 2018) and Brier Score (Brier, 1950) evaluation metrics, are provided in the Supporting Information A. The five repetitions effectively suppress variance attributable to any single random partition, rendering the reported performance metrics (AUC and Brier Score) highly objective and reproducible, and ultimately translating into susceptibility maps with smoother, more reliable spatial transitions."},{"citing_arxiv_id":"2604.23114","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Tale of Two Variances: When Single-Seed Benchmarks Fail in Bayesian Deep Learning","primary_cat":"cs.LG","submitted_at":"2026-04-25T02:52:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Single-seed CRPS estimates in limited-data BDL show high variance and peaks for heteroscedastic methods, with local variance correlating above 0.96 to single-seed error.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16238","ref_index":41,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Enhancing AI and Dynamical Subseasonal Forecasts with Probabilistic Bias Correction","primary_cat":"cs.LG","submitted_at":"2026-04-17T16:58:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Probabilistic bias correction doubles AI subseasonal forecast skill and wins a 2025 international competition by correcting biases in ECMWF models for pressure, temperature, and precipitation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"preparedness for hazards such as floods, heatwaves, droughts, storms, and cold snaps. Figure 5 summarizes the significant gains in week 4 extreme forecasting skill when PBC is applied to the dynamical ensemble from ECMWF. Here, we define extreme events as those verifying in the lowest or highest quintiles and measure skill using the Brier skill score (BSS) [41], a standard measure of improvement over a climatological baseline for a binary event. For each target variable and extreme, Figure 5a reports average global BSS over the years 2016-2024, while Figure 5b displays the spatial BSS distribution over the same time period. PBC substantially boosts the positive BSS of the raw ECMWF ensemble for precipitation extremes and again transforms its negative-skill temperature"},{"citing_arxiv_id":"2509.12760","ref_index":4,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Similarity-Distance-Magnitude Activations","primary_cat":"cs.LG","submitted_at":"2025-09-16T07:19:38+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}