EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
hub Mixed citations
In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp
Mixed citation behavior. Most common role is method (59%).
hub tools
citation-role summary
citation-polarity summary
co-cited works
representative citing papers
A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.
TabOrder learns unsupervised causal variable orderings and enforces them with order-constrained attention for tabular prediction and imputation under distribution shifts.
TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
TabPFN-MT is a multitask in-context learner for tabular data that sets a new state-of-the-art on deep multitask learning for datasets under 1000 samples while reducing inference cost from O(T) to O(1) passes.
An agentic AI workflow evolves an adaptive XGBoost quantile regression ensemble that reduces watershed-averaged forecast error by up to 29% versus California's operational forecasts for April-July runoff at 1-6 month leads across 23 Sierra Nevada sites.
A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.
V4FinBench is a new million-record benchmark where imbalance-aware finetuned TabPFN matches or beats gradient boosting on long-horizon bankruptcy prediction while Llama-3-8B lags, with evidence of transferable patterns to US data.
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
TFM-Retouche is an architecture-agnostic input-space residual adapter that improves tabular foundation model accuracy on 51 datasets by learning input corrections through the frozen backbone, with an identity guard to fall back to the original model.
Introduces Calibrated Size Ratio (CSR) and confidence-weighted metrics to better detect overconfidence risk and calibration issues beyond the limitations of ECE.
Probabilistic PCA latent-space model with Bayesian inference reconstructs TNO near-IR spectra from photometry, achieving 95% credible-interval coverage and supporting taxonomy plus survey optimization.
A controlled pairwise evaluation framework for multilingual TTS in 10 Indic languages produces a preference leaderboard using Bradley-Terry modeling and SHAP analysis on 120K+ comparisons.
LP2B encoding converts Lund plane jet representations into Bloch sphere qubit states, enabling a QTTN that matches classical LundNet performance on polarization tagging and W/top tagging with three orders of magnitude fewer parameters and improved low-data regime results.
Spot-and-Scoot collects binary and quantitative spot availability signals by canceling provisioning requests before instances run, achieving F1-macro scores up to 0.90 for current availability and 0.85 at 60-minute horizons across AWS and Azure.
First observation of electroweak photon plus two jets production yields a cross section of 202 fb consistent with the standard model prediction of 177 fb at greater than 5 sigma significance.
No excess observed; first LHC search sets 95% CL upper limits on H to AA to 4e branching fraction down to 10^{-5} for 10-100 MeV masses and short lifetimes.
FeDa4Fair is a new library and benchmark for creating federated datasets with heterogeneous client-level biases to standardize evaluation of fairness methods in federated learning.
ProxySHAP approximates higher-order Shapley and Banzhaf interactions via tree proxies plus residual correction and a polynomial-time interventional TreeSHAP generalization for tree ensembles.
Toto 2.0 is a family of open time series foundation models that demonstrates reliable scaling and sets new state-of-the-art results on three forecasting benchmarks.
The paper develops set-valued policies and conformal policy learning methods that output treatment sets with marginal coverage guarantees for robust decision-making under uncertainty.
Develops Grenander-type and debiased machine learning estimators for the sublevel-set probability curve of the CATE function, shown to be non-pathwise differentiable, along with its piecewise linear approximation.
Semantic segmentation decomposes monitoring features into canonical and residual components that concentrate fault-predictive information while preserving operational meaning in predictive maintenance.
Simple MLPs using temporal and behavioral features from gossip data predict Lightning Network channel closure types better than temporal graph neural networks.
citing papers explorer
-
EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data
EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
-
STRABLE: Benchmarking Tabular Machine Learning with Strings
A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.
-
Learning Causal Orderings for In-Context Tabular Prediction
TabOrder learns unsupervised causal variable orderings and enforces them with order-constrained attention for tabular prediction and imputation under distribution shifts.
-
TIDAL: Recovering Temporal Phase for Cloud Block Storage Placement from LLM-Derived Semantics
TIDAL recovers temporal phase signals from LLM-derived semantics of provisioning metadata to enable complementary CVD placement, reducing overload frequency by 79.1% on production traces.
-
TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data
TabPFN-MT is a multitask in-context learner for tabular data that sets a new state-of-the-art on deep multitask learning for datasets under 1000 samples while reducing inference cost from O(T) to O(1) passes.
-
Probabilistic Seasonal Streamflow Forecasting Across California's Sierra Nevada Watersheds with Agentic AI
An agentic AI workflow evolves an adaptive XGBoost quantile regression ensemble that reduces watershed-averaged forecast error by up to 29% versus California's operational forecasts for April-July runoff at 1-6 month leads across 23 Sierra Nevada sites.
-
Quantifying Sensitivity for Tree Ensembles: A symbolic and compositional approach
A compositional algebraic decision diagram algorithm quantifies sensitivity in decision tree ensembles with certified error and confidence bounds, outperforming model counters on benchmarks.
-
V4FinBench: Benchmarking Tabular Foundation Models, LLMs, and Standard Methods on Corporate Bankruptcy Prediction
V4FinBench is a new million-record benchmark where imbalance-aware finetuned TabPFN matches or beats gradient boosting on long-horizon bankruptcy prediction while Llama-3-8B lags, with evidence of transferable patterns to US data.
-
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
-
TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models
TFM-Retouche is an architecture-agnostic input-space residual adapter that improves tabular foundation model accuracy on 51 datasets by learning input corrections through the frozen backbone, with an identity guard to fall back to the original model.
-
Beyond ECE: Calibrated Size Ratio, Risk Assessment, and Confidence-Weighted Metrics
Introduces Calibrated Size Ratio (CSR) and confidence-weighted metrics to better detect overconfidence risk and calibration issues beyond the limitations of ECE.
-
Probabilistic Spectral Reconstruction of Trans-Neptunian Objects from Sparse Photometry: A Framework for Taxonomy, Survey Optimization, and Outlier Detection
Probabilistic PCA latent-space model with Bayesian inference reconstructs TNO near-IR spectra from photometry, achieving 95% credible-interval coverage and supporting taxonomy plus survey optimization.
-
Preferences of a Voice-First Nation: Large-Scale Pairwise Evaluation and Preference Analysis for TTS in Indian Languages
A controlled pairwise evaluation framework for multilingual TTS in 10 Indic languages produces a preference leaderboard using Bradley-Terry modeling and SHAP analysis on 120K+ comparisons.
-
Lund Plane to Bloch (LP2B) Encoding for Object and Polarization Tagging with Quantum Jet Substructure
LP2B encoding converts Lund plane jet representations into Bloch sphere qubit states, enabling a QTTN that matches classical LundNet performance on polarization tagging and W/top tagging with three orders of magnitude fewer parameters and improved low-data regime results.
-
Spot-and-Scoot: Peeking Into Spot Instance Availability
Spot-and-Scoot collects binary and quantitative spot availability signals by canceling provisioning requests before instances run, achieving F1-macro scores up to 0.90 for current availability and 0.85 at 60-minute horizons across AWS and Azure.
-
Measurements of electroweak production of a photon in association with two jets in proton-proton collisions at $\sqrt{s}$ = 13 TeV
First observation of electroweak photon plus two jets production yields a cross section of 202 fb consistent with the standard model prediction of 177 fb at greater than 5 sigma significance.
-
Search for light pseudoscalar bosons, pair-produced in Higgs boson decays in the four-electron final state in proton-proton collisions at $\sqrt{s}$ = 13 TeV
No excess observed; first LHC search sets 95% CL upper limits on H to AA to 4e branching fraction down to 10^{-5} for 10-100 MeV masses and short lifetimes.
-
FeDa4Fair: Client-Level Federated Datasets for Fairness Evaluation
FeDa4Fair is a new library and benchmark for creating federated datasets with heterogeneous client-level biases to standardize evaluation of fairness methods in federated learning.
-
Proxy-Based Approximation of Shapley and Banzhaf Interactions
ProxySHAP approximates higher-order Shapley and Banzhaf interactions via tree proxies plus residual correction and a polynomial-time interventional TreeSHAP generalization for tree ensembles.
-
Toto 2.0: Time Series Forecasting Enters the Scaling Era
Toto 2.0 is a family of open time series foundation models that demonstrates reliable scaling and sets new state-of-the-art results on three forecasting benchmarks.
-
Set-Valued Policy Learning
The paper develops set-valued policies and conformal policy learning methods that output treatment sets with marginal coverage guarantees for robust decision-making under uncertainty.
-
Nonparametric inference for sublevel-set probabilities of conditional average treatment effect functions
Develops Grenander-type and debiased machine learning estimators for the sublevel-set probability curve of the CATE function, shown to be non-pathwise differentiable, along with its piecewise linear approximation.
-
Semantic Feature Segmentation for Interpretable Predictive Maintenance in Complex Systems
Semantic segmentation decomposes monitoring features into canonical and residual components that concentrate fault-predictive information while preserving operational meaning in predictive maintenance.
-
Predicting Channel Closures in the Lightning Network with Machine Learning
Simple MLPs using temporal and behavioral features from gossip data predict Lightning Network channel closure types better than temporal graph neural networks.
-
Plan Before You Trade: Inference-Time Optimization for RL Trading Agents
FPILOT optimizes pre-trained RL trading policies at inference time using forecasted price trajectories to improve portfolio allocations and risk-adjusted returns on the DJ30 benchmark.
-
Beyond the Wrapper: Identifying Artifact Reliance in Static Malware Classifiers using TRUSTEE
Static malware classifiers learn packing artifacts and dataset composition biases rather than malicious semantics, as diagnosed by TRUSTEE interpretability across controlled dataset variations.
-
Predicting co-segregation in multicomponent alloys with solute-solute interactions
An extended dual-solute framework predicts co-segregation bounds in multicomponent alloys by machine-learning pairwise segregation energies that include solute-solute interactions and is validated on magnesium systems.
-
ACT: Anti-Crosstalk Learning for Cross-Sectional Stock Ranking via Temporal Disentanglement and Structural Purification
ACT disentangles temporal scales in stock sequences and purifies structural relations in graphs to achieve state-of-the-art cross-sectional stock ranking on CSI300 and CSI500 with up to 74.25% improvement.
-
Option Pricing on Noisy Intermediate-Scale Quantum Computers: A Quantum Neural Network Approach
A compact 2-qubit QNN approximates Black-Scholes-Merton option prices with usable accuracy when executed on multiple commercial NISQ quantum processors.
-
ParamBoost: Gradient Boosted Piecewise Cubic Polynomials
ParamBoost improves GAMs by fitting piecewise cubic polynomials via gradient boosting and supports constraints for continuity, monotonicity, convexity, and feature interactions.
-
AIBuildAI: An AI Agent for Automatically Building AI Models
AIBuildAI uses a manager agent and three LLM sub-agents to fully automate AI model development and achieves a 63.1% medal rate on MLE-Bench, matching experienced human engineers.
-
Natural Language Embeddings of Synthesis and Testing conditions Enhance Glass Dissolution Prediction
Natural language embeddings of synthesis and testing conditions improve ML predictions of glass dissolution rates and enable generalization to out-of-distribution compositions with new elements.
-
ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold
ReSS extracts decision paths from trees as scaffolds to guide LLM reasoning generation, fine-tunes the LLM on the resulting dataset with scaffold-invariant augmentation, and reports up to 10% gains on medical and financial tabular benchmarks with new faithfulness metrics.
-
Highly boosted dielectron identification in proton-proton collisions at $\sqrt{s}$ = 13 TeV
CMS develops two multivariate models for identifying boosted dielectrons with γ_L > 20, reporting 80% efficiency for two-track cases from J/ψ data and 60% for single-track from Z conversions, plus an energy correction.
-
KumoRFM-2: Scaling Foundation Models for Relational Learning
KumoRFM-2 pre-trains on synthetic and real relational data across row, column, foreign-key and cross-sample axes, injects task information early, and achieves up to 8% gains over supervised baselines on 41 benchmarks in few-shot and fine-tuned regimes while handling billion-scale datasets.
-
SemiCharmTag: a tool for Semileptonic Charm tagging
SemiCharmTag achieves a factor of ~4 signal-over-background improvement at 81% efficiency for Drell-Yan muons using secondary-vertex hadron tagging in LHCb proton-proton collision simulations.
-
TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection
TrajOnco uses a chain-of-agents LLM architecture with memory to perform temporal reasoning on longitudinal EHR, achieving 0.64-0.80 AUROC for 1-year multi-cancer risk prediction in zero-shot mode on matched cohorts while matching supervised ML on lung cancer and outperforming single-agent baselines.
-
CivBench: Progress-Based Evaluation for LLMs' Strategic Decision-Making in Civilization V
CivBench trains models on turn-level states in Civilization V to predict victory probabilities, providing a progress-based evaluation of LLM strategic capabilities across 307 games with 7 models.
-
Automatic search for transiting planets in TESS-SPOC FFIs with RAVEN: over 100 newly validated planets and over 2000 vetted candidates
RAVEN validates 118 transiting planets and flags over 2000 candidates from four years of TESS data on 2.2 million stars.
-
Photometric Redshift PDFs via Neural Network Classification for DESI Legacy Imaging Surveys and Pan-STARRS
Neural network classification with CRPS optimization produces calibrated photometric redshift PDFs for DESI Legacy and Pan-STARRS data, achieving σ_NMAD of 0.0153 on LSDR10 and outperforming regression methods.
-
A Dataset for Automatic Vocal Mode Classification
A new multi-microphone annotated dataset for classifying Neutral, Curbing, Overdrive, and Edge vocal modes achieves 81.3% balanced accuracy in ResNet18 baselines.
-
From Coordinates to Context: An LLM-Bootstrapped Semantic Encoding Framework for Privacy-Preserving Mobile Sensing Stress Recognition
A privacy-aware semantic encoding framework for GPS data in mobile stress recognition maintains performance comparable to non-private baselines while improving privacy by 2-3 times on the GeoLife dataset via LOSO validation.
-
Causal Additive Models with Unobserved Causal Paths and Backdoor Paths
Establishes sufficient conditions for causal direction identification in additive models with unobserved paths and introduces a sound, complete search algorithm.
-
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.
-
Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting
LLMs are highly sensitive to prompt formatting in few-shot settings, with accuracy varying by up to 76 points across formats; FormatSpread samples formats to report performance intervals without model weights.
-
PACT: Reducing Alert Fatigue in Low-Prevalence SOC Streams with Triggered Active Learning
PACT reduces benign-normalized false-positive burden by 43% and 21% on AIT-ADS and BOTSv1 benchmarks versus a frozen baseline while issuing 3.8x–5.2x fewer analyst queries than random updating.
-
OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025
The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
-
ldmppr: Location Dependent Marked Point Processes in R
ldmppr is an R package providing tools to model, simulate from, and assess goodness-of-fit for location-dependent marked point processes.
-
Inferring stellar metallicity and elemental abundances from kinematic and spectroscopic data using machine learning -- Implications for exoplanet host stars
ML regressors trained on APOGEE DR17 red giants predict C, O, Mg, Si abundances from kinematics and [Fe/H] more accurately than [Fe/H] baseline, with external validation on HARPS FGK dwarfs and reproduction of Galactic chemical evolution trends.
-
An Interpretable Closed-Loop Intelligent Tutoring System for Multimodal Affective Feedback in Asynchronous Presentation Training
An XGBoost ITS maps multimodal features from 10,360 videos to BARS scores and delivers layered feedback that produced significant skill gains (Cohen's d 0.39-0.90) in a 30-day study of 204 learners.