EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
hub
Chapman and Hall/CRC, New York (1994)
10 Pith papers cite this work, alongside 5,583 external citations. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 10representative citing papers
SensorFault-Bench is a new CPS-grounded benchmark showing that clean-MSE rankings of forecasting models often disagree with their robustness under standardized sensor-fault scenarios across four real datasets.
The authors propose target-space recovery profiles to diagnose which reproducible dimensions of fMRI brain responses are captured by model predictions, showing that accuracy alone can mask alignment mismatches in visual cortex.
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
Task-aligned supervised geometric stability predicts linear steerability with high accuracy while unsupervised stability detects representational drift earlier and with lower false alarms than CKA or Procrustes.
The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
Crowdsourced judgments reliably flag authentic videos but frequently miss manipulations and struggle to identify whether changes are audio-only, video-only, or both.
AVVA is a new framework adapting verbal analysis for classroom discourse with triangulation across ten steps and a four-criterion validation scheme for temporal stability, applied to 23 hours of recordings.
The deep SPAR model shows concurrent floods and droughts becoming more likely in the Upper Danube by 2100 under high emissions, with changes in the dependence between catchments contributing substantially to the increase.
Bayesian-ARGOS is a hybrid frequentist-Bayesian method that discovers equations from limited noisy observations more efficiently than SINDy or bootstrap-ARGOS while adding uncertainty quantification.
citing papers explorer
-
EnergyAgentBench: Benchmarking LLM Agents on Live Energy Infrastructure Data
EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
-
Benchmarking Sensor-Fault Robustness in Forecasting
SensorFault-Bench is a new CPS-grounded benchmark showing that clean-MSE rankings of forecasting models often disagree with their robustness under standardized sensor-fault scenarios across four real datasets.
-
Beyond Prediction Accuracy: Target-Space Recovery Profiles for Evaluating Model-Brain Alignment
The authors propose target-space recovery profiles to diagnose which reproducible dimensions of fMRI brain responses are captured by model predictions, showing that accuracy alone can mask alignment mismatches in visual cortex.
-
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
-
The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability
Task-aligned supervised geometric stability predicts linear steerability with high accuracy while unsupervised stability detects representational drift earlier and with lower false alarms than CKA or Procrustes.
-
OSS: Open Suturing Skills Vision-Based Assessment Challenge 2024-2025
The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
-
Beyond Seeing Is Believing: On Crowdsourced Detection of Audiovisual Deepfakes
Crowdsourced judgments reliably flag authentic videos but frequently miss manipulations and struggle to identify whether changes are audio-only, video-only, or both.
-
Audio Video Verbal Analysis (AVVA) for Capturing Classroom Dialogues
AVVA is a new framework adapting verbal analysis for classroom discourse with triangulation across ten steps and a four-criterion validation scheme for temporal stability, applied to 23 hours of recordings.
-
Exploring climate change effects on concurrent floods and concurrent droughts via statistical deep learning
The deep SPAR model shows concurrent floods and droughts becoming more likely in the Upper Danube by 2100 under high emissions, with changes in the dependence between catchments contributing substantially to the increase.
-
Fast and principled equation discovery from chaos to climate
Bayesian-ARGOS is a hybrid frequentist-Bayesian method that discovers equations from limited noisy observations more efficiently than SINDy or bootstrap-ARGOS while adding uncertainty quantification.