EnergyAgentBench is a new benchmark with 70 task variants that evaluates LLM agents on live energy data for datacenter siting, long-horizon optimization, and causal grid diagnosis.
hub
An Introduction to the Bootstrap
10 Pith papers cite this work, alongside 5,583 external citations. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 10representative citing papers
SensorFault-Bench is a new CPS-grounded benchmark showing that clean-MSE rankings of forecasting models often disagree with their robustness under standardized sensor-fault scenarios across four real datasets.
The authors propose target-space recovery profiles to diagnose which reproducible dimensions of fMRI brain responses are captured by model predictions, showing that accuracy alone can mask alignment mismatches in visual cortex.
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
Task-aligned supervised geometric stability predicts linear steerability with high accuracy while unsupervised stability detects representational drift earlier and with lower false alarms than CKA or Procrustes.
The OSS Challenge provides benchmarks showing spatiotemporal video models excel at open suturing skill classification and OSATS scoring but struggle with keypoint tracking under occlusion.
Crowdsourced judgments reliably flag authentic videos but frequently miss manipulations and struggle to identify whether changes are audio-only, video-only, or both.
AVVA is a new framework adapting verbal analysis for classroom discourse with triangulation across ten steps and a four-criterion validation scheme for temporal stability, applied to 23 hours of recordings.
The deep SPAR model shows concurrent floods and droughts becoming more likely in the Upper Danube by 2100 under high emissions, with changes in the dependence between catchments contributing substantially to the increase.
Bayesian-ARGOS is a hybrid frequentist-Bayesian method that discovers equations from limited noisy observations more efficiently than SINDy or bootstrap-ARGOS while adding uncertainty quantification.
citing papers explorer
-
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
-
Fast and principled equation discovery from chaos to climate
Bayesian-ARGOS is a hybrid frequentist-Bayesian method that discovers equations from limited noisy observations more efficiently than SINDy or bootstrap-ARGOS while adding uncertainty quantification.