FLOATBench is a tabular benchmark dataset with 582,120 fatigue labels from 19,404 OpenFAST simulations of three 22 MW FOWT towers, featuring alpha-shape regime partitioning and three evaluation protocols for surrogate models.
hub Canonical reference
URL https://cacm.acm.org/research/ datasheets-for-datasets/
Canonical reference. 91% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 11representative citing papers
FactoryNet is the first universal pretraining corpus for industrial time-series data with a shared S-E-F-C schema that supports cross-embodiment transfer and competitive anomaly detection.
UA-Legal-Bench is a new five-task benchmark for Ukrainian legal reasoning that demonstrates task-dependent few-shot prompting effects and the need for macro-F1 over accuracy on imbalanced classes.
OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.
MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.
Causal state binding is introduced as a framework that predicts action control in language agents, validated across large benchmarks and SWE-bench Lite where adding the measure raised issue-to-file hit@3 AUC from 0.873 to 0.935.
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
CausalReasoningBenchmark supplies 173 real-world queries that separately grade causal identification specifications and point estimates to expose distinct failure modes in automated causal systems.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Prompts for public-sector LLMs encode value-laden decisions and should be governed through community-maintained Prompt Commons repositories with provenance, licensing, and moderation.
Presents a publicly available multilingual corpus of 1,122 customer service self-help documents in four Nordic languages totaling 274,599 words.
ChronoMedKG builds a temporal biomedical KG with 460k evidence-linked triples across 13k diseases using LLM consensus and introduces the ChronoTQA benchmark showing RAG gains on time-sensitive questions.
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
Rollout cards preserve complete agent rollout records and declare the reporting rules behind scores, enabling reproducible evaluation where changing only the rule can alter success rates by over 20 percentage points.
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
MedVIGIL provides a 300-case evaluation suite with 2556 probes that measures silent failures in medical VLMs under broken evidence, showing the best model at 69.2 on the composite score versus a human radiologist at 83.3.
Know2Guess is a contamination-aware multi-zone benchmark for evaluating LLM knowledge boundaries with explicit abstention expectations and dual parsers.
No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms for detect/enforce/recover.
Presents the first formal Subjective Logic framework for uncertainty-aware assessment of dataset-level trustworthiness properties such as bias, evaluated on a traffic sign recognition dataset in centralized and federated settings.
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
RoboLineage introduces an agent-native data lifecycle governance system that represents robot policy iteration steps as typed lineage artifacts to improve speed and auditability in real-robot workflows.
The paper presents a threat model, taxonomy, and six-dimension measurement framework for AI sandboxes to clarify valid testing claims for safety, security, and regulatory assurance.
Longitudinal study of 56,800 AI papers finds sixfold increase in code+data sharing from 2014-2024 with inferred reproducibility rising from 28% to 64%.
Instrumented data augments observations with mechanistic models, uncertainty, and counterfactuals to enable causal interventions via Pearl's do-operator in scientific machine learning.
citing papers explorer
-
Digital Twins Need Feedback
Bidirectional feedback between physical and virtual systems is the defining property of digital twins, serving as an organizing principle for multi-scale hierarchies in biological and social organization.