FactoryNet is the first universal pretraining corpus for industrial time-series data with a shared S-E-F-C schema that supports cross-embodiment transfer and competitive anomaly detection.
hub Canonical reference
URL https://cacm.acm.org/research/ datasheets-for-datasets/
Canonical reference. 91% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
fields
cs.AI 8 cs.CY 5 cs.LG 4 cs.CL 3 cs.SE 3 cs.HC 2 cond-mat.mtrl-sci 1 cs.CR 1 cs.CV 1 eess.SP 1roles
background 11representative citing papers
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
CausalReasoningBenchmark supplies 173 real-world queries that separately grade causal identification specifications and point estimates to expose distinct failure modes in automated causal systems.
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
ChronoMedKG builds a temporal biomedical KG with 460k evidence-linked triples across 13k diseases using LLM consensus and introduces the ChronoTQA benchmark showing RAG gains on time-sensitive questions.
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
Rollout cards preserve complete agent rollout records and declare the reporting rules behind scores, enabling reproducible evaluation where changing only the rule can alter success rates by over 20 percentage points.
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
MedVIGIL provides a 300-case evaluation suite with 2556 probes that measures silent failures in medical VLMs under broken evidence, showing the best model at 69.2 on the composite score versus a human radiologist at 83.3.
No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms for detect/enforce/recover.
Presents the first formal Subjective Logic framework for uncertainty-aware assessment of dataset-level trustworthiness properties such as bias, evaluated on a traffic sign recognition dataset in centralized and federated settings.
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Ontological Knowledge Blocks formalize regulatory obligations as 5-tuples linking RDF/OWL schemas, SHACL rules, evidence requirements and provenance, with a compiler enabling profile-based validation demonstrated in an HPC allocation scenario.
Empirical analysis of 1,524 AI incident reports shows 83% arise from worker-AI trait misalignments, with 74% of those traceable to developers prioritizing efficiency over precision or personalization.
Introduces the Institutional Alignment Readiness (IAR) framework with five dimensions to evaluate institutional deployment readiness for AI in public systems, motivated by two anonymized education-sector cases.
Authors build a harmonized, geolocated atlas of participatory AI projects from existing and new sources, documenting geographic concentration and participation mostly at problem formulation and evaluation stages while providing update and governance mechanisms.
Structured dataset documentation shows little engagement with major reflexivity themes from FAccT literature, leading to a new codebook and extended datasheet questions.
IoBNT networks with blood-borne nanosensors deliver fresh biomarker data to external monitors within tens of seconds under realistic conditions, suiting tissue-level but not cellular-scale monitoring.
Interviews in a semiconductor company reveal 16 collaboration and communication challenges in ML engineering teams, with unclear roles and responsibilities as the top issue, and list effective mitigation practices under hardware-driven constraints.
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
Comparative review of AI coding tool ToS shows responsibility for code quality and compliance shifted to users, with policy misalignment for autonomous agents, plus a research roadmap.
Visualization researchers propose traceability—recording abundant annotated artifacts, reporting curated research threads, and enabling reading via interfaces—as a way to ensure rigor and transparency in inherently unreproducible design processes.
AIMBio-Mat is a conceptual blueprint for an AI-native, FAIR, governance-aware decision layer that formulates biomedical-materials discovery as constrained multi-objective optimization under uncertainty.
Societal-scale LLM agent simulations for policy need three preconditions: avoid neutral treatment of marginalized population simulations, require population participation, ensure accountability, plus development and deployment reports.
citing papers explorer
-
FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models
FactoryNet is the first universal pretraining corpus for industrial time-series data with a shared S-E-F-C schema that supports cross-embodiment transfer and competitive anomaly detection.
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation
CausalReasoningBenchmark supplies 173 real-world queries that separately grade causal identification specifications and point estimates to expose distinct failure modes in automated causal systems.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning
ChronoMedKG builds a temporal biomedical KG with 460k evidence-linked triples across 13k diseases using LLM consensus and introduces the ChronoTQA benchmark showing RAG gains on time-sensitive questions.
-
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
-
Rollout Cards: A Reproducibility Standard for Agent Research
Rollout cards preserve complete agent rollout records and declare the reporting rules behind scores, enabling reproducible evaluation where changing only the rule can alter success rates by over 20 percentage points.
-
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
-
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
MedVIGIL provides a 300-case evaluation suite with 2556 probes that measures silent failures in medical VLMs under broken evidence, showing the best model at 69.2 on the composite score versus a human radiologist at 83.3.
-
Auditable Agents
No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms for detect/enforce/recover.
-
Assessing Trustworthiness of AI Training Dataset using Subjective Logic -- A Use Case on Bias
Presents the first formal Subjective Logic framework for uncertainty-aware assessment of dataset-level trustworthiness properties such as bias, evaluated on a traffic sign recognition dataset in centralized and federated settings.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems
Ontological Knowledge Blocks formalize regulatory obligations as 5-tuples linking RDF/OWL schemas, SHACL rules, evidence requirements and provenance, with a compiler enabling profile-based validation demonstrated in an HPC allocation scenario.
-
The Quiet Path from Seemingly Minor Design Errors to Workplace AI Incidents
Empirical analysis of 1,524 AI incident reports shows 83% arise from worker-AI trait misalignments, with 74% of those traceable to developers prioritizing efficiency over precision or personalization.
-
Beyond Model Readiness: Institutional Readiness for AI Deployment in Public Systems
Introduces the Institutional Alignment Readiness (IAR) framework with five dimensions to evaluate institutional deployment readiness for AI in public systems, motivated by two anonymized education-sector cases.
-
Voices in the Loop: Mapping Participatory AI
Authors build a harmonized, geolocated atlas of participatory AI projects from existing and new sources, documenting geographic concentration and participation mostly at problem formulation and evaluation stages while providing update and governance mechanisms.
-
Evaluating Structured Documentation as a Tool for Reflexivity in Dataset Development
Structured dataset documentation shows little engagement with major reflexivity themes from FAccT literature, leading to a new codebook and extended datasheet questions.
-
How Time-Sensitive are IoBNT Networks? An Age of Information Perspective for In-Body Monitoring
IoBNT networks with blood-borne nanosensors deliver fresh biomarker data to external monitors within tens of seconds under realistic conditions, suiting tissue-level but not cellular-scale monitoring.
-
Exploring CoCo Challenges in ML Engineering Teams: Insights From the Semiconductor Industry
Interviews in a semiconductor company reveal 16 collaboration and communication challenges in ML engineering teams, with unclear roles and responsibilities as the top issue, and list effective mitigation practices under hardware-driven constraints.
-
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
-
Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap
Comparative review of AI coding tool ToS shows responsibility for code quality and compliance shifted to users, with policy misalignment for autonomous agents, plus a research roadmap.
-
Reflections on Traceability for Visualization Research
Visualization researchers propose traceability—recording abundant annotated artifacts, reporting curated research threads, and enabling reading via interfaces—as a way to ensure rigor and transparency in inherently unreproducible design processes.
-
AIMBio-Mat: An AI-Native FAIR Platform for Closed-Loop Materials Discovery and Biomedical Translation
AIMBio-Mat is a conceptual blueprint for an AI-native, FAIR, governance-aware decision layer that formulates biomedical-materials discovery as constrained multi-objective optimization under uncertainty.
-
We Need Strong Preconditions For Using Simulations In Policy
Societal-scale LLM agent simulations for policy need three preconditions: avoid neutral treatment of marginalized population simulations, require population participation, ensure accountability, plus development and deployment reports.
-
AI to Learn 2.0: A Deliverable-Oriented Governance Framework and Maturity Rubric for Opaque AI in Learning-Intensive Domains
AI to Learn 2.0 is a deliverable-oriented framework with a seven-dimension maturity rubric and capability-evidence ladder that permits opaque AI for exploration but requires final outputs to be auditable, transferable, and supported by human-attributable evidence.
-
Human-aligned AI Model Cards with Weighted Hierarchy Architecture
Introduces CRAI-MCF, an eight-module framework distilling 217 parameters from 240 projects into a quantitative sufficiency criterion for cross-model LLM comparison grounded in Value Sensitive Design.
-
Building a Regional Data-Centric Materials Science Ecosystem for Processing-Rich Materials Innovation in the Great Plains
Proposes a regional data-centric materials science ecosystem for the Great Plains, identifying five barriers to data sharing and outlining a staged roadmap illustrated by a high-purity germanium pilot.
- Causal state binding predicts action control in language agents
- Evaluation of AI Ethics Tools in Language Models: A Developers' Perspective Case Study
- LLM Harms: A Taxonomy and Discussion