hub Canonical reference

URL https://cacm.acm.org/research/ datasheets-for-datasets/

Gebru, T · 2021 · DOI 10.1145/3458723

Canonical reference. 91% of citing Pith papers cite this work as background.

49 Pith papers citing it

Background 91% of classified citations

open at publisher browse 49 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 11

citation-polarity summary

background 10 unclear 1

representative citing papers

FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue

cs.AI · 2026-05-25 · unverdicted · novelty 8.0

FLOATBench is a tabular benchmark dataset with 582,120 fatigue labels from 19,404 OpenFAST simulations of three 22 MW FOWT towers, featuring alpha-shape regime partitioning and three evaluation protocols for surrogate models.

FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models

cs.LG · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

FactoryNet is the first universal pretraining corpus for industrial time-series data with a shared S-E-F-C schema that supports cross-embodiment transfer and competitive anomaly detection.

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

UA-Legal-Bench is a new five-task benchmark for Ukrainian legal reasoning that demonstrates task-dependent few-shot prompting effects and the need for macro-F1 over accuracy on imbalanced classes.

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.

MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.

Causal state binding predicts action control in language agents

cs.AI · 2026-05-10 · unverdicted · novelty 7.0 · 3 refs

Causal state binding is introduced as a framework that predicts action control in language agents, validated across large benchmarks and SWE-bench Lite where adding the measure raised issue-to-file hit@3 AUC from 0.873 to 0.935.

ProactBench: Beyond What The User Asked For

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.

CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation

cs.AI · 2026-02-24 · unverdicted · novelty 7.0

CausalReasoningBenchmark supplies 173 real-world queries that separately grade causal identification specifications and point estimates to expose distinct failure modes in automated causal systems.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Prompts for Public-Sector LLMs Should Be Governed as Commons

cs.CY · 2026-05-30 · unverdicted · novelty 6.0

Prompts for public-sector LLMs encode value-laden decisions and should be governed through community-maintained Prompt Commons repositories with provenance, licensing, and moderation.

Telenor Nordics Customer Service self-help corpus

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

Presents a publicly available multilingual corpus of 1,122 customer service self-help documents in four Nordic languages totaling 274,599 words.

ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

ChronoMedKG builds a temporal biomedical KG with 460k evidence-linked triples across 13k diseases using LLM consensus and introduces the ChronoTQA benchmark showing RAG gains on time-sensitive questions.

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

cs.CR · 2026-05-17 · conditional · novelty 6.0

Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.

Rollout Cards: A Reproducibility Standard for Agent Research

cs.AI · 2026-05-12 · conditional · novelty 6.0

Rollout cards preserve complete agent rollout records and declare the reporting rules behind scores, enabling reproducible evaluation where changing only the rule can alter success rates by over 20 percentage points.

Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

cs.CV · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

MedVIGIL provides a 300-case evaluation suite with 2556 probes that measures silent failures in medical VLMs under broken evidence, showing the best model at 69.2 on the composite score versus a human radiologist at 83.3.

Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

Know2Guess is a contamination-aware multi-zone benchmark for evaluating LLM knowledge boundaries with explicit abstention expectations and dual parsers.

Auditable Agents

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms for detect/enforce/recover.

Assessing Trustworthiness of AI Training Dataset using Subjective Logic -- A Use Case on Bias

cs.LG · 2025-08-19 · unverdicted · novelty 6.0

Presents the first formal Subjective Logic framework for uncertainty-aware assessment of dataset-level trustworthiness properties such as bias, evaluated on a traffic sign recognition dataset in centralized and federated settings.

PaLM: Scaling Language Modeling with Pathways

cs.CL · 2022-04-05 · accept · novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

RoboLineage: Agent-Native Data Lifecycle Governance Across Robot Policy Iterations

cs.RO · 2026-06-20 · unverdicted · novelty 5.0

RoboLineage introduces an agent-native data lifecycle governance system that represents robot policy iteration steps as typed lineage artifacts to improve speed and auditability in real-robot workflows.

AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework

cs.CR · 2026-06-16 · unverdicted · novelty 5.0

The paper presents a threat model, taxonomy, and six-dimension measurement framework for AI sandboxes to clarify valid testing claims for safety, security, and regulatory assurance.

The Shift Toward Open and Reproducible AI Research

cs.AI · 2026-06-15 · unverdicted · novelty 5.0 · 2 refs

Longitudinal study of 56,800 AI papers finds sixfold increase in code+data sharing from 2014-2024 with inferred reproducibility rising from 28% to 64%.

Instrumented data for causal scientific machine learning

cs.LG · 2026-06-05 · unverdicted · novelty 5.0

Instrumented data augments observations with mechanistic models, uncertainty, and counterfactuals to enable causal interventions via Pearl's do-operator in scientific machine learning.

citing papers explorer

Showing 49 of 49 citing papers.

FLOATBench: A Dataset and Benchmark for Floating Offshore Wind Turbine Tower Fatigue cs.AI · 2026-05-25 · unverdicted · none · ref 40
FLOATBench is a tabular benchmark dataset with 582,120 fatigue labels from 19,404 OpenFAST simulations of three 22 MW FOWT towers, featuring alpha-shape regime partitioning and three evaluation protocols for surrogate models.
FactoryNet: A Large-Scale Dataset toward Industrial Time-Series Foundation Models cs.LG · 2026-05-09 · unverdicted · none · ref 8 · 2 links
FactoryNet is the first universal pretraining corpus for industrial time-series data with a shared S-E-F-C schema that supports cross-embodiment transfer and competitive anomaly detection.
UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning cs.CL · 2026-05-27 · unverdicted · none · ref 5
UA-Legal-Bench is a new five-task benchmark for Ukrainian legal reasoning that demonstrates task-dependent few-shot prompting effects and the need for macro-F1 over accuracy on imbalanced classes.
OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents cs.AI · 2026-05-27 · unverdicted · none · ref 10
OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.
MemLens: Benchmarking Multimodal Long-Term Memory in Large Vision-Language Models cs.CV · 2026-05-14 · unverdicted · none · ref 99
MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.
Causal state binding predicts action control in language agents cs.AI · 2026-05-10 · unverdicted · none · ref 24 · 3 links
Causal state binding is introduced as a framework that predicts action control in language agents, validated across large benchmarks and SWE-bench Lite where adding the measure raised issue-to-file hit@3 AUC from 0.873 to 0.935.
ProactBench: Beyond What The User Asked For cs.LG · 2026-05-09 · unverdicted · none · ref 110
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
CausalReasoningBenchmark: A Real-World Benchmark for Disentangled Evaluation of Causal Identification and Estimation cs.AI · 2026-02-24 · unverdicted · none · ref 40
CausalReasoningBenchmark supplies 173 real-world queries that separately grade causal identification specifications and point estimates to expose distinct failure modes in automated causal systems.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 282
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Prompts for Public-Sector LLMs Should Be Governed as Commons cs.CY · 2026-05-30 · unverdicted · none · ref 14
Prompts for public-sector LLMs encode value-laden decisions and should be governed through community-maintained Prompt Commons repositories with provenance, licensing, and moderation.
Telenor Nordics Customer Service self-help corpus cs.CL · 2026-05-26 · unverdicted · none · ref 3
Presents a publicly available multilingual corpus of 1,122 customer service self-help documents in four Nordic languages totaling 274,599 words.
ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning cs.CL · 2026-05-21 · unverdicted · none · ref 7
ChronoMedKG builds a temporal biomedical KG with 460k evidence-linked triples across 13k diseases using LLM consensus and introduces the ChronoTQA benchmark showing RAG gains on time-sensitive questions.
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents cs.CR · 2026-05-17 · conditional · none · ref 100
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
Rollout Cards: A Reproducibility Standard for Agent Research cs.AI · 2026-05-12 · conditional · none · ref 1
Rollout cards preserve complete agent rollout records and declare the reporting rules behind scores, enabling reproducible evaluation where changing only the rule can alter success rates by over 20 percentage points.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation cs.AI · 2026-05-11 · unverdicted · none · ref 5
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence cs.CV · 2026-05-08 · unverdicted · none · ref 4 · 2 links
MedVIGIL provides a 300-case evaluation suite with 2556 probes that measures silent failures in medical VLMs under broken evidence, showing the best model at 69.2 on the composite score versus a human radiologist at 83.3.
Know2Guess: A Contamination-Aware Multi-Zone Benchmark for Knowledge-Boundary Evaluation in Large Language Models cs.CL · 2026-04-30 · unverdicted · none · ref 3
Know2Guess is a contamination-aware multi-zone benchmark for evaluating LLM knowledge boundaries with explicit abstention expectations and dual parsers.
Auditable Agents cs.AI · 2026-04-07 · unverdicted · none · ref 5
No agent system can be accountable without auditability, which requires five dimensions (action recoverability, lifecycle coverage, policy checkability, responsibility attribution, evidence integrity) and mechanisms for detect/enforce/recover.
Assessing Trustworthiness of AI Training Dataset using Subjective Logic -- A Use Case on Bias cs.LG · 2025-08-19 · unverdicted · none · ref 7
Presents the first formal Subjective Logic framework for uncertainty-aware assessment of dataset-level trustworthiness properties such as bias, evaluated on a traffic sign recognition dataset in centralized and federated settings.
PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 50
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
RoboLineage: Agent-Native Data Lifecycle Governance Across Robot Policy Iterations cs.RO · 2026-06-20 · unverdicted · none · ref 51
RoboLineage introduces an agent-native data lifecycle governance system that represents robot policy iteration steps as typed lineage artifacts to improve speed and auditability in real-robot workflows.
AI Sandboxes: A Threat Model, Taxonomy, and Measurement Framework cs.CR · 2026-06-16 · unverdicted · none · ref 50
The paper presents a threat model, taxonomy, and six-dimension measurement framework for AI sandboxes to clarify valid testing claims for safety, security, and regulatory assurance.
The Shift Toward Open and Reproducible AI Research cs.AI · 2026-06-15 · unverdicted · none · ref 28 · 2 links
Longitudinal study of 56,800 AI papers finds sixfold increase in code+data sharing from 2014-2024 with inferred reproducibility rising from 28% to 64%.
Instrumented data for causal scientific machine learning cs.LG · 2026-06-05 · unverdicted · none · ref 31
Instrumented data augments observations with mechanistic models, uncertainty, and counterfactuals to enable causal interventions via Pearl's do-operator in scientific machine learning.
Is US Defense Acquisition Ready to Acquire AI-Enabled Capabilities? Assessing the DoD Software Acquisition Pathway Through a Scenario-Based Policy Analysis cs.SE · 2026-06-05 · unverdicted · none · ref 19
Scenario-based analysis finds the DoD Software Acquisition Pathway offers a foundation for AI but leaves AI-specific controls for data and oversight distributed in supplemental documents rather than core program mechanisms.
ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree cs.CR · 2026-05-31 · accept · none · ref 12
Analysis of 67,453 OpenClaw skills shows three scanners overlap on at most 10.4% of combined positives, with 81.9% flagged by only one scanner and distinct profiles for malicious versus suspicious skills.
Ontological Knowledge Blocks: Executable Compliance and Profile-Based Validation for Trustworthy AI Systems cs.AI · 2026-05-22 · conditional · none · ref 6
Ontological Knowledge Blocks formalize regulatory obligations as 5-tuples linking RDF/OWL schemas, SHACL rules, evidence requirements and provenance, with a compiler enabling profile-based validation demonstrated in an HPC allocation scenario.
The Quiet Path from Seemingly Minor Design Errors to Workplace AI Incidents cs.HC · 2026-05-20 · unverdicted · none · ref 38
Empirical analysis of 1,524 AI incident reports shows 83% arise from worker-AI trait misalignments, with 74% of those traceable to developers prioritizing efficiency over precision or personalization.
Beyond Model Readiness: Institutional Readiness for AI Deployment in Public Systems cs.CY · 2026-05-17 · unverdicted · none · ref 12
Introduces the Institutional Alignment Readiness (IAR) framework with five dimensions to evaluate institutional deployment readiness for AI in public systems, motivated by two anonymized education-sector cases.
Voices in the Loop: Mapping Participatory AI cs.AI · 2026-05-16 · unverdicted · none · ref 21
Authors build a harmonized, geolocated atlas of participatory AI projects from existing and new sources, documenting geographic concentration and participation mostly at problem formulation and evaluation stages while providing update and governance mechanisms.
Evaluating Structured Documentation as a Tool for Reflexivity in Dataset Development cs.CY · 2026-05-11 · unverdicted · none · ref 60
Structured dataset documentation shows little engagement with major reflexivity themes from FAccT literature, leading to a new codebook and extended datasheet questions.
How Time-Sensitive are IoBNT Networks? An Age of Information Perspective for In-Body Monitoring eess.SP · 2026-05-11 · unverdicted · none · ref 46
IoBNT networks with blood-borne nanosensors deliver fresh biomarker data to external monitors within tens of seconds under realistic conditions, suiting tissue-level but not cellular-scale monitoring.
Exploring CoCo Challenges in ML Engineering Teams: Insights From the Semiconductor Industry cs.SE · 2026-05-08 · unverdicted · none · ref 6
Interviews in a semiconductor company reveal 16 collaboration and communication challenges in ML engineering teams, with unclear roles and responsibilities as the top issue, and list effective mitigation practices under hardware-driven constraints.
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels cs.LG · 2026-05-07 · unverdicted · none · ref 18
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
Accountable Agents in Software Engineering: An Analysis of Terms of Service and a Research Roadmap cs.SE · 2026-05-06 · unverdicted · none · ref 17
Comparative review of AI coding tool ToS shows responsibility for code quality and compliance shifted to users, with policy misalignment for autonomous agents, plus a research roadmap.
Reflections on Traceability for Visualization Research cs.HC · 2026-04-15 · conditional · none · ref 26
Visualization researchers propose traceability—recording abundant annotated artifacts, reporting curated research threads, and enabling reading via interfaces—as a way to ensure rigor and transparency in inherently unreproducible design processes.
Task Decomposition for Efficient Annotation cs.CL · 2026-06-23 · unverdicted · none · ref 223
Decomposing annotation tasks using centers from centering theory reduces aggregate inferential load via a degrees-of-freedom model and enables better sub-task allocation.
Digital Twins Need Feedback cs.LO · 2026-06-22 · unverdicted · none · ref 8
Bidirectional feedback between physical and virtual systems is the defining property of digital twins, serving as an organizing principle for multi-scale hierarchies in biological and social organization.
PaperMentor: A Human-Centered Multi-Agent Writing Tutor for AI Research Papers on Overleaf cs.CL · 2026-06-07 · unverdicted · none · ref 9
A multi-agent writing tutor for Overleaf that uses 12 agents and an expert skill library to generate inline comments, with a 14-user study reporting 90.6% actionable and 67.5% valid comments that outperform a GPT-5.2 baseline.
Are Algorithm Registers Transparent? Perspectives from Germany cs.CY · 2026-06-01 · unverdicted · none · ref 25
Audit of two German algorithm registers using checklists from a 2025 proposal finds they require adaptations to meet proposed transparency goals.
Methodology for Creating a Clinically Verified Dermoscopic Image Dataset cs.CV · 2026-05-24 · unverdicted · none · ref 15
Describes a methodology and the resulting dataset of 1,026 dermoscopic images with structured metadata and verified diagnostic labels for medical informatics research.
AIMBio-Mat: An AI-Native FAIR Platform for Closed-Loop Materials Discovery and Biomedical Translation physics.app-ph · 2026-05-20 · unverdicted · none · ref 26
AIMBio-Mat is a conceptual blueprint for an AI-native, FAIR, governance-aware decision layer that formulates biomedical-materials discovery as constrained multi-objective optimization under uncertainty.
Pluralistic-Alignment Urbanism: Operationalizing a Right to AI for Inclusive Public Space cs.CY · 2026-05-15 · unverdicted · none · ref 30
Introduces PAU as a governance architecture for municipal AI in public spaces, informed by case studies on subgroup-aware scaling (R2=0.89) and pluralistic preference data that treats neutrality as indeterminacy.
We Need Strong Preconditions For Using Simulations In Policy cs.CY · 2026-04-09 · unverdicted · none · ref 26
Societal-scale LLM agent simulations for policy need three preconditions: avoid neutral treatment of marginalized population simulations, require population participation, ensure accountability, plus development and deployment reports.
AI to Learn 2.0: A Deliverable-Oriented Governance Framework and Maturity Rubric for Opaque AI in Learning-Intensive Domains cs.AI · 2026-03-16 · conditional · none · ref 24
AI to Learn 2.0 is a deliverable-oriented framework with a seven-dimension maturity rubric and capability-evidence ladder that permits opaque AI for exploration but requires final outputs to be auditable, transferable, and supported by human-attributable evidence.
Human-aligned AI Model Cards with Weighted Hierarchy Architecture cs.SE · 2025-10-08 · unverdicted · none · ref 17
Introduces CRAI-MCF, an eight-module framework distilling 217 parameters from 240 projects into a quantitative sufficiency criterion for cross-model LLM comparison grounded in Value Sensitive Design.
Building a Regional Data-Centric Materials Science Ecosystem for Processing-Rich Materials Innovation in the Great Plains cond-mat.mtrl-sci · 2026-05-19 · unverdicted · none · ref 52
Proposes a regional data-centric materials science ecosystem for the Great Plains, identifying five barriers to data sharing and outlining a staged roadmap illustrated by a high-purity germanium pilot.
Evaluation of AI Ethics Tools in Language Models: A Developers' Perspective Case Study cs.CY · 2025-12-16 · unreviewed · ref 27
LLM Harms: A Taxonomy and Discussion cs.CY · 2025-12-05 · unreviewed · ref 17

URL https://cacm.acm.org/research/ datasheets-for-datasets/

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer