Human-aligned AI Model Cards with Weighted Hierarchy Architecture

arxiv: 2510.06989 · v3 · submitted 2025-10-08 · 💻 cs.SE

Human-aligned AI Model Cards with Weighted Hierarchy Architecture

Pengyue Yang , Haolin Jin , Qingwen Zeng , Jiawen Wen , Harry Rao , Huaming Chen This is my paper

Pith reviewed 2026-05-18 09:05 UTC · model grok-4.3

classification 💻 cs.SE

keywords model cardsresponsible AIvalue sensitive designLLM documentationAI model evaluationdocumentation frameworkcross-model comparisonhuman-aligned AI

0 comments p. Extension

The pith

CRAI-MCF replaces static model cards with an eight-module architecture that supports quantitative comparison of AI models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that inconsistent documentation hinders discovery and adoption of the growing number of LLMs and domain-specific models. It addresses this by introducing the Comprehensive Responsible AI Model Card Framework, derived from an empirical review of 240 open-source projects that yields 217 parameters. These parameters are organized into an eight-module structure grounded in value sensitive design, augmented by a quantitative sufficiency criterion. This setup allows direct, rigorous comparisons across models while balancing technical, ethical, and operational aspects. A sympathetic reader would care because improved documentation could lead to more confident and responsible model selection in practice.

Core claim

CRAI-MCF transitions from static disclosures to actionable, human-aligned documentation by distilling 217 parameters from 240 open-source projects into an eight-module, value-aligned architecture grounded in Value Sensitive Design, and by introducing a quantitative sufficiency criterion that enables rigorous cross-model comparison under a unified scheme while balancing technical, ethical, and operational dimensions.

What carries the argument

The eight-module value-aligned architecture that organizes the 217 parameters to support weighted, human-aligned evaluation and quantitative sufficiency checks.

If this is right

Practitioners can assess and select LLMs with greater operational confidence and integrity.
Models can be compared directly across technical, ethical, and operational dimensions using one scheme.
Documentation becomes actionable rather than purely descriptive, reducing underutilization.
The framework applies to both general LLMs and specialized domain models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use might prompt platforms to require quantitative fields in model listings.
The approach could extend to closed-source models if they supply equivalent parameter data.
Regulators could adopt the sufficiency criterion as a baseline for transparency requirements.

Load-bearing premise

The parameters extracted from an analysis of 240 open-source projects form a representative set that generalizes to the broader ecosystem of LLMs and domain-specific models.

What would settle it

A side-by-side study measuring whether teams using CRAI-MCF documentation select and adopt models more consistently or with higher satisfaction than teams using conventional static model cards.

Figures

Figures reproduced from arXiv: 2510.06989 by Haolin Jin, Harry Rao, Huaming Chen, Jiawen Wen, Pengyue Yang, Qingwen Zeng.

**Figure 1.** Figure 1: The VSD-anchored research pipeline activity within the prior 12 months (a commit, tagged release, or model-card update). De-duplication. We removed forks, mirrors, and template repositories (via GitHub network metadata and README heuristics), collapsed cross-platform duplicates (same canonical model family explicitly cross-referenced across GitHub/Hugging Face), and merged packaging-only variants. When … view at source ↗

**Figure 2.** Figure 2: Tree visualization of CRAI-MCF. The diagram highlights a subset of core parameters under the eight Level-0 modules, [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Conceptual integration of the five evaluation di [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Task-module documentation heatmap. Lighter/warmer cells denote higher coverage; darker/- cooler cells denote lower coverage. Persistent dark bands on Feedback and, for many tasks, Broader Implications indicate ecosystem-level under-reporting of accountability and societal-risk information. 5.1 Coverage Diagnostic [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Three-domain case comparison. Bars report the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

The proliferation of Large Language Models (LLMs) has led to a burgeoning ecosystem of specialized, domain-specific models. While this rapid growth accelerates innovation, it has simultaneously created significant challenges in model discovery and adoption. Users struggle to navigate this landscape due to inconsistent, incomplete, and imbalanced documentation across platforms. Existing documentation frameworks, such as Model Cards and FactSheets, attempt to standardize reporting but are often static, predominantly qualitative, and lack the quantitative mechanisms needed for rigorous cross-model comparison. This gap exacerbates model underutilization and hinders responsible adoption. To address these shortcomings, we introduce the Comprehensive Responsible AI Model Card Framework (CRAI-MCF), a novel approach that transitions from static disclosures to actionable, human-aligned documentation. Grounded in Value Sensitive Design (VSD), CRAI-MCF is built upon an empirical analysis of 240 open-source projects, distilling 217 parameters into an eight-module, value-aligned architecture. Our framework introduces a quantitative sufficiency criterion to operationalize evaluation and enables rigorous cross-model comparison under a unified scheme. By balancing technical, ethical, and operational dimensions, CRAI-MCF empowers practitioners to efficiently assess, select, and adopt LLMs with greater confidence and operational integrity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes a quantitative model card framework distilled from 240 open-source projects but provides no validation or testing of its sufficiency criterion.

read the letter

The main takeaway is that the authors analyzed 240 open-source AI projects to extract 217 parameters and built them into an eight-module weighted hierarchy for model cards. They call it CRAI-MCF and add a quantitative sufficiency criterion to allow better comparisons between models. This is new in the way it combines Value Sensitive Design with a specific modular structure and numbers for evaluation. Existing model cards are more free-form and qualitative, so this tries to make documentation more actionable and standardized for the growing number of specialized LLMs. The paper does well in identifying the practical problem of inconsistent docs leading to poor adoption. The motivation section ties the framework back to real issues in model discovery. Grounding it in an empirical analysis of projects gives it some basis rather than pure theory. Where it falls short is the absence of any validation. The description outlines the framework but provides no results from applying it to models or testing whether the sufficiency criterion actually helps with comparisons. Details on distilling the 217 parameters are missing, making it hard to see if the process was systematic or if biases crept in from the sample. The concern about whether 240 open-source projects represent the broader ecosystem is valid. Proprietary models or those in niche domains might have different priorities that aren't captured here, which could limit how general the architecture really is. This paper is for people working on AI governance or tool-building for model selection. A practitioner looking for a template to improve their documentation might pick up some ideas, while a researcher could use it as a basis for further development. It deserves peer review because the proposal is concrete and builds on established work. With added experiments or case studies, it could become more solid. The authors show honest engagement with the literature on documentation frameworks.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes the Comprehensive Responsible AI Model Card Framework (CRAI-MCF) to address inconsistent documentation in the LLM ecosystem. Grounded in Value Sensitive Design, the framework is derived from an empirical analysis of 240 open-source projects that distills 217 parameters into an eight-module, value-aligned architecture. It introduces a quantitative sufficiency criterion intended to enable actionable, human-aligned documentation and rigorous cross-model comparison, moving beyond static and predominantly qualitative approaches such as Model Cards and FactSheets.

Significance. If the parameter set proves representative across domains, scales, and licensing regimes and the sufficiency criterion receives empirical validation, the framework could meaningfully improve model discoverability, selection, and responsible adoption by providing a unified, quantitative scheme. The explicit grounding in VSD and the attempt to balance technical, ethical, and operational dimensions are constructive elements; however, the current lack of validation data and process details substantially reduces the assessed significance.

major comments (3)

[§3 (Empirical Analysis and Parameter Distillation)] The central claim that an analysis of 240 open-source projects yields a representative set of 217 generalizable parameters for an eight-module architecture is load-bearing, yet the manuscript provides no details on project selection criteria, diversity metrics (domain, scale, licensing), coding process, or inter-rater reliability. This directly undermines evaluation of the representativeness assumption highlighted in the skeptic note.
[§5 (Quantitative Sufficiency Criterion and Evaluation)] The quantitative sufficiency criterion is presented as enabling rigorous cross-model comparison, but no validation data, testing protocol, or sufficiency threshold derivation is reported. Without these, the claim that CRAI-MCF supports actionable evaluation remains unsupported, consistent with the soundness assessment of 3.0.
[§4 (Framework Architecture)] The title references a 'Weighted Hierarchy Architecture,' yet the manuscript does not specify how weights are assigned, how the hierarchy is constructed from the 217 parameters, or how the eight modules are weighted relative to one another. This omission affects the operationalizability of the framework.

minor comments (2)

[Abstract] The abstract states that the framework 'balances technical, ethical, and operational dimensions' but does not indicate how balance is measured or enforced within the sufficiency criterion.
[§5] Notation for the sufficiency criterion (e.g., any formula or scoring function) should be introduced with explicit definitions and an example calculation for at least one model.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments and recommendations. We address each of the major comments in detail below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: [§3 (Empirical Analysis and Parameter Distillation)] The central claim that an analysis of 240 open-source projects yields a representative set of 217 generalizable parameters for an eight-module architecture is load-bearing, yet the manuscript provides no details on project selection criteria, diversity metrics (domain, scale, licensing), coding process, or inter-rater reliability. This directly undermines evaluation of the representativeness assumption highlighted in the skeptic note.

Authors: We acknowledge the importance of transparency in the empirical analysis. The manuscript currently summarizes the outcomes of the analysis but does not elaborate on the methodology. In the revised version, we will expand Section 3 to include: (1) explicit project selection criteria, (2) diversity metrics covering domain, scale, and licensing regimes, (3) a description of the coding and distillation process, and (4) inter-rater reliability statistics. These additions will allow readers to better assess the representativeness of the 217 parameters. revision: yes
Referee: [§5 (Quantitative Sufficiency Criterion and Evaluation)] The quantitative sufficiency criterion is presented as enabling rigorous cross-model comparison, but no validation data, testing protocol, or sufficiency threshold derivation is reported. Without these, the claim that CRAI-MCF supports actionable evaluation remains unsupported, consistent with the soundness assessment of 3.0.

Authors: We agree that empirical validation is essential for the sufficiency criterion. The current presentation focuses on the derivation from the VSD-grounded analysis and the conceptual framework. For the revision, we will add a new subsection in §5 detailing the testing protocol, provide validation data from applying the criterion to selected LLMs, and explain the process for deriving the sufficiency threshold. We note that a comprehensive validation across all domains would be extensive and may be addressed in future work, but initial results will be included. revision: yes
Referee: [§4 (Framework Architecture)] The title references a 'Weighted Hierarchy Architecture,' yet the manuscript does not specify how weights are assigned, how the hierarchy is constructed from the 217 parameters, or how the eight modules are weighted relative to one another. This omission affects the operationalizability of the framework.

Authors: We appreciate this observation regarding operationalizability. The weighted hierarchy is intended to reflect the relative importance derived from the empirical data. In the revised manuscript, we will provide explicit details on: the method for assigning weights to parameters and modules (e.g., based on occurrence frequency and VSD alignment scores), the construction of the hierarchy from the 217 parameters into the eight modules, and the relative weighting between modules. This will enhance the framework's practical applicability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on external empirical analysis

full rationale

The paper derives the CRAI-MCF eight-module architecture and 217 parameters explicitly from an empirical analysis of 240 open-source projects, which functions as an independent external input rather than an internal self-definition, fitted prediction, or self-citation chain. No equations, ansatzes, or load-bearing self-references appear in the claims that would reduce the quantitative sufficiency criterion or value-aligned structure back to the paper's own outputs by construction. The central claim therefore remains self-contained against the stated external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper rests on the domain assumption that Value Sensitive Design provides the correct lens for model documentation and that the sampled projects are representative; it introduces the CRAI-MCF as a new organizing structure without external falsifiable tests mentioned.

axioms (1)

domain assumption Value Sensitive Design is an appropriate foundational approach for creating human-aligned AI model documentation.
Explicitly stated as the grounding for the CRAI-MCF architecture in the abstract.

invented entities (1)

CRAI-MCF framework no independent evidence
purpose: To operationalize quantitative, value-aligned model cards with a sufficiency criterion
Newly introduced eight-module architecture distilled from project analysis.

pith-pipeline@v0.9.0 · 5752 in / 1285 out tokens · 38628 ms · 2026-05-18T09:05:51.652215+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Breath1024.lean period8 / 8-tick periodicity in reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

distilling 217 parameters into an eight-module, value-aligned architecture

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 4 internal anchors

[1]

NIST AI. 2023. Artificial intelligence risk management framework (AI RMF 1.0). URL: https://nvlpubs. nist. gov/nistpubs/ai/nist. ai(2023), 100–1

work page 2023
[2]

Chronos: Learning the Language of Time Series

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebas- tian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. 2024. Chronos: Learning the L...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

Varshney, Yunfeng Wei, and James Winter

Matthew Arnold, Rachel KE Bellamy, Michael Hind, Stephanie Houde, Samir Mehta, Aleksandra Mojsilović, Ravi Nair, Karthikeyan Natesan Ramamurthy, John Richards, Jason Tsay, Kush R. Varshney, Yunfeng Wei, and James Winter

work page
[4]

Model cards for model reporting

FactSheets: Increasing Trust in AI Services through Supplier’s Declarations of Conformity. InProceedings of the Conference on Fairness, Accountability, and Transparency (FAccT). ACM, 77–87. doi:10.1145/3287560.3287596

work page doi:10.1145/3287560.3287596
[5]

Sourav Banerjee, Ayushi Agarwal, and Eishkaran Singh. 2024. The Vulnera- bility of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? arXiv:2412.03597 [cs.CL] https://arxiv.org/abs/2412.03597

work page arXiv 2024
[6]

Amna Batool, Didar Zowghi, and Muneera Bano. 2023. Responsible AI gover- nance: a systematic literature review.arXiv preprint arXiv:2401.10896(2023). Human-aligned AI Model Cards with Weighted Hierarchy Architecture ,

work page arXiv 2023
[7]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Lan- guage Processing: Toward Mitigating System Bias and Enabling Better Science. InProceedings of the Conference on Fairness, Accountability, and Transparency (FAccT). ACM, 587–604. doi:10.1145/3287560.3287576

work page doi:10.1145/3287560.3287576 2018
[8]

Rishi Bommasani, Hannah Hudson, Eric Klyman, et al. 2023. The Foundation Model Transparency Index.arXiv preprint arXiv:2310.12941(2023). https://arxiv. org/abs/2310.12941

work page arXiv 2023
[9]

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner

work page
[10]

InProceedings of the 21st International Conference on Mining Software Repositories

Analyzing the evolution and maintenance of ml models on hugging face. InProceedings of the 21st International Conference on Mining Software Repositories. 607–618

work page
[11]

Kasia Chmielinski, Sarah Newman, Chris N Kranzinger, Michael Hind, Jen- nifer Wortman Vaughan, Margaret Mitchell, Julia Stoyanovich, Angelina McMillan-Major, Emily McReynolds, Kathleen Esfahany, et al. 2024. The CLeAR Documentation Framework for AI Transparency.Harvard Kennedy School Shoren- stein Center Discussion Paper(2024)

work page 2024
[12]

2013.Statistical power analysis for the behavioral sciences

Jacob Cohen. 2013.Statistical power analysis for the behavioral sciences. routledge

work page 2013
[13]

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith

work page
[14]

InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

Show Your Work: Improved Reporting of Experimental Results. InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2185–2194. doi:10.18653/v1/D19-1224

work page doi:10.18653/v1/d19-1224
[15]

John Estdale and Elli Georgiadou. 2018. Applying the ISO/IEC 25010 quality mod- els to software product. InEuropean Conference on Software Process Improvement. Springer, 492–503

work page 2018
[16]

Kahn, and Alan Borning

Batya Friedman, Peter H. Kahn, and Alan Borning. 2002. Value Sensitive Design: Theory and Methods.University of Washington Technical Report02-12-01 (2002). https://vsdesign.org/

work page 2002
[17]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for Datasets. Commun. ACM64, 12 (2021), 86–92. doi:10.1145/3458723

work page doi:10.1145/3458723 2021
[18]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

International Organization for Standardization. 2024. ISO/IEC 25002:2024 Systems and software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE) — Quality model overview and usage. https://www.iso. org/standard/78175.html Provides an overview and usage guidance for quality models within the SQuaRE series

work page 2024
[20]

Benjamin Laufer, Hamidah Oderinwale, and Jon Kleinberg. 2025. Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face. arXiv:2508.06811 [cs.SI] https://arxiv.org/abs/2508.06811

work page arXiv 2025
[21]

Weixin Liang, Nazneen Rajani, Xinyu Yang, Ezinwanne Ozoani, Eric Wu, Yiqun Chen, Daniel Scott Smith, and James Zou. 2024. Systematic analysis of 32,111 AI model cards characterizes documentation practice in AI.Nature Machine Intelligence6, 7 (2024), 744–753

work page 2024
[22]

Rebecca Linke. 2017. Design thinking, explained.Ideas Made to Matter(2017)

work page 2017
[23]

Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. 2023. BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine. arXiv:2308.09442 [cs.CE] https://arxiv.org/abs/2308.09442

work page arXiv 2023
[24]

David R Mandel, Tonya L Hendriks, and Daniel Irwin. 2022. Policy for promot- ing analytic rigor in intelligence: professionals’ views and their psychological correlates.Intelligence and National Security37, 2 (2022), 177–196

work page 2022
[25]

Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Capstick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Russell Wald, Tobi Walsh, Armin Hamrah, Lapo Santarlasci, Julia Betts Lotufo, Alexandra Rome, Andrew Shi, and Sukrut O...

work page arXiv 2025
[26]

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Ben Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency (FAccT). ACM, 220–229. doi:10.1145/3287560. 3287596

work page doi:10.1145/3287560 2019
[27]

Forty-Two Countries Adopt New OECD. 2019. Principles on Artificial Intelli- gence

work page 2019
[28]

Yashothara Shanmugarasa, Ming Ding, Chamikara Mahawaga Arachchige, and Thierry Rakotoarivelo. 2025. SoK: The Privacy Paradox of Large Language Models: Advancements, Privacy Risks, and Mitigation. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security (ASIA CCS ’25). ACM, 425–441. doi:10.1145/3708821.3733888

work page doi:10.1145/3708821.3733888 2025
[29]

Jan Philip Wahle, Terry Ruas, Saif M Mohammad, Norman Meuschke, and Bela Gipp. 2023. Ai usage cards: Responsibly reporting ai-generated content. In2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 282–284

work page 2023
[30]

Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Christopher Griffin, Po- Sen Huang, John Mellor, William Cheng, Amelia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. 2021. Ethical and Social Risks of Harm from Language Models. arXiv preprint arXiv:2112.04359(2021). https://arxiv.org/abs/2112.04359

work page internal anchor Pith review Pith/arXiv arXiv 2021
[31]

Justin D Weisz, Jessica He, Michael Muller, Gabriela Hoefer, Rachel Miles, and Werner Geyer. 2024. Design principles for generative AI applications. InPro- ceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–22

work page 2024
[32]

Amy Winecoff and Miranda Bogen. 2025. Improving governance outcomes through AI documentation: Bridging theory and practice. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–18

work page 2025
[33]

Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. 2023. FinGPT: Open- Source Financial Large Language Models. arXiv:2306.06031 [q-fin.ST] https: //arxiv.org/abs/2306.06031

work page arXiv 2023
[34]

Kai Zhang, Xiang Meng, Xue Yan, Jun Ji, Jun Liu, Hao Xu, Hao Zhang, Dong Liu, Jing Wang, Xiao Wang, Jian Gao, Yong Wang, Chang Shao, Wen Wang, Jie Li, Ming Zheng, Yu Yang, and Yue Tang. 2025. Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine.J Med Internet Res27, 1 (2025), e59069. doi:10.2196/59069

work page doi:10.2196/59069 2025
[35]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.182231, 2 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

NIST AI. 2023. Artificial intelligence risk management framework (AI RMF 1.0). URL: https://nvlpubs. nist. gov/nistpubs/ai/nist. ai(2023), 100–1

work page 2023

[2] [2]

Chronos: Learning the Language of Time Series

Abdul Fatir Ansari, Lorenzo Stella, Caner Turkmen, Xiyuan Zhang, Pedro Mercado, Huibin Shen, Oleksandr Shchur, Syama Sundar Rangapuram, Sebas- tian Pineda Arango, Shubham Kapoor, Jasper Zschiegner, Danielle C. Maddix, Hao Wang, Michael W. Mahoney, Kari Torkkola, Andrew Gordon Wilson, Michael Bohlke-Schneider, and Yuyang Wang. 2024. Chronos: Learning the L...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

Varshney, Yunfeng Wei, and James Winter

Matthew Arnold, Rachel KE Bellamy, Michael Hind, Stephanie Houde, Samir Mehta, Aleksandra Mojsilović, Ravi Nair, Karthikeyan Natesan Ramamurthy, John Richards, Jason Tsay, Kush R. Varshney, Yunfeng Wei, and James Winter

work page

[4] [4]

Model cards for model reporting

FactSheets: Increasing Trust in AI Services through Supplier’s Declarations of Conformity. InProceedings of the Conference on Fairness, Accountability, and Transparency (FAccT). ACM, 77–87. doi:10.1145/3287560.3287596

work page doi:10.1145/3287560.3287596

[5] [5]

Sourav Banerjee, Ayushi Agarwal, and Eishkaran Singh. 2024. The Vulnera- bility of Language Model Benchmarks: Do They Accurately Reflect True LLM Performance? arXiv:2412.03597 [cs.CL] https://arxiv.org/abs/2412.03597

work page arXiv 2024

[6] [6]

Amna Batool, Didar Zowghi, and Muneera Bano. 2023. Responsible AI gover- nance: a systematic literature review.arXiv preprint arXiv:2401.10896(2023). Human-aligned AI Model Cards with Weighted Hierarchy Architecture ,

work page arXiv 2023

[7] [7]

Bender and Batya Friedman

Emily M. Bender and Batya Friedman. 2018. Data Statements for Natural Lan- guage Processing: Toward Mitigating System Bias and Enabling Better Science. InProceedings of the Conference on Fairness, Accountability, and Transparency (FAccT). ACM, 587–604. doi:10.1145/3287560.3287576

work page doi:10.1145/3287560.3287576 2018

[8] [8]

Rishi Bommasani, Hannah Hudson, Eric Klyman, et al. 2023. The Foundation Model Transparency Index.arXiv preprint arXiv:2310.12941(2023). https://arxiv. org/abs/2310.12941

work page arXiv 2023

[9] [9]

Joel Castaño, Silverio Martínez-Fernández, Xavier Franch, and Justus Bogner

work page

[10] [10]

InProceedings of the 21st International Conference on Mining Software Repositories

Analyzing the evolution and maintenance of ml models on hugging face. InProceedings of the 21st International Conference on Mining Software Repositories. 607–618

work page

[11] [11]

Kasia Chmielinski, Sarah Newman, Chris N Kranzinger, Michael Hind, Jen- nifer Wortman Vaughan, Margaret Mitchell, Julia Stoyanovich, Angelina McMillan-Major, Emily McReynolds, Kathleen Esfahany, et al. 2024. The CLeAR Documentation Framework for AI Transparency.Harvard Kennedy School Shoren- stein Center Discussion Paper(2024)

work page 2024

[12] [12]

2013.Statistical power analysis for the behavioral sciences

Jacob Cohen. 2013.Statistical power analysis for the behavioral sciences. routledge

work page 2013

[13] [13]

Jesse Dodge, Suchin Gururangan, Dallas Card, Roy Schwartz, and Noah A. Smith

work page

[14] [14]

InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

Show Your Work: Improved Reporting of Experimental Results. InPro- ceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2185–2194. doi:10.18653/v1/D19-1224

work page doi:10.18653/v1/d19-1224

[15] [15]

John Estdale and Elli Georgiadou. 2018. Applying the ISO/IEC 25010 quality mod- els to software product. InEuropean Conference on Software Process Improvement. Springer, 492–503

work page 2018

[16] [16]

Kahn, and Alan Borning

Batya Friedman, Peter H. Kahn, and Alan Borning. 2002. Value Sensitive Design: Theory and Methods.University of Washington Technical Report02-12-01 (2002). https://vsdesign.org/

work page 2002

[17] [17]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021. Datasheets for Datasets. Commun. ACM64, 12 (2021), 86–92. doi:10.1145/3458723

work page doi:10.1145/3458723 2021

[18] [18]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

International Organization for Standardization. 2024. ISO/IEC 25002:2024 Systems and software engineering — Systems and software Quality Requirements and Evaluation (SQuaRE) — Quality model overview and usage. https://www.iso. org/standard/78175.html Provides an overview and usage guidance for quality models within the SQuaRE series

work page 2024

[20] [20]

Benjamin Laufer, Hamidah Oderinwale, and Jon Kleinberg. 2025. Anatomy of a Machine Learning Ecosystem: 2 Million Models on Hugging Face. arXiv:2508.06811 [cs.SI] https://arxiv.org/abs/2508.06811

work page arXiv 2025

[21] [21]

Weixin Liang, Nazneen Rajani, Xinyu Yang, Ezinwanne Ozoani, Eric Wu, Yiqun Chen, Daniel Scott Smith, and James Zou. 2024. Systematic analysis of 32,111 AI model cards characterizes documentation practice in AI.Nature Machine Intelligence6, 7 (2024), 744–753

work page 2024

[22] [22]

Rebecca Linke. 2017. Design thinking, explained.Ideas Made to Matter(2017)

work page 2017

[23] [23]

Yizhen Luo, Jiahuan Zhang, Siqi Fan, Kai Yang, Yushuai Wu, Mu Qiao, and Zaiqing Nie. 2023. BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine. arXiv:2308.09442 [cs.CE] https://arxiv.org/abs/2308.09442

work page arXiv 2023

[24] [24]

David R Mandel, Tonya L Hendriks, and Daniel Irwin. 2022. Policy for promot- ing analytic rigor in intelligence: professionals’ views and their psychological correlates.Intelligence and National Security37, 2 (2022), 177–196

work page 2022

[25] [25]

Nestor Maslej, Loredana Fattorini, Raymond Perrault, Yolanda Gil, Vanessa Parli, Njenga Kariuki, Emily Capstick, Anka Reuel, Erik Brynjolfsson, John Etchemendy, Katrina Ligett, Terah Lyons, James Manyika, Juan Carlos Niebles, Yoav Shoham, Russell Wald, Tobi Walsh, Armin Hamrah, Lapo Santarlasci, Julia Betts Lotufo, Alexandra Rome, Andrew Shi, and Sukrut O...

work page arXiv 2025

[26] [26]

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Ben Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model Cards for Model Reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency (FAccT). ACM, 220–229. doi:10.1145/3287560. 3287596

work page doi:10.1145/3287560 2019

[27] [27]

Forty-Two Countries Adopt New OECD. 2019. Principles on Artificial Intelli- gence

work page 2019

[28] [28]

Yashothara Shanmugarasa, Ming Ding, Chamikara Mahawaga Arachchige, and Thierry Rakotoarivelo. 2025. SoK: The Privacy Paradox of Large Language Models: Advancements, Privacy Risks, and Mitigation. InProceedings of the 20th ACM Asia Conference on Computer and Communications Security (ASIA CCS ’25). ACM, 425–441. doi:10.1145/3708821.3733888

work page doi:10.1145/3708821.3733888 2025

[29] [29]

Jan Philip Wahle, Terry Ruas, Saif M Mohammad, Norman Meuschke, and Bela Gipp. 2023. Ai usage cards: Responsibly reporting ai-generated content. In2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL). IEEE, 282–284

work page 2023

[30] [30]

Laura Weidinger, Jonathan Uesato, Maribeth Rauh, Christopher Griffin, Po- Sen Huang, John Mellor, William Cheng, Amelia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. 2021. Ethical and Social Risks of Harm from Language Models. arXiv preprint arXiv:2112.04359(2021). https://arxiv.org/abs/2112.04359

work page internal anchor Pith review Pith/arXiv arXiv 2021

[31] [31]

Justin D Weisz, Jessica He, Michael Muller, Gabriela Hoefer, Rachel Miles, and Werner Geyer. 2024. Design principles for generative AI applications. InPro- ceedings of the 2024 CHI Conference on Human Factors in Computing Systems. 1–22

work page 2024

[32] [32]

Amy Winecoff and Miranda Bogen. 2025. Improving governance outcomes through AI documentation: Bridging theory and practice. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–18

work page 2025

[33] [33]

Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. 2023. FinGPT: Open- Source Financial Large Language Models. arXiv:2306.06031 [q-fin.ST] https: //arxiv.org/abs/2306.06031

work page arXiv 2023

[34] [34]

Kai Zhang, Xiang Meng, Xue Yan, Jun Ji, Jun Liu, Hao Xu, Hao Zhang, Dong Liu, Jing Wang, Xiao Wang, Jian Gao, Yong Wang, Chang Shao, Wen Wang, Jie Li, Ming Zheng, Yu Yang, and Yue Tang. 2025. Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine.J Med Internet Res27, 1 (2025), e59069. doi:10.2196/59069

work page doi:10.2196/59069 2025

[35] [35]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models.arXiv preprint arXiv:2303.182231, 2 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023