Recognition: no theorem link
AdaQE-CG: Adaptive Query Expansion for Web-Scale Generative AI Model and Data Card Generation
Pith reviewed 2026-05-15 10:55 UTC · model grok-4.3
The pith
Adaptive query expansion generates model and data cards that exceed existing automated methods and human-authored data cards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdaQE-CG uses an Intra-Paper Extraction via Context-Aware Query Expansion module to iteratively refine queries and recover richer information directly from papers, combined with an Inter-Card Completion using the MetaGAI Pool module to transfer semantically relevant content from similar cards in a curated collection, producing documentation that outperforms prior automated approaches, exceeds human-authored data cards, and approaches human-level quality for model cards.
What carries the argument
The Intra-Paper Extraction via Context-Aware Query Expansion (IPE-QE) module that adapts queries based on paper context paired with the Inter-Card Completion using the MetaGAI Pool (ICC-MP) module that fills gaps via transfer from similar cards.
If this is right
- Dynamic query refinement handles varied paper structures better than fixed templates, reducing missing information in generated cards.
- Cross-card knowledge transfer from a pool of similar cards fills gaps in web-scale repositories with incomplete metadata.
- The MetaGAI-Bench benchmark enables reproducible evaluation of documentation quality across multiple dimensions.
- Higher-quality automated cards support more consistent and transparent documentation for generative AI systems.
- The approach scales to web-scale generation by combining intra-paper adaptation with inter-card completion.
Where Pith is reading between the lines
- The same iterative expansion and transfer pattern could apply to automated generation of other technical summaries in scientific literature.
- Integrating user corrections into the query refinement loop might further reduce omissions in future versions.
- Repositories with growing numbers of cards could see compounding quality gains as the pool for transfer improves over time.
- The method points toward hybrid systems where extraction from source documents and reuse from existing records work together for documentation tasks.
Load-bearing premise
Large language models can expand queries iteratively to extract accurate and complete information from papers without introducing hallucinations or omissions that reduce card quality.
What would settle it
A direct comparison on papers with independently verified ground-truth card content showing that the generated cards contain more factual errors or missing details than human-authored versions or prior methods.
Figures
read the original abstract
Transparent and standardized documentation is essential for building trustworthy generative AI (GAI) systems. However, existing automated methods for generating model and data cards still face three major challenges: (i) static templates, as most systems rely on fixed query templates that cannot adapt to diverse paper structures or evolving documentation requirements; (ii) information scarcity, since web-scale repositories such as Hugging Face often contain incomplete or inconsistent metadata, leading to missing or noisy information; and (iii) lack of benchmarks, as the absence of standardized datasets and evaluation protocols hinders fair and reproducible assessment of documentation quality. To address these limitations, we propose AdaQE-CG, an Adaptive Query Expansion for Card Generation framework that combines dynamic information extraction with cross-card knowledge transfer. Its Intra-Paper Extraction via Context-Aware Query Expansion (IPE-QE) module iteratively refines extraction queries to recover richer and more complete information from scientific papers and repositories, while its Inter-Card Completion using the MetaGAI Pool (ICC-MP) module fills missing fields by transferring semantically relevant content from similar cards in a curated dataset. In addition, we introduce MetaGAI-Bench, the first large-scale, expert-annotated benchmark for evaluating GAI documentation. Comprehensive experiments across five quality dimensions show that AdaQE-CG substantially outperforms existing approaches, exceeds human-authored data cards, and approaches human-level quality for model cards. Code, prompts, and data are publicly available at: https://github.com/haoxuan-unt2024/AdaQE-CG.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AdaQE-CG, an adaptive query expansion framework for generating model and data cards. It consists of IPE-QE for iterative context-aware extraction from papers and repositories, and ICC-MP for filling missing fields via transfer from a curated MetaGAI Pool. The work introduces MetaGAI-Bench as the first large-scale expert-annotated benchmark and reports experiments across five quality dimensions claiming substantial outperformance over existing methods, exceeding human-authored data cards, and approaching human-level model card quality. Code, prompts, and data are released publicly.
Significance. If the results hold, the work addresses a timely and important problem in trustworthy AI by improving automated documentation at web scale. The introduction of MetaGAI-Bench provides a much-needed standardized evaluation resource, and the public release of code and data is a clear strength that supports reproducibility. The adaptive, non-static approach to query expansion and cross-card transfer offers a practical advance over template-based methods.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments section: The claim of outperformance across five quality dimensions is central to the contribution, yet the manuscript provides no specification of the exact metrics, baselines, statistical tests, sample sizes, or error analysis used to support the headline results (outperforming baselines, exceeding human data cards, approaching human model-card quality). Without these details the experimental claims cannot be verified or reproduced.
- [Method (IPE-QE)] IPE-QE module (method description): The iterative context-aware query expansion is presented as recovering richer information, but the paper reports no quantitative audit of extraction fidelity such as precision/recall or hallucination rates against human gold labels on key fields (architecture, training data, metrics). This omission is load-bearing because downstream ICC-MP transfer cannot correct upstream factual errors, directly undermining the quality-score superiority claims on MetaGAI-Bench.
minor comments (2)
- [Introduction and Method] Notation: The acronyms IPE-QE and ICC-MP are introduced without an explicit table or consistent first-use expansion in all sections, which reduces readability.
- [Related Work] References: Several recent works on data-card generation and LLM-based information extraction are cited only in passing; a more systematic comparison table would clarify novelty.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and positive assessment of the work's significance. We address each major comment below and will revise the manuscript to provide the requested details and analyses.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The claim of outperformance across five quality dimensions is central to the contribution, yet the manuscript provides no specification of the exact metrics, baselines, statistical tests, sample sizes, or error analysis used to support the headline results (outperforming baselines, exceeding human data cards, approaching human model-card quality). Without these details the experimental claims cannot be verified or reproduced.
Authors: We agree that the experimental claims require explicit specification for verifiability. The manuscript currently describes the five quality dimensions at a high level without detailing the exact metrics, baseline implementations, statistical tests, sample sizes, or error analysis. In the revised manuscript, we will expand the Experiments section to include precise metric definitions, full baseline details, statistical significance results, exact evaluation sample sizes, and an error analysis to support all headline claims. revision: yes
-
Referee: [Method (IPE-QE)] IPE-QE module (method description): The iterative context-aware query expansion is presented as recovering richer information, but the paper reports no quantitative audit of extraction fidelity such as precision/recall or hallucination rates against human gold labels on key fields (architecture, training data, metrics). This omission is load-bearing because downstream ICC-MP transfer cannot correct upstream factual errors, directly undermining the quality-score superiority claims on MetaGAI-Bench.
Authors: We acknowledge this is a substantive point. While end-to-end results on MetaGAI-Bench provide indirect evidence, the manuscript lacks a direct quantitative audit of IPE-QE extraction fidelity. We will revise the Method section (and add an appendix if needed) to report precision, recall, and hallucination rates for key fields such as architecture, training data, and metrics, evaluated against human gold labels on a sampled subset of MetaGAI-Bench. This will demonstrate upstream reliability and strengthen the overall claims. revision: yes
Circularity Check
No circularity: framework and benchmark are independently specified and evaluated
full rationale
The paper presents AdaQE-CG as a two-module system (IPE-QE for iterative context-aware query expansion from papers and ICC-MP for transferring content from a curated MetaGAI pool) plus a new expert-annotated benchmark MetaGAI-Bench. Performance claims rest on comparative experiments across quality dimensions rather than any derivation that reduces outputs to fitted parameters or self-referential definitions. No equations appear, no predictions are obtained by fitting inputs within the same work, and no load-bearing uniqueness theorems or ansatzes are imported via self-citation. The approach is therefore self-contained against external baselines and the introduced benchmark.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLMs can iteratively refine extraction queries to recover richer information without significant hallucination or omission
- domain assumption Semantically similar cards in the MetaGAI Pool contain transferable content that correctly fills missing fields
Reference graph
Works this paper leans on
-
[1]
David Adkins, Bilal Alsallakh, Adeel Cheema, Narine Kokhlikyan, Emily McReynolds, Pushkar Mishra, Chavez Procope, Jeremy Sawruk, Erin Wang, and Polina Zvyagina. 2022. Prescriptive and descriptive approaches to machine- learning transparency. InCHI conference on human factors in computing systems extended abstracts. 1–9
work page 2022
-
[2]
Vijay Arya, Rachel KE Bellamy, Pin-Yu Chen, Amit Dhurandhar, Michael Hind, Samuel C Hoffman, Stephanie Houde, Q Vera Liao, Ronny Luss, Aleksandra Mojsilović, et al. 2019. One explanation does not fit all: A toolkit and taxonomy of ai explainability techniques.arXiv preprint arXiv:1909.03012(2019). AdaQE-CG: Adaptive Query Expansion for Web-Scale Generativ...
-
[3]
Jiaxin Bai, Wei Fan, Qi Hu, Qing Zong, Chunyang Li, Hong Ting Tsang, Hongyu Luo, Yauwai Yim, Haoyu Huang, Xiao Zhou, et al. 2025. AutoSchemaKG: Au- tonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora.arXiv preprint arXiv:2505.23628(2025)
-
[4]
Emily M Bender and Batya Friedman. 2018. Data statements for natural lan- guage processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics6 (2018), 587–604
work page 2018
-
[5]
Eshta Bhardwaj, Harshit Gujral, Siyi Wu, Ciara Zogheib, Tegan Maharaj, and Christoph Becker. 2024. The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track.Advances in Neural Information Processing Systems37 (2024), 53626–53648
work page 2024
- [6]
-
[7]
Danilo Brajovic, Niclas Renner, Vincent Philipp Goebels, Philipp Wagner, Ben- jamin Fresz, Martin Biller, Mara Klaeb, Janika Kutz, Jens Neuhuettler, and Marco F Huber. 2023. Model reporting for certifiable ai: A proposal from merging eu regulation into ai development.arXiv preprint arXiv:2307.11525(2023)
-
[8]
Cheng-Han Chiang and Hung-Yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations?. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 15607– 15631
work page 2023
-
[9]
Jacob Cohen. 1960. A coefficient of agreement for nominal scales.Educational and psychological measurement20, 1 (1960), 37–46
work page 1960
-
[10]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
John Dagdelen, Alexander Dunn, Sanghoon Lee, Nicholas Walker, Andrew S Rosen, Gerbrand Ceder, Kristin A Persson, and Anubhav Jain. 2024. Structured information extraction from scientific text with large language models.Nature communications15, 1 (2024), 1418
work page 2024
-
[12]
Roxana Daneshjou, Mary P Smith, Mary D Sun, Veronica Rotemberg, and James Zou. 2021. Lack of transparency and potential bias in artificial intelligence data sets and algorithms: a scoping review.JAMA dermatology157, 11 (2021), 1362–1369
work page 2021
-
[13]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM64, 12 (2021), 86–92
work page 2021
-
[14]
Stephen Gilbert, Rasmus Adler, Taras Holoyad, and Eva Weicken. 2025. Could transparent model cards with layered accessible information drive trust and safety in health AI?npj Digital Medicine8, 1 (2025), 124
work page 2025
-
[15]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al . 2024. A survey on llm-as-a-judge.arXiv preprint arXiv:2411.15594(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Yufei Guo, Muzhe Guo, Juntao Su, Zhou Yang, Mengqiu Zhu, Hongfei Li, Mengyang Qiu, and Shuo Shuo Liu. 2024. Bias in Large Language Models: Origin, Evaluation, and Mitigation.CoRR(2024)
work page 2024
-
[17]
Carolina AM Heming, Mohamed Abdalla, Shahram Mohanna, Monish Ahluwalia, Linglin Zhang, Hari Trivedi, MinJae Woo, Benjamin Fine, Judy Wawira Gichoya, Leo Anthony Celi, et al. 2023. Benchmarking bias: Expanding clinical AI model card to incorporate bias reporting of social and non-social factors.arXiv preprint arXiv:2311.12560(2023)
-
[18]
Sarah Holland, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielin- ski. 2020. The dataset nutrition label.Data protection and privacy12, 12 (2020), 1
work page 2020
-
[19]
Eliahu Horwitz, Nitzan Kurer, Jonathan Kahana, Liel Amar, and Yedid Hoshen
-
[20]
Charting and Navigating Hugging Face’s Model Atlas.arXiv e-prints(2025), arXiv–2503
work page 2025
-
[21]
Alon Jacovi, Ana Marasović, Tim Miller, and Yoav Goldberg. 2021. Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in AI. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency. 624–635
work page 2021
-
[22]
Yizhu Jiao, Sha Li, Sizhe Zhou, Heng Ji, and Jiawei Han. 2024. TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents. InFindings of the Association for Computational Linguistics ACL 2024. 185–205
work page 2024
-
[23]
Joshua A Kroll. 2021. Outlining traceability: A principle for operationalizing accountability in computing systems. InProceedings of the 2021 ACM Conference on fairness, accountability, and transparency. 758–771
work page 2021
- [24]
-
[25]
Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. Llms-as-judges: a comprehensive survey on llm-based evaluation methods.arXiv preprint arXiv:2412.05579(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Weixin Liang, Nazneen Rajani, Xinyu Yang, Ezinwanne Ozoani, Eric Wu, Yiqun Chen, Daniel Scott Smith, and James Zou. 2024. Systematic analysis of 32,111 AI model cards characterizes documentation practice in AI.Nature Machine Intelligence6, 7 (2024), 744–753
work page 2024
-
[27]
Jiarui Liu, Wenkai Li, Zhijing Jin, and Mona Diab. 2024. Automatic Generation of Model and Data Cards: A Step Towards Responsible AI. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 1975–1997
work page 2024
-
[28]
Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, et al. 2024. A large-scale audit of dataset licensing and attribution in AI.Nature Machine Intelligence6, 8 (2024), 975–987
work page 2024
- [29]
-
[30]
Rada Mihalcea, Oana Ignat, Longju Bai, Angana Borah, Luis Chiruzzo, Zhijing Jin, Claude Kwizera, Joan Nwatu, Soujanya Poria, and Thamar Solorio. 2025. Why AI Is WEIRD and Shouldn’t Be This Way: Towards AI for Everyone, with Everyone, by Everyone. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 28657–28670
work page 2025
-
[31]
Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. 2019. Model cards for model reporting. InProceedings of the conference on fairness, accountability, and transparency. 220–229
work page 2019
-
[32]
Claudio Novelli, Mariarosaria Taddeo, and Luciano Floridi. 2024. Accountability in artificial intelligence: What it is and how it works.Ai & Society39, 4 (2024), 1871–1882
work page 2024
-
[33]
Trishan Panch, Heather Mattie, and Rifat Atun. 2019. Artificial intelligence and algorithmic bias: implications for health systems.Journal of global health9, 2 (2019), 020318
work page 2019
-
[34]
Emmanouil Papagiannidis, Patrick Mikalef, and Kieran Conboy. 2025. Responsible artificial intelligence governance: A review and research framework.The Journal of Strategic Information Systems34, 2 (2025), 101885
work page 2025
-
[35]
P Jonathon Phillips, P Jonathon Phillips, Carina A Hahn, Peter C Fontana, Amy N Yates, Kristen Greene, David A Broniatowski, and Mark A Przybocki. 2021. Four principles of explainable artificial intelligence. (2021)
work page 2021
- [36]
-
[37]
Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. 2022. Data cards: Purposeful and transparent dataset documentation for responsible ai. InProceed- ings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 1776–1826
work page 2022
-
[38]
Shaina Raza, Rizwan Qureshi, Anam Zahid, Safiullah Kamawal, Ferhat Sadak, Joseph Fioresi, Muhammaed Saeed, Ranjan Sapkota, Aditya Jain, Anas Zafar, et al
- [39]
- [40]
-
[41]
Marco Rondina, Antonio Vetrò, and Juan Carlos De Martin. 2023. Completeness of datasets documentation on ML/AI repositories: An empirical investigation. In EPIA Conference on Artificial Intelligence. Springer, 79–91
work page 2023
-
[42]
Christin Seifert, Stefanie Scherzinger, and Lena Wiese. 2019. Towards generating consumer labels for machine learning models. In2019 IEEE First International Conference on Cognitive Machine Intelligence (CogMI). IEEE, 173–179
work page 2019
-
[43]
Harald Semmelrock, Tony Ross-Hellauer, Simone Kopeinik, Dieter Theiler, Armin Haberl, Stefan Thalmann, and Dominik Kowald. 2025. Reproducibility in machine- learning-based research: Overview, barriers, and drivers.AI Magazine46, 2 (2025), e70002
work page 2025
-
[44]
Hong Shen, Wesley H Deng, Aditi Chattopadhyay, Zhiwei Steven Wu, Xu Wang, and Haiyi Zhu. 2021. Value cards: An educational toolkit for teaching social impacts of machine learning through deliberation. InProceedings of the 2021 ACM conference on fairness, accountability, and transparency. 850–861
work page 2021
- [45]
- [46]
-
[47]
Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A Rossi, and Dinesh Manocha. 2025. VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologi...
work page 2025
- [48]
-
[49]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi
-
[50]
InInternational Con- ference on Learning Representations
BERTScore: Evaluating Text Generation with BERT. InInternational Con- ference on Learning Representations
-
[51]
Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou
-
[52]
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models.arXiv preprint arXiv:2506.05176(2025). A Appendix A.1 Definition of Model and Data Card The definitions of model and data cards for GAI are shown in Table 4. A.2 Weighted Card Completeness Index The WCCI quantifies documentation quality through content avail- ability and con...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
to 100-430 (Round 10); (2) sustained quality gains—model cards achieve 12-16% average improvements per dimension while data cards show 1-2% gains; (3) field-specific behavior—Model Details and Dataset Structure maintain higher activity (430-520 cards at Round 10), while responsible AI fields converge rapidly to under 200 cards. A.7 Case Study of Generated...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.