Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models

Wenting Chen , Guo Yu , Yiu-Fai Cheung , Meidan Ding , Jie Liu , Zizhan Ma , Wenxuan Wang , Linlin Shen

Authors on Pith no claims yet

classification 💻 cs.CL cs.AIcs.CVcs.LGcs.MM

keywords benchmarksevaluationmedcheckmedicalclinicaldataframeworkhealthcare

read the original abstract

Large language models (LLMs) show significant potential in healthcare, prompting numerous benchmarks to evaluate their capabilities. However, concerns persist regarding the reliability of these benchmarks, which often lack clinical fidelity, robust data management, and safety-oriented evaluation metrics. To address these shortcomings, we introduce MedCheck, the first lifecycle-oriented assessment framework specifically designed for medical benchmarks. Our framework deconstructs a benchmark's development into five continuous stages, from design to governance, and provides a comprehensive checklist of 46 medically-tailored criteria. Using MedCheck, we conducted an in-depth empirical evaluation of 53 medical LLM benchmarks. Our analysis uncovers widespread, systemic issues, including a profound disconnect from clinical practice, a crisis of data integrity due to unmitigated contamination risks, and a systematic neglect of safety-critical evaluation dimensions like model robustness and uncertainty awareness. Based on these findings, MedCheck serves as both a diagnostic tool for existing benchmarks and an actionable guideline to foster a more standardized, reliable, and transparent approach to evaluating AI in healthcare.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction
cs.CL 2026-05 unverdicted novelty 5.0

Small open-weight language models can self-optimize prompts for clinical named entity recognition in dental notes, reaching micro F1 of 0.864 after DPO on Qwen2.5-14B.
Measuring What Matters: Benchmarking Generative, Multimodal, and Agentic AI in Healthcare
cs.AI 2026-05 unverdicted novelty 3.0

Healthcare AI benchmarks show high scores on medical exams but sharply lower performance on real clinical tasks such as documentation and decision support, indicating a need for better frameworks to measure reliabilit...

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Preprint, arXiv:2501.03465

Medec: A benchmark for medical error de- tection and correction in clinical notes. Preprint, arXiv:2501.03465. Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, Inioluwa Deborah Raji, and Travis Zack. 2025. Medical large language model benchmarks should prioritize construct valid- ity. Preprint, arXiv:2503.10694. Rahul K. A...

work page arXiv 2025
[2]

InProceedings of the IEEE/ACM International Conference on Software Engineering (ICSE)

Machine learning data practices through a data curation lens: An evaluation framework. In Pro- ceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1055–1067. Jialun Cao, Yuk-Kit Chan, Zixuan Ling, Wenxuan Wang, Shuqing Li, Mingwei Liu, Ruixi Qiao, Yuting Han, Chaozheng Wang, Boxi Yu, Pinjia He, Shuai Wang, Zibin Zheng,...

work page arXiv 2024
[3]

Cheng, Z

A survey on evaluation of large language mod- els. In Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 2: Short Papers), pages 88–109. Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhua...

work page arXiv 2024
[4]

ER-Reason: A Benchmark Dataset for LLM Clinical Reasoning in the Emergency Room

Er-reason: A benchmark dataset for llm-based clinical reasoning in the emergency room. Preprint, arXiv:2505.22919. Francesco Molfese, Simone Balloccu, Gianni Fenu, and Ludovico Marras. 2025. Right answer, wrong score: Uncovering the inconsistencies of llm evaluation in multiple-choice question answering. arXiv preprint arXiv:2503.05113. Ziad Obermeyer, Br...

work page internal anchor Pith review arXiv 2025
[5]

In Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, pages 8428–8438, Miami, Florida, USA

CliMedBench: A large-scale Chinese bench- mark for evaluating medical large language models in clinical scenarios. In Proceedings of the 2024 Con- ference on Empirical Methods in Natural Language Processing, pages 8428–8438, Miami, Florida, USA. Association for Computational Linguistics. Ankit Pal, Logesh Kumar Umapathi, and Malaikan- nan Sankarasubbu. 20...

work page arXiv 2024
[6]

Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical envi- ronments

Betterbench: Assessing ai benchmarks, un- covering issues, and establishing best practices. In Advances in Neural Information Processing Systems, volume 37, pages 21763–21813. Curran Associates, Inc. Winston W Royce. 1970. Managing the development of large software systems. Proceedings of IEEE WESCON, 26(8):1–9. Samuel Schmidgall, Rojin Ziaei, Carl Harris...

work page arXiv 1970
[7]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

A medical data-effective learning benchmark for highly efficient pre-training of foundation models. In Proceedings of the 32nd ACM International Con- ference on Multimedia, MM ’24, page 3499–3508, New York, NY , USA. Association for Computing Machinery. Qahtan M Yas, Abdulbasit ALazzawi, and Bahbibi Rah- matullah. 2023. A comprehensive review of software ...

work page internal anchor Pith review arXiv 2023
[8]

Ndepartment de- notes the number of medical departments, typi- cally referring to the medical specialties included in the model’s evaluation

In this Equation, Ndisease represents the num- ber of diseases in the ICD-11 standard, specifically the first 23 diseases, serving as the benchmark for the types of diseases considered. Ndepartment de- notes the number of medical departments, typi- cally referring to the medical specialties included in the model’s evaluation. The variableN benchmark disea...

work page
[9]

did not mention the issue at all. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 AfriMed-QA GMAI-MMBench MedRisk Asclepius MVME MMMU(Health &… VQA-Med HealthBench BRIDGE OmniMedVQA webMedQA MedCalc-Bench MedChain MedJourney CMB CLIMB MedMCQA ReasonMed COGNET-MD CheXpert CHBench SLAKE MediQ MedExQA MLEC-QA LLMEval-Med PathMMU Disease Type Coverage 0 0.1 0.2 0.3 0.4 0.5 0.6...

work page
[10]

The remaining 72% (38 out of 53) did not conduct in- ternal consistency assessments

also reported such metrics, but without clear discussion or yielded inconclusive findings. The remaining 72% (38 out of 53) did not conduct in- ternal consistency assessments. The absence of such analysis raises concerns regarding evaluation accuracy and fairness. Inconsistent benchmarks may obscure true model strengths or weaknesses across specific skill...

work page
[11]

These findings highlight a lack of rigorous statistical stan- dards in current benchmark design

provided mean and standard deviation across multiple runs, they did not employ formal statis- tical tests to compare model performance. These findings highlight a lack of rigorous statistical stan- dards in current benchmark design. Incorporating structured multi-run strategies and statistical signif- 23 icance testing is critical to enhance the reliabili...

work page
[12]

• Justification: Clearly defined evaluation objectives can avoid ambiguity, facili- tating subsequent data collection, task design, and metric selection

Clarity of Evaluation Objectives 6 6.5 7 7.5 8 8.5 9 9.5 10 10.5 HealthBench MEDIC EHRNoteQA BRIDGE ClinicBench MedAgentBench AgentClinicMedOdysseyMedCalc-Bench MVME MedR-Bench TRIALPANORAMA MedRisk MedAgentsBench CHBench Figure 34: Top 15 Benchmarks by Score in Phase IV 9 10 11 12 13 14 15 16 17 18 MedS-Bench MEDICBRIDGE MedAgentBench HealthBench MedSafe...

work page
[13]

• Justification: Linking the benchmark to real-world application scenarios en- sures that the results are meaningful

Clarity of Application Scenario • Explanation: Benchmark developers should clearly describes the specific med- 25 ical application scenarios it corresponds to and explains the potential value. • Justification: Linking the benchmark to real-world application scenarios en- sures that the results are meaningful. It facilitates the validation of model’s effec...

work page
[14]

• Justification: Demonstrating the unique- ness demonstrates the necessity and jus- tification of the new benchmark

Uniqueness and Novelty • Explanation: By comparing with rele- vant benchmarks, benchmark developers should explain their contributions and the uniqueness of the benchmark, such as filling a gap and proposing new evalu- ation methodology. • Justification: Demonstrating the unique- ness demonstrates the necessity and jus- tification of the new benchmark. It...

work page
[15]

• Scoring: – 0: Does not define the target LLM capability

Target Capability of Evaluation • Explanation: Benchmark developers should clearly define the capability of LLMs intended to evaluate (e.g., text gen- eration, multimodal understanding) • Justification: Clearly defining the tar- geted capability helps clarify the scope of the benchmark, which can prevent mis- using the benchmark. • Scoring: – 0: Does not ...

work page
[16]

• Justification: By clearly defining the medical scope, it helps users better un- derstand the breath and depth of the cov- erage of the benchmark

Medical Domain Coverage • Explanation: Benchmark developers should clearly define the scope of the medical specialties of the benchmark, such as clinical departments, disease types, or task types. • Justification: By clearly defining the medical scope, it helps users better un- derstand the breath and depth of the cov- erage of the benchmark. It also help...

work page
[17]

• Justification: An effective benchmark should serve the needs of users

Demonstration of User Needs • Explanation: The benchmark should re- flect the core concerns and assessment needs of its target users, such as address- ing specific clinical challenges or over- coming technical obstacles. • Justification: An effective benchmark should serve the needs of users. Evi- dence of user needs justifies the devel- opment of the ben...

work page
[18]

• Justification: Due to the professionalism and rigor required in the medical field, the development of a benchmark must involve deep engagement from domain experts

Domain Experts Involvement • Explanation: Domain expert (e.g., physicians or clinical researchers) should be involved in the development of the benchmark. • Justification: Due to the professionalism and rigor required in the medical field, the development of a benchmark must involve deep engagement from domain experts. Their expertise are fundamen- tal to...

work page
[19]

• Justification: Benchmark content should adhere to recognized, evidence- based medical knowledge sources

Authoritative Knowledge Sources • Explanation: Benchmark developers must clearly specify which authorita- tive medical knowledge sources (e.g., clinical guidelines, textbooks, medical databases) the benchmark content is based on. • Justification: Benchmark content should adhere to recognized, evidence- based medical knowledge sources. By using authoritati...

work page
[20]

• Justification: Adherence to medical standards ensures the clinical relevance and consistency, facilitating integration and comparison in reflect real-world medical practice

Medical Standards Alignment • Explanation: Benchmarks should follow internationally or industry- recognized medical standards (e.g., ICD, SNOMED CT, LOINC) when medical terminology is involved. • Justification: Adherence to medical standards ensures the clinical relevance and consistency, facilitating integration and comparison in reflect real-world medic...

work page
[21]

• Justification: Evaluation metrics di- rectly shape the interpretation of results

Validity of Core Metric • Explanation: The core performance metric should be clearly defined and closely related to the clinical task or med- ical capability being assessed. • Justification: Evaluation metrics di- rectly shape the interpretation of results. Choosing suitable metric and explaining its relevance ensures a shared understand- ing and comprehe...

work page
[22]

• Justification: In high-stakes medical do- main, going beyond correctness is vi- tal

Multi-dimensional Evaluation 27 • Explanation: Apart from the correct- ness, benchmark developers should in- clude evaluation of other important di- mensions (e.g., safety, fluency). • Justification: In high-stakes medical do- main, going beyond correctness is vi- tal. Multi-dimensional evaluation of- fers a more comprehensive assessment of whether a mode...

work page
[23]

Safety and Fairness Considerations • Explanation: Benchmark should in- clude evaluation for potential risks (e.g., unsafe recommendation, toxicity) and bias (e.g., gender, ethnicity) in model outputs. • Justification: Evaluating risks and fair- ness facilitates bias-free, safe and equi- table applications of LLMs in the med- ical domain, promoting respons...

work page
[24]

• Justification: Clear and traceable data sources are critical to ensure trans- parency and ethical data usage, which is especially important when sensitive data is involved

Data source transparency and traceability • Explanation: Benchmark developers should clearly state the data source of the benchmark, along with relevant trace- ability information (e.g., data collection time frame, platforms). • Justification: Clear and traceable data sources are critical to ensure trans- parency and ethical data usage, which is especiall...

work page
[25]

• Justification: Collecting data from unre- liable sources may lead to invalid results for medical applications

Data Source Reliability • Explanation: Benchmark developers should clearly describe the selection cri- teria and explain the reliability of the data source. • Justification: Collecting data from unre- liable sources may lead to invalid results for medical applications. By explaining the reliability of the data source, credibil- ity of the benchmark can be...

work page
[26]

For synthetically generated data, the con- struction process and verification for its authenticity (e.g., expert review) should also be described

Data Authenticity • Explanation: Benchmark developers should clearly specify whether the data comes from the real world scenarios, syn- thetically generated, or a mixture of both. For synthetically generated data, the con- struction process and verification for its authenticity (e.g., expert review) should also be described. • Justification: Real-world da...

work page
[27]

• Justification: A benchmark that lacks representativeness may lead to bias in evaluation, reducing the clinical rele- vance, generalizability and fairness of the results

Dataset Representativeness • Explanation: The representativeness of key features (e.g., patient age, disease) of the dataset should be explained and statistically analyzed. • Justification: A benchmark that lacks representativeness may lead to bias in evaluation, reducing the clinical rele- vance, generalizability and fairness of the results. • Scoring: –...

work page
[28]

• Justification: Ensuring the dataset cov- ers a variety helps comprehensively eval- uate the model’s generalization ability, reducing bias in the evaluation results

Dataset Diversity • Explanation: The benchmark should have a diverse coverage, and provide quantitative evidence of its diversity of disease or medical departments covered. • Justification: Ensuring the dataset cov- ers a variety helps comprehensively eval- uate the model’s generalization ability, reducing bias in the evaluation results. • Scoring: – 0: D...

work page
[29]

• Justification: It ensures that the final dataset is well-structured, enhancing re- liability

Data Cleaning and Standardization • Explanation: The processes and steps of data preprocessing, including data cleaning and standardization, should be clearly described. • Justification: It ensures that the final dataset is well-structured, enhancing re- liability. Also, it allows a better under- standing of the construction process of the benchmark, ensu...

work page
[30]

Methods of de-identification should be described and compliance with relevant regulations (e.g., HIPAA) should be clearly stated

Privacy Protection • Explanation: If sensitive data is used, it should be de-identified. Methods of de-identification should be described and compliance with relevant regulations (e.g., HIPAA) should be clearly stated. • Justification: Real-world clinical data may contain patient information. It is es- sential to ensure that the data and privacy protectio...

work page
[31]

• Justification: A clear and consistent for- mat is essential for standardized evalua- tion

Data Format Clarity and Consistency 29 • Explanation: The data in the dataset, including questions, cases, or task de- scriptions, should be written clearly and presented in a consistent format. • Justification: A clear and consistent for- mat is essential for standardized evalua- tion. Inconsistent format may affect mod- els’ interpretation, compromising...

work page
[32]

• Justification: A dataset construction pro- cess without review mechanism is prone to errors

Data Review and Audit • Explanation: The dataset construction process should include a review proce- dure with involvement of medical ex- perts. • Justification: A dataset construction pro- cess without review mechanism is prone to errors. Careful review that involves medical experts can ensures clinical rel- evance, professionalism and reliable of the da...

work page
[33]

• Justification: Clear reference answers or scoring guidelines ensures transpar- ent and accurate evaluation

Quality of Reference Answer • Explanation: The benchmark should provide clear and accurate reference an- swers or scoring guidelines, and explain the construction and verification process (e.g., expert consensus). • Justification: Clear reference answers or scoring guidelines ensures transpar- ent and accurate evaluation. By explain- ing how are reference...

work page
[34]

• Justification: Data contamination may lead to inflated performance, which only reflect memorization instead of medical capability from the models

Data Contamination Prevention • Explanation: Benchmark developers should take actions to identify, address and prevent potential data contamination issues of the data. • Justification: Data contamination may lead to inflated performance, which only reflect memorization instead of medical capability from the models. Preventing and controlling potential con...

work page
[35]

• Justification: It ensures that users can use the benchmark conveniently, promot- ing benchmark adoption and ensuring fair, transparent, and consistent evalua- tion

User-friendliness of evaluation tools • Explanation: Evaluation scripts or tools that is easy to obtain and use should be provided. • Justification: It ensures that users can use the benchmark conveniently, promot- ing benchmark adoption and ensuring fair, transparent, and consistent evalua- tion. • Scoring: – 0: No evaluation scripts or tools is provided...

work page
[36]

Technical Reproducibility • Explanation: Tools and technical doc- umentation (e.g., environment configu- ration, dependency versions) should be provided. • Justification: The availability of evalua- tion tools and clear technical documenta- tion ensures reproducibility of reported results, thereby enhancing the credibility of the benchmark. • Scoring: – 0...

work page
[37]

• Justification: Providing different per- formance baselines allows comparison against the model’s performance, en- abling a deeper understanding and better interpretability

Provision of performance baselines • Explanation: Benchmark developers should provide multiple meaningful per- formance baselines, such as random, baseline models and human perfor- mance. • Justification: Providing different per- formance baselines allows comparison against the model’s performance, en- abling a deeper understanding and better interpretabi...

work page
[38]

• Justification: In medical domain, un- derstanding the model’s decision-making process is just as important as the final answer

Reasoning Process Evaluation • Explanation: Apart from the final an- swer, the benchmark includes evalua- tions for the model’s reasoning process or explanation abilities. • Justification: In medical domain, un- derstanding the model’s decision-making process is just as important as the final answer. Evaluating the reasoning process helps ensure that the ...

work page
[39]

Robustness testing en- sures model’s output is consistent and reliable under different conditions

Robustness Evaluation • Explanation: Benchmark should in- clude testing for the model’s stability and robustness (e.g., input perturbations, ad- versarial samples) • Justification: In practical applications, models may encounter different varia- tions of inputs. Robustness testing en- sures model’s output is consistent and reliable under different conditi...

work page
[40]

Generalization Capability Evaluation • Explanation: The benchmark design should help evaluate the generalization capability of models to unseen data or scenarios (e.g., careful train/test split, out-of-distribution testing). 31 • Justification: Due to the high variability in medical scenarios, assessing a model’s generalization capability is essential for...

work page
[41]

I don’t know

Uncertainty Evaluation • Explanation: Benchmark should in- clude evaluation for the model’s ability to recognize and express its own uncer- tainty (e.g., responding “I don’t know”) • Justification: Incorrect overconfident an- swers can be dangerous in high-stakes medical applications. A model’s ability to accurately identify and express uncer- tainty is c...

work page
[42]

• Justification: It ensures that different types of models can be tested under the same interface

Evaluation Flexibility • Explanation: Evaluation code should have modular interface and support dif- ferent models, including closed-source models that require API access and local open-source models. • Justification: It ensures that different types of models can be tested under the same interface. • Scoring: – 0: Supports only one type of model and is di...

work page
[43]

• Justification: Sufficient coverage of the core medical competencies the bench- mark aims to measure is the prerequisite for establishing content validity

Knowledge and Skill Coverage • Explanation: Evidence (e.g., expert evaluation) is provided to demonstrate that the benchmark content sufficiently covers the medical knowledge and skills it claims to assess. • Justification: Sufficient coverage of the core medical competencies the bench- mark aims to measure is the prerequisite for establishing content val...

work page
[44]

• Justification: Ensuring that the evalu- ation task closely mirrors the targeted clinical practice in real-world scenarios enhance the relevance of the benchmark

Scenario Authenticity • Explanation: The evaluation tasks of the benchmark should effectively simulate and mirror real-world medical scenario. • Justification: Ensuring that the evalu- ation task closely mirrors the targeted clinical practice in real-world scenarios enhance the relevance of the benchmark. It helps assess whether the model can be transferr...

work page
[45]

• Justification: An effective benchmark should be capable of differentiating and distinguishing models of varying capabil- ities

Model Discrimination Ability • Explanation: Benchmark developers should provide experiment data and anal- ysis, indicating that the benchmark can effectively distinguish the capability be- tween different models. • Justification: An effective benchmark should be capable of differentiating and distinguishing models of varying capabil- ities. It provides me...

work page
[46]

Correlation with Clinical Performance • Explanation: Benchmark developers should provide evidence that prelimi- narily explores the correlation between benchmark performance and the model’s performance in actual clinical applica- tions. • Justification: Validating whether bench- mark results have meaningful indica- tion on the model’s performance in real-...

work page
[47]

• Scoring: – 0: Does not mention or conduct any internal consistency measurement

Internal Consistency • Explanation: For different sections or items of the benchmark evaluating the same capability, benchmark developers should demonstrate good internal consis- tency (e.g., Cronbach’sα) • Justification: It ensures that different sections or items reliably assess the same capability, enabling meaningful compar- isons and interpretation o...

work page
[48]

Statistical Significance Reporting • Explanation: When comparing the per- formance of different models, statistical significance (e.g., confidence intervals, and p-values) should be report. • Justification: Statistical significance testing allows for a more informed inter- pretation of results by differentiating the performance differences between mod- el...

work page
[49]

• Justification: A complete and clear doc- umentation help users understand and use the benchmark properly, enhancing the usability, reproducibility and trans- parency

Documentation Completeness • Explanation: Benchmark developers should provide a clear and comprehen- sive documentation of the benchmark, systematically describing the relevant de- tails, including the design, objectives, scope, construction process, task defini- tions and evaluation procedures. • Justification: A complete and clear doc- umentation help u...

work page
[50]

• Justification: Clear evaluation guide- lines help users better understand how model performance is quantified, ensur- ing a shared interpretation and under- standing

Clarity of Evaluation Guidelines • Explanation: Benchmark developers should provide evaluation guidelines, with definitions of evaluation metrics, de- tailed scoring criteria, and easy-to-follow usage instructions for users to replicate the evaluation process. • Justification: Clear evaluation guide- lines help users better understand how model performanc...

work page
[51]

• Justification: Disclosing limitations and risks demonstrates scientific rigor and responsibility

Discussion of Limitations and Risks • Explanation: The benchmark develop- ers should openly discuss the limitations and potential social risks of the bench- mark. • Justification: Disclosing limitations and risks demonstrates scientific rigor and responsibility. This transparency helps prevent users from misunderstanding and misusing the benchmark, especi...

work page
[52]

• Justification: Going through peer review process means that the design, validity and results of a benchmark has been rig- orous evaluated

Peer Review • Explanation: The benchmark and its corresponding paper was accepted at peer-reviewed venue. • Justification: Going through peer review process means that the design, validity and results of a benchmark has been rig- orous evaluated. It enhances credibility and ensures quality. • Scoring: – 0: The benchmark has not been ac- cepted at a peer-r...

work page
[53]

on platforms like GitHub or Hugging Face) along with the applica- ble license

Accessibility of Evaluation Code and Data • Explanation: Access to the evaluation code and data, which can be shared within legal and ethical boundaries, is provided (e.g. on platforms like GitHub or Hugging Face) along with the applica- ble license. • Justification: Accessible code and data are prerequisites for reproducibility. Moreover, it allows the c...

work page
[54]

• Justification: Proper usage and citation guidelines help maintain academic in- tegrity

Usage and Citation Guidelines • Explanation: Clear guidelines are pro- vided to standardize benchmark usage, result reporting, and correct citation for- mats in academic papers or technical re- ports. • Justification: Proper usage and citation guidelines help maintain academic in- tegrity. It also promote standardized re- porting of results, facilitating ...

work page
[55]

• Justification: Maintaining an effective feedback channel allows users to provide feedback when issues with the bench- mark are discovered

Update and Version Management • Explanation: A feedback channel should be maintained for users to report problems, provide feedback and sugges- tions. • Justification: Maintaining an effective feedback channel allows users to provide feedback when issues with the bench- mark are discovered. This is crucial for continuously improving benchmark qual- ity an...

work page
[56]

• Justification: Maintaining an effective feedback channel allows users to provide feedback when issues with the bench- mark are discovered

Feedback Channel for Users • Explanation: A feedback channel should be maintained for users to report problems, provide feedback and sugges- tions. • Justification: Maintaining an effective feedback channel allows users to provide feedback when issues with the bench- mark are discovered. This is crucial for continuously improving benchmark qual- ity and f...

work page
[57]

• Justification: Clarifying who holds long- term responsibility reassures the commu- nity that it will be actively supported and improved, ensuring usability and credi- bility

Long-term Maintenance Responsibility • Explanation: The individuals, teams, or institutions responsible for its long-term maintenance and development should be clearly stated. • Justification: Clarifying who holds long- term responsibility reassures the commu- nity that it will be actively supported and improved, ensuring usability and credi- bility. • Sc...

work page