pith. sign in

arxiv: 2603.04459 · v3 · pith:WE4OYEKYnew · submitted 2026-03-03 · 💻 cs.CR · cs.AI· cs.SE

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

Pith reviewed 2026-05-21 12:05 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SE
keywords LLM safetybenchmarkscode qualityrunnabilityadoption factorsethical considerationsrepository analysisjailbreak
0
0 comments X

The pith

LLM safety benchmark adoption tracks author prominence and basic runnability rather than code quality or ethical standards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a measurement study on 31 LLM safety benchmarks covering areas like prompt injection and jailbreaks, comparing them to 382 non-benchmark papers. It shows that most repositories fail to run out of the box, lack reliable setup instructions, and rarely address the ethical risks of publishing harmful examples. Adoption by the community links more closely to how prominent the authors are and whether the code executes at all than to measurable code quality or documentation standards. These shortfalls have stayed consistent over time and raise questions about the dependability of safety evaluations built on such tools.

Core claim

Only 39 percent of the benchmark repositories run without any modification, 16 percent supply flawless installation guides, and just 6 percent include ethical considerations even though they contain potentially harmful content. Adoption correlates with author prominence and code runnability but shows no relation to static code quality metrics such as Pylint scores or maintainability. These patterns hold across the study period without improvement, and some repositories make successful attack responses publicly available without warnings or controls.

What carries the argument

Systematic measurement combining automated static analysis, over 220 person-hours of human runnability testing, and bibliometric analysis of adoption patterns across 31 benchmarks versus a control group of 382 papers.

If this is right

  • Downstream safety evaluations across papers may not be comparable when each requires ad-hoc code changes to run the benchmarks.
  • Repositories that expose unfiltered harmful content without warnings or access controls can serve as open resources for attacks.
  • The community does not reward higher coding standards or documentation when choosing which benchmarks to adopt.
  • Persistent deficiencies suggest that reliability and safety concerns in evaluations will continue unless practices change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Benchmark creators could gain faster adoption by prioritizing immediate runnability and visible author networks over internal code polish.
  • A shared quality checklist might shift selection incentives if later studies show it predicts higher uptake.
  • Similar gaps in repository standards likely appear in other evaluation-heavy areas such as general LLM capability testing.

Load-bearing premise

The 31 selected benchmarks and 382 control papers form a representative sample of LLM safety literature without bias from how they were identified.

What would settle it

Re-running the full static analysis, runnability tests, and adoption correlation on an independently chosen larger set of LLM safety benchmarks to check whether the reported percentages and lack of quality correlation persist.

Figures

Figures reproduced from arXiv: 2603.04459 by Junjie Chu, Michael Backes, Xinyue Shen, Yang Zhang, Ye Leng, Yun Shen.

Figure 1
Figure 1. Figure 1: Data collection pipeline [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Human-based evaluation results of code quality. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Human-based evaluation results of supplementary materials. Repositories without code or unavailable ones are labeled “Not [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PRISMA-style flow diagram for benchmark selection. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Average values of five influence-related metrics on benchmark and non-benchmark papers. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of the scientific fields that the LLM safety papers influence. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average values of eight metrics related to the code repository quality on benchmark and non-benchmark papers. We have [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Average time to successfully run the example scripts [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: GitHub repository availability proportions. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: A typical example of the general pattern we identify. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Spearman correlation ρ matrix of the influence met￾rics (those with p ≥ 0.05 are omitted). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Spearman correlation ρ matrices between the influence metrics and the potential quantitative factors. The unadjusted p-values on the left can be interpreted exploratively. Pylint Score Cyclomatic Complexity Maintainability Index Number of Static Errors Reply Time (Hours) Last Commit Time (Days) Number of Commits Commit Frequency Citation Density Citation Count GitHub Star Density GitHub Star Count Scienti… view at source ↗
Figure 14
Figure 14. Figure 14: Spearman correlation ρ matrices between the influence metrics and the code repository quality metrics. The unadjusted p-values on the left can be interpreted exploratively. Author Number Institution Number Area Number Author H-Index (Top-1) Author Citation Count (Top-1) Insitution CSRankings (Top-1) Insitution ARWU (Top-1) Search Appearance Frequency Pylint Score Cyclomatic Complexity Maintainability Inde… view at source ↗
Figure 15
Figure 15. Figure 15: Spearman correlation ρ matrices between the code repository quality metrics and the potential quantitative factors. The unadjusted p-values on the left can be interpreted exploratively. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Box plots of citation density by group. The red dashed lines represent the means. [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Box plots of citation density and the status of extra modifications and runnable code. The red dashed lines represent the [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Box plots of citation density and various potential qualitative factors. The red dashed lines in the box plots represent the [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Box plots of Pylint score and various potential qualitative factors. The red dashed lines in the box plots represent the mean [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗
read the original abstract

The rapid expansion of research in LLM safety presents challenges in tracking advancements, making benchmarks important evaluation infrastructures for identifying key trends and facilitating systematic comparisons. Yet no systematic assessment exists of their code quality and runnability, nor of what factors are associated with the community's adoption of certain benchmarks over others. To address this gap, we conduct a systematic measurement study of 31 LLM safety benchmarks (covering prompt injection, jailbreak, and hallucination) with 382 non-benchmark papers as a control group, combining automated static analysis, human runnability testing (220+ person-hours), and bibliometric analysis. We find that only 39\% of benchmark repositories can run without modification, only 16\% provide flawless installation guides, and a mere 6\% include ethical considerations despite containing potentially harmful content. These deficiencies persist across the study period with no significant improvement. Analyzing adoption factors, we find that benchmark adoption correlates with author prominence and code runnability, but not with code quality standards such as Pylint score and maintainability, suggesting that the community's benchmark selection does not reward higher coding standards. Based on these results, we identify potential safety and reliability concerns. Some safety benchmark repositories openly expose harmful content, such as successful jailbreak responses, without any ethical warning or access control, effectively serving as unguarded attack resources. Furthermore, when benchmarks require ad-hoc modifications to run, downstream safety evaluations across different papers may not be comparable. We present case studies illustrating these concrete consequences and propose a targeted checklist to help benchmark contributors improve code quality, documentation, and ethical practices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a systematic measurement study of 31 LLM safety benchmarks (prompt injection, jailbreak, hallucination) against 382 non-benchmark control papers. It combines automated static analysis, 220+ person-hours of manual runnability testing, and bibliometric analysis to quantify code quality and adoption factors. Key results: 39% of repositories run without modification, 16% have flawless installation guides, and only 6% include ethical notes. Adoption correlates with author prominence and runnability but shows no correlation with code-quality metrics such as Pylint score or maintainability. The work identifies safety risks from unguarded harmful content and proposes a contributor checklist.

Significance. If the sample is representative, the study supplies concrete, reproducible evidence on reproducibility failures and ethical gaps in LLM safety benchmarks, backed by extensive manual verification and bibliometric controls. The finding that adoption tracks prominence and runnability rather than quality standards, together with the explicit checklist, offers actionable guidance for the community. The combination of automated tools, large-scale human testing, and falsifiable quantitative claims (percentages, correlations) is a clear strength.

major comments (2)
  1. [§3] §3 (Methodology): The paper provides no search strategy, databases, keywords, date range, inclusion/exclusion rules, or justification for selecting the 31 benchmarks and 382 control papers. This detail is load-bearing for the central claim that adoption correlates with prominence and runnability rather than quality metrics, because an uncharacterized sample could mechanically produce the reported pattern through selection artifacts.
  2. [§4.3] §4.3 (Adoption analysis): The reported correlations (e.g., with author prominence and runnability) are presented without sensitivity checks for alternative sampling frames or controls for publication venue; if the 31 benchmarks over-represent prominent authors, the absence of correlation with Pylint/maintainability cannot be interpreted as a community-wide preference.
minor comments (2)
  1. [§2] §2: A brief comparison table of prior benchmark surveys would help situate the novelty of the 220+ person-hour manual evaluation.
  2. [Table 2] Table 2: Define all column abbreviations (e.g., “EA”, “Pylint”) in the caption so the table is self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the revisions we will make to improve the manuscript's clarity and robustness.

read point-by-point responses
  1. Referee: [§3] §3 (Methodology): The paper provides no search strategy, databases, keywords, date range, inclusion/exclusion rules, or justification for selecting the 31 benchmarks and 382 control papers. This detail is load-bearing for the central claim that adoption correlates with prominence and runnability rather than quality metrics, because an uncharacterized sample could mechanically produce the reported pattern through selection artifacts.

    Authors: We agree that a more explicit description of the sampling process is necessary to support the reproducibility and validity of our findings. In the revised manuscript we will add a dedicated subsection to §3 that fully documents the search strategy. This will specify the databases queried (arXiv, ACL Anthology, and Google Scholar), the precise keywords and Boolean combinations used, the date range (2022–2024), the inclusion criteria (publicly available code repositories for prompt-injection, jailbreak, or hallucination benchmarks), the exclusion criteria (non-code-based evaluations, non-English papers, or works without GitHub links), and the rationale for arriving at the final counts of 31 benchmarks and 382 control papers. These additions will directly address the possibility of selection artifacts and strengthen the interpretation of the reported correlations. revision: yes

  2. Referee: [§4.3] §4.3 (Adoption analysis): The reported correlations (e.g., with author prominence and runnability) are presented without sensitivity checks for alternative sampling frames or controls for publication venue; if the 31 benchmarks over-represent prominent authors, the absence of correlation with Pylint/maintainability cannot be interpreted as a community-wide preference.

    Authors: We acknowledge the value of additional robustness checks. In the revision we will augment §4.3 with sensitivity analyses that (i) restrict the sample to benchmarks published in top-tier venues, (ii) include publication venue as a control variable in the regression models, and (iii) repeat the correlation tests on a venue-matched subset of the control group. These checks will help demonstrate that the observed pattern—adoption tracking prominence and runnability rather than static code-quality metrics—holds under alternative sampling frames and is not an artifact of over-representation of prominent authors. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical measurement study with direct observations

full rationale

This paper performs a systematic measurement study by selecting 31 LLM safety benchmarks and 382 control papers, then applying automated static analysis, human runnability testing, and bibliometric analysis to measure code quality, runnability, and adoption correlations. No derivations, equations, or first-principles predictions exist that reduce to the paper's own inputs by construction. The reported correlations (adoption with prominence/runnability, none with Pylint/maintainability) are computed directly from the sampled data without fitted parameters renamed as predictions or self-citation chains that bear the central load. The study is self-contained against external benchmarks of repository quality and adoption metrics, with no self-definitional, uniqueness-imported, or ansatz-smuggled steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on operational definitions of 'runnable without modification', 'flawless installation guide', and 'ethical consideration' that are applied during human testing; these definitions are not derived from prior literature but introduced for the study.

axioms (1)
  • domain assumption The selected 31 benchmarks and 382 control papers adequately represent the broader LLM safety literature.
    Selection criteria and search strategy are not specified in the provided abstract.

pith-pipeline@v0.9.0 · 5831 in / 1245 out tokens · 52709 ms · 2026-05-21T12:05:45.632221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    Pickard, Stephen G

    Kitchenham Barbara A., Lesley M. Pickard, Stephen G. MacDonell, and Martin J. Shepperd. What accuracy statistics really measure.IEE Proceedings-Software, 2001. 3

  2. [2]

    Ashok Agarwal, Damayanthi Durairajanayagam, Sindhuja Tatagari, Sandro C. Esteves, Avi Harlev, Ralf R Henkel, Shubhadeep Roychoudhury, Sheryl T Homa, Nicolás Gar- rido Puchalt, Ranjith Ramasamy, Ahmad Majzoub, Kim Dao Ly, Eva Tvrdá, Mourad Assidi, Kavindra Kumar Kesari, Reecha Sharma, Saleem Ali Banihani, Edmund Y Ko, Muhammad Muhammad Abu-Elmagd, Jaime Go...

  3. [3]

    Generated Data with Fake Privacy: Hidden Dangers of Fine- Tuning Large Language Models on Generated Data

    Atilla Akkus, Masoud Poorghaffar Aghdam, Mingjie Li, Junjie Chu, Michael Backes, Yang Zhang, and Sinem Sav. Generated Data with Fake Privacy: Hidden Dangers of Fine- Tuning Large Language Models on Generated Data. In USENIX Security Symposium (USENIX Security). USENIX,

  4. [4]

    Candice Alder, Candice Yu, Gerta Bardhoshi, and Bradley T. Erford. Counseling and values metastudy: An analysis of publication characteristics from 2000 to 2019.Counseling and Values, 2021. 15

  5. [5]

    Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Countermeasures

    Eugene Bagdasaryan and Vitaly Shmatikov. Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Countermeasures. InIEEE Symposium on Security and Pri- vacy (S&P), pages 769–786, Piscataway, NJ, USA, 2022. IEEE. 12

  6. [6]

    What do we know about the h index?Journal of the American Society for In- formation Science, 2007

    Lutz Bornmann and Hans-Dieter Daniel. What do we know about the h index?Journal of the American Society for In- formation Science, 2007. 8

  7. [7]

    Wears, and Ellen Weber

    Michael Callaham, Robert L. Wears, and Ellen Weber. Journal Prestige, Publication Bias, and Other Characteris- tics Associated With Citation of Published Studies in Peer- Reviewed Journals.Journal of the American Medical Asso- ciation, 287(21):2847–2850, 2002. 15

  8. [8]

    Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel

    Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting Training Data from Large Lan- guage Models. InUSENIX Security Symposium (USENIX Security), pages 2633–2650. USENIX, 2021. 12

  9. [9]

    JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Se- hwag, Edgar Dobriban, Nicolas Flammarion, George J. Pap- pas, Florian Tramer, Hamed Hassani, and Eric Wong. Jail- breakBench: An Open Robustness Benchmark for Jailbreak- ing Large Language Models.CoRR abs/2404.01318, 2024. 5

  10. [10]

    BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements

    Xiaoyi Chen, Ahmed Salem, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements. InAnnual Computer Security Applications Conference (ACSAC), pages 554–569. ACSAC, 2021. 12

  11. [11]

    JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring.CoRR abs/2508.20848, 2025

    Junjie Chu, Mingjie Li, Ziqing Yang, Ye Leng, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, and Yang Zhang. JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring.CoRR abs/2508.20848, 2025. 1

  12. [12]

    Neeko: Model Hijacking Attacks Against Generative Adversarial Networks

    Junjie Chu, Yugeng Liu, Xinlei He, Michael Backes, Yang Zhang, and Ahmed Salem. Neeko: Model Hijacking Attacks Against Generative Adversarial Networks. InInternational Conference on Multimedia and Expo (ICME). IEEE, 2025. 1

  13. [13]

    JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

    Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs. InAnnual Meeting of the Association for Computational Linguistics (ACL). ACL, 2025. 1, 2, 12

  14. [14]

    Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models

    Junjie Chu, Zeyang Sha, Michael Backes, and Yang Zhang. Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models. InConference on Empirical Methods in Natu- ral Language Processing (EMNLP), page 6584–6600. ACL,

  15. [15]

    Efficient Re- source Scheduling for Distributed Infrastructures Using Ne- gotiation Capabilities

    Junjie Chu, Prashant Singh, and Salman Toor. Efficient Re- source Scheduling for Distributed Infrastructures Using Ne- gotiation Capabilities. InIEEE International Conference on Cloud Computing (CLOUD). IEEE, 2023. 12

  16. [16]

    Routledge, 1988

    Jacob Cohen.Statistical power analysis for the behavioral sciences. Routledge, 1988. 4

  17. [17]

    A power primer.Psychological Bulletin, 1992

    Jacob Cohen. A power primer.Psychological Bulletin, 1992. 4

  18. [18]

    Proebsting

    Christian Collberg and Todd A. Proebsting. Repeatability in computer systems research.Communications of the ACM,

  19. [19]

    Corca AI.https://github.com/corca-ai/awesome- llm-security/, 2024. 1, 3

  20. [20]

    Teixeira da Silva

    Jaime A. Teixeira da Silva. The matthew effect im- pacts science and academic publishing by preferentially amplifying citations, metrics and status.Scientometrics, 126(6):5373–5377, 2021. 15

  21. [21]

    Teixeira da Silva and Aamir Raoof Memon

    Jaime A. Teixeira da Silva and Aamir Raoof Memon. CiteScore: A cite for sore eyes, or a valuable, transparent metric?Scientometrics, 2017. 8

  22. [22]

    Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cre- monesi, and D. Jannach. A Troubling Analysis of Repro- ducibility and Progress in Recommender Systems Research. ACM Transactions on Information Systems, 39:1–49, 2019. 15

  23. [23]

    Jail- breaker: Automated jailbreak across multiple large language model chatbots,

    Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jail- breaker: Automated Jailbreak Across Multiple Large Lan- guage Model Chatbots.CoRR abs/2307.08715, 2023. 12

  24. [24]

    Viewing computer science through ci- tation analysis: Salton and Bergmark Redux.Scientometrics, 125(1):271–287, 2020

    Sitaram Devarakonda, Dmitriy Korobskiy, Tandy Warnow, and George Chacko. Viewing computer science through ci- tation analysis: Salton and Bergmark Redux.Scientometrics, 125(1):271–287, 2020. 15

  25. [25]

    Nonparametric Pairwise Multiple Compar- isons in Independent Groups using Dunn’s Test.The Stata Journal, 2015

    Alexis Dinno. Nonparametric Pairwise Multiple Compar- isons in Independent Groups using Dunn’s Test.The Stata Journal, 2015. 6

  26. [26]

    CRC Press, 4th edition, 2007

    Eugene Edgington and Patrick Onghena.Randomization Tests. CRC Press, 4th edition, 2007. 16

  27. [27]

    Michael D. Ernst. Permutation methods: A basis for exact inference.Statistical Science, 19(4):676–685, 2004. 16

  28. [28]

    Falagas, Angeliki Zarkali, Drosos E

    Matthew E. Falagas, Angeliki Zarkali, Drosos E. Karageor- gopoulos, Vangelis Bardakas, and Michael N. Mavros. The impact of article length on the number of future citations: A bibliometric analysis of general medicine journals.PLOS ONE, 8(2):e49476, 2013. 8

  29. [29]

    Over-optimization of aca- demic publishing metrics: observing Goodhart’s Law in ac- tion.GigaScience, 2019

    Michael Fire and Carlos Guestrin. Over-optimization of aca- demic publishing metrics: observing Goodhart’s Law in ac- tion.GigaScience, 2019. 8

  30. [30]

    Statistical methods for research work- ers.Breakthroughs in statistics: Methodology and distribu- tion, pages 66–70, 1970

    Ronald Aylmer Fisher. Statistical methods for research work- ers.Breakthroughs in statistics: Methodology and distribu- tion, pages 66–70, 1970. 2

  31. [31]

    Citation analysis as a tool in journal eval- uation: Journals can be ranked by frequency and impact of citations for science policy studies.Science, 178(4060):471– 479, 1972

    Eugene Garfield. Citation analysis as a tool in journal eval- uation: Journals can be ranked by frequency and impact of citations for science policy studies.Science, 178(4060):471– 479, 1972. 15

  32. [32]

    1, 2, 13

    GitHub.https://docs.github.com/en/graphql/, 2024. 1, 2, 13

  33. [33]

    Good.Permutation, Parametric, and Bootstrap Tests of Hypotheses

    Phillip I. Good.Permutation, Parametric, and Bootstrap Tests of Hypotheses. Springer, 3rd edition, 2005. 16

  34. [34]

    1, 2, 13

    Google.https://scholar.google.com/, 2024. 1, 2, 13

  35. [35]

    Revisiting Inter-Class Maintainability Indica- tors

    Lena Gregor, Markus Schnappinger, and Alexander Pretschner. Revisiting Inter-Class Maintainability Indica- tors. InIEEE International Conference on Software Analy- sis, Evolution and Reengineering (SANER), pages 805–814, Piscataway, NJ, USA, 2023. IEEE. 8

  36. [36]

    Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

    Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you’ve asked for: A Comprehensive Analysis of Novel Prompt In- jection Threats to Application-Integrated Large Language Models.CoRR abs/2302.12173, 2023. 1, 2, 12

  37. [37]

    Odd Erik Gundersen, Yolanda Gil, and David W. Aha. On Reproducible AI Towards reproducible research, open sci- ence, and digital scholarship in AI publications.AI Maga- zine, 2019. 15

  38. [38]

    A Sys- tematic Analysis of User Evaluations in Security Research

    Peter Hamm, David Harborth, and Sebastian Pape. A Sys- tematic Analysis of User Evaluations in Security Research. InProceedings of the 14th International Conference on Availability, Reliability and Security, New York, NY , USA,

  39. [39]

    Association for Computing Machinery. 15, 16

  40. [40]

    Searching relevant papers for soft- ware engineering secondary studies: Semantic Scholar cov- erage and identification role.IET Software, 2021

    Abdelhakim Hannousse. Searching relevant papers for soft- ware engineering secondary studies: Semantic Scholar cov- erage and identification role.IET Software, 2021. 2, 13

  41. [41]

    H. T. Hayslett.Statistics. Elsevier, 2014. 3

  42. [42]

    Melinda Hess and Jeffrey D. Kromrey. Robust Confidence Intervals for Effect Sizes: A Comparative Study of Cohen’s d and Cliff’s Delta Under Non-normality and Heterogeneous Variances. Inannual meeting of the American Educational Research Association (AERA), pages 1–13. American Edu- cational Research Association, 2004. 3

  43. [43]

    Hogg and Elliot A

    Robert V . Hogg and Elliot A. Tanis.Probability and Statisti- cal Inference. Prentice Hall, 2010. 3

  44. [44]

    A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70,

    Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70,

  45. [45]

    Survey of Hallucination in Natural Language Generation.ACM Computing Surveys, 2023

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pas- cale Fung. Survey of Hallucination in Natural Language Generation.ACM Computing Surveys, 2023. 1, 2, 12

  46. [46]

    Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621, 2026

    Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, and Yang Zhang. Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621, 2026. 12

  47. [47]

    Adjacent Words, Divergent Intents: Jailbreaking Large Lan- guage Models via Task Concurrency

    Yukun Jiang, Mingjie Li, Michael Backes, and Yang Zhang. Adjacent Words, Divergent Intents: Jailbreaking Large Lan- guage Models via Task Concurrency. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2025. 2, 12

  48. [48]

    Jones, Travis M

    Richard E. Jones, Travis M. Hughes, Kevin A. Lawson, and Gregory L Desilva. Citation analysis of the 100 most com- mon articles regarding distal radius fractures.Journal of Clinical Orthopaedics and Trauma, 81:73–75, 2017. 3

  49. [49]

    Research methodology used in the 50 most cited articles in the field of pediatrics: types of stud- ies that become citation classics.BMC Medical Research Methodology, 2020

    Antonia Jelicic Kadic, Tanja Kovacevic, Edita Runjic, Ana Simicic Majce, Josko Markic, Branka Polic, Julije Me- strovic, and Livia Puljak. Research methodology used in the 50 most cited articles in the field of pediatrics: types of stud- ies that become citation classics.BMC Medical Research Methodology, 2020. 3, 15

  50. [50]

    Graham, F.Q

    Rodney Michael Kinney, Chloe Anastasiades, Russell Au- thur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Is- 9 abel Cachola, Stefan Candra, Yoganand Chandrasekhar, Ar- man Cohan, Miles Crawford, Doug Downey, Jason Dunkel- berger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David W. Graham, F.Q. Hu, Regan Huff, Daniel King, Sebastian Koh...

  51. [51]

    Roger E. Kirk. Practical Significance: A Concept Whose Time Has Come.Educational and Psychological Measure- ment, 1996. 3

  52. [52]

    A meta-analysis of semantic classification of citations.Quantitative science studies, 2(4):1170–1215,

    Suchetha N Kunnath, Drahomira Herrmannova, David Pride, and Petr Knoth. A meta-analysis of semantic classification of citations.Quantitative science studies, 2(4):1170–1215,

  53. [53]

    The impact factor’s matthew effect: A natural experiment in bibliometrics.Jour- nal of the American Society for Information Science and Technology, 61(2):424–427, 2010

    Vincent Larivière and Yves Gingras. The impact factor’s matthew effect: A natural experiment in bibliometrics.Jour- nal of the American Society for Information Science and Technology, 61(2):424–427, 2010. 8

  54. [54]

    Are Smarter LLMs Safer? Exploring Safety- Reasoning Trade-offs in Prompting and Fine-Tuning.CoRR abs/2502.09673, 2025

    Ang Li, Yichuan Mo, Mingjie Li, Yifei Wang, and Yisen Wang. Are Smarter LLMs Safer? Exploring Safety- Reasoning Trade-offs in Prompting and Fine-Tuning.CoRR abs/2502.09673, 2025. 1

  55. [55]

    Multi-step Jailbreaking Privacy Attacks on ChatGPT

    Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy Attacks on ChatGPT. CoRR abs/2304.05197, 2023. 12

  56. [56]

    HaluEval: A Large-Scale Hallucination Evalu- ation Benchmark for Large Language Models

    Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji- Rong Wen. HaluEval: A Large-Scale Hallucination Evalu- ation Benchmark for Large Language Models. InConfer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 6449–6464. ACL, 2023. 1, 2, 12

  57. [57]

    SaLoRA: Safety-Alignment Preserved Low- Rank Adaptation

    Mingjie Li, Wai-Man Si, Michael Backes, Yang Zhang, and Yisen Wang. SaLoRA: Safety-Alignment Preserved Low- Rank Adaptation. InInternational Conference on Learning Representations (ICLR), 2025. 1

  58. [58]

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineering: An Em- pirical Study.CoRR abs/2305.13860, 2023. 12

  59. [59]

    Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of Large Language Models.CoRR abs/2308.07847, 2023

    Yugeng Liu, Tianshuo Cong, Zhengyu Zhao, Michael Backes, Yun Shen, and Yang Zhang. Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of Large Language Models.CoRR abs/2308.07847, 2023. 1

  60. [60]

    Analyzing Leak- age of Personally Identifiable Information in Language Mod- els

    Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella Béguelin. Analyzing Leak- age of Personally Identifiable Information in Language Mod- els. InIEEE Symposium on Security and Privacy (S&P), pages 346–363, Piscataway, NJ, USA, 2023. IEEE. 12

  61. [61]

    Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations.Universitas Psychologica, 2010

    Guillermo Macbeth, Eugenia Razumiejczyk, and Rubén Daniel Ledesma. Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations.Universitas Psychologica, 2010. 3

  62. [62]

    An abla- tion study on the use of publication venue quality to rank computer science departments.Scientometrics, 2023

    Aniruddha Maiti, Sai Shi, and Slobodan Vucetic. An abla- tion study on the use of publication venue quality to rank computer science departments.Scientometrics, 2023. 8

  63. [63]

    Systematic review and meta- analyses of studies analysing instructions to authors from 1987 to 2017.Nature communications, 12(1):5840, 2021

    Mario Mali ˇcki, Ana Jeronˇci´c, IJsbrand Jan Aalbersberg, Lex Bouter, and Gerben Ter Riet. Systematic review and meta- analyses of studies analysing instructions to authors from 1987 to 2017.Nature communications, 12(1):5840, 2021. 15

  64. [64]

    Measuring the influence of non-scientific features on citations.Scientometrics, 127(7):4123–4137, 2022

    Stefano Mammola, Elena Piano, Alberto Doretto, Enrico Caprio, and Dan Chamberlain. Measuring the influence of non-scientific features on citations.Scientometrics, 127(7):4123–4137, 2022. 8

  65. [65]

    The accuracy of effect-size estimates under normals and contaminated nor- mals in meta-analysis.Heliyon, 2019

    Philomena Marfo and Gabriel Asare Okyere. The accuracy of effect-size estimates under normals and contaminated nor- mals in meta-analysis.Heliyon, 2019. 3

  66. [66]

    Martin and Douglas G

    Bland J. Martin and Douglas G. Altman. Applying the right statistics: analyses of measurement studies.Ultrasound in Obstetrics and Gynecology: The Official Journal of the In- ternational Society of Ultrasound in Obstetrics and Gynecol- ogy, 2003. 1, 3

  67. [67]

    Mascha and Thomas R

    Edward J. Mascha and Thomas R. Vetter. Significance, Er- rors, Power, and Sample Size: The Blocking and Tackling of Statistics.Anesthesia & Analgesia, 2018. 3

  68. [68]

    T.J. McCabe. A Complexity Measure.IEEE Transactions on Software Engineering, 1976. 13

  69. [69]

    Francis McIntyre and F. N. David. Tables of the Ordinates and Probability Integral of the Distribution of the Correlation Coefficient in Small Samples. InMathematics, Cambridge, United Kingdom, 1938. Cambridge University Press. 16

  70. [70]

    McKight and Julius Najab

    Patrick E. McKight and Julius Najab. Kruskal-Wallis Test. The Corsini Encyclopedia of Psychology, 2010. 4

  71. [71]

    McKnight and Julius Najab

    Patrick E. McKnight and Julius Najab. Mann–Whitney U Test.The SAGE Encyclopedia of Research Design, 2010. 3

  72. [72]

    Kane Meissel and Esther S. Yao. Using Cliff’s Delta as a Non-Parametric Effect Size Measure: An Accessible Web App and R Tutorial.Practical Assessment, Research, and Evaluation, 2024. 3

  73. [73]

    Robert K. Merton. The matthew effect in science.Science, 159(3810):56–63, 1968. 4, 15

  74. [74]

    Meta AI.https://paperswithcode.com/api/v1/docs/,

  75. [75]

    Microsoft.https://learn.microsoft.com/en- us/visualstudio/code-quality/code-metrics- maintainability-index-range-and-meaning?view= vs-2022/, 2022. 13

  76. [76]

    Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks

    Fatemehsadat Mireshghallah, Kartik Goyal, Archit Uniyal, Taylor Berg-Kirkpatrick, and Reza Shokri. Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 8332–8347. ACL, 2022. 12

  77. [77]

    The mann-whitney u: A test for assessing whether two independent samples come from the same dis- tribution.Tutorials in quantitative Methods for Psychology,

    Nachar Nadim. The mann-whitney u: A test for assessing whether two independent samples come from the same dis- tribution.Tutorials in quantitative Methods for Psychology,

  78. [78]

    National Academies Press (US), 2019

    National Academies of Sciences, Engineering, and Medicine.Reproducibility and Replicability in Science. National Academies Press (US), 2019. 15 10

  79. [79]

    Jason T. Newsom. Sample Size and Power for Re- gression.https://web.pdx.edu/~newsomj/ho_sample% 20size.pdf, 2021. 16

  80. [80]

    Get in Researchers; We’re Measuring Reproducibility

    Daniel Olszewski, Allison Lu, Carson Stillman, Kevin War- ren, Cole Kitroser, Alejandro Pascual, Divyajyoti Ukirde, Kevin Butler, and Patrick Traynor. "Get in Researchers; We’re Measuring Reproducibility": A Reproducibility Study of Machine Learning Papers in Tier 1 Security Conferences. InACM SIGSAC Conference on Computer and Communica- tions Security (C...

Showing first 80 references.