Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

Junjie Chu; Michael Backes; Xinyue Shen; Yang Zhang; Ye Leng; Yun Shen

arxiv: 2603.04459 · v3 · pith:WE4OYEKYnew · submitted 2026-03-03 · 💻 cs.CR · cs.AI· cs.SE

Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks

Junjie Chu , Xinyue Shen , Ye Leng , Michael Backes , Yun Shen , Yang Zhang This is my paper

Pith reviewed 2026-05-21 12:05 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SE

keywords LLM safetybenchmarkscode qualityrunnabilityadoption factorsethical considerationsrepository analysisjailbreak

0 comments

The pith

LLM safety benchmark adoption tracks author prominence and basic runnability rather than code quality or ethical standards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a measurement study on 31 LLM safety benchmarks covering areas like prompt injection and jailbreaks, comparing them to 382 non-benchmark papers. It shows that most repositories fail to run out of the box, lack reliable setup instructions, and rarely address the ethical risks of publishing harmful examples. Adoption by the community links more closely to how prominent the authors are and whether the code executes at all than to measurable code quality or documentation standards. These shortfalls have stayed consistent over time and raise questions about the dependability of safety evaluations built on such tools.

Core claim

Only 39 percent of the benchmark repositories run without any modification, 16 percent supply flawless installation guides, and just 6 percent include ethical considerations even though they contain potentially harmful content. Adoption correlates with author prominence and code runnability but shows no relation to static code quality metrics such as Pylint scores or maintainability. These patterns hold across the study period without improvement, and some repositories make successful attack responses publicly available without warnings or controls.

What carries the argument

Systematic measurement combining automated static analysis, over 220 person-hours of human runnability testing, and bibliometric analysis of adoption patterns across 31 benchmarks versus a control group of 382 papers.

If this is right

Downstream safety evaluations across papers may not be comparable when each requires ad-hoc code changes to run the benchmarks.
Repositories that expose unfiltered harmful content without warnings or access controls can serve as open resources for attacks.
The community does not reward higher coding standards or documentation when choosing which benchmarks to adopt.
Persistent deficiencies suggest that reliability and safety concerns in evaluations will continue unless practices change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmark creators could gain faster adoption by prioritizing immediate runnability and visible author networks over internal code polish.
A shared quality checklist might shift selection incentives if later studies show it predicts higher uptake.
Similar gaps in repository standards likely appear in other evaluation-heavy areas such as general LLM capability testing.

Load-bearing premise

The 31 selected benchmarks and 382 control papers form a representative sample of LLM safety literature without bias from how they were identified.

What would settle it

Re-running the full static analysis, runnability tests, and adoption correlation on an independently chosen larger set of LLM safety benchmarks to check whether the reported percentages and lack of quality correlation persist.

Figures

Figures reproduced from arXiv: 2603.04459 by Junjie Chu, Michael Backes, Xinyue Shen, Yang Zhang, Ye Leng, Yun Shen.

**Figure 2.** Figure 2: Human-based evaluation results of code quality. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Human-based evaluation results of supplementary materials. Repositories without code or unavailable ones are labeled “Not [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: PRISMA-style flow diagram for benchmark selection. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Average values of five influence-related metrics on benchmark and non-benchmark papers. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of the scientific fields that the LLM safety papers influence. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Average values of eight metrics related to the code repository quality on benchmark and non-benchmark papers. We have [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Average time to successfully run the example scripts [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 10.** Figure 10: GitHub repository availability proportions. [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: A typical example of the general pattern we identify. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Spearman correlation ρ matrix of the influence metrics (those with p ≥ 0.05 are omitted). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Spearman correlation ρ matrices between the influence metrics and the potential quantitative factors. The unadjusted p-values on the left can be interpreted exploratively. Pylint Score Cyclomatic Complexity Maintainability Index Number of Static Errors Reply Time (Hours) Last Commit Time (Days) Number of Commits Commit Frequency Citation Density Citation Count GitHub Star Density GitHub Star Count Scienti… view at source ↗

**Figure 14.** Figure 14: Spearman correlation ρ matrices between the influence metrics and the code repository quality metrics. The unadjusted p-values on the left can be interpreted exploratively. Author Number Institution Number Area Number Author H-Index (Top-1) Author Citation Count (Top-1) Insitution CSRankings (Top-1) Insitution ARWU (Top-1) Search Appearance Frequency Pylint Score Cyclomatic Complexity Maintainability Inde… view at source ↗

**Figure 15.** Figure 15: Spearman correlation ρ matrices between the code repository quality metrics and the potential quantitative factors. The unadjusted p-values on the left can be interpreted exploratively. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Box plots of citation density by group. The red dashed lines represent the means. [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Box plots of citation density and the status of extra modifications and runnable code. The red dashed lines represent the [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗

**Figure 18.** Figure 18: Box plots of citation density and various potential qualitative factors. The red dashed lines in the box plots represent the [PITH_FULL_IMAGE:figures/full_fig_p022_18.png] view at source ↗

**Figure 19.** Figure 19: Box plots of Pylint score and various potential qualitative factors. The red dashed lines in the box plots represent the mean [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗

read the original abstract

The rapid expansion of research in LLM safety presents challenges in tracking advancements, making benchmarks important evaluation infrastructures for identifying key trends and facilitating systematic comparisons. Yet no systematic assessment exists of their code quality and runnability, nor of what factors are associated with the community's adoption of certain benchmarks over others. To address this gap, we conduct a systematic measurement study of 31 LLM safety benchmarks (covering prompt injection, jailbreak, and hallucination) with 382 non-benchmark papers as a control group, combining automated static analysis, human runnability testing (220+ person-hours), and bibliometric analysis. We find that only 39\% of benchmark repositories can run without modification, only 16\% provide flawless installation guides, and a mere 6\% include ethical considerations despite containing potentially harmful content. These deficiencies persist across the study period with no significant improvement. Analyzing adoption factors, we find that benchmark adoption correlates with author prominence and code runnability, but not with code quality standards such as Pylint score and maintainability, suggesting that the community's benchmark selection does not reward higher coding standards. Based on these results, we identify potential safety and reliability concerns. Some safety benchmark repositories openly expose harmful content, such as successful jailbreak responses, without any ethical warning or access control, effectively serving as unguarded attack resources. Furthermore, when benchmarks require ad-hoc modifications to run, downstream safety evaluations across different papers may not be comparable. We present case studies illustrating these concrete consequences and propose a targeted checklist to help benchmark contributors improve code quality, documentation, and ethical practices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a systematic measurement study of 31 LLM safety benchmarks (prompt injection, jailbreak, hallucination) against 382 non-benchmark control papers. It combines automated static analysis, 220+ person-hours of manual runnability testing, and bibliometric analysis to quantify code quality and adoption factors. Key results: 39% of repositories run without modification, 16% have flawless installation guides, and only 6% include ethical notes. Adoption correlates with author prominence and runnability but shows no correlation with code-quality metrics such as Pylint score or maintainability. The work identifies safety risks from unguarded harmful content and proposes a contributor checklist.

Significance. If the sample is representative, the study supplies concrete, reproducible evidence on reproducibility failures and ethical gaps in LLM safety benchmarks, backed by extensive manual verification and bibliometric controls. The finding that adoption tracks prominence and runnability rather than quality standards, together with the explicit checklist, offers actionable guidance for the community. The combination of automated tools, large-scale human testing, and falsifiable quantitative claims (percentages, correlations) is a clear strength.

major comments (2)

[§3] §3 (Methodology): The paper provides no search strategy, databases, keywords, date range, inclusion/exclusion rules, or justification for selecting the 31 benchmarks and 382 control papers. This detail is load-bearing for the central claim that adoption correlates with prominence and runnability rather than quality metrics, because an uncharacterized sample could mechanically produce the reported pattern through selection artifacts.
[§4.3] §4.3 (Adoption analysis): The reported correlations (e.g., with author prominence and runnability) are presented without sensitivity checks for alternative sampling frames or controls for publication venue; if the 31 benchmarks over-represent prominent authors, the absence of correlation with Pylint/maintainability cannot be interpreted as a community-wide preference.

minor comments (2)

[§2] §2: A brief comparison table of prior benchmark surveys would help situate the novelty of the 220+ person-hour manual evaluation.
[Table 2] Table 2: Define all column abbreviations (e.g., “EA”, “Pylint”) in the caption so the table is self-contained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the revisions we will make to improve the manuscript's clarity and robustness.

read point-by-point responses

Referee: [§3] §3 (Methodology): The paper provides no search strategy, databases, keywords, date range, inclusion/exclusion rules, or justification for selecting the 31 benchmarks and 382 control papers. This detail is load-bearing for the central claim that adoption correlates with prominence and runnability rather than quality metrics, because an uncharacterized sample could mechanically produce the reported pattern through selection artifacts.

Authors: We agree that a more explicit description of the sampling process is necessary to support the reproducibility and validity of our findings. In the revised manuscript we will add a dedicated subsection to §3 that fully documents the search strategy. This will specify the databases queried (arXiv, ACL Anthology, and Google Scholar), the precise keywords and Boolean combinations used, the date range (2022–2024), the inclusion criteria (publicly available code repositories for prompt-injection, jailbreak, or hallucination benchmarks), the exclusion criteria (non-code-based evaluations, non-English papers, or works without GitHub links), and the rationale for arriving at the final counts of 31 benchmarks and 382 control papers. These additions will directly address the possibility of selection artifacts and strengthen the interpretation of the reported correlations. revision: yes
Referee: [§4.3] §4.3 (Adoption analysis): The reported correlations (e.g., with author prominence and runnability) are presented without sensitivity checks for alternative sampling frames or controls for publication venue; if the 31 benchmarks over-represent prominent authors, the absence of correlation with Pylint/maintainability cannot be interpreted as a community-wide preference.

Authors: We acknowledge the value of additional robustness checks. In the revision we will augment §4.3 with sensitivity analyses that (i) restrict the sample to benchmarks published in top-tier venues, (ii) include publication venue as a control variable in the regression models, and (iii) repeat the correlation tests on a venue-matched subset of the control group. These checks will help demonstrate that the observed pattern—adoption tracking prominence and runnability rather than static code-quality metrics—holds under alternative sampling frames and is not an artifact of over-representation of prominent authors. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical measurement study with direct observations

full rationale

This paper performs a systematic measurement study by selecting 31 LLM safety benchmarks and 382 control papers, then applying automated static analysis, human runnability testing, and bibliometric analysis to measure code quality, runnability, and adoption correlations. No derivations, equations, or first-principles predictions exist that reduce to the paper's own inputs by construction. The reported correlations (adoption with prominence/runnability, none with Pylint/maintainability) are computed directly from the sampled data without fitted parameters renamed as predictions or self-citation chains that bear the central load. The study is self-contained against external benchmarks of repository quality and adoption metrics, with no self-definitional, uniqueness-imported, or ansatz-smuggled steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on operational definitions of 'runnable without modification', 'flawless installation guide', and 'ethical consideration' that are applied during human testing; these definitions are not derived from prior literature but introduced for the study.

axioms (1)

domain assumption The selected 31 benchmarks and 382 control papers adequately represent the broader LLM safety literature.
Selection criteria and search strategy are not specified in the provided abstract.

pith-pipeline@v0.9.0 · 5831 in / 1245 out tokens · 52709 ms · 2026-05-21T12:05:45.632221+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.

Reference graph

Works this paper leans on

129 extracted references · 129 canonical work pages · cited by 1 Pith paper · 8 internal anchors

[1]

Pickard, Stephen G

Kitchenham Barbara A., Lesley M. Pickard, Stephen G. MacDonell, and Martin J. Shepperd. What accuracy statistics really measure.IEE Proceedings-Software, 2001. 3

work page 2001
[2]

Ashok Agarwal, Damayanthi Durairajanayagam, Sindhuja Tatagari, Sandro C. Esteves, Avi Harlev, Ralf R Henkel, Shubhadeep Roychoudhury, Sheryl T Homa, Nicolás Gar- rido Puchalt, Ranjith Ramasamy, Ahmad Majzoub, Kim Dao Ly, Eva Tvrdá, Mourad Assidi, Kavindra Kumar Kesari, Reecha Sharma, Saleem Ali Banihani, Edmund Y Ko, Muhammad Muhammad Abu-Elmagd, Jaime Go...

work page 2016
[3]

Generated Data with Fake Privacy: Hidden Dangers of Fine- Tuning Large Language Models on Generated Data

Atilla Akkus, Masoud Poorghaffar Aghdam, Mingjie Li, Junjie Chu, Michael Backes, Yang Zhang, and Sinem Sav. Generated Data with Fake Privacy: Hidden Dangers of Fine- Tuning Large Language Models on Generated Data. In USENIX Security Symposium (USENIX Security). USENIX,

work page
[4]

Candice Alder, Candice Yu, Gerta Bardhoshi, and Bradley T. Erford. Counseling and values metastudy: An analysis of publication characteristics from 2000 to 2019.Counseling and Values, 2021. 15

work page 2000
[5]

Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Countermeasures

Eugene Bagdasaryan and Vitaly Shmatikov. Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Countermeasures. InIEEE Symposium on Security and Pri- vacy (S&P), pages 769–786, Piscataway, NJ, USA, 2022. IEEE. 12

work page 2022
[6]

What do we know about the h index?Journal of the American Society for In- formation Science, 2007

Lutz Bornmann and Hans-Dieter Daniel. What do we know about the h index?Journal of the American Society for In- formation Science, 2007. 8

work page 2007
[7]

Wears, and Ellen Weber

Michael Callaham, Robert L. Wears, and Ellen Weber. Journal Prestige, Publication Bias, and Other Characteris- tics Associated With Citation of Published Studies in Peer- Reviewed Journals.Journal of the American Medical Asso- ciation, 287(21):2847–2850, 2002. 15

work page 2002
[8]

Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting Training Data from Large Lan- guage Models. InUSENIX Security Symposium (USENIX Security), pages 2633–2650. USENIX, 2021. 12

work page 2021
[9]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Se- hwag, Edgar Dobriban, Nicolas Flammarion, George J. Pap- pas, Florian Tramer, Hamed Hassani, and Eric Wong. Jail- breakBench: An Open Robustness Benchmark for Jailbreak- ing Large Language Models.CoRR abs/2404.01318, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements

Xiaoyi Chen, Ahmed Salem, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements. InAnnual Computer Security Applications Conference (ACSAC), pages 554–569. ACSAC, 2021. 12

work page 2021
[11]

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring.CoRR abs/2508.20848, 2025

Junjie Chu, Mingjie Li, Ziqing Yang, Ye Leng, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, and Yang Zhang. JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring.CoRR abs/2508.20848, 2025. 1

work page arXiv 2025
[12]

Neeko: Model Hijacking Attacks Against Generative Adversarial Networks

Junjie Chu, Yugeng Liu, Xinlei He, Michael Backes, Yang Zhang, and Ahmed Salem. Neeko: Model Hijacking Attacks Against Generative Adversarial Networks. InInternational Conference on Multimedia and Expo (ICME). IEEE, 2025. 1

work page 2025
[13]

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs. InAnnual Meeting of the Association for Computational Linguistics (ACL). ACL, 2025. 1, 2, 12

work page 2025
[14]

Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models

Junjie Chu, Zeyang Sha, Michael Backes, and Yang Zhang. Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models. InConference on Empirical Methods in Natu- ral Language Processing (EMNLP), page 6584–6600. ACL,

work page
[15]

Efficient Re- source Scheduling for Distributed Infrastructures Using Ne- gotiation Capabilities

Junjie Chu, Prashant Singh, and Salman Toor. Efficient Re- source Scheduling for Distributed Infrastructures Using Ne- gotiation Capabilities. InIEEE International Conference on Cloud Computing (CLOUD). IEEE, 2023. 12

work page 2023
[16]

Routledge, 1988

Jacob Cohen.Statistical power analysis for the behavioral sciences. Routledge, 1988. 4

work page 1988
[17]

A power primer.Psychological Bulletin, 1992

Jacob Cohen. A power primer.Psychological Bulletin, 1992. 4

work page 1992
[18]

Proebsting

Christian Collberg and Todd A. Proebsting. Repeatability in computer systems research.Communications of the ACM,

work page
[19]

Corca AI.https://github.com/corca-ai/awesome- llm-security/, 2024. 1, 3

work page 2024
[20]

Teixeira da Silva

Jaime A. Teixeira da Silva. The matthew effect im- pacts science and academic publishing by preferentially amplifying citations, metrics and status.Scientometrics, 126(6):5373–5377, 2021. 15

work page 2021
[21]

Teixeira da Silva and Aamir Raoof Memon

Jaime A. Teixeira da Silva and Aamir Raoof Memon. CiteScore: A cite for sore eyes, or a valuable, transparent metric?Scientometrics, 2017. 8

work page 2017
[22]

Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cre- monesi, and D. Jannach. A Troubling Analysis of Repro- ducibility and Progress in Recommender Systems Research. ACM Transactions on Information Systems, 39:1–49, 2019. 15

work page 2019
[23]

Jail- breaker: Automated jailbreak across multiple large language model chatbots,

Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jail- breaker: Automated Jailbreak Across Multiple Large Lan- guage Model Chatbots.CoRR abs/2307.08715, 2023. 12

work page arXiv 2023
[24]

Viewing computer science through ci- tation analysis: Salton and Bergmark Redux.Scientometrics, 125(1):271–287, 2020

Sitaram Devarakonda, Dmitriy Korobskiy, Tandy Warnow, and George Chacko. Viewing computer science through ci- tation analysis: Salton and Bergmark Redux.Scientometrics, 125(1):271–287, 2020. 15

work page 2020
[25]

Nonparametric Pairwise Multiple Compar- isons in Independent Groups using Dunn’s Test.The Stata Journal, 2015

Alexis Dinno. Nonparametric Pairwise Multiple Compar- isons in Independent Groups using Dunn’s Test.The Stata Journal, 2015. 6

work page 2015
[26]

CRC Press, 4th edition, 2007

Eugene Edgington and Patrick Onghena.Randomization Tests. CRC Press, 4th edition, 2007. 16

work page 2007
[27]

Michael D. Ernst. Permutation methods: A basis for exact inference.Statistical Science, 19(4):676–685, 2004. 16

work page 2004
[28]

Falagas, Angeliki Zarkali, Drosos E

Matthew E. Falagas, Angeliki Zarkali, Drosos E. Karageor- gopoulos, Vangelis Bardakas, and Michael N. Mavros. The impact of article length on the number of future citations: A bibliometric analysis of general medicine journals.PLOS ONE, 8(2):e49476, 2013. 8

work page 2013
[29]

Over-optimization of aca- demic publishing metrics: observing Goodhart’s Law in ac- tion.GigaScience, 2019

Michael Fire and Carlos Guestrin. Over-optimization of aca- demic publishing metrics: observing Goodhart’s Law in ac- tion.GigaScience, 2019. 8

work page 2019
[30]

Statistical methods for research work- ers.Breakthroughs in statistics: Methodology and distribu- tion, pages 66–70, 1970

Ronald Aylmer Fisher. Statistical methods for research work- ers.Breakthroughs in statistics: Methodology and distribu- tion, pages 66–70, 1970. 2

work page 1970
[31]

Citation analysis as a tool in journal eval- uation: Journals can be ranked by frequency and impact of citations for science policy studies.Science, 178(4060):471– 479, 1972

Eugene Garfield. Citation analysis as a tool in journal eval- uation: Journals can be ranked by frequency and impact of citations for science policy studies.Science, 178(4060):471– 479, 1972. 15

work page 1972
[32]

1, 2, 13

GitHub.https://docs.github.com/en/graphql/, 2024. 1, 2, 13

work page 2024
[33]

Good.Permutation, Parametric, and Bootstrap Tests of Hypotheses

Phillip I. Good.Permutation, Parametric, and Bootstrap Tests of Hypotheses. Springer, 3rd edition, 2005. 16

work page 2005
[34]

1, 2, 13

Google.https://scholar.google.com/, 2024. 1, 2, 13

work page 2024
[35]

Revisiting Inter-Class Maintainability Indica- tors

Lena Gregor, Markus Schnappinger, and Alexander Pretschner. Revisiting Inter-Class Maintainability Indica- tors. InIEEE International Conference on Software Analy- sis, Evolution and Reengineering (SANER), pages 805–814, Piscataway, NJ, USA, 2023. IEEE. 8

work page 2023
[36]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you’ve asked for: A Comprehensive Analysis of Novel Prompt In- jection Threats to Application-Integrated Large Language Models.CoRR abs/2302.12173, 2023. 1, 2, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Odd Erik Gundersen, Yolanda Gil, and David W. Aha. On Reproducible AI Towards reproducible research, open sci- ence, and digital scholarship in AI publications.AI Maga- zine, 2019. 15

work page 2019
[38]

A Sys- tematic Analysis of User Evaluations in Security Research

Peter Hamm, David Harborth, and Sebastian Pape. A Sys- tematic Analysis of User Evaluations in Security Research. InProceedings of the 14th International Conference on Availability, Reliability and Security, New York, NY , USA,

work page
[39]

Association for Computing Machinery. 15, 16

work page
[40]

Searching relevant papers for soft- ware engineering secondary studies: Semantic Scholar cov- erage and identification role.IET Software, 2021

Abdelhakim Hannousse. Searching relevant papers for soft- ware engineering secondary studies: Semantic Scholar cov- erage and identification role.IET Software, 2021. 2, 13

work page 2021
[41]

H. T. Hayslett.Statistics. Elsevier, 2014. 3

work page 2014
[42]

Melinda Hess and Jeffrey D. Kromrey. Robust Confidence Intervals for Effect Sizes: A Comparative Study of Cohen’s d and Cliff’s Delta Under Non-normality and Heterogeneous Variances. Inannual meeting of the American Educational Research Association (AERA), pages 1–13. American Edu- cational Research Association, 2004. 3

work page 2004
[43]

Hogg and Elliot A

Robert V . Hogg and Elliot A. Tanis.Probability and Statisti- cal Inference. Prentice Hall, 2010. 3

work page 2010
[44]

A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70,

Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70,

work page
[45]

Survey of Hallucination in Natural Language Generation.ACM Computing Surveys, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pas- cale Fung. Survey of Hallucination in Natural Language Generation.ACM Computing Surveys, 2023. 1, 2, 12

work page 2023
[46]

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621, 2026

Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, and Yang Zhang. Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621, 2026. 12

work page arXiv 2026
[47]

Adjacent Words, Divergent Intents: Jailbreaking Large Lan- guage Models via Task Concurrency

Yukun Jiang, Mingjie Li, Michael Backes, and Yang Zhang. Adjacent Words, Divergent Intents: Jailbreaking Large Lan- guage Models via Task Concurrency. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2025. 2, 12

work page 2025
[48]

Jones, Travis M

Richard E. Jones, Travis M. Hughes, Kevin A. Lawson, and Gregory L Desilva. Citation analysis of the 100 most com- mon articles regarding distal radius fractures.Journal of Clinical Orthopaedics and Trauma, 81:73–75, 2017. 3

work page 2017
[49]

Research methodology used in the 50 most cited articles in the field of pediatrics: types of stud- ies that become citation classics.BMC Medical Research Methodology, 2020

Antonia Jelicic Kadic, Tanja Kovacevic, Edita Runjic, Ana Simicic Majce, Josko Markic, Branka Polic, Julije Me- strovic, and Livia Puljak. Research methodology used in the 50 most cited articles in the field of pediatrics: types of stud- ies that become citation classics.BMC Medical Research Methodology, 2020. 3, 15

work page 2020
[50]

Graham, F.Q

Rodney Michael Kinney, Chloe Anastasiades, Russell Au- thur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Is- 9 abel Cachola, Stefan Candra, Yoganand Chandrasekhar, Ar- man Cohan, Miles Crawford, Doug Downey, Jason Dunkel- berger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David W. Graham, F.Q. Hu, Regan Huff, Daniel King, Sebastian Koh...

work page arXiv 2023
[51]

Roger E. Kirk. Practical Significance: A Concept Whose Time Has Come.Educational and Psychological Measure- ment, 1996. 3

work page 1996
[52]

A meta-analysis of semantic classification of citations.Quantitative science studies, 2(4):1170–1215,

Suchetha N Kunnath, Drahomira Herrmannova, David Pride, and Petr Knoth. A meta-analysis of semantic classification of citations.Quantitative science studies, 2(4):1170–1215,

work page
[53]

The impact factor’s matthew effect: A natural experiment in bibliometrics.Jour- nal of the American Society for Information Science and Technology, 61(2):424–427, 2010

Vincent Larivière and Yves Gingras. The impact factor’s matthew effect: A natural experiment in bibliometrics.Jour- nal of the American Society for Information Science and Technology, 61(2):424–427, 2010. 8

work page 2010
[54]

Are Smarter LLMs Safer? Exploring Safety- Reasoning Trade-offs in Prompting and Fine-Tuning.CoRR abs/2502.09673, 2025

Ang Li, Yichuan Mo, Mingjie Li, Yifei Wang, and Yisen Wang. Are Smarter LLMs Safer? Exploring Safety- Reasoning Trade-offs in Prompting and Fine-Tuning.CoRR abs/2502.09673, 2025. 1

work page arXiv 2025
[55]

Multi-step Jailbreaking Privacy Attacks on ChatGPT

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy Attacks on ChatGPT. CoRR abs/2304.05197, 2023. 12

work page arXiv 2023
[56]

HaluEval: A Large-Scale Hallucination Evalu- ation Benchmark for Large Language Models

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji- Rong Wen. HaluEval: A Large-Scale Hallucination Evalu- ation Benchmark for Large Language Models. InConfer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 6449–6464. ACL, 2023. 1, 2, 12

work page 2023
[57]

SaLoRA: Safety-Alignment Preserved Low- Rank Adaptation

Mingjie Li, Wai-Man Si, Michael Backes, Yang Zhang, and Yisen Wang. SaLoRA: Safety-Alignment Preserved Low- Rank Adaptation. InInternational Conference on Learning Representations (ICLR), 2025. 1

work page 2025
[58]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineering: An Em- pirical Study.CoRR abs/2305.13860, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[59]

Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of Large Language Models.CoRR abs/2308.07847, 2023

Yugeng Liu, Tianshuo Cong, Zhengyu Zhao, Michael Backes, Yun Shen, and Yang Zhang. Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of Large Language Models.CoRR abs/2308.07847, 2023. 1

work page arXiv 2023
[60]

Analyzing Leak- age of Personally Identifiable Information in Language Mod- els

Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella Béguelin. Analyzing Leak- age of Personally Identifiable Information in Language Mod- els. InIEEE Symposium on Security and Privacy (S&P), pages 346–363, Piscataway, NJ, USA, 2023. IEEE. 12

work page 2023
[61]

Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations.Universitas Psychologica, 2010

Guillermo Macbeth, Eugenia Razumiejczyk, and Rubén Daniel Ledesma. Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations.Universitas Psychologica, 2010. 3

work page 2010
[62]

An abla- tion study on the use of publication venue quality to rank computer science departments.Scientometrics, 2023

Aniruddha Maiti, Sai Shi, and Slobodan Vucetic. An abla- tion study on the use of publication venue quality to rank computer science departments.Scientometrics, 2023. 8

work page 2023
[63]

Systematic review and meta- analyses of studies analysing instructions to authors from 1987 to 2017.Nature communications, 12(1):5840, 2021

Mario Mali ˇcki, Ana Jeronˇci´c, IJsbrand Jan Aalbersberg, Lex Bouter, and Gerben Ter Riet. Systematic review and meta- analyses of studies analysing instructions to authors from 1987 to 2017.Nature communications, 12(1):5840, 2021. 15

work page 1987
[64]

Measuring the influence of non-scientific features on citations.Scientometrics, 127(7):4123–4137, 2022

Stefano Mammola, Elena Piano, Alberto Doretto, Enrico Caprio, and Dan Chamberlain. Measuring the influence of non-scientific features on citations.Scientometrics, 127(7):4123–4137, 2022. 8

work page 2022
[65]

The accuracy of effect-size estimates under normals and contaminated nor- mals in meta-analysis.Heliyon, 2019

Philomena Marfo and Gabriel Asare Okyere. The accuracy of effect-size estimates under normals and contaminated nor- mals in meta-analysis.Heliyon, 2019. 3

work page 2019
[66]

Martin and Douglas G

Bland J. Martin and Douglas G. Altman. Applying the right statistics: analyses of measurement studies.Ultrasound in Obstetrics and Gynecology: The Official Journal of the In- ternational Society of Ultrasound in Obstetrics and Gynecol- ogy, 2003. 1, 3

work page 2003
[67]

Mascha and Thomas R

Edward J. Mascha and Thomas R. Vetter. Significance, Er- rors, Power, and Sample Size: The Blocking and Tackling of Statistics.Anesthesia & Analgesia, 2018. 3

work page 2018
[68]

T.J. McCabe. A Complexity Measure.IEEE Transactions on Software Engineering, 1976. 13

work page 1976
[69]

Francis McIntyre and F. N. David. Tables of the Ordinates and Probability Integral of the Distribution of the Correlation Coefficient in Small Samples. InMathematics, Cambridge, United Kingdom, 1938. Cambridge University Press. 16

work page 1938
[70]

McKight and Julius Najab

Patrick E. McKight and Julius Najab. Kruskal-Wallis Test. The Corsini Encyclopedia of Psychology, 2010. 4

work page 2010
[71]

McKnight and Julius Najab

Patrick E. McKnight and Julius Najab. Mann–Whitney U Test.The SAGE Encyclopedia of Research Design, 2010. 3

work page 2010
[72]

Kane Meissel and Esther S. Yao. Using Cliff’s Delta as a Non-Parametric Effect Size Measure: An Accessible Web App and R Tutorial.Practical Assessment, Research, and Evaluation, 2024. 3

work page 2024
[73]

Robert K. Merton. The matthew effect in science.Science, 159(3810):56–63, 1968. 4, 15

work page 1968
[74]

Meta AI.https://paperswithcode.com/api/v1/docs/,

work page
[75]

Microsoft.https://learn.microsoft.com/en- us/visualstudio/code-quality/code-metrics- maintainability-index-range-and-meaning?view= vs-2022/, 2022. 13

work page 2022
[76]

Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks

Fatemehsadat Mireshghallah, Kartik Goyal, Archit Uniyal, Taylor Berg-Kirkpatrick, and Reza Shokri. Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 8332–8347. ACL, 2022. 12

work page 2022
[77]

The mann-whitney u: A test for assessing whether two independent samples come from the same dis- tribution.Tutorials in quantitative Methods for Psychology,

Nachar Nadim. The mann-whitney u: A test for assessing whether two independent samples come from the same dis- tribution.Tutorials in quantitative Methods for Psychology,

work page
[78]

National Academies Press (US), 2019

National Academies of Sciences, Engineering, and Medicine.Reproducibility and Replicability in Science. National Academies Press (US), 2019. 15 10

work page 2019
[79]

Jason T. Newsom. Sample Size and Power for Re- gression.https://web.pdx.edu/~newsomj/ho_sample% 20size.pdf, 2021. 16

work page 2021
[80]

Get in Researchers; We’re Measuring Reproducibility

Daniel Olszewski, Allison Lu, Carson Stillman, Kevin War- ren, Cole Kitroser, Alejandro Pascual, Divyajyoti Ukirde, Kevin Butler, and Patrick Traynor. "Get in Researchers; We’re Measuring Reproducibility": A Reproducibility Study of Machine Learning Papers in Tier 1 Security Conferences. InACM SIGSAC Conference on Computer and Communica- tions Security (C...

work page 2023

Showing first 80 references.

[1] [1]

Pickard, Stephen G

Kitchenham Barbara A., Lesley M. Pickard, Stephen G. MacDonell, and Martin J. Shepperd. What accuracy statistics really measure.IEE Proceedings-Software, 2001. 3

work page 2001

[2] [2]

Ashok Agarwal, Damayanthi Durairajanayagam, Sindhuja Tatagari, Sandro C. Esteves, Avi Harlev, Ralf R Henkel, Shubhadeep Roychoudhury, Sheryl T Homa, Nicolás Gar- rido Puchalt, Ranjith Ramasamy, Ahmad Majzoub, Kim Dao Ly, Eva Tvrdá, Mourad Assidi, Kavindra Kumar Kesari, Reecha Sharma, Saleem Ali Banihani, Edmund Y Ko, Muhammad Muhammad Abu-Elmagd, Jaime Go...

work page 2016

[3] [3]

Generated Data with Fake Privacy: Hidden Dangers of Fine- Tuning Large Language Models on Generated Data

Atilla Akkus, Masoud Poorghaffar Aghdam, Mingjie Li, Junjie Chu, Michael Backes, Yang Zhang, and Sinem Sav. Generated Data with Fake Privacy: Hidden Dangers of Fine- Tuning Large Language Models on Generated Data. In USENIX Security Symposium (USENIX Security). USENIX,

work page

[4] [4]

Candice Alder, Candice Yu, Gerta Bardhoshi, and Bradley T. Erford. Counseling and values metastudy: An analysis of publication characteristics from 2000 to 2019.Counseling and Values, 2021. 15

work page 2000

[5] [5]

Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Countermeasures

Eugene Bagdasaryan and Vitaly Shmatikov. Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Countermeasures. InIEEE Symposium on Security and Pri- vacy (S&P), pages 769–786, Piscataway, NJ, USA, 2022. IEEE. 12

work page 2022

[6] [6]

What do we know about the h index?Journal of the American Society for In- formation Science, 2007

Lutz Bornmann and Hans-Dieter Daniel. What do we know about the h index?Journal of the American Society for In- formation Science, 2007. 8

work page 2007

[7] [7]

Wears, and Ellen Weber

Michael Callaham, Robert L. Wears, and Ellen Weber. Journal Prestige, Publication Bias, and Other Characteris- tics Associated With Citation of Published Studies in Peer- Reviewed Journals.Journal of the American Medical Asso- ciation, 287(21):2847–2850, 2002. 15

work page 2002

[8] [8]

Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel

Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting Training Data from Large Lan- guage Models. InUSENIX Security Symposium (USENIX Security), pages 2633–2650. USENIX, 2021. 12

work page 2021

[9] [9]

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models

Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Se- hwag, Edgar Dobriban, Nicolas Flammarion, George J. Pap- pas, Florian Tramer, Hamed Hassani, and Eric Wong. Jail- breakBench: An Open Robustness Benchmark for Jailbreak- ing Large Language Models.CoRR abs/2404.01318, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements

Xiaoyi Chen, Ahmed Salem, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements. InAnnual Computer Security Applications Conference (ACSAC), pages 554–569. ACSAC, 2021. 12

work page 2021

[11] [11]

JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring.CoRR abs/2508.20848, 2025

Junjie Chu, Mingjie Li, Ziqing Yang, Ye Leng, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, and Yang Zhang. JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring.CoRR abs/2508.20848, 2025. 1

work page arXiv 2025

[12] [12]

Neeko: Model Hijacking Attacks Against Generative Adversarial Networks

Junjie Chu, Yugeng Liu, Xinlei He, Michael Backes, Yang Zhang, and Ahmed Salem. Neeko: Model Hijacking Attacks Against Generative Adversarial Networks. InInternational Conference on Multimedia and Expo (ICME). IEEE, 2025. 1

work page 2025

[13] [13]

JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs

Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs. InAnnual Meeting of the Association for Computational Linguistics (ACL). ACL, 2025. 1, 2, 12

work page 2025

[14] [14]

Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models

Junjie Chu, Zeyang Sha, Michael Backes, and Yang Zhang. Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models. InConference on Empirical Methods in Natu- ral Language Processing (EMNLP), page 6584–6600. ACL,

work page

[15] [15]

Efficient Re- source Scheduling for Distributed Infrastructures Using Ne- gotiation Capabilities

Junjie Chu, Prashant Singh, and Salman Toor. Efficient Re- source Scheduling for Distributed Infrastructures Using Ne- gotiation Capabilities. InIEEE International Conference on Cloud Computing (CLOUD). IEEE, 2023. 12

work page 2023

[16] [16]

Routledge, 1988

Jacob Cohen.Statistical power analysis for the behavioral sciences. Routledge, 1988. 4

work page 1988

[17] [17]

A power primer.Psychological Bulletin, 1992

Jacob Cohen. A power primer.Psychological Bulletin, 1992. 4

work page 1992

[18] [18]

Proebsting

Christian Collberg and Todd A. Proebsting. Repeatability in computer systems research.Communications of the ACM,

work page

[19] [19]

Corca AI.https://github.com/corca-ai/awesome- llm-security/, 2024. 1, 3

work page 2024

[20] [20]

Teixeira da Silva

Jaime A. Teixeira da Silva. The matthew effect im- pacts science and academic publishing by preferentially amplifying citations, metrics and status.Scientometrics, 126(6):5373–5377, 2021. 15

work page 2021

[21] [21]

Teixeira da Silva and Aamir Raoof Memon

Jaime A. Teixeira da Silva and Aamir Raoof Memon. CiteScore: A cite for sore eyes, or a valuable, transparent metric?Scientometrics, 2017. 8

work page 2017

[22] [22]

Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cre- monesi, and D. Jannach. A Troubling Analysis of Repro- ducibility and Progress in Recommender Systems Research. ACM Transactions on Information Systems, 39:1–49, 2019. 15

work page 2019

[23] [23]

Jail- breaker: Automated jailbreak across multiple large language model chatbots,

Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jail- breaker: Automated Jailbreak Across Multiple Large Lan- guage Model Chatbots.CoRR abs/2307.08715, 2023. 12

work page arXiv 2023

[24] [24]

Viewing computer science through ci- tation analysis: Salton and Bergmark Redux.Scientometrics, 125(1):271–287, 2020

Sitaram Devarakonda, Dmitriy Korobskiy, Tandy Warnow, and George Chacko. Viewing computer science through ci- tation analysis: Salton and Bergmark Redux.Scientometrics, 125(1):271–287, 2020. 15

work page 2020

[25] [25]

Nonparametric Pairwise Multiple Compar- isons in Independent Groups using Dunn’s Test.The Stata Journal, 2015

Alexis Dinno. Nonparametric Pairwise Multiple Compar- isons in Independent Groups using Dunn’s Test.The Stata Journal, 2015. 6

work page 2015

[26] [26]

CRC Press, 4th edition, 2007

Eugene Edgington and Patrick Onghena.Randomization Tests. CRC Press, 4th edition, 2007. 16

work page 2007

[27] [27]

Michael D. Ernst. Permutation methods: A basis for exact inference.Statistical Science, 19(4):676–685, 2004. 16

work page 2004

[28] [28]

Falagas, Angeliki Zarkali, Drosos E

Matthew E. Falagas, Angeliki Zarkali, Drosos E. Karageor- gopoulos, Vangelis Bardakas, and Michael N. Mavros. The impact of article length on the number of future citations: A bibliometric analysis of general medicine journals.PLOS ONE, 8(2):e49476, 2013. 8

work page 2013

[29] [29]

Over-optimization of aca- demic publishing metrics: observing Goodhart’s Law in ac- tion.GigaScience, 2019

Michael Fire and Carlos Guestrin. Over-optimization of aca- demic publishing metrics: observing Goodhart’s Law in ac- tion.GigaScience, 2019. 8

work page 2019

[30] [30]

Statistical methods for research work- ers.Breakthroughs in statistics: Methodology and distribu- tion, pages 66–70, 1970

Ronald Aylmer Fisher. Statistical methods for research work- ers.Breakthroughs in statistics: Methodology and distribu- tion, pages 66–70, 1970. 2

work page 1970

[31] [31]

Citation analysis as a tool in journal eval- uation: Journals can be ranked by frequency and impact of citations for science policy studies.Science, 178(4060):471– 479, 1972

Eugene Garfield. Citation analysis as a tool in journal eval- uation: Journals can be ranked by frequency and impact of citations for science policy studies.Science, 178(4060):471– 479, 1972. 15

work page 1972

[32] [32]

1, 2, 13

GitHub.https://docs.github.com/en/graphql/, 2024. 1, 2, 13

work page 2024

[33] [33]

Good.Permutation, Parametric, and Bootstrap Tests of Hypotheses

Phillip I. Good.Permutation, Parametric, and Bootstrap Tests of Hypotheses. Springer, 3rd edition, 2005. 16

work page 2005

[34] [34]

1, 2, 13

Google.https://scholar.google.com/, 2024. 1, 2, 13

work page 2024

[35] [35]

Revisiting Inter-Class Maintainability Indica- tors

Lena Gregor, Markus Schnappinger, and Alexander Pretschner. Revisiting Inter-Class Maintainability Indica- tors. InIEEE International Conference on Software Analy- sis, Evolution and Reengineering (SANER), pages 805–814, Piscataway, NJ, USA, 2023. IEEE. 8

work page 2023

[36] [36]

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you’ve asked for: A Comprehensive Analysis of Novel Prompt In- jection Threats to Application-Integrated Large Language Models.CoRR abs/2302.12173, 2023. 1, 2, 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Odd Erik Gundersen, Yolanda Gil, and David W. Aha. On Reproducible AI Towards reproducible research, open sci- ence, and digital scholarship in AI publications.AI Maga- zine, 2019. 15

work page 2019

[38] [38]

A Sys- tematic Analysis of User Evaluations in Security Research

Peter Hamm, David Harborth, and Sebastian Pape. A Sys- tematic Analysis of User Evaluations in Security Research. InProceedings of the 14th International Conference on Availability, Reliability and Security, New York, NY , USA,

work page

[39] [39]

Association for Computing Machinery. 15, 16

work page

[40] [40]

Searching relevant papers for soft- ware engineering secondary studies: Semantic Scholar cov- erage and identification role.IET Software, 2021

Abdelhakim Hannousse. Searching relevant papers for soft- ware engineering secondary studies: Semantic Scholar cov- erage and identification role.IET Software, 2021. 2, 13

work page 2021

[41] [41]

H. T. Hayslett.Statistics. Elsevier, 2014. 3

work page 2014

[42] [42]

Melinda Hess and Jeffrey D. Kromrey. Robust Confidence Intervals for Effect Sizes: A Comparative Study of Cohen’s d and Cliff’s Delta Under Non-normality and Heterogeneous Variances. Inannual meeting of the American Educational Research Association (AERA), pages 1–13. American Edu- cational Research Association, 2004. 3

work page 2004

[43] [43]

Hogg and Elliot A

Robert V . Hogg and Elliot A. Tanis.Probability and Statisti- cal Inference. Prentice Hall, 2010. 3

work page 2010

[44] [44]

A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70,

Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70,

work page

[45] [45]

Survey of Hallucination in Natural Language Generation.ACM Computing Surveys, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pas- cale Fung. Survey of Hallucination in Natural Language Generation.ACM Computing Surveys, 2023. 1, 2, 12

work page 2023

[46] [46]

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621, 2026

Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, and Yang Zhang. Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621, 2026. 12

work page arXiv 2026

[47] [47]

Adjacent Words, Divergent Intents: Jailbreaking Large Lan- guage Models via Task Concurrency

Yukun Jiang, Mingjie Li, Michael Backes, and Yang Zhang. Adjacent Words, Divergent Intents: Jailbreaking Large Lan- guage Models via Task Concurrency. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2025. 2, 12

work page 2025

[48] [48]

Jones, Travis M

Richard E. Jones, Travis M. Hughes, Kevin A. Lawson, and Gregory L Desilva. Citation analysis of the 100 most com- mon articles regarding distal radius fractures.Journal of Clinical Orthopaedics and Trauma, 81:73–75, 2017. 3

work page 2017

[49] [49]

Research methodology used in the 50 most cited articles in the field of pediatrics: types of stud- ies that become citation classics.BMC Medical Research Methodology, 2020

Antonia Jelicic Kadic, Tanja Kovacevic, Edita Runjic, Ana Simicic Majce, Josko Markic, Branka Polic, Julije Me- strovic, and Livia Puljak. Research methodology used in the 50 most cited articles in the field of pediatrics: types of stud- ies that become citation classics.BMC Medical Research Methodology, 2020. 3, 15

work page 2020

[50] [50]

Graham, F.Q

Rodney Michael Kinney, Chloe Anastasiades, Russell Au- thur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Is- 9 abel Cachola, Stefan Candra, Yoganand Chandrasekhar, Ar- man Cohan, Miles Crawford, Doug Downey, Jason Dunkel- berger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David W. Graham, F.Q. Hu, Regan Huff, Daniel King, Sebastian Koh...

work page arXiv 2023

[51] [51]

Roger E. Kirk. Practical Significance: A Concept Whose Time Has Come.Educational and Psychological Measure- ment, 1996. 3

work page 1996

[52] [52]

A meta-analysis of semantic classification of citations.Quantitative science studies, 2(4):1170–1215,

Suchetha N Kunnath, Drahomira Herrmannova, David Pride, and Petr Knoth. A meta-analysis of semantic classification of citations.Quantitative science studies, 2(4):1170–1215,

work page

[53] [53]

The impact factor’s matthew effect: A natural experiment in bibliometrics.Jour- nal of the American Society for Information Science and Technology, 61(2):424–427, 2010

Vincent Larivière and Yves Gingras. The impact factor’s matthew effect: A natural experiment in bibliometrics.Jour- nal of the American Society for Information Science and Technology, 61(2):424–427, 2010. 8

work page 2010

[54] [54]

Are Smarter LLMs Safer? Exploring Safety- Reasoning Trade-offs in Prompting and Fine-Tuning.CoRR abs/2502.09673, 2025

Ang Li, Yichuan Mo, Mingjie Li, Yifei Wang, and Yisen Wang. Are Smarter LLMs Safer? Exploring Safety- Reasoning Trade-offs in Prompting and Fine-Tuning.CoRR abs/2502.09673, 2025. 1

work page arXiv 2025

[55] [55]

Multi-step Jailbreaking Privacy Attacks on ChatGPT

Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy Attacks on ChatGPT. CoRR abs/2304.05197, 2023. 12

work page arXiv 2023

[56] [56]

HaluEval: A Large-Scale Hallucination Evalu- ation Benchmark for Large Language Models

Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji- Rong Wen. HaluEval: A Large-Scale Hallucination Evalu- ation Benchmark for Large Language Models. InConfer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 6449–6464. ACL, 2023. 1, 2, 12

work page 2023

[57] [57]

SaLoRA: Safety-Alignment Preserved Low- Rank Adaptation

Mingjie Li, Wai-Man Si, Michael Backes, Yang Zhang, and Yisen Wang. SaLoRA: Safety-Alignment Preserved Low- Rank Adaptation. InInternational Conference on Learning Representations (ICLR), 2025. 1

work page 2025

[58] [58]

Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineering: An Em- pirical Study.CoRR abs/2305.13860, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[59] [59]

Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of Large Language Models.CoRR abs/2308.07847, 2023

Yugeng Liu, Tianshuo Cong, Zhengyu Zhao, Michael Backes, Yun Shen, and Yang Zhang. Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of Large Language Models.CoRR abs/2308.07847, 2023. 1

work page arXiv 2023

[60] [60]

Analyzing Leak- age of Personally Identifiable Information in Language Mod- els

Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella Béguelin. Analyzing Leak- age of Personally Identifiable Information in Language Mod- els. InIEEE Symposium on Security and Privacy (S&P), pages 346–363, Piscataway, NJ, USA, 2023. IEEE. 12

work page 2023

[61] [61]

Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations.Universitas Psychologica, 2010

Guillermo Macbeth, Eugenia Razumiejczyk, and Rubén Daniel Ledesma. Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations.Universitas Psychologica, 2010. 3

work page 2010

[62] [62]

An abla- tion study on the use of publication venue quality to rank computer science departments.Scientometrics, 2023

Aniruddha Maiti, Sai Shi, and Slobodan Vucetic. An abla- tion study on the use of publication venue quality to rank computer science departments.Scientometrics, 2023. 8

work page 2023

[63] [63]

Systematic review and meta- analyses of studies analysing instructions to authors from 1987 to 2017.Nature communications, 12(1):5840, 2021

Mario Mali ˇcki, Ana Jeronˇci´c, IJsbrand Jan Aalbersberg, Lex Bouter, and Gerben Ter Riet. Systematic review and meta- analyses of studies analysing instructions to authors from 1987 to 2017.Nature communications, 12(1):5840, 2021. 15

work page 1987

[64] [64]

Measuring the influence of non-scientific features on citations.Scientometrics, 127(7):4123–4137, 2022

Stefano Mammola, Elena Piano, Alberto Doretto, Enrico Caprio, and Dan Chamberlain. Measuring the influence of non-scientific features on citations.Scientometrics, 127(7):4123–4137, 2022. 8

work page 2022

[65] [65]

The accuracy of effect-size estimates under normals and contaminated nor- mals in meta-analysis.Heliyon, 2019

Philomena Marfo and Gabriel Asare Okyere. The accuracy of effect-size estimates under normals and contaminated nor- mals in meta-analysis.Heliyon, 2019. 3

work page 2019

[66] [66]

Martin and Douglas G

Bland J. Martin and Douglas G. Altman. Applying the right statistics: analyses of measurement studies.Ultrasound in Obstetrics and Gynecology: The Official Journal of the In- ternational Society of Ultrasound in Obstetrics and Gynecol- ogy, 2003. 1, 3

work page 2003

[67] [67]

Mascha and Thomas R

Edward J. Mascha and Thomas R. Vetter. Significance, Er- rors, Power, and Sample Size: The Blocking and Tackling of Statistics.Anesthesia & Analgesia, 2018. 3

work page 2018

[68] [68]

T.J. McCabe. A Complexity Measure.IEEE Transactions on Software Engineering, 1976. 13

work page 1976

[69] [69]

Francis McIntyre and F. N. David. Tables of the Ordinates and Probability Integral of the Distribution of the Correlation Coefficient in Small Samples. InMathematics, Cambridge, United Kingdom, 1938. Cambridge University Press. 16

work page 1938

[70] [70]

McKight and Julius Najab

Patrick E. McKight and Julius Najab. Kruskal-Wallis Test. The Corsini Encyclopedia of Psychology, 2010. 4

work page 2010

[71] [71]

McKnight and Julius Najab

Patrick E. McKnight and Julius Najab. Mann–Whitney U Test.The SAGE Encyclopedia of Research Design, 2010. 3

work page 2010

[72] [72]

Kane Meissel and Esther S. Yao. Using Cliff’s Delta as a Non-Parametric Effect Size Measure: An Accessible Web App and R Tutorial.Practical Assessment, Research, and Evaluation, 2024. 3

work page 2024

[73] [73]

Robert K. Merton. The matthew effect in science.Science, 159(3810):56–63, 1968. 4, 15

work page 1968

[74] [74]

Meta AI.https://paperswithcode.com/api/v1/docs/,

work page

[75] [75]

Microsoft.https://learn.microsoft.com/en- us/visualstudio/code-quality/code-metrics- maintainability-index-range-and-meaning?view= vs-2022/, 2022. 13

work page 2022

[76] [76]

Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks

Fatemehsadat Mireshghallah, Kartik Goyal, Archit Uniyal, Taylor Berg-Kirkpatrick, and Reza Shokri. Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 8332–8347. ACL, 2022. 12

work page 2022

[77] [77]

The mann-whitney u: A test for assessing whether two independent samples come from the same dis- tribution.Tutorials in quantitative Methods for Psychology,

Nachar Nadim. The mann-whitney u: A test for assessing whether two independent samples come from the same dis- tribution.Tutorials in quantitative Methods for Psychology,

work page

[78] [78]

National Academies Press (US), 2019

National Academies of Sciences, Engineering, and Medicine.Reproducibility and Replicability in Science. National Academies Press (US), 2019. 15 10

work page 2019

[79] [79]

Jason T. Newsom. Sample Size and Power for Re- gression.https://web.pdx.edu/~newsomj/ho_sample% 20size.pdf, 2021. 16

work page 2021

[80] [80]

Get in Researchers; We’re Measuring Reproducibility

Daniel Olszewski, Allison Lu, Carson Stillman, Kevin War- ren, Cole Kitroser, Alejandro Pascual, Divyajyoti Ukirde, Kevin Butler, and Patrick Traynor. "Get in Researchers; We’re Measuring Reproducibility": A Reproducibility Study of Machine Learning Papers in Tier 1 Security Conferences. InACM SIGSAC Conference on Computer and Communica- tions Security (C...

work page 2023