Benchmark of Benchmarks: Unpacking Influence and Code Repository Quality in LLM Safety Benchmarks
Pith reviewed 2026-05-21 12:05 UTC · model grok-4.3
The pith
LLM safety benchmark adoption tracks author prominence and basic runnability rather than code quality or ethical standards.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Only 39 percent of the benchmark repositories run without any modification, 16 percent supply flawless installation guides, and just 6 percent include ethical considerations even though they contain potentially harmful content. Adoption correlates with author prominence and code runnability but shows no relation to static code quality metrics such as Pylint scores or maintainability. These patterns hold across the study period without improvement, and some repositories make successful attack responses publicly available without warnings or controls.
What carries the argument
Systematic measurement combining automated static analysis, over 220 person-hours of human runnability testing, and bibliometric analysis of adoption patterns across 31 benchmarks versus a control group of 382 papers.
If this is right
- Downstream safety evaluations across papers may not be comparable when each requires ad-hoc code changes to run the benchmarks.
- Repositories that expose unfiltered harmful content without warnings or access controls can serve as open resources for attacks.
- The community does not reward higher coding standards or documentation when choosing which benchmarks to adopt.
- Persistent deficiencies suggest that reliability and safety concerns in evaluations will continue unless practices change.
Where Pith is reading between the lines
- Benchmark creators could gain faster adoption by prioritizing immediate runnability and visible author networks over internal code polish.
- A shared quality checklist might shift selection incentives if later studies show it predicts higher uptake.
- Similar gaps in repository standards likely appear in other evaluation-heavy areas such as general LLM capability testing.
Load-bearing premise
The 31 selected benchmarks and 382 control papers form a representative sample of LLM safety literature without bias from how they were identified.
What would settle it
Re-running the full static analysis, runnability tests, and adoption correlation on an independently chosen larger set of LLM safety benchmarks to check whether the reported percentages and lack of quality correlation persist.
Figures
read the original abstract
The rapid expansion of research in LLM safety presents challenges in tracking advancements, making benchmarks important evaluation infrastructures for identifying key trends and facilitating systematic comparisons. Yet no systematic assessment exists of their code quality and runnability, nor of what factors are associated with the community's adoption of certain benchmarks over others. To address this gap, we conduct a systematic measurement study of 31 LLM safety benchmarks (covering prompt injection, jailbreak, and hallucination) with 382 non-benchmark papers as a control group, combining automated static analysis, human runnability testing (220+ person-hours), and bibliometric analysis. We find that only 39\% of benchmark repositories can run without modification, only 16\% provide flawless installation guides, and a mere 6\% include ethical considerations despite containing potentially harmful content. These deficiencies persist across the study period with no significant improvement. Analyzing adoption factors, we find that benchmark adoption correlates with author prominence and code runnability, but not with code quality standards such as Pylint score and maintainability, suggesting that the community's benchmark selection does not reward higher coding standards. Based on these results, we identify potential safety and reliability concerns. Some safety benchmark repositories openly expose harmful content, such as successful jailbreak responses, without any ethical warning or access control, effectively serving as unguarded attack resources. Furthermore, when benchmarks require ad-hoc modifications to run, downstream safety evaluations across different papers may not be comparable. We present case studies illustrating these concrete consequences and propose a targeted checklist to help benchmark contributors improve code quality, documentation, and ethical practices.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a systematic measurement study of 31 LLM safety benchmarks (prompt injection, jailbreak, hallucination) against 382 non-benchmark control papers. It combines automated static analysis, 220+ person-hours of manual runnability testing, and bibliometric analysis to quantify code quality and adoption factors. Key results: 39% of repositories run without modification, 16% have flawless installation guides, and only 6% include ethical notes. Adoption correlates with author prominence and runnability but shows no correlation with code-quality metrics such as Pylint score or maintainability. The work identifies safety risks from unguarded harmful content and proposes a contributor checklist.
Significance. If the sample is representative, the study supplies concrete, reproducible evidence on reproducibility failures and ethical gaps in LLM safety benchmarks, backed by extensive manual verification and bibliometric controls. The finding that adoption tracks prominence and runnability rather than quality standards, together with the explicit checklist, offers actionable guidance for the community. The combination of automated tools, large-scale human testing, and falsifiable quantitative claims (percentages, correlations) is a clear strength.
major comments (2)
- [§3] §3 (Methodology): The paper provides no search strategy, databases, keywords, date range, inclusion/exclusion rules, or justification for selecting the 31 benchmarks and 382 control papers. This detail is load-bearing for the central claim that adoption correlates with prominence and runnability rather than quality metrics, because an uncharacterized sample could mechanically produce the reported pattern through selection artifacts.
- [§4.3] §4.3 (Adoption analysis): The reported correlations (e.g., with author prominence and runnability) are presented without sensitivity checks for alternative sampling frames or controls for publication venue; if the 31 benchmarks over-represent prominent authors, the absence of correlation with Pylint/maintainability cannot be interpreted as a community-wide preference.
minor comments (2)
- [§2] §2: A brief comparison table of prior benchmark surveys would help situate the novelty of the 220+ person-hour manual evaluation.
- [Table 2] Table 2: Define all column abbreviations (e.g., “EA”, “Pylint”) in the caption so the table is self-contained.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the revisions we will make to improve the manuscript's clarity and robustness.
read point-by-point responses
-
Referee: [§3] §3 (Methodology): The paper provides no search strategy, databases, keywords, date range, inclusion/exclusion rules, or justification for selecting the 31 benchmarks and 382 control papers. This detail is load-bearing for the central claim that adoption correlates with prominence and runnability rather than quality metrics, because an uncharacterized sample could mechanically produce the reported pattern through selection artifacts.
Authors: We agree that a more explicit description of the sampling process is necessary to support the reproducibility and validity of our findings. In the revised manuscript we will add a dedicated subsection to §3 that fully documents the search strategy. This will specify the databases queried (arXiv, ACL Anthology, and Google Scholar), the precise keywords and Boolean combinations used, the date range (2022–2024), the inclusion criteria (publicly available code repositories for prompt-injection, jailbreak, or hallucination benchmarks), the exclusion criteria (non-code-based evaluations, non-English papers, or works without GitHub links), and the rationale for arriving at the final counts of 31 benchmarks and 382 control papers. These additions will directly address the possibility of selection artifacts and strengthen the interpretation of the reported correlations. revision: yes
-
Referee: [§4.3] §4.3 (Adoption analysis): The reported correlations (e.g., with author prominence and runnability) are presented without sensitivity checks for alternative sampling frames or controls for publication venue; if the 31 benchmarks over-represent prominent authors, the absence of correlation with Pylint/maintainability cannot be interpreted as a community-wide preference.
Authors: We acknowledge the value of additional robustness checks. In the revision we will augment §4.3 with sensitivity analyses that (i) restrict the sample to benchmarks published in top-tier venues, (ii) include publication venue as a control variable in the regression models, and (iii) repeat the correlation tests on a venue-matched subset of the control group. These checks will help demonstrate that the observed pattern—adoption tracking prominence and runnability rather than static code-quality metrics—holds under alternative sampling frames and is not an artifact of over-representation of prominent authors. revision: yes
Circularity Check
No significant circularity: empirical measurement study with direct observations
full rationale
This paper performs a systematic measurement study by selecting 31 LLM safety benchmarks and 382 control papers, then applying automated static analysis, human runnability testing, and bibliometric analysis to measure code quality, runnability, and adoption correlations. No derivations, equations, or first-principles predictions exist that reduce to the paper's own inputs by construction. The reported correlations (adoption with prominence/runnability, none with Pylint/maintainability) are computed directly from the sampled data without fitted parameters renamed as predictions or self-citation chains that bear the central load. The study is self-contained against external benchmarks of repository quality and adoption metrics, with no self-definitional, uniqueness-imported, or ansatz-smuggled steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected 31 benchmarks and 382 control papers adequately represent the broader LLM safety literature.
Forward citations
Cited by 1 Pith paper
-
Pruning Unsafe Tickets: A Resource-Efficient Framework for Safer and More Robust LLMs
Pruning removes 'unsafe tickets' from LLMs via gradient-free attribution, reducing harmful outputs and jailbreak vulnerability with minimal utility loss.
Reference graph
Works this paper leans on
-
[1]
Kitchenham Barbara A., Lesley M. Pickard, Stephen G. MacDonell, and Martin J. Shepperd. What accuracy statistics really measure.IEE Proceedings-Software, 2001. 3
work page 2001
-
[2]
Ashok Agarwal, Damayanthi Durairajanayagam, Sindhuja Tatagari, Sandro C. Esteves, Avi Harlev, Ralf R Henkel, Shubhadeep Roychoudhury, Sheryl T Homa, Nicolás Gar- rido Puchalt, Ranjith Ramasamy, Ahmad Majzoub, Kim Dao Ly, Eva Tvrdá, Mourad Assidi, Kavindra Kumar Kesari, Reecha Sharma, Saleem Ali Banihani, Edmund Y Ko, Muhammad Muhammad Abu-Elmagd, Jaime Go...
work page 2016
-
[3]
Atilla Akkus, Masoud Poorghaffar Aghdam, Mingjie Li, Junjie Chu, Michael Backes, Yang Zhang, and Sinem Sav. Generated Data with Fake Privacy: Hidden Dangers of Fine- Tuning Large Language Models on Generated Data. In USENIX Security Symposium (USENIX Security). USENIX,
-
[4]
Candice Alder, Candice Yu, Gerta Bardhoshi, and Bradley T. Erford. Counseling and values metastudy: An analysis of publication characteristics from 2000 to 2019.Counseling and Values, 2021. 15
work page 2000
-
[5]
Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Countermeasures
Eugene Bagdasaryan and Vitaly Shmatikov. Spinning Lan- guage Models: Risks of Propaganda-As-A-Service and Countermeasures. InIEEE Symposium on Security and Pri- vacy (S&P), pages 769–786, Piscataway, NJ, USA, 2022. IEEE. 12
work page 2022
-
[6]
What do we know about the h index?Journal of the American Society for In- formation Science, 2007
Lutz Bornmann and Hans-Dieter Daniel. What do we know about the h index?Journal of the American Society for In- formation Science, 2007. 8
work page 2007
-
[7]
Michael Callaham, Robert L. Wears, and Ellen Weber. Journal Prestige, Publication Bias, and Other Characteris- tics Associated With Citation of Published Studies in Peer- Reviewed Journals.Journal of the American Medical Asso- ciation, 287(21):2847–2850, 2002. 15
work page 2002
-
[8]
Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel
Nicholas Carlini, Florian Tramèr, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom B. Brown, Dawn Song, Úlfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting Training Data from Large Lan- guage Models. InUSENIX Security Symposium (USENIX Security), pages 2633–2650. USENIX, 2021. 12
work page 2021
-
[9]
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Se- hwag, Edgar Dobriban, Nicolas Flammarion, George J. Pap- pas, Florian Tramer, Hamed Hassani, and Eric Wong. Jail- breakBench: An Open Robustness Benchmark for Jailbreak- ing Large Language Models.CoRR abs/2404.01318, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements
Xiaoyi Chen, Ahmed Salem, Michael Backes, Shiqing Ma, Qingni Shen, Zhonghai Wu, and Yang Zhang. BadNL: Back- door Attacks Against NLP Models with Semantic-preserving Improvements. InAnnual Computer Security Applications Conference (ACSAC), pages 554–569. ACSAC, 2021. 12
work page 2021
-
[11]
Junjie Chu, Mingjie Li, Ziqing Yang, Ye Leng, Chenhao Lin, Chao Shen, Michael Backes, Yun Shen, and Yang Zhang. JADES: A Universal Framework for Jailbreak Assessment via Decompositional Scoring.CoRR abs/2508.20848, 2025. 1
-
[12]
Neeko: Model Hijacking Attacks Against Generative Adversarial Networks
Junjie Chu, Yugeng Liu, Xinlei He, Michael Backes, Yang Zhang, and Ahmed Salem. Neeko: Model Hijacking Attacks Against Generative Adversarial Networks. InInternational Conference on Multimedia and Expo (ICME). IEEE, 2025. 1
work page 2025
-
[13]
JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs
Junjie Chu, Yugeng Liu, Ziqing Yang, Xinyue Shen, Michael Backes, and Yang Zhang. JailbreakRadar: Comprehensive Assessment of Jailbreak Attacks Against LLMs. InAnnual Meeting of the Association for Computational Linguistics (ACL). ACL, 2025. 1, 2, 12
work page 2025
-
[14]
Junjie Chu, Zeyang Sha, Michael Backes, and Yang Zhang. Reconstruct Your Previous Conversations! Comprehensively Investigating Privacy Leakage Risks in Conversations with GPT Models. InConference on Empirical Methods in Natu- ral Language Processing (EMNLP), page 6584–6600. ACL,
-
[15]
Efficient Re- source Scheduling for Distributed Infrastructures Using Ne- gotiation Capabilities
Junjie Chu, Prashant Singh, and Salman Toor. Efficient Re- source Scheduling for Distributed Infrastructures Using Ne- gotiation Capabilities. InIEEE International Conference on Cloud Computing (CLOUD). IEEE, 2023. 12
work page 2023
-
[16]
Jacob Cohen.Statistical power analysis for the behavioral sciences. Routledge, 1988. 4
work page 1988
-
[17]
A power primer.Psychological Bulletin, 1992
Jacob Cohen. A power primer.Psychological Bulletin, 1992. 4
work page 1992
-
[18]
Christian Collberg and Todd A. Proebsting. Repeatability in computer systems research.Communications of the ACM,
-
[19]
Corca AI.https://github.com/corca-ai/awesome- llm-security/, 2024. 1, 3
work page 2024
-
[20]
Jaime A. Teixeira da Silva. The matthew effect im- pacts science and academic publishing by preferentially amplifying citations, metrics and status.Scientometrics, 126(6):5373–5377, 2021. 15
work page 2021
-
[21]
Teixeira da Silva and Aamir Raoof Memon
Jaime A. Teixeira da Silva and Aamir Raoof Memon. CiteScore: A cite for sore eyes, or a valuable, transparent metric?Scientometrics, 2017. 8
work page 2017
-
[22]
Maurizio Ferrari Dacrema, Simone Boglio, Paolo Cre- monesi, and D. Jannach. A Troubling Analysis of Repro- ducibility and Progress in Recommender Systems Research. ACM Transactions on Information Systems, 39:1–49, 2019. 15
work page 2019
-
[23]
Jail- breaker: Automated jailbreak across multiple large language model chatbots,
Gelei Deng, Yi Liu, Yuekang Li, Kailong Wang, Ying Zhang, Zefeng Li, Haoyu Wang, Tianwei Zhang, and Yang Liu. Jail- breaker: Automated Jailbreak Across Multiple Large Lan- guage Model Chatbots.CoRR abs/2307.08715, 2023. 12
-
[24]
Sitaram Devarakonda, Dmitriy Korobskiy, Tandy Warnow, and George Chacko. Viewing computer science through ci- tation analysis: Salton and Bergmark Redux.Scientometrics, 125(1):271–287, 2020. 15
work page 2020
-
[25]
Alexis Dinno. Nonparametric Pairwise Multiple Compar- isons in Independent Groups using Dunn’s Test.The Stata Journal, 2015. 6
work page 2015
-
[26]
Eugene Edgington and Patrick Onghena.Randomization Tests. CRC Press, 4th edition, 2007. 16
work page 2007
-
[27]
Michael D. Ernst. Permutation methods: A basis for exact inference.Statistical Science, 19(4):676–685, 2004. 16
work page 2004
-
[28]
Falagas, Angeliki Zarkali, Drosos E
Matthew E. Falagas, Angeliki Zarkali, Drosos E. Karageor- gopoulos, Vangelis Bardakas, and Michael N. Mavros. The impact of article length on the number of future citations: A bibliometric analysis of general medicine journals.PLOS ONE, 8(2):e49476, 2013. 8
work page 2013
-
[29]
Michael Fire and Carlos Guestrin. Over-optimization of aca- demic publishing metrics: observing Goodhart’s Law in ac- tion.GigaScience, 2019. 8
work page 2019
-
[30]
Ronald Aylmer Fisher. Statistical methods for research work- ers.Breakthroughs in statistics: Methodology and distribu- tion, pages 66–70, 1970. 2
work page 1970
-
[31]
Eugene Garfield. Citation analysis as a tool in journal eval- uation: Journals can be ranked by frequency and impact of citations for science policy studies.Science, 178(4060):471– 479, 1972. 15
work page 1972
- [32]
-
[33]
Good.Permutation, Parametric, and Bootstrap Tests of Hypotheses
Phillip I. Good.Permutation, Parametric, and Bootstrap Tests of Hypotheses. Springer, 3rd edition, 2005. 16
work page 2005
- [34]
-
[35]
Revisiting Inter-Class Maintainability Indica- tors
Lena Gregor, Markus Schnappinger, and Alexander Pretschner. Revisiting Inter-Class Maintainability Indica- tors. InIEEE International Conference on Software Analy- sis, Evolution and Reengineering (SANER), pages 805–814, Piscataway, NJ, USA, 2023. IEEE. 8
work page 2023
-
[36]
Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. More than you’ve asked for: A Comprehensive Analysis of Novel Prompt In- jection Threats to Application-Integrated Large Language Models.CoRR abs/2302.12173, 2023. 1, 2, 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Odd Erik Gundersen, Yolanda Gil, and David W. Aha. On Reproducible AI Towards reproducible research, open sci- ence, and digital scholarship in AI publications.AI Maga- zine, 2019. 15
work page 2019
-
[38]
A Sys- tematic Analysis of User Evaluations in Security Research
Peter Hamm, David Harborth, and Sebastian Pape. A Sys- tematic Analysis of User Evaluations in Security Research. InProceedings of the 14th International Conference on Availability, Reliability and Security, New York, NY , USA,
-
[39]
Association for Computing Machinery. 15, 16
-
[40]
Abdelhakim Hannousse. Searching relevant papers for soft- ware engineering secondary studies: Semantic Scholar cov- erage and identification role.IET Software, 2021. 2, 13
work page 2021
-
[41]
H. T. Hayslett.Statistics. Elsevier, 2014. 3
work page 2014
-
[42]
Melinda Hess and Jeffrey D. Kromrey. Robust Confidence Intervals for Effect Sizes: A Comparative Study of Cohen’s d and Cliff’s Delta Under Non-normality and Heterogeneous Variances. Inannual meeting of the American Educational Research Association (AERA), pages 1–13. American Edu- cational Research Association, 2004. 3
work page 2004
-
[43]
Robert V . Hogg and Elliot A. Tanis.Probability and Statisti- cal Inference. Prentice Hall, 2010. 3
work page 2010
-
[44]
Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70,
-
[45]
Survey of Hallucination in Natural Language Generation.ACM Computing Surveys, 2023
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pas- cale Fung. Survey of Hallucination in Natural Language Generation.ACM Computing Surveys, 2023. 1, 2, 12
work page 2023
-
[46]
Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621, 2026
Yukun Jiang, Hai Huang, Mingjie Li, Yage Zhang, Michael Backes, and Yang Zhang. Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs.CoRR abs/2602.08621, 2026. 12
-
[47]
Adjacent Words, Divergent Intents: Jailbreaking Large Lan- guage Models via Task Concurrency
Yukun Jiang, Mingjie Li, Michael Backes, and Yang Zhang. Adjacent Words, Divergent Intents: Jailbreaking Large Lan- guage Models via Task Concurrency. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2025. 2, 12
work page 2025
-
[48]
Richard E. Jones, Travis M. Hughes, Kevin A. Lawson, and Gregory L Desilva. Citation analysis of the 100 most com- mon articles regarding distal radius fractures.Journal of Clinical Orthopaedics and Trauma, 81:73–75, 2017. 3
work page 2017
-
[49]
Antonia Jelicic Kadic, Tanja Kovacevic, Edita Runjic, Ana Simicic Majce, Josko Markic, Branka Polic, Julije Me- strovic, and Livia Puljak. Research methodology used in the 50 most cited articles in the field of pediatrics: types of stud- ies that become citation classics.BMC Medical Research Methodology, 2020. 3, 15
work page 2020
-
[50]
Rodney Michael Kinney, Chloe Anastasiades, Russell Au- thur, Iz Beltagy, Jonathan Bragg, Alexandra Buraczynski, Is- 9 abel Cachola, Stefan Candra, Yoganand Chandrasekhar, Ar- man Cohan, Miles Crawford, Doug Downey, Jason Dunkel- berger, Oren Etzioni, Rob Evans, Sergey Feldman, Joseph Gorney, David W. Graham, F.Q. Hu, Regan Huff, Daniel King, Sebastian Koh...
-
[51]
Roger E. Kirk. Practical Significance: A Concept Whose Time Has Come.Educational and Psychological Measure- ment, 1996. 3
work page 1996
-
[52]
Suchetha N Kunnath, Drahomira Herrmannova, David Pride, and Petr Knoth. A meta-analysis of semantic classification of citations.Quantitative science studies, 2(4):1170–1215,
-
[53]
Vincent Larivière and Yves Gingras. The impact factor’s matthew effect: A natural experiment in bibliometrics.Jour- nal of the American Society for Information Science and Technology, 61(2):424–427, 2010. 8
work page 2010
-
[54]
Ang Li, Yichuan Mo, Mingjie Li, Yifei Wang, and Yisen Wang. Are Smarter LLMs Safer? Exploring Safety- Reasoning Trade-offs in Prompting and Fine-Tuning.CoRR abs/2502.09673, 2025. 1
-
[55]
Multi-step Jailbreaking Privacy Attacks on ChatGPT
Haoran Li, Dadi Guo, Wei Fan, Mingshi Xu, and Yangqiu Song. Multi-step Jailbreaking Privacy Attacks on ChatGPT. CoRR abs/2304.05197, 2023. 12
-
[56]
HaluEval: A Large-Scale Hallucination Evalu- ation Benchmark for Large Language Models
Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji- Rong Wen. HaluEval: A Large-Scale Hallucination Evalu- ation Benchmark for Large Language Models. InConfer- ence on Empirical Methods in Natural Language Processing (EMNLP), pages 6449–6464. ACL, 2023. 1, 2, 12
work page 2023
-
[57]
SaLoRA: Safety-Alignment Preserved Low- Rank Adaptation
Mingjie Li, Wai-Man Si, Michael Backes, Yang Zhang, and Yisen Wang. SaLoRA: Safety-Alignment Preserved Low- Rank Adaptation. InInternational Conference on Learning Representations (ICLR), 2025. 1
work page 2025
-
[58]
Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study
Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Yang Liu. Jailbreaking ChatGPT via Prompt Engineering: An Em- pirical Study.CoRR abs/2305.13860, 2023. 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
Yugeng Liu, Tianshuo Cong, Zhengyu Zhao, Michael Backes, Yun Shen, and Yang Zhang. Robustness Over Time: Understanding Adversarial Examples’ Effectiveness on Longitudinal Versions of Large Language Models.CoRR abs/2308.07847, 2023. 1
-
[60]
Analyzing Leak- age of Personally Identifiable Information in Language Mod- els
Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella Béguelin. Analyzing Leak- age of Personally Identifiable Information in Language Mod- els. InIEEE Symposium on Security and Privacy (S&P), pages 346–363, Piscataway, NJ, USA, 2023. IEEE. 12
work page 2023
-
[61]
Guillermo Macbeth, Eugenia Razumiejczyk, and Rubén Daniel Ledesma. Cliff’s Delta Calculator: A non-parametric effect size program for two groups of observations.Universitas Psychologica, 2010. 3
work page 2010
-
[62]
Aniruddha Maiti, Sai Shi, and Slobodan Vucetic. An abla- tion study on the use of publication venue quality to rank computer science departments.Scientometrics, 2023. 8
work page 2023
-
[63]
Mario Mali ˇcki, Ana Jeronˇci´c, IJsbrand Jan Aalbersberg, Lex Bouter, and Gerben Ter Riet. Systematic review and meta- analyses of studies analysing instructions to authors from 1987 to 2017.Nature communications, 12(1):5840, 2021. 15
work page 1987
-
[64]
Stefano Mammola, Elena Piano, Alberto Doretto, Enrico Caprio, and Dan Chamberlain. Measuring the influence of non-scientific features on citations.Scientometrics, 127(7):4123–4137, 2022. 8
work page 2022
-
[65]
Philomena Marfo and Gabriel Asare Okyere. The accuracy of effect-size estimates under normals and contaminated nor- mals in meta-analysis.Heliyon, 2019. 3
work page 2019
-
[66]
Bland J. Martin and Douglas G. Altman. Applying the right statistics: analyses of measurement studies.Ultrasound in Obstetrics and Gynecology: The Official Journal of the In- ternational Society of Ultrasound in Obstetrics and Gynecol- ogy, 2003. 1, 3
work page 2003
-
[67]
Edward J. Mascha and Thomas R. Vetter. Significance, Er- rors, Power, and Sample Size: The Blocking and Tackling of Statistics.Anesthesia & Analgesia, 2018. 3
work page 2018
-
[68]
T.J. McCabe. A Complexity Measure.IEEE Transactions on Software Engineering, 1976. 13
work page 1976
-
[69]
Francis McIntyre and F. N. David. Tables of the Ordinates and Probability Integral of the Distribution of the Correlation Coefficient in Small Samples. InMathematics, Cambridge, United Kingdom, 1938. Cambridge University Press. 16
work page 1938
-
[70]
Patrick E. McKight and Julius Najab. Kruskal-Wallis Test. The Corsini Encyclopedia of Psychology, 2010. 4
work page 2010
-
[71]
Patrick E. McKnight and Julius Najab. Mann–Whitney U Test.The SAGE Encyclopedia of Research Design, 2010. 3
work page 2010
-
[72]
Kane Meissel and Esther S. Yao. Using Cliff’s Delta as a Non-Parametric Effect Size Measure: An Accessible Web App and R Tutorial.Practical Assessment, Research, and Evaluation, 2024. 3
work page 2024
-
[73]
Robert K. Merton. The matthew effect in science.Science, 159(3810):56–63, 1968. 4, 15
work page 1968
-
[74]
Meta AI.https://paperswithcode.com/api/v1/docs/,
-
[75]
Microsoft.https://learn.microsoft.com/en- us/visualstudio/code-quality/code-metrics- maintainability-index-range-and-meaning?view= vs-2022/, 2022. 13
work page 2022
-
[76]
Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks
Fatemehsadat Mireshghallah, Kartik Goyal, Archit Uniyal, Taylor Berg-Kirkpatrick, and Reza Shokri. Quantifying Pri- vacy Risks of Masked Language Models Using Membership Inference Attacks. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 8332–8347. ACL, 2022. 12
work page 2022
-
[77]
Nachar Nadim. The mann-whitney u: A test for assessing whether two independent samples come from the same dis- tribution.Tutorials in quantitative Methods for Psychology,
-
[78]
National Academies Press (US), 2019
National Academies of Sciences, Engineering, and Medicine.Reproducibility and Replicability in Science. National Academies Press (US), 2019. 15 10
work page 2019
-
[79]
Jason T. Newsom. Sample Size and Power for Re- gression.https://web.pdx.edu/~newsomj/ho_sample% 20size.pdf, 2021. 16
work page 2021
-
[80]
Get in Researchers; We’re Measuring Reproducibility
Daniel Olszewski, Allison Lu, Carson Stillman, Kevin War- ren, Cole Kitroser, Alejandro Pascual, Divyajyoti Ukirde, Kevin Butler, and Patrick Traynor. "Get in Researchers; We’re Measuring Reproducibility": A Reproducibility Study of Machine Learning Papers in Tier 1 Security Conferences. InACM SIGSAC Conference on Computer and Communica- tions Security (C...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.