pith. machine review for the scientific record. sign in

arxiv: 2604.17803 · v1 · submitted 2026-04-20 · 💻 cs.AI · cs.LG

Recognition: unknown

Adversarial Arena: Crowdsourcing Data Generation through Interactive Competition

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:52 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords adversarial arenacrowdsourcingdata generationLLM fine-tuningcybersecuritysafety alignmentmulti-turn conversationssecure code generation
0
0 comments X

The pith

Framing data generation as an attacker-defender competition produces diverse multi-turn conversations that improve secure code generation after fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that organizing data creation through interactive competitions, where some teams act as attackers crafting prompts and others as defenders generating responses, yields higher quality and more diverse conversational datasets than traditional crowdsourcing or synthetic methods. This is particularly useful for scarce domains like multi-turn safety alignment in cybersecurity. By running such a competition with ten university teams, they collected nearly twenty thousand conversations. Fine-tuning an open-source model on this data led to notable gains on two cybersecurity evaluation benchmarks.

Core claim

By framing data generation as an adversarial task between attacker bots creating prompts and defender bots generating responses in a competitive arena with multiple teams, the approach naturally produces diverse and complex multi-turn conversations. A competition involving 10 teams generated 19,683 such conversations focused on safety alignment of LLMs in cybersecurity. Fine-tuning an open-source model on this dataset resulted in an 18.47% improvement on CyberSecEval-Instruct and 29.42% on CyberSecEval-MITRE.

What carries the argument

Adversarial Arena: an interactive competition framework where attackers generate prompts and defenders produce responses, which drives the creation of high-quality data through team competition.

Load-bearing premise

The benchmark improvements result specifically from the quality and diversity induced by the attacker-defender competition rather than from other factors like the total amount of data collected or the choice of participants.

What would settle it

A control experiment collecting a similar number of conversations without the adversarial competition structure and showing no comparable improvements upon fine-tuning would falsify the claim that the arena method is responsible for the gains.

Figures

Figures reproduced from arXiv: 2604.17803 by Anna Gottardi, Anna Rumshisky, Desheng Zhang, Hangjie Shi, Lavina Vaz, Leslie Ball, Lucy Hu, Luke Dai, Maureen Murray, Michael Johnston, Prasoon Goyal, Rahul Gupta, Samyuth Sagi, Sankaranarayanan Ananthakrishnan, Sattvik Sahai, Shaohua Liu, Yao Lu.

Figure 1
Figure 1. Figure 1: Adversarial Arena Overview: Attacker/defender pairs interact over several tournament [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: T-SNE plots: Conversations in the left plot are grouped by attackers, the middle plot is [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Orchestrator Architecture B.3 FUNCTIONAL GUARANTEES The orchestrator enforces the following guarantees to ensure fairness, robustness, and experimental control: Pairing and Session Scheduling All attacker-defender pairs are statically defined during initial￾ization based on the tournament configuration. The system supports per-pair session quotas, en￾abling unequal traffic allocation for A/B testing or spe… view at source ↗
Figure 4
Figure 4. Figure 4: Vulnerable vs Malicious Sessions Across Tournaments [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example Adversarial Conversation 1: A representative conversation between an attacker [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Example Adversarial Conversation 2: A conversation demonstrating a multi-step attack [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualizations of pairwise agreements between annotators, along with a histogram of inter [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
read the original abstract

Post-training Large Language Models requires diverse, high-quality data which is rare and costly to obtain, especially in low resource domains and for multi-turn conversations. Common solutions are crowdsourcing or synthetic generation, but both often yield low-quality or low-diversity data. We introduce Adversarial Arena for building high quality conversational datasets by framing data generation as an adversarial task: attackers create prompts, and defenders generate responses. This interactive competition between multiple teams naturally produces diverse and complex data. We validated this approach by conducting a competition with 10 academic teams from top US and European universities, each building attacker or defender bots. The competition, focused on safety alignment of LLMs in cybersecurity, generated 19,683 multi-turn conversations. Fine-tuning an open-source model on this dataset produced an 18.47% improvement in secure code generation on CyberSecEval-Instruct and 29.42% improvement on CyberSecEval-MITRE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Adversarial Arena, a framework that frames conversational data generation as an interactive competition between attacker and defender bots built by multiple independent teams. A competition involving 10 academic teams generated 19,683 multi-turn conversations focused on cybersecurity safety alignment; fine-tuning an open-source model on this dataset is reported to produce an 18.47% improvement on CyberSecEval-Instruct and a 29.42% improvement on CyberSecEval-MITRE for secure code generation.

Significance. If the reported gains can be causally linked to the adversarial competition format rather than data volume or participant expertise, the approach offers a scalable way to produce diverse, high-complexity multi-turn data in low-resource domains such as safety alignment. The concrete benchmark deltas and the scale of the crowdsourced dataset (19k conversations) suggest practical utility for post-training, provided the method is shown to outperform simpler collection strategies.

major comments (2)
  1. [Abstract] Abstract: The central empirical claim attributes the 18.47% and 29.42% benchmark improvements to the quality and diversity arising from the attacker-defender competition, yet the abstract (and, from context, the results section) provides no baseline model details, no matched-volume non-adversarial control dataset, and no description of how the 19,683 conversations were filtered or split. Without these, the incremental benefit of the arena format over simply collecting an equivalent volume of cybersecurity multi-turn data cannot be assessed.
  2. [Results] Results / Evaluation: The fine-tuning experiments report percentage improvements on CyberSecEval-Instruct and CyberSecEval-MITRE but omit the base model name, training hyperparameters, number of epochs, and any statistical significance tests or variance across runs. This leaves open whether the deltas exceed what would be obtained from a comparable volume of data generated by non-competitive prompting or from existing cybersecurity corpora.
minor comments (2)
  1. [Method] The manuscript does not specify how the 10 teams were assigned to attacker versus defender roles or whether any teams participated in both, which affects reproducibility of the data-generation protocol.
  2. [Data] Figure or table captions describing the generated conversations should include basic statistics (average turns per conversation, topic distribution) to allow readers to gauge diversity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, indicating the revisions we will make to improve clarity and reproducibility while honestly noting limitations of the current study.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central empirical claim attributes the 18.47% and 29.42% benchmark improvements to the quality and diversity arising from the attacker-defender competition, yet the abstract (and, from context, the results section) provides no baseline model details, no matched-volume non-adversarial control dataset, and no description of how the 19,683 conversations were filtered or split. Without these, the incremental benefit of the arena format over simply collecting an equivalent volume of cybersecurity multi-turn data cannot be assessed.

    Authors: We agree that the abstract is too concise and that the results section requires expansion for proper evaluation of the claims. In the revised manuscript we will update the abstract to name the base model and briefly note the data filtering and train/test split procedure. We will add a dedicated subsection in Results describing the full data processing pipeline, including any quality filters applied to the 19,683 conversations. A matched-volume non-adversarial control was not collected in this work; we will explicitly discuss this as a limitation, explain the design rationale for focusing on the adversarial format (iterative attack-defense dynamics and cross-team diversity), and suggest such a control as valuable future work. revision: partial

  2. Referee: [Results] Results / Evaluation: The fine-tuning experiments report percentage improvements on CyberSecEval-Instruct and CyberSecEval-MITRE but omit the base model name, training hyperparameters, number of epochs, and any statistical significance tests or variance across runs. This leaves open whether the deltas exceed what would be obtained from a comparable volume of data generated by non-competitive prompting or from existing cybersecurity corpora.

    Authors: We will revise the Results section to specify the exact base model, all training hyperparameters, the number of epochs, and any available run-to-run variance or statistical tests. These additions will make the experimental protocol fully reproducible and allow readers to assess the magnitude of the reported gains relative to the base model. We maintain that the multi-team adversarial setting produces data characteristics (complexity, diversity of attack vectors) that are difficult to replicate with simple non-competitive prompting, but we will add explicit language acknowledging that a direct volume-matched comparison remains an open question. revision: yes

standing simulated objections not resolved
  • Reporting results from a matched-volume non-adversarial control dataset, which was outside the scope of the original competition-based study.

Circularity Check

0 steps flagged

Empirical benchmark gains measured on external datasets exhibit no circularity

full rationale

The paper's central result is an empirical measurement: 10 teams generated 19,683 conversations via an attacker-defender competition, an open-source model was fine-tuned on the resulting dataset, and performance deltas of 18.47% and 29.42% were observed on the independent CyberSecEval-Instruct and CyberSecEval-MITRE benchmarks. No equations, fitted parameters, self-definitions, or self-citation chains are present that would reduce these measured outcomes to the inputs by construction. The derivation chain consists of a described data-generation procedure followed by standard fine-tuning and external evaluation; the reported improvements are falsifiable observations rather than tautological restatements of the competition setup.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that adversarial prompt-response competition yields higher-quality and more diverse data than standard methods; no free parameters or new invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Interactive competition between attacker teams creating prompts and defender teams generating responses naturally produces diverse, complex, and high-quality multi-turn conversational data suitable for LLM safety alignment.
    This premise is invoked to justify why the competition format outperforms crowdsourcing or synthetic generation; it is not derived from prior results in the abstract.

pith-pipeline@v0.9.0 · 5516 in / 1333 out tokens · 28206 ms · 2026-05-10T04:52:13.530776+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 25 canonical work pages · 9 internal anchors

  1. [1]

    2024 , note="

    Chen, Hao and Waheed, Abdul and Li, Xiang and Wang, Yidong and Wang, Jindong and Raj, Bhiksha and Abdin, Marah I , journal=. 2024 , note="

  2. [2]

    2023 , note="

    Ranaldi, Leonardo and Pucci, Giulia , journal=. 2023 , note="

  3. [3]

    2022 , note="

    Wang, Yizhong and Kordi, Yeganeh and Mishra, Swaroop and Liu, Alisa and Smith, Noah A and Khashabi, Daniel and Hajishirzi, Hannaneh , journal=. 2022 , note="

  4. [4]

    and Goldman, Max and Miller, Robert C

    Little, Greg and Chilton, Lydia B. and Goldman, Max and Miller, Robert C. , title =. Proceedings of the ACM SIGKDD Workshop on Human Computation , pages =. 2010 , isbn =. doi:10.1145/1837885.1837907 , abstract =

  5. [5]

    Hamby, Tyler and Taylor, Wyn , abstract =. Educ. Psychol. Meas. , publisher =

  6. [6]

    and Goguladinne, Partha S.R

    Marshall, Catherine C. and Goguladinne, Partha S.R. and Maheshwari, Mudit and Sathe, Apoorva and Shipman, Frank M. , title =. Proceedings of the 15th ACM Web Science Conference 2023 , pages =. 2023 , isbn =. doi:10.1145/3578503.3583622 , abstract =

  7. [7]

    Concrete Problems in AI Safety

    Dario Amodei and Chris Olah and Jacob Steinhardt and Paul Christiano and John Schulman and Dan Mané , year=. 1606.06565 , archivePrefix=

  8. [8]

    2022 , publisher=

    Wu, Xingjiao and Xiao, Luwei and Sun, Yixuan and Zhang, Junhang and Ma, Tianlong and He, Liang , journal=. 2022 , publisher=

  9. [9]

    2022 , note="

    Ding, Bosheng and Qin, Chengwei and Liu, Linlin and Chia, Yew Ken and Joty, Shafiq and Li, Boyang and Bing, Lidong , journal=. 2022 , note="

  10. [10]

    2024 , note="

    Long, Lin and Wang, Rui and Xiao, Ruixuan and Zhao, Junbo and Ding, Xiao and Chen, Gang and Wang, Haobo , journal=. 2024 , note="

  11. [11]

    2023 , note="

    Xu, Ran and Cui, Hejie and Yu, Yue and Kan, Xuan and Shi, Wenqi and Zhuang, Yuchen and Jin, Wei and Ho, Joyce and Yang, Carl , journal=. 2023 , note="

  12. [12]

    2024 , note="

    Sudalairaj, Shivchander and Bhandwaldar, Abhishek and Pareja, Aldo and Xu, Kai and Cox, David D and Srivastava, Akash , journal=. 2024 , note="

  13. [13]

    2021 , note="

    Yoo, Kang Min and Park, Dongju and Kang, Jaewook and Lee, Sang-Woo and Park, Woomyeong , journal=. 2021 , note="

  14. [14]

    2024 , note="

    Gandhi, Saumya and Gala, Ritu and Viswanathan, Vijay and Wu, Tongshuang and Neubig, Graham , journal=. 2024 , note="

  15. [15]

    Textbooks Are All You Need

    Gunasekar, Suriya and Zhang, Yi and Aneja, Jyoti and Mendes, Caio C. arXiv preprint arXiv:2306.11644 , year=

  16. [16]

    2023 , note="

    Eldan, Ronen and Li, Yuanzhi , journal=. 2023 , note="

  17. [17]

    2022 , note="

    Ye, Jiacheng and Gao, Jiahui and Li, Qintong and Xu, Hang and Feng, Jiangtao and Wu, Zhiyong and Yu, Tao and Kong, Lingpeng , journal=. 2022 , note="

  18. [18]

    2023 , note="

    Yu, Yue and Zhuang, Yuchen and Zhang, Jieyu and Meng, Yu and Ratner, Alexander J and Krishna, Ranjay and Shen, Jiaming and Zhang, Chao , journal=. 2023 , note="

  19. [19]

    2023 , note="

    Josifoski, Martin and Sakota, Marija and Peyrard, Maxime and West, Robert , journal=. 2023 , note="

  20. [20]

    2023 , note="

    Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Zheng, Zhi and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen , journal=. 2023 , note="

  21. [21]

    2022 , note="

    Meng, Yu and Huang, Jiaxin and Zhang, Yu and Han, Jiawei , journal=. 2022 , note="

  22. [22]

    2023 , note="

    He, Xingwei and Lin, Zhenghao and Gong, Yeyun and Jin, Alex and Zhang, Hang and Lin, Chen and Jiao, Jian and Yiu, Siu Ming and Duan, Nan and Chen, Weizhu and others , journal=. 2023 , note="

  23. [23]

    2022 , note="

    Li, Junlong and Wang, Jinyuan and Zhang, Zhuosheng and Zhao, Hai , journal=. 2022 , note="

  24. [24]

    2023 , note="

    Ye, Jiacheng and Li, Chengzu and Kong, Lingpeng and Yu, Tao , journal=. 2023 , note="

  25. [25]

    2022 , note="

    Honovich, Or and Scialom, Thomas and Levy, Omer and Schick, Timo , journal=. 2022 , note="

  26. [26]

    2023 , organization=

    Shao, Zhihong and Gong, Yeyun and Shen, Yelong and Huang, Minlie and Duan, Nan and Chen, Weizhu , booktitle=. 2023 , organization=

  27. [27]

    2023 , note="

    Chen, Lichang and Li, Shiyang and Yan, Jun and Wang, Hai and Gunaratna, Kalpa and Yadav, Vikas and Tang, Zheng and Srinivasan, Vijay and Zhou, Tianyi and Huang, Heng and others , journal=. 2023 , note="

  28. [28]

    2023 , note="

    Seedat, Nabeel and Huynh, Nicolas and Van Breugel, Boris and Van Der Schaar, Mihaela , journal=. 2023 , note="

  29. [29]

    2022 , note="

    Ye, Jiacheng and Gao, Jiahui and Feng, Jiangtao and Wu, Zhiyong and Yu, Tao and Kong, Lingpeng , journal=. 2022 , note="

  30. [30]

    2023 , note="

    Chung, John Joon Young and Kamar, Ece and Amershi, Saleema , journal=. 2023 , note="

  31. [31]

    2023 , note="

    Pangakis, Nicholas and Wolken, Samuel and Fasching, Neil , journal=. 2023 , note="

  32. [32]

    2022 , note="

    Liu, Alisa and Swayamdipta, Swabha and Smith, Noah A and Choi, Yejin , journal=. 2022 , note="

  33. [33]

    2024 , note="

    Tan, Zhen and Li, Dawei and Wang, Song and Beigi, Alimohammad and Jiang, Bohan and Bhattacharjee, Amrita and Karami, Mansooreh and Li, Jundong and Cheng, Lu and Liu, Huan , journal=. 2024 , note="

  34. [34]

    2023 , note="

    Li, Zhuoyan and Zhu, Hangxiao and Lu, Zhuoran and Yin, Ming , journal=. 2023 , note="

  35. [35]

    2024 , note="

    Guo, Xu and Chen, Yiqiang , journal=. 2024 , note="

  36. [36]

    arXiv preprint arXiv:2401.02524 , year=

    Bauer, Andr. arXiv preprint arXiv:2401.02524 , year=

  37. [37]

    2024 , note="

    Liu, Ruibo and Wei, Jerry and Liu, Fangyu and Si, Chenglei and Zhang, Yanzhe and Rao, Jinmeng and Zheng, Steven and Peng, Daiyi and Yang, Diyi and Zhou, Denny and others , journal=. 2024 , note="

  38. [38]

    2024 , note="

    Havrilla, Alex and Dai, Andrew and O'Mahony, Laura and Oostermeijer, Koen and Zisler, Vera and Albalak, Alon and Milo, Fabrizio and Raparthy, Sharath Chandra and Gandhi, Kanishk and Abbasi, Baber and others , journal=. 2024 , note="

  39. [39]

    2025 , note="

    Nadas, Mihai and Diosan, Laura and Tomescu, Andreea , journal=. 2025 , note="

  40. [40]

    MSN , year =

    Sherin Shibu , title =. MSN , year =

  41. [41]

    cnbc , year =

    Jordan Novet , title =. cnbc , year =

  42. [42]

    Asleep at the keyboard? Assessing the security of GitHub Copilot’s code contributions,

    Hammond Pearce and Baleegh Ahmad and Benjamin Tan and Brendan Dolan-Gavitt and Ramesh Karri , year=. 2108.09293 , archivePrefix=

  43. [43]

    Opening up closings.Semiotica, 8(4): 289–327, 1973

    Emanuel A. Schegloff and Harvey Sacks. Semiotica. 1973. doi:10.1515/semi.1973.8.4.289

  44. [44]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei and Xuezhi Wang and Dale Schuurmans and Maarten Bosma and Brian Ichter and Fei Xia and Ed Chi and Quoc Le and Denny Zhou , year=. 2201.11903 , archivePrefix=

  45. [45]

    Self-Reflection in Large Language Model Agents: Effects on Problem-Solving Performance

    Renze, Matthew and Guven, Erhan , year=. doi:10.1109/fllm63129.2024.10852493 , booktitle=

  46. [46]

    B leu: a method for automatic evaluation of machine translation

    Papineni, Kishore and Roukos, Salim and Ward, Todd and Zhu, Wei-Jing. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002. doi:10.3115/1073083.1073135

  47. [47]

    2018 , note="

    Vaughan, Jennifer Wortman , journal=. 2018 , note="

  48. [48]

    2019 , note="

    Reimers, Nils and Gurevych, Iryna , journal=. 2019 , note="

  49. [49]

    Mark Chen and Jerry Tworek and Heewoo Jun and Qiming Yuan and Henrique Ponde de Oliveira Pinto and Jared Kaplan and Harri Edwards and Yuri Burda and Nicholas Joseph and Greg Brockman and Alex Ray and Raul Puri and Gretchen Krueger and Michael Petrov and Heidy Khlaaf and Girish Sastry and Pamela Mishkin and Brooke Chan and Scott Gray and Nick Ryder and Mik...

  50. [50]

    Subliminal learning: Language models transmit behavioral traits via hidden signals in data.arXiv preprint arXiv:2507.14805,

    Alex Cloud and Minh Le and James Chua and Jan Betley and Anna Sztyber-Betley and Jacob Hilton and Samuel Marks and Owain Evans , year=. 2507.14805 , archivePrefix=

  51. [51]

    Albert Q. Jiang and Alexandre Sablayrolles and Arthur Mensch and Chris Bamford and Devendra Singh Chaplot and Diego de las Casas and Florian Bressand and Gianna Lengyel and Guillaume Lample and Lucile Saulnier and Lélio Renard Lavaud and Marie-Anne Lachaux and Pierre Stock and Teven Le Scao and Thibaut Lavril and Thomas Wang and Timothée Lacroix and Willi...

  52. [52]

    Purple llama CyberSecEval : A secure coding benchmark for language models

    Manish Bhatt and Sahana Chennabasappa and Cyrus Nikolaidis and Shengye Wan and Ivan Evtimov and Dominik Gabi and Daniel Song and Faizan Ahmad and Cornelius Aschermann and Lorenzo Fontana and Sasha Frolov and Ravi Prakash Giri and Dhaval Kapil and Yiannis Kozyrakis and David LeBlanc and James Milazzo and Aleksandar Straumann and Gabriel Synnaeve and Varun ...

  53. [53]

    2402.00530 , archivePrefix=

    Ming Li and Yong Zhang and Shwai He and Zhitao Li and Hongyu Zhao and Jianzong Wang and Ning Cheng and Tianyi Zhou , year=. 2402.00530 , archivePrefix=

  54. [54]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Guilherme Penedo and Quentin Malartic and Daniel Hesslow and Ruxandra Cojocaru and Alessandro Cappelli and Hamza Alobeidli and Baptiste Pannier and Ebtesam Almazrouei and Julien Launay , year=. 2306.01116 , archivePrefix=

  55. [55]

    Advances in Neural Information Processing Systems , editor =

    Penedo, Guilherme and Kydl\'. Advances in Neural Information Processing Systems , editor =. 2024 , note = "

  56. [56]

    o pf, Yannic Kilcher, Dimitri von R \

    Andreas Köpf and Yannic Kilcher and Dimitri von Rütte and Sotiris Anagnostidis and Zhi-Rui Tam and Keith Stevens and Abdullah Barhoum and Nguyen Minh Duc and Oliver Stanley and Richárd Nagyfi and Shahul ES and Sameer Suri and David Glushkov and Arnav Dantuluri and Andrew Maguire and Christoph Schuhmann and Huu Nguyen and Alexander Mattick , year=. 2304.07...

  57. [57]

    arXiv preprint arXiv:2403.02990 , year=

    Bosheng Ding and Chengwei Qin and Ruochen Zhao and Tianze Luo and Xinze Li and Guizhen Chen and Wenhan Xia and Junjie Hu and Anh Tuan Luu and Shafiq Joty , year=. 2403.02990 , archivePrefix=

  58. [58]

    The Llama 3 Herd of Models

    AI @ Meta, Llama Team , year=. 2407.21783 , archivePrefix=

  59. [59]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai and Andy Jones and Kamal Ndousse and Amanda Askell and Anna Chen and Nova DasSarma and Dawn Drain and Stanislav Fort and Deep Ganguli and Tom Henighan and Nicholas Joseph and Saurav Kadavath and Jackson Kernion and Tom Conerly and Sheer El-Showk and Nelson Elhage and Zac Hatfield-Dodds and Danny Hernandez and Tristan Hume and Scott Johnston and...

  60. [60]

    Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...

  61. [61]

    Gonzalez and Ion Stoica , booktitle=

    Wei-Lin Chiang and Lianmin Zheng and Ying Sheng and Anastasios Nikolas Angelopoulos and Tianle Li and Dacheng Li and Banghua Zhu and Hao Zhang and Michael Jordan and Joseph E. Gonzalez and Ion Stoica , booktitle=. 2024 , note="

  62. [62]

    Self-Instruct: Aligning Language Models with Self-Generated Instructions

    Yizhong Wang and Yeganeh Kordi and Swaroop Mishra and Alisa Liu and Noah A. Smith and Daniel Khashabi and Hannaneh Hajishirzi , year=. 2212.10560 , archivePrefix=

  63. [63]

    Llama-nemotron: Efficient reasoning models.arXiv preprint arXiv:2505.00949, 2025

    Bercovich, Akhiad and others , year=. 2505.00949 , archivePrefix=

  64. [64]

    2406.12397 , archivePrefix=

    Jie Chen and Yupeng Zhang and Bingning Wang and Wayne Xin Zhao and Ji-Rong Wen and Weipeng Chen , year=. 2406.12397 , archivePrefix=

  65. [65]

    Zur, Amir and Loftus, Alexander R and Orgad, Hadas and Ying, Zhuofan and Sahin, Kerem and Bau, David , year=

  66. [66]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Can Xu and Qingfeng Sun and Kai Zheng and Xiubo Geng and Pu Zhao and Jiazhan Feng and Chongyang Tao and Qingwei Lin and Daxin Jiang , year=. 2304.12244 , archivePrefix=

  67. [67]

    Magicoder: Empow- ering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

    Yuxiang Wei and Zhe Wang and Jiawei Liu and Yifeng Ding and Lingming Zhang , year=. 2312.02120 , archivePrefix=

  68. [68]

    Journal of Machine Learning Research , year =

    Laurens van der Maaten and Geoffrey Hinton , title =. Journal of Machine Learning Research , year =

  69. [69]

    2025 , eprint=

    Amazon Nova AI Challenge -- Trusted AI: Advancing secure, AI-assisted software development , author=. 2025 , eprint=

  70. [70]

    Secure and useful models are reasonable: Aligning code models via utility-preserving reasoning , journal =

    Atharva Naik and Alex Xie and Abhinav Rao and Anmol Agarwal and Shubham Gandhi and Michael Hilton and Carolyn Ros. Secure and useful models are reasonable: Aligning code models via utility-preserving reasoning , journal =. 2025 , url =

  71. [71]

    2025 , eprint=

    PurpCode: Reasoning for Safer Code Generation , author=. 2025 , eprint=

  72. [72]

    SecureLion: Building a trustworthy AI assistant with security reasoning in a realistic adversarial competition , journal =

    Jinjun Peng and Weiliang Zhao and Ira Ceka and Alex Mathai and Adam. SecureLion: Building a trustworthy AI assistant with security reasoning in a realistic adversarial competition , journal =. 2025 , url =

  73. [73]

    AlquistCoder: A constitution-guided approach to safe, trustworthy code generation , journal =

    Ond. AlquistCoder: A constitution-guided approach to safe, trustworthy code generation , journal =. 2025 , url =

  74. [74]

    2025 , eprint=

    RedTWIZ: Diverse LLM Red Teaming via Adaptive Attack Planning , author=. 2025 , eprint=

  75. [75]

    2025 , eprint=

    ASTRA: Autonomous Spatial-Temporal Red-teaming for AI Software Assistants , author=. 2025 , eprint=

  76. [76]

    2025 , eprint=

    RedCoder: Automated Multi-Turn Red Teaming for Code LLMs , author=. 2025 , eprint=

  77. [77]

    Amazon Nova AI Challenge Proceedings , title =

    Xu, Zexin and Li, Tingxi and Rathnasuriya, Ravishka and Song, Zihe and Ren, Jun and Mandalapu, Bhavesh and Setayeshpour, Soroush and Du, Xinya and Yang, Wei , year =. Amazon Nova AI Challenge Proceedings , title =

  78. [78]

    Amazon Nova AI Challenge Proceedings , year =

    Xiaogeng Liu and Jingyu Huang and Jiongxiao Wang and Yingzi Ma and Hao Wu and Chaowei Xiao , title =. Amazon Nova AI Challenge Proceedings , year =

  79. [79]

    Amazon Nova AI Challenge Proceedings , year =

    Yi Zeng and Mahavir Dabas and Tran Huynh and Nikhil Reddy and Adam Nguyen and Sanchit Kabra and Ruoxi Jia , title =. Amazon Nova AI Challenge Proceedings , year =