pith. sign in

arxiv: 2605.05415 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.AI· cs.CR

Information Theoretic Adversarial Training of Large Language Models

Pith reviewed 2026-05-08 17:25 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR
keywords adversarial traininglarge language modelsdistributionally robust optimizationf-divergencerobust alignmentattack success rateinformation theoretic objectives
0
0 comments X

The pith

Warden reweights adversarial examples inside an f-divergence ball to cut attack success rates on large language models while keeping utility costs comparable to prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WARDEN, a distributionally robust training procedure that optimizes the worst-case loss inside an f-divergence neighborhood of the empirical data distribution and uses a dynamical parameter to emphasize harder adversarial prompts. If the approach works, it would let practitioners strengthen LLM alignment against evolving attacks without the prohibitive expense of earlier continuous adversarial techniques. A reader cares because current models still produce harmful outputs under novel prompts, and a scalable robustness method could make safer deployment feasible. The method converts the robust objective, via convex duality, into a log-sum-exp loss under the KL divergence that automatically upweights difficult examples during training.

Core claim

Warden optimizes the worst-case adversarial loss within an f-divergence ambiguity set around the empirical training distribution; under the KL divergence this reduces to a log-sum-exp objective controlled by a dynamical reweighting parameter that automatically focuses on harder examples, yielding substantially lower attack success rates across multiple LLMs and attack settings at computational and utility costs comparable to CAT, CAPO, and MixAT baselines.

What carries the argument

The f-divergence ambiguity set with dynamical reweighting that converts worst-case loss minimization into an automatic emphasis on difficult adversarial examples.

If this is right

  • Attack success rates drop substantially on the tested LLMs and attack types.
  • Model utility on normal tasks stays comparable to non-robust baselines.
  • Training compute remains in line with existing continuous adversarial methods.
  • The framework supplies a new family of information-theoretic objectives for robust alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reweighting mechanism could apply to other distribution-shift problems beyond adversarial prompting.
  • Automatic focus on hard examples might lessen the need for manual design of attack prompts.
  • Whether the dynamical parameter remains stable when models grow much larger is worth direct testing.

Load-bearing premise

The f-divergence ball around the empirical training distribution covers the adversarial perturbations that LLMs actually encounter, and the reweighting parameter can be chosen without introducing instability or overfitting.

What would settle it

A new attack strategy or previously untested LLM where WARDEN produces higher attack success rates than CAT or CAPO at matched utility and compute.

Figures

Figures reproduced from arXiv: 2605.05415 by Elisa Bertino, Jason Pacheco, Jeremiah Birrell, Reza Ebrahimi, Rouzbeh Behnia, Yiwei Zhang.

Figure 1
Figure 1. Figure 1: Overview of WARDEN with adaptive DRO reweighting. Left: A base continuous adversarial training method generates embedding-space perturbations and per-sample adversarial losses. Right: WARDEN replaces uniform aggregation with an f-divergence DRO objective that dynamically upweights high-loss adversarial examples via a learnable or optimized dual variable λt. For KL divergence, the dual objective reduces to … view at source ↗
Figure 2
Figure 2. Figure 2: Sensitivity to the KL-DRO radius ϵ on Mistral-7B CAPO-WARDENO. Moderate values of ϵ improve robustness, with the lowest average ASR around ϵ = 0.1, while utility remains relatively stable across the evaluated range. Effect of the DRO radius ϵ. The radius ϵ controls how far the adversarial reweighting distribution may deviate from the empirical minibatch distribution. Smaller values keep the objective close… view at source ↗
Figure 3
Figure 3. Figure 3: Ablation of dual-variable handling strategies on Mistral-7B. The fixed variant holds view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics under learnable and optimized dual-variable treatments on Mistral-7B. view at source ↗
read the original abstract

Large language models (LLMs) remain vulnerable to adversarial prompting despite advances in alignment and safety, often exhibiting harmful behaviors under novel attack strategies. While adversarial training can improve robustness, existing approaches are computationally expensive and difficult to scale. Recent continuous adversarial training methods, such as Continuous adversarial training (CAT) and Continuous Adversarial Preference Optimization (CAPO), address this challenge by leveraging gradient-based perturbations in the embedding space, enabling more efficient and expressive attacks. Building on this paradigm, we propose WARDEN, a distributionally robust adversarial training framework for LLMs that dynamically reweights adversarial examples through an f -divergence ambiguity set around the empirical training distribution. Our method optimizes the worst-case adversarial loss within a divergence ball around the empirical data distribution, automatically emphasizing harder adversarial examples. Using the convex dual formulation, the objective reduces to a log-sum-exp form under the KL divergence, with a dynamical parameter controlling the strength of reweighting. This study leads to a new class of information-theoretic objectives that significantly reduce attack success rates while maintaining model utility. Across multiple LLMs and attack settings, WARDEN substantially reduces attack success rates with computational and utility costs comparable to CAT-, CAPO-, and MixAT-based baselines, making it a practical approach for scalable robust alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes WARDEN, a distributionally robust optimization (DRO) framework for adversarial training of large language models (LLMs). It uses an f-divergence ambiguity set around the empirical training distribution to optimize the worst-case adversarial loss, which under KL divergence reduces to a log-sum-exp objective with a dynamical reweighting parameter. The paper claims that this leads to significant reductions in attack success rates across multiple LLMs and attack settings, with costs comparable to baselines such as CAT, CAPO, and MixAT, while maintaining model utility.

Significance. If the empirical results are robust and the ambiguity set is appropriately chosen, this work offers a principled information-theoretic approach to scalable adversarial training for LLMs. It builds on continuous adversarial training methods by incorporating automatic reweighting of harder examples via DRO, potentially providing better robustness without excessive computational overhead. The provision of a convex dual formulation is a strength.

major comments (2)
  1. [Method description (abstract and §3)] The central claim that optimizing within the f-divergence ball yields robustness to practical adversarial prompts (e.g., jailbreaks) rests on the assumption that continuous embedding-space perturbations within the ball are representative of discrete, semantically coherent attacks. This is not obviously true, as the ball may not intersect the support of effective real-world attacks; additional justification or experiments showing coverage would be needed to support the robustness gains.
  2. [Experimental section] The abstract mentions reductions in attack success rates but without details on statistical significance, number of runs, or specific attack success metrics used for tuning the dynamical reweighting parameter, it is difficult to rule out overfitting or circularity in the reported improvements.
minor comments (2)
  1. [Abstract] The phrasing 'This study leads to a new class of information-theoretic objectives' is vague; specify what the new class is and how it differs from existing DRO objectives.
  2. [Abstract] Typo: 'f -divergence' should be 'f-divergence'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful and constructive comments on our manuscript. We address each major comment point-by-point below and describe the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Method description (abstract and §3)] The central claim that optimizing within the f-divergence ball yields robustness to practical adversarial prompts (e.g., jailbreaks) rests on the assumption that continuous embedding-space perturbations within the ball are representative of discrete, semantically coherent attacks. This is not obviously true, as the ball may not intersect the support of effective real-world attacks; additional justification or experiments showing coverage would be needed to support the robustness gains.

    Authors: We appreciate the referee highlighting this key assumption. Our approach directly extends prior continuous adversarial training methods (CAT and CAPO) that have already demonstrated practical effectiveness against discrete jailbreak attacks via embedding perturbations. The f-divergence ball provides a principled neighborhood for robust optimization, and our empirical results across multiple LLMs and attack settings show consistent reductions in attack success rates. To address the concern, we will revise the abstract and Section 3 to include additional justification, referencing the literature on continuous attacks and providing qualitative examples from our experiments where embedding perturbations yield semantically coherent prompt variations. This will better articulate the coverage without requiring entirely new experiments. revision: partial

  2. Referee: [Experimental section] The abstract mentions reductions in attack success rates but without details on statistical significance, number of runs, or specific attack success metrics used for tuning the dynamical reweighting parameter, it is difficult to rule out overfitting or circularity in the reported improvements.

    Authors: We agree that greater experimental transparency is essential. In the revised manuscript, we will expand the experimental section (and update the abstract accordingly) to report the number of independent runs, include statistical significance measures such as standard deviations and confidence intervals for attack success rate reductions, and provide a detailed description of the attack success rate metric along with the exact validation procedure used to tune the dynamical reweighting parameter. This will explicitly demonstrate that tuning was performed on held-out data to avoid circularity or overfitting. revision: yes

Circularity Check

0 steps flagged

No circularity; standard DRO dual applied without reduction to inputs

full rationale

The paper applies the convex dual of f-divergence DRO to produce a log-sum-exp objective with dynamical reweighting for adversarial LLM training. This is a standard mathematical reduction independent of the target robustness claims or evaluation metrics. The ambiguity set and parameter are explicit modeling choices, not fitted to the reported attack success rates by construction. No self-citation load-bearing steps, no self-definitional equations, and no renaming of known results as novel derivations appear in the abstract or described chain. Empirical gains are measured on separate attack benchmarks, keeping the derivation self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard convex dual of distributionally robust optimization and the choice of f-divergence (specifically KL) to obtain the log-sum-exp form; one tunable dynamical parameter is introduced to control reweighting strength.

free parameters (1)
  • dynamical reweighting parameter
    Controls the strength of emphasis on harder adversarial examples in the log-sum-exp objective; its selection procedure is not detailed in the abstract.
axioms (1)
  • standard math The convex dual formulation of the worst-case loss over an f-divergence ball reduces to a log-sum-exp expression under the KL divergence.
    Invoked to derive the practical training objective from the distributionally robust formulation.

pith-pipeline@v0.9.0 · 5539 in / 1338 out tokens · 40304 ms · 2026-05-08T17:25:42.891034+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

91 extracted references · 10 canonical work pages · 2 internal anchors

  1. [1]

    Advances in Neural Information Processing Systems , volume=

    Many-shot jailbreaking , author=. Advances in Neural Information Processing Systems , volume=

  2. [2]

    Wang, Xunguang and Wu, Daoyuan and Ji, Zhenlan and Li, Zongjie and Ma, Pingchuan and Wang, Shuai and Li, Yingjiu and Liu, Yang and Liu, Ning and Rahmel, Juergen , booktitle=

  3. [3]

    Shaopeng Fu and Liang Ding and Di Wang , booktitle=. ''. 2025 , url=

  4. [4]

    Efficient adversarial training in

    Xhonneux, Sophie and Sordoni, Alessandro and G. Efficient adversarial training in. Advances in Neural Information Processing Systems , volume=

  5. [5]

    Understanding and Improving Continuous

    Shaopeng and Fu, Di and Wang , journal=. Understanding and Improving Continuous

  6. [6]

    D. Mix. Advances in Neural Information Processing Systems , year=

  7. [7]

    Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in

    Abhay Sheshadri and Aidan Ewart and Phillip Huang Guo and Aengus Lynch and Cindy Wu and Vivek Hebbar and Henry Sleight and Asa Cooper Stickland and Ethan Perez and Dylan Hadfield-Menell and Stephen Casper , journal=. Latent Adversarial Training Improves Robustness to Persistent Harmful Behaviors in. 2025 , url=

  8. [8]

    Transactions on Machine Learning Research , issn=

    Defending Against Unforeseen Failure Modes with Latent Adversarial Training , author=. Transactions on Machine Learning Research , issn=. 2025 , url=

  9. [9]

    Forty-first International Conference on Machine Learning , year=

    Benign overfitting in adversarial training of neural networks , author=. Forty-first International Conference on Machine Learning , year=

  10. [11]

    The Thirteenth International Conference on Learning Representations , year=

    Adversarial Training Can Provably Improve Robustness: Theoretical Analysis of Feature Learning Process Under Structured Data , author=. The Thirteenth International Conference on Learning Representations , year=

  11. [12]

    International Conference on Machine Learning , pages=

    Explaining the role of Intrinsic Dimensionality in Adversarial Training , author=. International Conference on Machine Learning , pages=. 2025 , organization=

  12. [13]

    International Conference on Machine Learning , pages=

    CAT: Contrastive Adversarial Training for Evaluating the Robustness of Protective Perturbations in Latent Diffusion Models , author=. International Conference on Machine Learning , pages=. 2025 , organization=

  13. [14]

    Advances in Neural Information Processing Systems , volume=

    High-dimensional (group) adversarial training in linear regression , author=. Advances in Neural Information Processing Systems , volume=

  14. [15]

    Advances in Neural Information Processing Systems , volume=

    Defensive unlearning with adversarial training for robust concept erasure in diffusion models , author=. Advances in Neural Information Processing Systems , volume=

  15. [16]

    Advances in Neural Information Processing Systems , volume=

    Stability and generalization of adversarial training for shallow neural networks with smooth activation , author=. Advances in Neural Information Processing Systems , volume=

  16. [17]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    Improving accuracy-robustness trade-off via pixel reweighted adversarial training , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

  17. [19]

    Forty-second International Conference on Machine Learning , year=

    Boosting Adversarial Robustness with CLAT: Criticality Leveraged Adversarial Training , author=. Forty-second International Conference on Machine Learning , year=

  18. [20]

    Advances in Neural Information Processing Systems , volume=

    RAMP: Boosting Adversarial Robustness Against Multiple l\_p Perturbations for Universal Robustness , author=. Advances in Neural Information Processing Systems , volume=

  19. [21]

    13th International Conference on Learning Representations, ICLR 2025 , pages=

    INDIRECT GRADIENT MATCHING FOR ADVERSARIAL ROBUST DISTILLATION , author=. 13th International Conference on Learning Representations, ICLR 2025 , pages=. 2025 , organization=

  20. [22]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Vulnerable Data-Aware Adversarial Training , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  21. [23]

    Entropic Value-at-Risk: A New Coherent Risk Measure

    Ahmadi-Javid, A. Entropic Value-at-Risk: A New Coherent Risk Measure. Journal of Optimization Theory and Applications. 2012

  22. [24]

    2004 , publisher=

    Approaches to the Theory of Optimization , author=. 2004 , publisher=

  23. [25]

    Studia Scientiarum Mathematicarum Hungarica , year=

    Broniatowski, Michel and Keziou, Amor , title=. Studia Scientiarum Mathematicarum Hungarica , year=

  24. [26]

    IEEE Transactions on Information Theory , title=

    X. IEEE Transactions on Information Theory , title=. 2010 , volume=

  25. [27]

    Birrell, Jeremiah and Dupuis, Paul and Katsoulakis, Markos A and Pantazis, Yannis and Rey-Bellet, Luc , journal=. (f,

  26. [28]

    1997 , publisher=

    Optimization by Vector Space Methods , author=. 1997 , publisher=

  27. [30]

    IEEE Transactions on Information Theory , title=

    F. IEEE Transactions on Information Theory , title=. 2006 , volume=

  28. [31]

    Journal of the Royal Statistical Society: Series B (Methodological) , volume=

    A general class of coefficients of divergence of one distribution from another , author=. Journal of the Royal Statistical Society: Series B (Methodological) , volume=. 1966 , publisher=

  29. [32]

    Studia Sci

    On information-type measure of difference of probability distributions and indirect observations , author=. Studia Sci. Math. Hungar. , volume=

  30. [33]

    International Conference on Artificial Intelligence and Statistics , pages=

    A general theoretical paradigm to understand learning from human preferences , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

  31. [34]

    2023 , howpublished =

    Zephyr-7B- , author =. 2023 , howpublished =

  32. [35]

    2025 , howpublished =

    Mistral-7B-Instruct-v0.1 , author =. 2025 , howpublished =

  33. [36]

    2023 , howpublished =

    Llama-2-7b-chat-hf , author =. 2023 , howpublished =

  34. [37]

    2024 , howpublished =

    Meta-Llama-3-8B-Instruct , author =. 2024 , howpublished =

  35. [39]

    International Conference on Learning Representations , year=

    Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations , year=

  36. [41]

    10.1051/cocv/2022002

    Formulation and properties of a divergence used to compare probability measures without absolute continuity , DOI= "10.1051/cocv/2022002", url= "https://doi.org/10.1051/cocv/2022002", journal =

  37. [42]

    Function-space regularized

    Jeremiah Birrell and Yannis Pantazis and Paul Dupuis and Luc Rey-Bellet and Markos Katsoulakis , booktitle=. Function-space regularized. 2023 , url=

  38. [43]

    Katsoulakis and Yannis Pantazis and Luc Rey-Bellet , title =

    Jeremiah Birrell and Paul Dupuis and Markos A. Katsoulakis and Yannis Pantazis and Luc Rey-Bellet , title =. Journal of Machine Learning Research , year =

  39. [44]

    2008 , publisher=

    Optimal Transport: Old and New , author=. 2008 , publisher=

  40. [45]

    Mohajerin Esfahani and D

    Mohajerin Esfahani, Peyman and Kuhn, Daniel , title =. Mathematical Programming , volume =. doi:10.1007/s10107-017-1172-1 , year =

  41. [46]

    2013 , publisher=

    Real Analysis: Modern Techniques and Their Applications , author=. 2013 , publisher=

  42. [47]

    2018 , publisher=

    Weak Convergence of Measures , author=. 2018 , publisher=

  43. [48]

    2022 , eprint=

    On Generalization and Regularization via Wasserstein Distributionally Robust Optimization , author=. 2022 , eprint=

  44. [49]

    Distributionally Robust Optimization and Generalization in Kernel Methods , url =

    Staib, Matthew and Jegelka, Stefanie , booktitle =. Distributionally Robust Optimization and Generalization in Kernel Methods , url =

  45. [50]

    2022 , eprint=

    A General Wasserstein Framework for Data-driven Distributionally Robust Optimization: Tractability and Applications , author=. 2022 , eprint=

  46. [51]

    Operations Research , volume =

    Goh, Joel and Sim, Melvyn , title =. Operations Research , volume =. 2010 , doi =

  47. [52]

    Operations Research , volume =

    Delage, Erick and Ye, Yinyu , title =. Operations Research , volume =. 2010 , doi =. https://doi.org/10.1287/opre.1090.0741 , abstract =

  48. [53]

    Operations Research , volume =

    Wiesemann, Wolfram and Kuhn, Daniel and Sim, Melvyn , title =. Operations Research , volume =. 2014 , doi =

  49. [54]

    , title =

    Ben-Tal, Aharon and Bertsimas, Dimitris and Brown, David B. , title =. Operations Research , volume =. 2010 , doi =. https://doi.org/10.1287/opre.1100.0821 , abstract =

  50. [55]

    and Hong, L.J

    Hu, Z. and Hong, L.J. , year =

  51. [56]

    Robust Solutions of Optimization Problems Affected by Uncertain Probabilities , volume =

    Aharon Ben-Tal and Dick den Hertog and Anja De Waegenaere and Bertrand Melenberg and Gijs Rennen , journal =. Robust Solutions of Optimization Problems Affected by Uncertain Probabilities , volume =

  52. [57]

    arXiv e-prints , keywords =

    Recovering Best Statistical Guarantees via the Empirical Divergence-based Distributionally Robust Optimization. arXiv e-prints , keywords =. 2016

  53. [58]

    Mathematics of Operations Research , volume =

    Gao, Rui and Kleywegt, Anton , title =. Mathematics of Operations Research , volume =. 2023 , doi =. https://doi.org/10.1287/moor.2022.1275 , abstract =

  54. [59]

    Mathematics of Operations Research , volume =

    Blanchet, Jose and Murthy, Karthyek , title =. Mathematics of Operations Research , volume =. 2019 , doi =

  55. [60]

    Ahmadi-Javid

    A. Ahmadi-Javid. Entropic value-at-risk: A new coherent risk measure. Journal of Optimization Theory and Applications, 155: 0 1105--1123, 2012

  56. [61]

    A general class of coefficients of divergence of one distribution from another

    Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society: Series B (Methodological), 28 0 (1): 0 131--142, 1966

  57. [62]

    Explaining the role of intrinsic dimensionality in adversarial training

    Enes Altinisik, Safa Messaoud, Husrev Taha Sencar, Hassan Sajjad, and Sanjay Chawla. Explaining the role of intrinsic dimensionality in adversarial training. In International Conference on Machine Learning, pp.\ 1298--1313. PMLR, 2025

  58. [63]

    Many-shot jailbreaking

    Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreaking. Advances in Neural Information Processing Systems, 37: 0 129696--129742, 2024

  59. [64]

    A general theoretical paradigm to understand learning from human preferences

    Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp.\ 4447--4455. PMLR, 2024

  60. [65]

    An old-new concept of convex risk measures: The optimized certainty equivalent

    Aharon Ben-Tal and Marc Teboulle. An old-new concept of convex risk measures: The optimized certainty equivalent. Mathematical Finance, 17 0 (3): 0 449--476, 2007. doi:10.1111/j.1467-9965.2007.00311.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9965.2007.00311.x

  61. [66]

    (f, ) -divergences: I nterpolating between f-divergences and integral probability metrics

    Jeremiah Birrell, Paul Dupuis, Markos A Katsoulakis, Yannis Pantazis, and Luc Rey-Bellet. (f, ) -divergences: I nterpolating between f-divergences and integral probability metrics. Journal of machine learning research, 23 0 (39): 0 1--70, 2022

  62. [67]

    Minimization of divergences on sets of signed measures

    Michel Broniatowski and Amor Keziou. Minimization of divergences on sets of signed measures. Studia Scientiarum Mathematicarum Hungarica, 43 0 (4): 0 403–442, 2006

  63. [68]

    Long-tailed adversarial training with self-distillation

    Seungju Cho, Hongsin Lee, and Changick Kim. Long-tailed adversarial training with self-distillation. arXiv preprint arXiv:2503.06461, 2025

  64. [69]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  65. [70]

    On information-type measure of difference of probability distributions and indirect observations

    Imre Csisz \'a r. On information-type measure of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar., 2: 0 299--318, 1967

  66. [71]

    Mix AT : Combining continuous and discrete adversarial training for LLM s

    Csaba D \'e k \'a ny, Stefan Balauca, Robin Staab, Dimitar I Dimitrov, and Martin Vechev. Mix AT : Combining continuous and discrete adversarial training for LLM s. Advances in Neural Information Processing Systems, 2025

  67. [72]

    Vulnerable data-aware adversarial training

    Yuqi Feng, Jiahao Fan, and Yanan Sun. Vulnerable data-aware adversarial training. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  68. [73]

    '' Short-length '' adversarial training helps LLM s defend '' Long-length '' jailbreak attacks: Theoretical and empirical evidence

    Shaopeng Fu, Liang Ding, and Di Wang. '' Short-length '' adversarial training helps LLM s defend '' Long-length '' jailbreak attacks: Theoretical and empirical evidence. In ICLR 2025 Workshop on Foundation Models in the Wild, 2025. URL https://openreview.net/forum?id=U74MXMriLw

  69. [74]

    Boosting adversarial robustness with clat: Criticality leveraged adversarial training

    Bhavna Gopal, Huanrui Yang, Jingyang Zhang, Mark Horton, and Yiran Chen. Boosting adversarial robustness with clat: Criticality leveraged adversarial training. In Forty-second International Conference on Machine Learning, 2025

  70. [75]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. International Conference on Learning Representations, 2021

  71. [76]

    Zephyr-7b-

    Hugging Face H4 . Zephyr-7b- . https://github.com/huggingface/alignment-handbook, 2023. Hugging Face model checkpoint

  72. [77]

    Ramp: Boosting adversarial robustness against multiple l\_p perturbations for universal robustness

    Enyi Jiang and Gagandeep Singh. Ramp: Boosting adversarial robustness against multiple l\_p perturbations for universal robustness. Advances in Neural Information Processing Systems, 37: 0 43759--43787, 2024

  73. [78]

    Indirect gradient matching for adversarial robust distillation

    Hongsin Lee, Seungju Cho, and Changick Kim. Indirect gradient matching for adversarial robust distillation. In 13th International Conference on Learning Representations, ICLR 2025, pp.\ 49625--49646. International Conference on Learning Representations, ICLR, 2025

  74. [79]

    Adversarial training can provably improve robustness: Theoretical analysis of feature learning process under structured data

    Binghui Li and Yuanzhi Li. Adversarial training can provably improve robustness: Theoretical analysis of feature learning process under structured data. In The Thirteenth International Conference on Learning Representations, 2025

  75. [80]

    Liese and I

    F. Liese and I. Vajda . On divergences and informations in statistics and information theory. IEEE Transactions on Information Theory, 52 0 (10): 0 4394--4412, 2006

  76. [81]

    Luenberger

    D.G. Luenberger. Optimization by Vector Space Methods. Professional Series. Wiley, 1997. ISBN 9780471181170. URL https://books.google.com/books?id=M5n9DwAAQBAJ

  77. [82]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249, 2024

  78. [83]

    Llama-2-7b-chat-hf

    meta-llama . Llama-2-7b-chat-hf. https://huggingface.co/meta-llama/Llama-2-7b-chat-hf, 2023. Hugging Face model checkpoint

  79. [84]

    Meta-llama-3-8b-instruct

    meta-llama . Meta-llama-3-8b-instruct. https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct, 2024. Hugging Face model checkpoint

  80. [85]

    Mistral-7b-instruct-v0.1

    mistralai . Mistral-7b-instruct-v0.1. https://huggingface.co/mistralai/Mistral-7B-v0.1, 2025. Hugging Face model checkpoint

Showing first 80 references.