pith. machine review for the scientific record.
sign in
Pith Number

pith:BPZWJ3U2

pith:2023:BPZWJ3U2IYPSLIDGVUX5K5S7AM
not attested not anchored not stored refs resolved

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Aniruddha Saha, Avi Schwarzschild, Gowthami Somepalli, John Kirchenbauer, Jonas Geiping, Micah Goldblum, Neel Jain, Ping-yeh Chiang, Tom Goldstein, Yuxin Wen

Weak discrete optimizers and high optimization costs make baseline defenses effective against jailbreaking attacks on aligned language models.

arxiv:2309.00614 v2 · 2023-09-01 · cs.LG · cs.CL · cs.CR

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs

C2weakest assumption

That the specific attacks and threat models tested are representative of practical, real-world jailbreaking attempts against deployed LLMs.

C3one line summary

Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.

References

67 extracted · 67 resolved · 16 Pith anchors

[1] Obfuscated Gradients Give a False Sense of Security : Circumventing Defenses to Adversarial Examples 2018
[2] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback 2022 · arXiv:2204.05862
[3] Constitutional AI: Harmlessness from AI Feedback 2022 · arXiv:2212.08073
[4] Enhancing robustness of machine learning systems via data transformations 2018
[5] Adversarial Examples Are Not Easily Detected : Bypassing Ten Detection Methods 2017 · doi:10.1145/3128572.3140444

Cited by

30 papers in Pith

Receipt and verification
First computed 2026-05-18T03:45:00.709211Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

0bf364ee9a461f25a066ad2fd5765f0308a44f99511bb1d389d3ad1988a2a258

Aliases

arxiv: 2309.00614 · arxiv_version: 2309.00614v2 · doi: 10.48550/arxiv.2309.00614 · pith_short_12: BPZWJ3U2IYPS · pith_short_16: BPZWJ3U2IYPSLIDG · pith_short_8: BPZWJ3U2
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/BPZWJ3U2IYPSLIDGVUX5K5S7AM \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 0bf364ee9a461f25a066ad2fd5765f0308a44f99511bb1d389d3ad1988a2a258
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "712e24533eaa0f724504ea6f532c35cf63171f1c848d58739133e131d492cffa",
    "cross_cats_sorted": [
      "cs.CL",
      "cs.CR"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2023-09-01T17:59:44Z",
    "title_canon_sha256": "30c0897e6adbc00f6ac72b025b734528e1a6a01a369e621269727a5984e6ae75"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2309.00614",
    "kind": "arxiv",
    "version": 2
  }
}