Pith Number

pith:BPZWJ3U2

pith:2023:BPZWJ3U2IYPSLIDGVUX5K5S7AM

not attested not anchored not stored refs resolved

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Aniruddha Saha, Avi Schwarzschild, Gowthami Somepalli, John Kirchenbauer, Jonas Geiping, Micah Goldblum, Neel Jain, Ping-yeh Chiang, Tom Goldstein, Yuxin Wen

Weak discrete optimizers and high optimization costs make baseline defenses effective against jailbreaking attacks on aligned language models.

arxiv:2309.00614 v2 · 2023-09-01 · cs.LG · cs.CL · cs.CR

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs

C2weakest assumption

That the specific attacks and threat models tested are representative of practical, real-world jailbreaking attempts against deployed LLMs.

C3one line summary

Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.

References

67 extracted · 67 resolved · 16 Pith anchors

[1] Obfuscated Gradients Give a False Sense of Security : Circumventing Defenses to Adversarial Examples 2018

[2] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback 2022 · arXiv:2204.05862

[3] Constitutional AI: Harmlessness from AI Feedback 2022 · arXiv:2212.08073

[4] Enhancing robustness of machine learning systems via data transformations 2018

[5] Adversarial Examples Are Not Easily Detected : Bypassing Ten Detection Methods 2017 · doi:10.1145/3128572.3140444

Cited by

30 papers in Pith

GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs

SAID: Safety-Aware Intent Defense via Prefix Probing for Large Language Models

Agentic AI Security: Threats, Defenses, Evaluation, and Open Challenges

Prompt Injection Attack to Tool Selection in LLM Agents

RACC: Representation-Aware Coverage Criteria for LLM Safety Testing

Receipt and verification

First computed	2026-05-18T03:45:00.709211Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

0bf364ee9a461f25a066ad2fd5765f0308a44f99511bb1d389d3ad1988a2a258

Aliases

arxiv: 2309.00614 · arxiv_version: 2309.00614v2 · doi: 10.48550/arxiv.2309.00614 · pith_short_12: BPZWJ3U2IYPS · pith_short_16: BPZWJ3U2IYPSLIDG · pith_short_8: BPZWJ3U2

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/BPZWJ3U2IYPSLIDGVUX5K5S7AM \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 0bf364ee9a461f25a066ad2fd5765f0308a44f99511bb1d389d3ad1988a2a258

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "712e24533eaa0f724504ea6f532c35cf63171f1c848d58739133e131d492cffa",
    "cross_cats_sorted": [
      "cs.CL",
      "cs.CR"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2023-09-01T17:59:44Z",
    "title_canon_sha256": "30c0897e6adbc00f6ac72b025b734528e1a6a01a369e621269727a5984e6ae75"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2309.00614",
    "kind": "arxiv",
    "version": 2
  }
}