Pith Number
pith:BPZWJ3U2
pith:2023:BPZWJ3U2IYPSLIDGVUX5K5S7AM
not attested
not anchored
not stored
refs resolved
Baseline Defenses for Adversarial Attacks Against Aligned Language Models
Weak discrete optimizers and high optimization costs make baseline defenses effective against jailbreaking attacks on aligned language models.
arxiv:2309.00614 v2 · 2023-09-01 · cs.LG · cs.CL · cs.CR
Record completeness
1
Bitcoin timestamp
2
Internet Archive
3
Author claim
· sign in to
claim
4
Citations
5
Replications
✓
Portable graph bundle live · download bundle · merged
state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same
current state with the deterministic merge algorithm.
Claims
C1strongest claim
the weakness of existing discrete optimizers for text, combined with the relatively high costs of optimization, makes standard adaptive attacks more challenging for LLMs
C2weakest assumption
That the specific attacks and threat models tested are representative of practical, real-world jailbreaking attempts against deployed LLMs.
C3one line summary
Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.
References
[1] Obfuscated Gradients Give a False Sense of Security : Circumventing Defenses to Adversarial Examples
[2] Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
[3] Constitutional AI: Harmlessness from AI Feedback
[4] Enhancing robustness of machine learning systems via data transformations
[5] Adversarial Examples Are Not Easily Detected : Bypassing Ten Detection Methods
Cited by
Receipt and verification
| First computed | 2026-05-18T03:45:00.709211Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
0bf364ee9a461f25a066ad2fd5765f0308a44f99511bb1d389d3ad1988a2a258
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/BPZWJ3U2IYPSLIDGVUX5K5S7AM \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 0bf364ee9a461f25a066ad2fd5765f0308a44f99511bb1d389d3ad1988a2a258
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "712e24533eaa0f724504ea6f532c35cf63171f1c848d58739133e131d492cffa",
"cross_cats_sorted": [
"cs.CL",
"cs.CR"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.LG",
"submitted_at": "2023-09-01T17:59:44Z",
"title_canon_sha256": "30c0897e6adbc00f6ac72b025b734528e1a6a01a369e621269727a5984e6ae75"
},
"schema_version": "1.0",
"source": {
"id": "2309.00614",
"kind": "arxiv",
"version": 2
}
}