{"paper":{"title":"BackFlush: Knowledge-Free Backdoor Detection and Elimination with Watermark Preservation in Large Language Models","license":"http://creativecommons.org/publicdomain/zero/1.0/","headline":"BackFlush detects unknown backdoors in LLMs by amplifying susceptibility and flushes them via embedding rotation while preserving watermarks and clean accuracy.","cross_cats":[],"primary_cat":"cs.CR","authors_text":"Amit Shukla, Jagadeesh Rachapudi, Praful Hambarde, Pranav Singh, Ritali Vatsi","submitted_at":"2026-04-15T10:56:08Z","abstract_excerpt":"In recent trends, one can observe Large Language Models (LLMs) are exposed to backdoor attacks where vicious triggers added during training or model editing to elicit harmful outputs on specific input patterns while maintaining clean performance on normal inputs. Legitimate watermarks used as ownership signatures share similar mechanisms to backdoors, creating a critical challenge: detecting and eliminating unknown backdoors without compromising watermark integrity. Existing defenses require prior knowledge of triggers or their payloads, depend on clean reference models, or sacrifice model uti"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"BackFlush achieves approximately 1% Attack Success Rate (ASR), approximately 99% clean accuracy (CACC), and preserved watermarking capabilities in the realm where no existing method simultaneously provides these alongside maintaining model utility comparable to clean baselines.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The Backdoor Flushing Phenomenon and Backdoor Susceptibility Amplification are assumed to hold generally for unknown backdoors across trigger types and architectures, and that RoPE Unlearning selectively removes backdoors without damaging watermarks.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"BackFlush detects unknown backdoors in LLMs by amplifying susceptibility and flushes them via embedding rotation while preserving watermarks and clean accuracy.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"be1888dbc68283a8717f182b6fd0121e3e1bff0cf2ad96cefc9c4714f98e61b1"},"source":{"id":"2605.12529","kind":"arxiv","version":1},"verdict":{"id":"a6acda57-9a81-4d3f-9ef2-e1829a30af86","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T20:57:57.694005Z","strongest_claim":"BackFlush achieves approximately 1% Attack Success Rate (ASR), approximately 99% clean accuracy (CACC), and preserved watermarking capabilities in the realm where no existing method simultaneously provides these alongside maintaining model utility comparable to clean baselines.","one_line_summary":"BackFlush detects backdoors via susceptibility amplification and eliminates them with RoPE unlearning to reach 1% ASR and 99% clean accuracy while preserving watermarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The Backdoor Flushing Phenomenon and Backdoor Susceptibility Amplification are assumed to hold generally for unknown backdoors across trigger types and architectures, and that RoPE Unlearning selectively removes backdoors without damaging watermarks.","pith_extraction_headline":"BackFlush detects unknown backdoors in LLMs by amplifying susceptibility and flushes them via embedding rotation while preserving watermarks and clean accuracy."},"references":{"count":42,"sample":[{"doi":"","year":2024,"title":"Large language models (llms): survey, technical frameworks, and future challenges,","work_id":"35f3f4f5-fc03-4736-be3f-a463b36fc1c0","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Look before you leap: An exploratory study of uncertainty measurement for large language models.arXiv preprint arXiv:2307.10236","work_id":"50e2761a-1893-4580-b1dd-fe617e8fd2c4","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2025,"title":"Putting people in llms’ shoes: Generating better answers via question rewriter,","work_id":"27bedfcf-aa33-477f-8d99-a3d96ccfa8f0","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2024,"title":"Can multiple-choice questions really be useful in detecting the abilities of LLMs? , url=","work_id":"b97fd071-70e3-45d0-984d-70e56f8637cd","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Bid-lora: A parameter-efficient framework for continual learning and unlearning,","work_id":"2ff5649a-868a-4a56-8da7-4a8b37a657b9","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":42,"snapshot_sha256":"baa8571ef1f4c15722e4c436494912d9bbcc6fed5bf838bb30e143f63f23d369","internal_anchors":8},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}