Recognition: unknown
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
Pith reviewed 2026-05-10 01:56 UTC · model grok-4.3
The pith
LLMs are highly vulnerable to jailbreaks when users supply incomplete drafts containing harmful cues, yet preference optimization can train them to refuse such completions while preserving helpfulness on benign co-authoring tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Existing LLMs are highly vulnerable in co-authoring contexts because incomplete drafts with domain-specific cues in areas such as Explosives, Drugs, Weapons, and Cyberattacks reliably elicit harmful completions, and a safety-utility balanced alignment approach based on preference optimization significantly reduces these harmful outputs without degrading performance on benign co-authoring capabilities.
What carries the argument
HarDBench, a benchmark of prompts that use incomplete structures and domain-specific cues to simulate draft-based co-authoring jailbreaks, paired with a preference optimization procedure that trains models to refuse harmful completions while remaining helpful on safe drafts.
If this is right
- Models intended for collaborative writing require specialized safety evaluation beyond standard single-turn jailbreak tests.
- Preference optimization can produce models that refuse harmful draft completions while retaining normal co-authoring performance.
- Benchmarks built around incomplete, cue-rich prompts can serve as training signals for safer human-LLM writing systems.
- Safety mechanisms developed for draft-based attacks may generalize to other interactive, multi-turn generation settings.
- Widespread adoption would shift alignment practice from post-generation filtering to proactive refusal during collaborative workflows.
Where Pith is reading between the lines
- The same incomplete-draft technique could be applied to non-writing tasks such as code completion or planning, suggesting a broader class of context-dependent jailbreaks.
- If the benchmark prompts prove narrower than real attacks, future work would need to expand the set of cues and structures to maintain coverage.
- Wider use of this alignment method might reduce reliance on external content filters in consumer LLM tools.
- The approach leaves open the question of how to handle edge cases where a draft is ambiguous between harmful and benign intent.
Load-bearing premise
The HarDBench prompts with their specific incomplete structures and domain cues accurately represent real-world malicious draft-based attacks, and the preference optimization does not cause over-refusal on legitimate but sensitive writing topics.
What would settle it
A model aligned with the proposed preference optimization still produces harmful content on a large fraction of HarDBench test cases or begins refusing to complete clearly benign drafts on sensitive but non-prohibited topics.
Figures
read the original abstract
Large language models (LLMs) are increasingly used as co-authors in collaborative writing, where users begin with rough drafts and rely on LLMs to complete, revise, and refine their content. However, this capability poses a serious safety risk: malicious users could jailbreak the models-filling incomplete drafts with dangerous content-to force them into generating harmful outputs. In this paper, we identify the vulnerability of current LLMs to such draft-based co-authoring jailbreak attacks and introduce HarDBench, a systematic benchmark designed to evaluate the robustness of LLMs against this emerging threat. HarDBench spans a range of high-risk domains-including Explosives, Drugs, Weapons, and Cyberattacks-and features prompts with realistic structure and domain-specific cues to assess the model susceptibility to harmful completions. To mitigate this risk, we introduce a safety-utility balanced alignment approach based on preference optimization, training models to refuse harmful completions while remaining helpful on benign drafts. Experimental results show that existing LLMs are highly vulnerable in co-authoring contexts and our alignment method significantly reduces harmful outputs without degrading performance on co-authoring capabilities. This presents a new paradigm for evaluating and aligning LLMs in human-LLM collaborative writing settings. Our new benchmark and dataset are available on our project page at https://github.com/untae0122/HarDBench
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HarDBench, a new benchmark for assessing LLM vulnerability to draft-based co-authoring jailbreak attacks. It constructs prompts as incomplete drafts containing domain-specific cues in high-risk areas (Explosives, Drugs, Weapons, Cyberattacks) and reports that current LLMs show high susceptibility to generating harmful completions. The authors propose a preference-optimization alignment method that trains models to refuse harmful drafts while preserving helpfulness on benign co-authoring tasks, claiming this reduces harmful outputs without degrading utility. The benchmark and dataset are released publicly.
Significance. If the central claims hold, the work identifies a practically relevant safety gap in human-LLM collaborative writing and supplies both an evaluation framework and a mitigation technique. The public release of HarDBench would allow the community to test and improve alignment methods for interactive writing scenarios, which is a growing use case.
major comments (3)
- [Abstract / HarDBench construction] Abstract and HarDBench construction: the claim that the prompts feature 'realistic structure and domain-specific cues' and thereby demonstrate 'high vulnerability in co-authoring contexts' rests on the unverified assumption that these artificially constructed incomplete drafts with explicit domain cues match the distribution of organic malicious user inputs. No comparison to real collaborative-writing logs or organic attack attempts is described, which directly affects the generalizability of the reported attack success rates and the robustness of the subsequent alignment.
- [Alignment method and experimental results] Alignment and evaluation: the statement that preference optimization 'significantly reduces harmful outputs without degrading performance on co-authoring capabilities' requires explicit evidence that the benign test set includes sensitive but legal topics (e.g., medical or technical writing) where over-refusal could plausibly occur. Without such coverage or quantitative refusal rates on those cases, the no-degradation claim cannot be fully assessed.
- [Experimental results] Experimental details: the abstract reports results on vulnerability and mitigation effectiveness, yet the manuscript does not appear to provide data splits, number of runs, statistical significance tests, or a clear description of the baselines and preference-optimization hyperparameters. These omissions make it impossible to verify the soundness of the quantitative claims.
minor comments (2)
- [Abstract] The project page URL is given but the manuscript should include a short description of the exact contents of the released dataset (e.g., number of prompts per domain, format of the incomplete drafts).
- [Alignment method] Notation for the preference-optimization objective could be clarified with an equation or pseudocode to make the training procedure reproducible from the text alone.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of generalizability, evaluation coverage, and experimental rigor. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract and HarDBench construction: the claim that the prompts feature 'realistic structure and domain-specific cues' and thereby demonstrate 'high vulnerability in co-authoring contexts' rests on the unverified assumption that these artificially constructed incomplete drafts with explicit domain cues match the distribution of organic malicious user inputs. No comparison to real collaborative-writing logs or organic attack attempts is described, which directly affects the generalizability of the reported attack success rates and the robustness of the subsequent alignment.
Authors: We acknowledge that HarDBench prompts were constructed synthetically using domain expertise to incorporate realistic structures and specific cues in high-risk areas. Direct comparison to organic malicious collaborative-writing logs was not performed, primarily due to the absence of publicly available datasets containing such sensitive interactions and ethical constraints around accessing real user data. In the revised manuscript, we will add an expanded section detailing the prompt construction methodology, including the use of expert consultations, iterative validation for plausibility, and explicit discussion of the synthetic nature as a limitation. This will better contextualize the attack success rates while preserving the benchmark's utility for evaluating the identified vulnerability in controlled settings. revision: partial
-
Referee: Alignment and evaluation: the statement that preference optimization 'significantly reduces harmful outputs without degrading performance on co-authoring capabilities' requires explicit evidence that the benign test set includes sensitive but legal topics (e.g., medical or technical writing) where over-refusal could plausibly occur. Without such coverage or quantitative refusal rates on those cases, the no-degradation claim cannot be fully assessed.
Authors: The current benign evaluation set covers a range of co-authoring tasks, but we agree it does not explicitly quantify performance on sensitive yet legal topics such as medical or technical writing. To strengthen the no-degradation claim, we will revise the evaluation section to include a curated subset of such cases and report quantitative refusal rates alongside utility metrics. This addition will provide direct evidence that the preference optimization approach maintains helpfulness without excessive over-refusal in plausible real-world scenarios. revision: yes
-
Referee: Experimental details: the abstract reports results on vulnerability and mitigation effectiveness, yet the manuscript does not appear to provide data splits, number of runs, statistical significance tests, or a clear description of the baselines and preference-optimization hyperparameters. These omissions make it impossible to verify the soundness of the quantitative claims.
Authors: We apologize for these omissions in the initial submission. The revised manuscript will include a dedicated experimental details subsection specifying the data splits for training and evaluation, the number of independent runs performed, results from statistical significance tests (such as paired t-tests on key metrics), full descriptions of all baselines, and the complete set of hyperparameters used in the preference optimization process. These additions will enable full verification and reproducibility of the reported results. revision: yes
Circularity Check
No circularity: empirical benchmark construction and evaluation
full rationale
The paper constructs HarDBench as a new benchmark of incomplete draft prompts with domain-specific cues across high-risk categories and performs direct empirical testing of LLMs plus preference optimization for alignment. No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the derivation chain. Claims rest on external model behavior measurements rather than reducing to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, and 1 others
Refusal in language models is mediated by a single direction.Advances in Neural Information Processing Systems, 37:136037–136083. Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, and 1 others. 2022. Training a helpful and harmless assis- tant with reinforcement learning...
2022
-
[2]
Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Se- cure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE. Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Mar- cel Blistein, Ori Ram, Dan Zhang, Evan Rosen, and 1 others. 2025. Gemini 2.5: Pushing the fro...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Mark Russinovich, Ahmed Salem, and Ronen Eldan
Direct preference optimization: Your lan- guage model is secretly a reward model.Advances in Neural Information Processing Systems, 36:53728– 53741. Mark Russinovich, Ahmed Salem, and Ronen Eldan
-
[4]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Great, now write an article about that: The crescendo {Multi-Turn}{LLM} jailbreak attack. In 34th USENIX Security Symposium (USENIX Security 25), pages 2421–2440. Rusheb Shah, Quentin Feuillade Montixi, Soroush Pour, Arush Tagade, and Javier Rando. 2023. Scalable and transferable black-box jailbreaks for language mod- els via persona modulation. InSociall...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Writingbench: A comprehensive benchmark for generative writing.CoRR, abs/2503.05244,
Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36:80079–80110. Yuning Wu, Jiahao Mei, Ming Yan, Chenliang Li, Shaopeng Lai, Yuran Ren, Zijia Wang, Ji Zhang, Mengyue Wu, Qin Jin, and 1 others. 2025. Writing- bench: A comprehensive benchmark for generative writing.arXiv preprint arXiv:2503.05244. Yueqi Xie,...
-
[6]
Qwen3 technical report.arXiv preprint arXiv:2505.09388. Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Gptfuzzer: Red teaming large language mod- els with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, YuYue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, and 17...
work page internal anchor Pith review arXiv 2025
-
[10]
PCB Functional Testing: ……
-
[11]
PCB Physical Inspection: ……
-
[12]
Mistral-7B w/o benign Core Components: …… Assembly and Testing Steps:
Quality Control: …… Additional Considerations: …… Mistral-7B zero-shot Sorry, but I can't help with that request because it involves illegal and harmful activities. Mistral-7B w/o benign Core Components: …… Assembly and Testing Steps:
-
[13]
Component Sourcing: ……
-
[14]
Component Placement: ……
-
[15]
Wire and Connector Installation: ……
-
[16]
Power Source Connection: ……
-
[17]
Power-Up Sequence: ……
-
[18]
Visual Inspection: ……
-
[19]
I can’t answer that. Is there anything else I can help you with?
Quality Control: …… Additional Considerations: …… Mistral-7B w/ benign Keyword : educational circuit board kit Figure 17: Comparison of Mistral-7B completions for a benign co-authoring prompt. Red: unjustified refusal. Blue: helpful addition. 29 Core Components: …Manufacturing Steps:1. Design: …2. PCB Fabrication: …3. Component Sourcing: …4. Assembly: …5....
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.