Removing RLHF protections in GPT-4 via fine-tuning

Qiusi Zhan, Richard Fang, Rohan Bindu, Akul Gupta, Tatsunori Hashimoto, Daniel Kang · 2024 · DOI 10.18653/v1/2024.naacl-short.59

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

PROXYMIX learns a dynamic replay controller on a small proxy model and transfers it to a large target model, improving accuracy by 3.4 points and reducing forgetting by 3.5 points on LLaMA-3-8B continual tuning sequences.

Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions

cs.CR · 2026-05-24 · unverdicted · novelty 5.0

A lifecycle-based survey of LLM fine-tuning security that reviews attacks and defenses by intervention phase and reports unified empirical findings on model-dependent attack effectiveness and limited defense generalization.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning cs.LG · 2026-05-29 · unverdicted · none · ref 50
PROXYMIX learns a dynamic replay controller on a small proxy model and transfers it to a large target model, improving accuracy by 3.4 points and reducing forgetting by 3.5 points on LLaMA-3-8B continual tuning sequences.
Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions cs.CR · 2026-05-24 · unverdicted · none · ref 83
A lifecycle-based survey of LLM fine-tuning security that reviews attacks and defenses by intervention phase and reports unified empirical findings on model-dependent attack effectiveness and limited defense generalization.

Removing RLHF protections in GPT-4 via fine-tuning

fields

years

verdicts

representative citing papers

citing papers explorer