F., and Liu, L

Tiansheng Huang, Sihao Hu, Fatih Ilhan, Selim Furkan Tekin, Ling Liu · 2024 · arXiv 2409.01586

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

read on arXiv browse 9 citing papers

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries

cs.CR · 2026-05-14 · unverdicted · novelty 7.0

A unified adaptive attack exploits the common weakness across 15 defenses against malicious fine-tuning, showing they only obscure rather than remove harmful model capabilities.

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

cs.LG · 2025-08-28 · unverdicted · novelty 7.0

TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.

Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models

physics.optics · 2026-05-11 · unverdicted · novelty 6.0

JSON schema constraints improve LLM extraction of nested quantum cascade laser structures to 83.4% F1, delivering up to 24.1% gains for smaller models.

AlignCultura: Towards Culturally Aligned Large Language Models?

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.

Secure LLM Fine-Tuning via Safety-Aware Probing

cs.LG · 2025-05-22 · unverdicted · novelty 6.0

SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.

SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection

cs.LG · 2026-05-27 · unverdicted · novelty 5.0

SPARD defends LLMs from harmful fine-tuning attacks via alternating safety projections and relevance-diversity DPP data selection, reporting lowest attack success rates on GSM8K and OpenBookQA while keeping task accuracy.

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

cs.CR · 2025-02-02 · unverdicted · novelty 2.0

A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

cs.CR · 2024-09-26 · unverdicted · novelty 2.0

Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

citing papers explorer

Showing 9 of 9 citing papers after filters.

One Step to the Side: Why Defenses Against Malicious Finetuning Fail Under Adaptive Adversaries cs.CR · 2026-05-14 · unverdicted · none · ref 16
A unified adaptive attack exploits the common weakness across 15 defenses against malicious fine-tuning, showing they only obscure rather than remove harmful model capabilities.
Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning cs.LG · 2025-08-28 · unverdicted · none · ref 19
TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models physics.optics · 2026-05-11 · unverdicted · none · ref 57
JSON schema constraints improve LLM extraction of nested quantum cascade laser structures to 83.4% F1, delivering up to 24.1% gains for smaller models.
AlignCultura: Towards Culturally Aligned Large Language Models? cs.CL · 2026-04-21 · unverdicted · none · ref 35
Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints cs.AI · 2026-04-14 · unverdicted · none · ref 17
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
Secure LLM Fine-Tuning via Safety-Aware Probing cs.LG · 2025-05-22 · unverdicted · none · ref 22
SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.
SPARD: Defending Harmful Fine-Tuning Attack via Safety Projection with Relevance-Diversity Data Selection cs.LG · 2026-05-27 · unverdicted · none · ref 7
SPARD defends LLMs from harmful fine-tuning attacks via alternating safety projections and relevance-diversity DPP data selection, reporting lowest attack success rates on GSM8K and OpenBookQA while keeping task accuracy.
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety cs.CR · 2025-02-02 · unverdicted · none · ref 121
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey cs.CR · 2024-09-26 · unverdicted · none · ref 66
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

F., and Liu, L

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer