arXiv preprint arXiv:2409.01586 (2024)

Huang, T · 2024 · arXiv 2409.01586

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

cs.LG · 2025-08-28 · unverdicted · novelty 7.0

TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.

Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models

physics.optics · 2026-05-11 · unverdicted · novelty 6.0

JSON schema constraints improve LLM extraction of nested quantum cascade laser structures to 83.4% F1, delivering up to 24.1% gains for smaller models.

AlignCultura: Towards Culturally Aligned Large Language Models?

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.

Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints

cs.AI · 2026-04-14 · unverdicted · novelty 6.0

Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.

Secure LLM Fine-Tuning via Safety-Aware Probing

cs.LG · 2025-05-22 · unverdicted · novelty 6.0

SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

cs.CR · 2025-02-02 · unverdicted · novelty 2.0

A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

cs.CR · 2024-09-26 · unverdicted · novelty 2.0

Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

citing papers explorer

Showing 7 of 7 citing papers.

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning cs.LG · 2025-08-28 · unverdicted · none · ref 19
TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.
Information Extraction of Nested Complex Structure of Quantum Cascade Lasers via Large Language Models physics.optics · 2026-05-11 · unverdicted · none · ref 57
JSON schema constraints improve LLM extraction of nested quantum cascade laser structures to 83.4% F1, delivering up to 24.1% gains for smaller models.
AlignCultura: Towards Culturally Aligned Large Language Models? cs.CL · 2026-04-21 · unverdicted · none · ref 35
Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
Preventing Safety Drift in Large Language Models via Coupled Weight and Activation Constraints cs.AI · 2026-04-14 · unverdicted · none · ref 17
Coupled constraints on weight updates in a safety subspace and regularization of SAE-identified safety features preserve LLM refusal behaviors during fine-tuning better than weight-only or activation-only methods.
Secure LLM Fine-Tuning via Safety-Aware Probing cs.LG · 2025-05-22 · unverdicted · none · ref 22
SAP locates safety-correlated directions via contrastive signals and perturbs hidden-state propagation with a lightweight probe to preserve safety while fine-tuning LLMs for task performance.
Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety cs.CR · 2025-02-02 · unverdicted · none · ref 121
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey cs.CR · 2024-09-26 · unverdicted · none · ref 66
Survey of harmful fine-tuning attacks on LLMs, their variants, defense strategies, mechanical analysis, and evaluation methodologies.

arXiv preprint arXiv:2409.01586 (2024)

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer