Black-box behavioral distillation breaks safety alignment in medical llms

Sohely Jahan, Ruimin Sun · 2025 · arXiv 2512.09403

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Hiding in Plain Sight: Detectability-Aware Antidistillation of Reasoning Models

cs.CR · 2026-04-25 · unverdicted · novelty 7.0 · 2 refs

TraceGuard formulates antidistillation as a detectability-constrained Stackelberg game and poisons sparsely located thought anchors via branching-token detection to degrade student models while preserving trace quality.

CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.

ExecTune: Effective Steering of Black-Box LLMs with Guide Models

cs.LG · 2026-04-09 · unverdicted · novelty 6.0

ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on math and code tasks.

Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control

cs.AI · 2026-04-29 · unverdicted · novelty 5.0

LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not yet safe for clinical use.

citing papers explorer

Showing 4 of 4 citing papers.

Hiding in Plain Sight: Detectability-Aware Antidistillation of Reasoning Models cs.CR · 2026-04-25 · unverdicted · none · ref 2 · 2 links
TraceGuard formulates antidistillation as a detectability-constrained Stackelberg game and poisons sparsely located thought anchors via branching-token detection to degrade student models while preserving trace quality.
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics cs.CL · 2026-05-10 · unverdicted · none · ref 156
CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
ExecTune: Effective Steering of Black-Box LLMs with Guide Models cs.LG · 2026-04-09 · unverdicted · none · ref 18
ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on math and code tasks.
Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control cs.AI · 2026-04-29 · unverdicted · none · ref 34
LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not yet safe for clinical use.

Black-box behavioral distillation breaks safety alignment in medical llms

fields

years

verdicts

representative citing papers

citing papers explorer