Swapping the reasoning trace prefill on unlearned weights can replicate or reverse the parser-split bypass gap, showing that the gap alone does not identify or rule out weight-level memorization.
A comprehensive survey on trustworthiness in reasoning with large language models
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
background 1polarities
background 1representative citing papers
GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
TRACE-RPS drops LLM attribute inference accuracy from around 50% to below 5% via fine-grained anonymization plus a two-stage rejection optimization.
A 16-factor structured prompt framework strengthens CoT reasoning in LLMs for security analysis, yielding up to 40% reasoning gains in smaller models and stable accuracy improvements validated by human raters with Cohen's k > 0.80.
citing papers explorer
-
Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries
Swapping the reasoning trace prefill on unlearned weights can replicate or reverse the parser-split bypass gap, showing that the gap alone does not identify or rule out weight-level memorization.
-
Pause or Fabricate? Training Language Models for Grounded Reasoning
GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.
-
From $P(y|x)$ to $P(y)$: Investigating Reinforcement Learning in Pre-train Space
PreRL applies reward-driven updates to P(y) in pre-train space, uses Negative Sample Reinforcement to prune bad reasoning paths and boost reflection, and combines with standard RL in Dual Space RL to outperform baselines on reasoning tasks.
-
Stop Tracking Me! Proactive Defense Against Attribute Inference Attack in LLMs
TRACE-RPS drops LLM attribute inference accuracy from around 50% to below 5% via fine-grained anonymization plus a two-stage rejection optimization.
-
Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework
A 16-factor structured prompt framework strengthens CoT reasoning in LLMs for security analysis, yielding up to 40% reasoning gains in smaller models and stable accuracy improvements validated by human raters with Cohen's k > 0.80.