pith. machine review for the scientific record. sign in

arxiv: 2204.05862 · v1 · submitted 2022-04-12 · 💻 cs.CL · cs.LG

Recognition: 3 theorem links

· Lean Theorem

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:30 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords RLHFpreference modelinglanguage model alignmenthelpful and harmless assistantreinforcement learninghuman feedbackNLP evaluationAI safety
0
0 comments X

The pith

Reinforcement learning from human feedback aligns language models to be helpful and harmless assistants while improving performance on NLP evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how preference modeling followed by RLHF can finetune language models so they follow human judgments on what counts as helpful and harmless. A sympathetic reader would care because the same process also raises scores on nearly every standard NLP benchmark instead of creating the expected capability trade-off. The work further demonstrates that this alignment can run in parallel with training on narrow skills such as coding or summarization, and that fresh human feedback collected on a weekly cycle keeps improving the models without restarting from scratch.

Core claim

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence.

What carries the argument

Preference models trained on human rankings of model outputs, followed by reinforcement learning that optimizes a policy against those models.

If this is right

  • Alignment training raises rather than lowers scores on most NLP tasks.
  • Specialized capability training such as coding or summarization can be performed alongside or after the alignment stage without interference.
  • Weekly online updates using new human rankings allow steady dataset and model improvement.
  • The observed linear relation between reward and square-root KL divergence gives a practical dial for controlling how far the policy moves during RL.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the linear reward-KL relation generalizes, future runs could choose divergence targets in advance to keep models close enough to retain capabilities while still aligning.
  • The lack of conflict with domain-specific training suggests alignment can be applied as a reusable post-training step for many different model lineages.
  • Ongoing weekly feedback collection implies that alignment can be maintained as models continue to scale rather than treated as a one-time step.

Load-bearing premise

Human preference rankings supply a stable training signal that does not contain systematic biases the RL stage will amplify into worse behavior.

What would settle it

Train a model with the described RLHF procedure and then measure that its scores on standard NLP benchmarks have not risen or that the rate of harmful outputs has not fallen relative to the starting model.

read the original abstract

We apply preference modeling and reinforcement learning from human feedback (RLHF) to finetune language models to act as helpful and harmless assistants. We find this alignment training improves performance on almost all NLP evaluations, and is fully compatible with training for specialized skills such as python coding and summarization. We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data, efficiently improving our datasets and models. Finally, we investigate the robustness of RLHF training, and identify a roughly linear relation between the RL reward and the square root of the KL divergence between the policy and its initialization. Alongside our main results, we perform peripheral analyses on calibration, competing objectives, and the use of OOD detection, compare our models with human writers, and provide samples from our models using prompts appearing in recent related work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript applies preference modeling and reinforcement learning from human feedback (RLHF) to fine-tune language models as helpful and harmless assistants. It reports that this alignment training improves performance on almost all NLP evaluations while remaining compatible with specialized skill training such as Python coding and summarization. The work also introduces an iterated online training mode with weekly updates using fresh human feedback, identifies a roughly linear relation between RL reward and the square root of KL divergence from the initialization, and includes peripheral analyses on calibration, competing objectives, OOD detection, human writer comparisons, and model samples.

Significance. If the empirical results hold under rigorous controls, the paper demonstrates that RLHF can simultaneously boost general NLP capabilities and safety alignment in language models, with the iterated online approach offering a scalable data-collection strategy. The reported linear reward-KL relation provides an empirical regularity that could guide future theoretical analyses of RLHF optimization landscapes. These findings, supported by the paper's machine-checked or reproducible elements where present, would have direct implications for deploying aligned AI assistants.

major comments (1)
  1. [Abstract] Abstract and the section describing human preference data collection: the central claim that alignment training improves performance on almost all NLP evaluations rests on the assumption that human preference rankings supply a consistent, unbiased signal for both helpfulness and harmlessness. The manuscript does not appear to report inter-annotator agreement statistics or explicit tests for style biases (e.g., preference for longer or more confident responses) that could be amplified by the RL stage; without these, the observed gains risk being artifacts of the training signal rather than genuine capability or safety improvements.
minor comments (2)
  1. [Abstract] The abstract states a 'roughly linear relation' between RL reward and sqrt(KL) but does not specify the exact functional form, the range of KL values examined, or whether the relation holds after controlling for reward scaling coefficient; adding this detail would strengthen the observation.
  2. [Abstract] The claim of compatibility with specialized skills (python coding, summarization) would benefit from explicit quantitative results or ablation tables showing performance before and after alignment training on those tasks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address the single major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the section describing human preference data collection: the central claim that alignment training improves performance on almost all NLP evaluations rests on the assumption that human preference rankings supply a consistent, unbiased signal for both helpfulness and harmlessness. The manuscript does not appear to report inter-annotator agreement statistics or explicit tests for style biases (e.g., preference for longer or more confident responses) that could be amplified by the RL stage; without these, the observed gains risk being artifacts of the training signal rather than genuine capability or safety improvements.

    Authors: We agree that the absence of inter-annotator agreement (IAA) statistics and explicit bias analyses constitutes a gap in the current manuscript. In the revised version we will report IAA for both the helpfulness and harmlessness preference datasets (computed on the subset of examples that received multiple annotations). We will also add a short analysis of response-length and confidence-related biases: we will measure the correlation between preference scores and response length in the training data, and we will compare length distributions of model outputs before and after RLHF. While our annotation guidelines explicitly instructed raters to prioritize substantive helpfulness and harmlessness over stylistic features, we acknowledge that these controls are not sufficient without quantitative checks. The fact that gains appear on standard NLP benchmarks (many of which penalize overly verbose or hedging answers) provides indirect evidence against pure artifact explanations, but we will make this argument explicit and add the requested statistics and bias diagnostics. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of fitted inputs

full rationale

The paper reports empirical measurements of RLHF alignment training on held-out NLP evaluations, human comparisons, and specialized tasks. The noted roughly linear relation between RL reward and sqrt(KL divergence is described as an observed pattern from training runs, not a first-principles derivation or prediction that reduces to model inputs by construction. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The central claims rest on independent benchmark performance rather than any equation or theorem that collapses to the training data or preferences by definition.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that human preference rankings form a consistent and unbiased signal for both helpfulness and harmlessness. No new physical or mathematical entities are introduced; the work uses standard RL and preference modeling machinery.

free parameters (1)
  • RL reward scaling coefficient
    The strength of the preference model signal relative to the KL penalty is tuned to achieve the reported linear relation and performance gains.
axioms (1)
  • domain assumption Human preference rankings over model outputs are transitive and consistent enough to train a reliable reward model.
    Invoked when preference modeling is used as the training signal for RL.

pith-pipeline@v0.9.0 · 5561 in / 1222 out tokens · 25058 ms · 2026-05-10T13:30:30.292855+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.DimensionForcing dimension_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We find this alignment training improves performance on almost all NLP evaluations

  • Foundation.PhiForcing phi_equation unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    identify a roughly linear relation between the RL reward and the square root of the KL divergence

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Statistical Cost of Adaptation in Multi-Source Transfer Learning

    math.ST 2026-05 unverdicted novelty 8.0

    Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

  2. Contrastive Identification and Generation in the Limit

    cs.LG 2026-05 unverdicted novelty 8.0

    Contrastive pair presentations yield exact identifiability characterizations via a geometric refinement of Angluin's condition, a new contrastive closure dimension for generation, mutual incomparability with text iden...

  3. Weak-to-Strong Generalization is Nearly Inevitable (in Linear Models)

    cs.LG 2026-05 unverdicted novelty 8.0

    Weak-to-strong generalization is nearly inevitable in linear logistic regression for most student-teacher pairs without any model capacity mismatch.

  4. Efficient Preference Poisoning Attack on Offline RLHF

    cs.LG 2026-05 unverdicted novelty 8.0

    Label-flip attacks on log-linear DPO reduce to binary sparse approximation problems that can be solved efficiently by lattice-based and binary matching pursuit methods with recovery guarantees.

  5. Revisable by Design: A Theory of Streaming LLM Agent Execution

    cs.LG 2026-04 unverdicted novelty 8.0

    LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...

  6. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    cs.CR 2024-06 unverdicted novelty 8.0

    AgentDojo introduces an extensible evaluation framework populated with realistic agent tasks and security test cases to measure prompt injection robustness in tool-using LLM agents.

  7. Temper and Tilt Lead to SLOP: Reward Hacking Mitigation with Inference-Time Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    Temperature adjustment on the reference model generalizes inference-time alignment to SLOP ensembles of reward models, with a calibration algorithm that improves robustness to reward hacking while preserving alignment...

  8. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 7.0

    TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

  9. Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

  10. Variance-aware Reward Modeling with Anchor Guidance

    stat.ML 2026-05 unverdicted novelty 7.0

    Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...

  11. Structure from Strategic Interaction & Uncertainty: Risk Sensitive Games for Robust Preference Learning

    cs.GT 2026-05 unverdicted novelty 7.0

    Risk-sensitive preference games retain monotonicity via translation-invariant risk measures, enabling convergent self-play algorithms with stability bounds and empirical robustness across data strata.

  12. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

  13. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.

  14. The Attacker in the Mirror: Breaking Self-Consistency in Safety via Anchored Bipolicy Self-Play

    cs.AI 2026-05 unverdicted novelty 7.0

    Anchored Bipolicy Self-Play trains role-specific LoRA adapters on a frozen base model to break self-consistency collapse in self-play red-teaming, yielding up to 100x parameter efficiency and stronger safety on Qwen2....

  15. Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

    cs.LG 2026-05 unverdicted novelty 7.0

    Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.

  16. Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective

    cs.LG 2026-05 unverdicted novelty 7.0

    The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...

  17. Topology-Enhanced Alignment for Large Language Models: Trajectory Topology Loss and Topological Preference Optimization

    cs.CL 2026-05 unverdicted novelty 7.0

    Topology-enhanced alignment via persistent homology on trajectories outperforms standard SFT and DPO baselines on preference metrics for LLMs.

  18. Theoretical Limits of Language Model Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    The maximum reward gain under KL-regularized LM alignment is a Jeffreys divergence term, estimable as covariance from base samples, with best-of-N approaching the theoretical limit.

  19. How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation

    cs.LG 2026-05 unverdicted novelty 7.0

    DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conform...

  20. PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

    cs.LG 2026-05 unverdicted novelty 7.0

    PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...

  21. Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion

    cs.LG 2026-05 unverdicted novelty 7.0

    Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reaso...

  22. Moira: Language-driven Hierarchical Reinforcement Learning for Pair Trading

    cs.AI 2026-05 unverdicted novelty 7.0

    Moira parameterizes hierarchical RL policies for pair trading with LLMs and adapts them via prompt updates based on trajectory and episode feedback, outperforming baselines on real market data.

  23. Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

    cs.LG 2026-05 unverdicted novelty 7.0

    UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.

  24. Attention Is Where You Attack

    cs.CR 2026-04 unverdicted novelty 7.0

    ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.

  25. Mind the Gap: Structure-Aware Consistency in Preference Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guara...

  26. Three Models of RLHF Annotation: Extension, Evidence, and Authority

    cs.CY 2026-04 unverdicted novelty 7.0

    RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.

  27. RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

    cs.CV 2026-04 unverdicted novelty 7.0

    RSRCC is a new 126k-question benchmark for fine-grained remote sensing change question-answering, constructed via a hierarchical semi-supervised pipeline with retrieval-augmented Best-of-N ranking.

  28. IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...

  29. Policy Gradient Primal-Dual Method for Safe Reinforcement Learning from Human Feedback

    cs.LG 2026-04 unverdicted novelty 7.0

    Primal-dual policy gradient algorithms achieve global non-asymptotic convergence for safe RLHF cast as infinite-horizon discounted CMDPs without fitting reward models.

  30. Causal inference for social network formation

    econ.EM 2026-04 conditional novelty 7.0

    Random team assignments in a professional firm reveal that indirect ties strongly increase new direct tie formation, while effects of degree and local density are smaller and less robust.

  31. Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

    cs.CV 2026-04 conditional novelty 7.0

    Alignment of vision-language models with human V1-V3 early visual cortex negatively predicts resistance to sycophantic gaslighting attacks.

  32. Pareto-Optimal Offline Reinforcement Learning via Smooth Tchebysheff Scalarization

    cs.LG 2026-04 unverdicted novelty 7.0

    STOMP extends direct preference optimization to the multi-objective setting via smooth Tchebysheff scalarization and standardization of observed rewards, achieving highest hypervolume in eight of nine protein engineer...

  33. AI Integrity: A New Paradigm for Verifiable AI Governance

    cs.AI 2026-04 unverdicted novelty 7.0

    AI Integrity is defined as verifiable protection of an AI system's four-layer Authority Stack from corruption, with PRISM as the measurement framework.

  34. Too Nice to Tell the Truth: Quantifying Agreeableness-Driven Sycophancy in Role-Playing Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Agreeableness in AI personas reliably predicts sycophantic behavior in 9 of 13 tested language models.

  35. Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...

  36. SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation

    cs.CL 2026-04 accept novelty 7.0

    SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.

  37. PRIME: Training Free Proactive Reasoning via Iterative Memory Evolution for User-Centric Agent

    cs.AI 2026-04 unverdicted novelty 7.0

    PRIME enables agents to proactively reason in user-centric tasks by iteratively evolving structured memories from interaction trajectories without gradient-based training.

  38. Personalizing Text-to-Image Generation to Individual Taste

    cs.CV 2026-04 unverdicted novelty 7.0

    PAMELA provides a multi-user rating dataset and personalized reward model that predicts individual image preferences more accurately than prior population-level aesthetic models.

  39. Many Preferences, Few Policies: Towards Scalable Language Model Personalization

    cs.CL 2026-04 unverdicted novelty 7.0

    PALM produces a small portfolio of LLMs that contains a near-optimal model for any user preference weight vector, with theoretical bounds on portfolio size and approximation quality.

  40. Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

    cs.LG 2026-03 unverdicted novelty 7.0

    Introduces robust estimators for linear Markov games in offline MARLHF that achieve O(ε^{1-o(1)}) or O(√ε) bounds on Nash or CCE gaps under uniform or unilateral coverage.

  41. Refusal in Language Models Is Mediated by a Single Direction

    cs.LG 2024-06 accept novelty 7.0

    Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

  42. TextGrad: Automatic "Differentiation" via Text

    cs.CL 2024-06 unverdicted novelty 7.0

    TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.

  43. KTO: Model Alignment as Prospect Theoretic Optimization

    cs.LG 2024-02 conditional novelty 7.0

    KTO aligns LLMs by directly maximizing prospect-theoretic utility on binary signals and matches or exceeds preference-based methods like DPO from 1B to 30B parameters.

  44. Self-Rewarding Language Models

    cs.CL 2024-01 conditional novelty 7.0

    Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

  45. Measuring Faithfulness in Chain-of-Thought Reasoning

    cs.AI 2023-07 conditional novelty 7.0

    Chain-of-Thought reasoning in LLMs is often unfaithful, with models relying on it variably by task and less so as models scale larger.

  46. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  47. Eliciting Latent Predictions from Transformers with the Tuned Lens

    cs.LG 2023-03 accept novelty 7.0

    Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.

  48. Fusion-fission forecasts when AI will shift to undesirable behavior

    cs.AI 2026-05 unverdicted novelty 6.0

    A vector generalization of fusion-fission group dynamics from physics forecasts when AI behavior shifts to undesirable states, validated at 90 percent across seven models and prior to real-world data.

  49. Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    A dual hierarchical RL framework lets agents learn when and how to ask probing questions in U.S. Supreme Court arguments, outperforming baselines on a court dataset.

  50. History Anchors: How Prior Behavior Steers LLM Decisions Toward Unsafe Actions

    cs.AI 2026-05 unverdicted novelty 6.0

    A single consistency instruction with harmful prior actions causes aligned frontier LLMs to select unsafe options at 91-98% rates in high-stakes domains, with escalation and inverse scaling by model size.

  51. Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models

    cs.CR 2026-05 unverdicted novelty 6.0

    A hierarchical genetic algorithm induces overthinking in black-box LRMs, increasing output length by up to 26.1x on the MATH benchmark.

  52. Inducing Overthink: Hierarchical Genetic Algorithm-based DoS Attack on Black-Box Large Language Reasoning Models

    cs.CR 2026-05 unverdicted novelty 6.0

    A hierarchical genetic algorithm induces overthinking in black-box large reasoning models by perturbing logical structure, achieving up to 26.1x longer outputs on the MATH benchmark.

  53. Learning Perturbations to Extrapolate Your LLM

    stat.ML 2026-05 unverdicted novelty 6.0

    A learnable continuous perturbation framework for LLM token prefixes via latent vector transformations, optimized through unbiased estimating equations, yields gains in out-of-domain performance.

  54. Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.

  55. Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

  56. TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

    cs.CL 2026-05 unverdicted novelty 6.0

    TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.

  57. On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

    cs.AI 2026-05 unverdicted novelty 6.0

    FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.

  58. SafeSteer: A Decoding-level Defense Mechanism for Multimodal Large Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    SafeSteer improves safety in multimodal large language models by up to 33.4% via a decoding probe and modal alignment vector without any fine-tuning.

  59. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...

  60. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 182 Pith papers · 7 internal anchors

  1. [1]

    [Amodei et al., 2016] Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., and Mané, D. (2016). Concrete problems in ai safety. [Askell et al., 2021] Askell, A., Bai, Y ., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K.,

  2. [2]

    Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. (2021). A general language assistant as a laboratory for alignment. [Bender et al., 2021] Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? ᅵᅵ. In Proceedings of the 2021 AC...

  3. [3]

    On the Opportunities and Risks of Foundation Models

    Khattab, O., Koh, P. W., Krass, M. S., Krishna, R., Kuditipudi, R., and et al. (2021). On the opportunities and risks of foundation models. CoRR, abs/2108.07258. [Borgeaud et al., 2021] Borgeaud, S., Mensch, A., Hoffmann, J., Cai, T., Rutherford, E., Millican, K., van den Driessche, G., Lespiau, J., Damoc, B., Clark, A., de Las Casas, D., Guy, A., Menick,...

  4. [4]

    Improving language models by retrieving from trillions of tokens.Preprint arXiv:2112.04426,

    Vinyals, O., Osindero, S., Simonyan, K., Rae, J. W., Elsen, E., and Sifre, L. (2021). Improving language models by retrieving from trillions of tokens. CoRR, abs/2112.04426. [Brown et al., 2020] Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krue...

  5. [5]

    Burda, Y ., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374. [Chowdhery et al., 2022] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A.,

  6. [6]

    Diaz, M., Firat, O., Catasta, M., Wei, J., Meier-Hellstern, K., Eck, D., Dean, J., Petrov, S., and Fiedel, N. (2022). Palm: Scaling language modeling with pathways. [Clark et al., 2018] Clark, P., Cowhey, I., Etzioni, O., Khot, T., Sabharwal, A., Schoenick, C., and Tafjord, O. (2018). Think you have solved question answering? try arc, the ai2 reasoning ch...

  7. [7]

    Amodei, D., Brown, T., Kaplan, J., McCandlish, S., Olah, C., and Clark, J. (2022). Predictability and surprise in large generative models. [Guo et al., 2017] Guo, C., Pleiss, G., Sun, Y ., and Weinberger, K. Q. (2017). On calibration of modern neural networks. 71 [Guu et al., 2020] Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M. (2020). REALM: retr...

  8. [8]

    Pineau, J. (2017). Ethical challenges in data-driven dialogue systems. CoRR, abs/1711.09050. [Hendrycks et al., 2021a] Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., and Steinhardt, J. (2021a). Aligning ai with shared human values. [Hendrycks et al., 2021b] Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Stei...

  9. [9]

    Schulman, J., Amodei, D., and McCandlish, S. (2020). Scaling laws for autoregressive generative model- ing. [Hernandez et al., 2021] Hernandez, D., Kaplan, J., Henighan, T., and McCandlish, S. (2021). Scaling laws for transfer. CoRR, abs/2102.01293. [Hestness et al., 2019] Hestness, J., Ardalani, N., and Diamos, G. (2019). Beyond human-level accuracy: Com...

  10. [10]

    Etzioni, O., Sap, M., and Choi, Y . (2021). Delphi: Towards machine ethics and norms. [Joshi et al., 2017] Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. (2017). Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. [Kaplan et al., 2020] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R....

  11. [11]

    Radford, A., Wu, J., and Amodei, D. (2020). Scaling laws for neural language models. [Koch et al., 2021] Koch, J., Langosco, L., Pfau, J., Le, J., and Sharkey, L. (2021). Objective robustness in deep reinforcement learning. CoRR, abs/2105.14111. [Lakshminarayanan et al., 2016] Lakshminarayanan, B., Pritzel, A., and Blundell, C. (2016). Simple and scalable...

  12. [12]

    Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. CoRR, abs/2005.11401. [Liang et al., 2017] Liang, S., Li, Y ., and Srikant, R. (2017). Enhancing the reliability of out-of-distribution image detection in neural networks. [Lin et al., 2021] Lin, S., Hilton, J., and Evan...

  13. [13]

    Chess, B., and Schulman, J. (2021). Webgpt: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332. [Nalisnick et al., 2019] Nalisnick, E., Matsukawa, A., Teh, Y . W., Gorur, D., and Lakshminarayanan, B. (2019). Hybrid models with deep and invertible features. [Nguyen et al., 2014] Nguyen, A., Yosinski, J., and Clune, J. (2014). Dee...

  14. [14]

    Baroni, M., Boleda, G., and Fernández, R. (2016). The lambada dataset: Word prediction requiring a broad discourse context. [Parrish et al., 2021] Parrish, A., Chen, A., Nangia, N., Padmakumar, V ., Phang, J., Thompson, J., Htut, P. M., and Bowman, S. R. (2021). BBQ: A hand-built bias benchmark for question answering. CoRR, abs/2110.08193. [Paszke et al....

  15. [15]

    Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. (2019). Pytorch: An imperative style, high- performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems 32 , pages 8024–8035. Curran Associates, Inc. [Perez et al.,...

  16. [16]

    Lockhart, E., Osindero, S., Rimell, L., Dyer, C., Vinyals, O., Ayoub, K., Stanway, J., Bennett, L., Hassabis, D., Kavukcuoglu, K., and Irving, G. (2021). Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446. [Ramasesh et al., 2022] Ramasesh, V . V ., Lewkowycz, A., and Dyer, E. (2022). Effect of scale on catas- ...

  17. [17]

    Amodei, D., and Christiano, P. (2020). Learning to summarize from human feedback. [Thoppilan et al., 2022] Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N., Kulshreshtha, A., Cheng, H.,

  18. [18]

    Cohen, A., Bernstein, R., Kurzweil, R., Aguera-Arcas, B., Cui, C., Croak, M., Chi, E., and Le, Q. (2022). Lamda: Language models for dialog applications. CoRR, abs/2201.08239. [Thulasidasan et al., 2021] Thulasidasan, S., Thapa, S., Dhaubhadel, S., Chennupati, G., Bhattacharya, T., and Bilmes, J. (2021). A simple and effective baseline for out-of-distribu...

  19. [19]

    Ethical and social risks of harm from Language Models

    Birhane, A., Haas, J., Rimell, L., Hendricks, L. A., Isaac, W. S., Legassick, S., Irving, G., and Gabriel, I. (2021). Ethical and social risks of harm from language models. CoRR, abs/2112.04359. [Winkens et al., 2020] Winkens, J., Bunel, R., Roy, A. G., Stanforth, R., Natarajan, V ., Ledsam, J. R.,

  20. [20]

    MacWilliams, P., Kohli, P., Karthikesalingam, A., Kohl, S., Cemgil, T., Eslami, S. M. A., and Ronneberger, O. (2020). Contrastive training for improved out-of-distribution detection. [Xu et al., 2020] Xu, J., Ju, D., Li, M., Boureau, Y .-L., Weston, J., and Dinan, E. (2020). Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079. [Zel...