pith. sign in

hub Canonical reference

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744

Canonical reference. 83% of citing Pith papers cite this work as background.

81 Pith papers citing it
Background 83% of classified citations

hub tools

citation-role summary

background 21 method 3

citation-polarity summary

claims ledger

  • background step limit, demonstrating a failure to incorporate short-term state and past actions into decision-making. Agents also typically disregard previously entered inputs or action history [162]. Modern pretraining and supervised fine-tuning paradigms on dialogue-style data, which trains the model to learn short-term instruction-response behavior (while deprioritizing long-term embodied sequential state tracking), are likely resulting in these shortcomings [165, 162]. Premature termination and achieva
  • background further advances the alignment of LLMs with human intent by applying supervised fine-tuning (SFT) on instruction-following datasets, followed by reinforcement learning from human feedback (RLHF). Since then, alignment techniques have been extensively studied to ensure that large AI models behave in accordance with safety considerations, human preferences, and values [52]. These technological advances have led to the development of highly capable commercial LLMs, such as GPT- 4 [3] and Claude, wh
  • method seamlessly integrates reasoning and action generation, allowing adaptive switching between direct trajectory generation and CoT reasoning. In supervised fine-tuning (SFT), we leverage both trajectory- only data and CoT reasoning data to equip the model with dual-process capabilities (fast and slow thinking). Furthermore, we propose reinforcement fine-tuning (RFT) [48], utilizing Group Relative Policy Optimization (GRPO) [49] with verifiable planning reward functions. This enables adaptive reason
  • background Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. [2] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. [3] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katar
  • background Large language models (LLMs) are increasingly deployed in high-stakes settings, spanning scientific research [17], cybersecurity [28], and medical consultation [14], making misuse prevention a central safety challenge. Recent advances in model reasoning, safety alignment, and external guardrails have made frontier systems more effective at refusing explicit harmful requests [20, 1, 11, 42]. However, these improvements have also changed how attacks are carried out: rather than stating a harmful o
  • background that increasing instruction variation yields larger gains than scaling the number of training instances. Our work differs from these prior approaches by generating multiple response variations per question via heuristic-conditioned prompting and studying the effect of this diversity during mid-training on subsequent RL. Reinforcement Learning for LLMsReinforcement Learning from Human Feedback (RLHF) [ 38] has become a standard post-training step, aligning models with human preferences by trainin

co-cited works

representative citing papers

MemGym: a Long-Horizon Memory Environment for LLM Agents

cs.CL · 2026-05-20 · unverdicted · novelty 7.0

MemGym unifies agent gyms into a memory benchmark with isolated scoring across tool-use, research, coding, and computer-use regimes plus a lightweight reward model for tractable coding evaluation.

GRASP: Deterministic argument ranking in interaction graphs

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

GRASP aggregates stable local LLM interaction judgments into global argument rankings via a convergent attack-defense propagation operator on interaction graphs, yielding higher reproducibility than holistic judging and no correlation with human convincingness.

Learning from Language Feedback via Variational Policy Distillation

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.

Do Coding Agents Understand Least-Privilege Authorization?

cs.CR · 2026-05-14 · unverdicted · novelty 7.0

Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.

MASS-DPO: Multi-negative Active Sample Selection for Direct Policy Optimization

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

MASS-DPO derives a Plackett-Luce-specific log-determinant Fisher information objective to select non-redundant negative samples, matching or exceeding multi-negative DPO performance with substantially fewer negatives across four benchmarks and three model families.

Bounded Ratio Reinforcement Learning

cs.LG · 2026-04-20 · conditional · novelty 7.0

BRRL derives an analytic optimal policy for regularized constrained RL that guarantees monotonic improvement and yields the BPO algorithm that matches or exceeds PPO.

Beyond Syntax: Action Semantics Learning for App Agents

cs.AI · 2025-06-21 · unverdicted · novelty 7.0

Action Semantics Learning trains app agents to align with the semantic effects of actions via a Semantic Estimator module, improving robustness to out-of-distribution scenarios over syntax-matching fine-tuning.

CLORE: Content-Level Optimization for Reasoning Efficiency

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

CLORE augments correct on-policy rollouts by deleting repetitive and irrelevant segments then optimizes with auxiliary DPO to improve accuracy-efficiency trade-off on math benchmarks.

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.

citing papers explorer

Showing 50 of 81 citing papers.