super hub Canonical reference

Training language models to follow instructions with human feedback

Carroll L. Wainwright, Diogo Almeida, Jeff Wu, Long Ouyang, Pamela Mishkin, Xu Jiang · 2022 · cs.CL · arXiv 2203.02155

Canonical reference. 93% of citing Pith papers cite this work as background.

218 Pith papers citing it

Background 93% of classified citations

open full Pith review browse 218 citing papers more from Carroll L. Wainwright arXiv PDF

abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 54 method 1 other 1

citation-polarity summary

background 52 unclear 3 use method 1

claims ledger

abstract Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we u

authors

Carroll L. Wainwright Diogo Almeida Jeff Wu Long Ouyang Pamela Mishkin Xu Jiang

co-cited works

representative citing papers

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

cs.CL · 2022-01-28 · accept · novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

cs.SE · 2026-05-20 · conditional · novelty 8.0

RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.

Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems

cs.MA · 2024-10-09 · unverdicted · novelty 8.0 · 2 refs

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Generative Agents: Interactive Simulacra of Human Behavior

cs.HC · 2023-04-07 · accept · novelty 8.0

Generative agents with memory streams, reflection, and planning using LLMs exhibit believable individual and emergent social behaviors in a simulated town.

Discovering Latent Knowledge in Language Models Without Supervision

cs.CL · 2022-12-07 · conditional · novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average across models and datasets.

Code as Policies: Language Model Programs for Embodied Control

cs.RO · 2022-09-16 · accept · novelty 8.0

Language models generate robot policy code from natural language commands via few-shot prompting, enabling spatial-geometric reasoning, generalization, and precise control on real robots.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

cs.CL · 2026-05-31 · conditional · novelty 7.0

Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.

Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

cs.CL · 2026-05-16 · unverdicted · novelty 7.0

PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.

Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

cond-mat.stat-mech · 2026-05-11 · unverdicted · novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.

Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets

math.OC · 2026-05-09 · unverdicted · novelty 7.0

Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends on intrinsic manifold dimension.

Leveraging Pretrained Language Models as Energy Functions for Glauber Dynamics Text Diffusion

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

Pretrained language models are used as energy functions for Glauber dynamics in discrete text diffusion, improving generation quality over prior diffusion LMs and matching autoregressive models on benchmarks and reasoning tasks.

ContextualJailbreak: Evolutionary Red-Teaming via Simulated Conversational Priming

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

ContextualJailbreak uses evolutionary search over simulated primed dialogues with novel mutations to reach 90-100% attack success on open LLMs and transfers to some closed frontier models at 15-90% rates.

VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

cs.CV · 2026-05-02 · unverdicted · novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

Political Bias Audits of LLMs Capture Sycophancy to the Inferred Auditor

cs.AI · 2026-04-30 · conditional · novelty 7.0

Political bias audits of LLMs largely capture sycophantic accommodation to the inferred political identity of the asker rather than any fixed model ideology.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

citing papers explorer

Showing 50 of 218 citing papers.

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models cs.CV · 2023-03-08 · accept · none · ref 29 · internal anchor
Visual ChatGPT integrates visual foundation models with ChatGPT via prompts to enable multi-step image understanding, generation, and editing in conversational interactions.
A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 44 · internal anchor
Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 266 · internal anchor
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
InCoder: A Generative Model for Code Infilling and Synthesis cs.SE · 2022-04-12 · unverdicted · none · ref 22 · internal anchor
InCoder is the first generative model to directly perform zero-shot code infilling via bidirectional context from a masked-then-appended training scheme, matching left-to-right models on synthesis while improving on type inference, comment generation, and variable renaming.
RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning cs.LG · 2026-06-05 · unverdicted · none · ref 41 · internal anchor
RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than standard SFT and RL on math and code benchmarks.
What Do People Actually Want From AI? Mapping Preference Plurality cs.CL · 2026-06-04 · unverdicted · none · ref 66 · internal anchor
Open-ended preference data reveals substantial plurality in what people want from AI and divergent interpretations of shared values such as truthfulness.
Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game cs.CL · 2026-06-03 · unverdicted · none · ref 7 · internal anchor
LLMs produce human-like finite bids in the St. Petersburg game but shift toward rational behavior under controlled prompt changes, indicating surface-level outcome resemblance without mechanism-level alignment.
Annealed Softmax Greedy in Many-Armed Bayesian Bandits cs.LG · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
Annealed softmax greedy achieves Õ(m + T/m) Bayes regret (Õ(√T) at m=Θ(√T)) in many-armed Bayesian Bernoulli bandits under linear upper-tail prior condition, matching empirical-mean greedy.
Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 49 · internal anchor
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA cs.CL · 2026-05-21 · unverdicted · none · ref 20 · internal anchor
Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggregate differences.
When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning cs.LG · 2026-05-20 · unverdicted · none · ref 26 · internal anchor
Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.
TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health cs.LG · 2026-05-20 · unverdicted · none · ref 49 · internal anchor
TimeSRL uses semantic abstractions from time-series data optimized via reinforcement learning to achieve better cross-dataset generalization than standard ML or LLM baselines in mental health prediction.
Reinforcing Human Behavior Simulation via Verbal Feedback cs.LG · 2026-05-19 · unverdicted · none · ref 24 · internal anchor
DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR cs.AI · 2026-05-19 · unverdicted · none · ref 26 · internal anchor
POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining cs.LG · 2026-05-19 · unverdicted · none · ref 7 · internal anchor
DG-Hard uses Donoho-Gavish hard thresholding on the fine-tuning weight delta to separate task-aligned signal from noise-like residual, recovering damaged capabilities while preserving target-task gains.
EmbGen: Teaching with Reassembled Corpora cs.CL · 2026-05-19 · unverdicted · none · ref 18 · internal anchor
EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.
Towards Human-Level Book-Writing Capability cs.AI · 2026-05-16 · unverdicted · none · ref 4 · internal anchor
A prompt-to-book training framework that derives hierarchical summaries from public-domain novels and inverts them to supervise long-context models toward human literary prose instead of assistant-style output.
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy cs.LG · 2026-05-14 · conditional · none · ref 17 · internal anchor
ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 percentage points.
Early Data Exposure Improves Robustness to Subsequent Fine-Tuning cs.LG · 2026-05-12 · conditional · none · ref 15 · internal anchor
Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.
Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training cs.LG · 2026-05-12 · unverdicted · none · ref 21 · internal anchor
A new RL objective adapts trust-region and off-policy handling automatically via normalized effective sample size of batch policy ratios, matching tuned baselines without new hyperparameters.
Rotation-Preserving Supervised Fine-Tuning cs.LG · 2026-05-08 · unverdicted · none · ref 56 · internal anchor
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 65 · internal anchor
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
Why Does Agentic Safety Fail to Generalize Across Tasks? cs.LG · 2026-05-07 · conditional · none · ref 84 · internal anchor
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code cs.SE · 2026-05-06 · accept · none · ref 95 · internal anchor
A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation cs.LG · 2026-05-06 · unverdicted · none · ref 154 · internal anchor
The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
RLearner-LLM: Balancing Logical Grounding and Fluency in Large Language Models via Hybrid Direct Preference Optimization cs.CL · 2026-05-06 · unverdicted · none · ref 1 · 3 links · internal anchor
RLearner-LLM achieves up to 6x gains in NLI entailment over standard fine-tuning by using an automated hybrid DPO pipeline that balances logic and fluency across multiple model sizes and domains.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management cs.LG · 2026-05-04 · unverdicted · none · ref 37 · internal anchor
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
Iterative Finetuning is Mostly Idempotent cs.AI · 2026-05-01 · unverdicted · none · ref 7 · internal anchor
Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.
Diversity in Large Language Models under Supervised Fine-Tuning cs.LG · 2026-04-30 · unverdicted · none · ref 33 · 2 links · internal anchor
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning cs.CR · 2026-04-30 · unverdicted · none · ref 18 · internal anchor
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
What Did They Mean? How LLMs Resolve Ambiguous Social Situations across Perspectives and Roles cs.HC · 2026-04-27 · unverdicted · none · ref 11 · internal anchor
LLMs produce interpretive closure in 87.5% of ambiguous social scenarios through narrative alignment, reversal, or normative advice, with first-person perspectives increasing alignment tendencies.
Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models cs.CR · 2026-04-23 · unverdicted · none · ref 16 · internal anchor
Transient Turn Injection is a new attack that evades LLM moderation by spreading harmful intent over multiple isolated turns using automated agents.
Structural Quality Gaps in Practitioner AI Governance Prompts: An Empirical Study Using a Five-Principle Evaluation Framework cs.SE · 2026-04-22 · unverdicted · none · ref 12 · internal anchor
A new five-principle framework applied to 34 practitioner AI governance prompts finds 37% lack key structural elements such as data classification and rubrics.
Vibrotactile Preference Learning: Uncertainty-Aware Preference Learning for Personalized Vibration Feedback cs.HC · 2026-04-22 · unverdicted · none · ref 26 · internal anchor
VPL learns individualized vibrotactile preferences efficiently via uncertainty-aware Gaussian process models and active query selection in a 13-participant user study on an Xbox controller.
Different Paths to Harmful Compliance: Behavioral Side Effects and Mechanistic Divergence Across LLM Jailbreaks cs.CR · 2026-04-20 · unverdicted · none · ref 4 · internal anchor
Different LLM jailbreak techniques achieve similar harmful compliance but lead to distinct behavioral side effects and mechanistic changes.
ClawGuard: A Runtime Security Framework for Tool-Augmented LLM Agents Against Indirect Prompt Injection cs.CR · 2026-04-13 · unverdicted · none · ref 28 · 2 links · internal anchor
ClawGuard enforces deterministic, user-derived access constraints at tool boundaries to block indirect prompt injection without changing the underlying LLM.
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale cs.CL · 2026-04-13 · unverdicted · none · ref 15 · internal anchor
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 69 · internal anchor
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
C$^2$T: Captioning-Structure and LLM-Aligned Common-Sense Reward Learning for Traffic--Vehicle Coordination cs.MA · 2026-04-10 · unverdicted · none · ref 9 · internal anchor
C2T learns an LLM-derived common-sense reward function to improve cooperative multi-intersection traffic control policies, outperforming standard MARL baselines on efficiency, safety, and energy proxies while allowing prompt-based policy tuning.
Dictionary-Aligned Concept Control for Safeguarding Multimodal LLMs cs.LG · 2026-04-10 · unverdicted · none · ref 73 · internal anchor
DACO curates a 15,000-concept dictionary from 400K image-caption pairs and uses it to initialize an SAE that enables granular, concept-specific steering of MLLM activations, raising safety scores on MM-SafetyBench and JailBreakV while preserving general capabilities.
TrajGuard: Streaming Hidden-state Trajectory Detection for Decoding-time Jailbreak Defense cs.CR · 2026-04-09 · unverdicted · none · ref 22 · internal anchor
TrajGuard detects jailbreaks by tracking how hidden-state trajectories move toward high-risk regions during decoding, achieving 95% defense rate with 5.2 ms/token latency across tested attacks.
SysTradeBench: An Iterative Build-Test-Patch Benchmark for Strategy-to-Code Trading Systems with Drift-Aware Diagnostics cs.SE · 2026-04-06 · unverdicted · none · ref 3 · internal anchor
SysTradeBench evaluates 17 LLMs on 12 trading strategies, finding over 91.7% code validity but rapid convergence in iterative fixes and a continued need for human oversight on critical strategies.
Unmasking Hallucinations: A Causal Graph-Attention Perspective on Factual Reliability in Large Language Models cs.CL · 2026-04-05 · unverdicted · none · ref 2 · internal anchor
GCAN cuts LLM hallucination rates by 27.8% and raises factual accuracy by 16.4% on TruthfulQA and HotpotQA by using causal token graphs and a new Causal Contribution Score.
MemFactory: Unified Inference & Training Framework for Agent Memory cs.CL · 2026-03-31 · unverdicted · none · ref 9 · internal anchor
MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.
Teaching an Agent to Sketch One Part at a Time cs.AI · 2026-03-19 · unverdicted · none · ref 22 · internal anchor
A multi-modal LM agent is trained to produce vector sketches part-by-part via supervised fine-tuning and process-reward RL on the new ControlSketch-Part dataset with automatic part annotations.
Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings cs.LG · 2026-03-11 · unverdicted · none · ref 12 · internal anchor
HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.
Do Fine-Tuned LLMs Understand Vulnerabilities? An Investigation into the Semantic Trap cs.CR · 2026-01-30 · unverdicted · none · ref 32 · internal anchor
Fine-tuned decoder-only LLMs fall into a Semantic Trap on vulnerability detection, achieving high scores on unpaired normal code but failing on paired vulnerable-patched code, semantic perturbations, and gap analysis, while reasoning supervision reduces symptoms at the cost of recall.
SWaRL: Safeguard Code Watermarking via Reinforcement Learning cs.CR · 2026-01-05 · unverdicted · none · ref 23 · internal anchor
SWaRL trains code LLMs with RL using compiler correctness signals and a confidential verifier reward to embed robust, functionality-preserving watermarks that resist refactoring attacks.
Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety cs.CL · 2025-12-08 · unverdicted · none · ref 34 · internal anchor
Distilling safe refusal behavior from OpenAI o1-mini into Llama-3, Gemma-2, and Qwen3 models via response-based LoRA on multilingual jailbreak data increases jailbreak success rates on MultiJail by up to 16.6 points.
SAM 3D: 3Dfy Anything in Images cs.CV · 2025-11-20 · unverdicted · none · ref 28 · internal anchor
SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.

Training language models to follow instructions with human feedback

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer