pith. machine review for the scientific record. sign in

arxiv: 2411.15124 · v5 · submitted 2024-11-22 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Authors on Pith no claims yet

Pith reviewed 2026-05-11 05:03 UTC · model grok-4.3

classification 💻 cs.CL
keywords language model post-trainingsupervised fine-tuningdirect preference optimizationreinforcement learningopen source modelsbenchmark evaluationdata decontamination
0
0 comments X

The pith

Fully open post-training on Llama 3.1 bases yields models that surpass several closed systems on benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tulu 3, a family of models refined from Llama 3.1 bases through supervised finetuning, direct preference optimization, and a new method called reinforcement learning with verifiable rewards. These models achieve higher scores than the official instruct versions of Llama 3.1, Qwen 2.5, and Mistral, as well as closed models including GPT-4o-mini and Claude 3.5-Haiku. The work supplies complete datasets, code, infrastructure, and a multi-task evaluation scheme that includes development and unseen splits along with decontamination of training data. A sympathetic reader would care because post-training steps have long remained opaque, and an open recipe that reaches competitive performance removes a major barrier to further progress. The authors also report which training approaches failed to deliver reliable gains.

Core claim

Tulu 3 demonstrates that applying supervised finetuning, direct preference optimization, and reinforcement learning with verifiable rewards to Llama 3.1 base models, using carefully curated and decontaminated data, produces results that exceed those of Llama 3.1 instruct models, Qwen 2.5 instruct, Mistral instruct, GPT-4o-mini, and Claude 3.5-Haiku on the multi-task benchmarks.

What carries the argument

Reinforcement Learning with Verifiable Rewards (RLVR), which uses automatically verifiable signals to guide reinforcement learning instead of relying only on preference data or model judges.

If this is right

  • Post-training can be fully reproduced and adapted to new domains using the released data, code, and procedures.
  • A combination of SFT, DPO, and RLVR reliably improves over base models on the tested benchmarks.
  • Decontamination and separate unseen splits provide a stricter test than standard benchmark reporting.
  • Some common training techniques do not produce consistent improvements and can be deprioritized.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The open release of the full pipeline could allow independent groups to match or exceed current closed-model performance on similar tasks.
  • RLVR may extend naturally to any domain where correctness can be checked automatically, such as code generation or mathematical reasoning.
  • Widespread adoption of the decontamination and multi-split evaluation approach could raise the bar for future post-training papers.

Load-bearing premise

The multi-task evaluation scheme with decontamination and unseen splits accurately measures real generalization instead of overfitting to known benchmark distributions.

What would settle it

Running the released Tulu 3 models on a new collection of tasks assembled after the training data cutoff or on live user queries that shows no performance edge over the original instruct baselines.

read the original abstract

Language model post-training is applied to refine behaviors and unlock new skills across a wide range of recent language models, but open recipes for applying these techniques lag behind proprietary ones. The underlying training data and recipes for post-training are simultaneously the most important pieces of the puzzle and the portion with the least transparency. To bridge this gap, we introduce Tulu 3, a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques. Tulu 3, which builds on Llama 3.1 base models, achieves results surpassing the instruct versions of Llama 3.1, Qwen 2.5, Mistral, and even closed models such as GPT-4o-mini and Claude 3.5-Haiku. The training algorithms for our models include supervised finetuning (SFT), Direct Preference Optimization (DPO), and a novel method we call Reinforcement Learning with Verifiable Rewards (RLVR). With Tulu 3, we introduce a multi-task evaluation scheme for post-training recipes with development and unseen evaluations, standard benchmark implementations, and substantial decontamination of existing open datasets on said benchmarks. We conclude with analysis and discussion of training methods that did not reliably improve performance. In addition to the Tulu 3 model weights and demo, we release the complete recipe -- including datasets for diverse core skills, a robust toolkit for data curation and evaluation, the training code and infrastructure, and, most importantly, a detailed report for reproducing and further adapting the Tulu 3 approach to more domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Tulu 3, a family of fully open post-trained models built on Llama 3.1 base models. It applies supervised fine-tuning (SFT), Direct Preference Optimization (DPO), and a novel Reinforcement Learning with Verifiable Rewards (RLVR) method, claiming superior performance over Llama 3.1 instruct, Qwen 2.5, Mistral, GPT-4o-mini, and Claude 3.5-Haiku. The work releases all training data, code, recipes, and models, while introducing a multi-task evaluation protocol with development/unseen splits and substantial decontamination of open benchmark datasets.

Significance. If the benchmark results hold under rigorous decontamination, the primary contribution is the complete, reproducible open recipe for modern post-training that includes both established methods and RLVR, plus analysis of approaches that failed to improve performance. Releasing the full data, code, infrastructure, and detailed report enables independent verification and adaptation, which is a substantial advance for the open-source community.

major comments (2)
  1. [Abstract / Evaluation section] Abstract and evaluation description: the claim of surpassing closed models rests on benchmark results after 'substantial decontamination,' yet no concrete method is specified (e.g., n-gram overlap thresholds, embedding similarity cutoffs, model-based detection, or paraphrase handling). Without these details, residual leakage on MMLU, GSM8K, or HumanEval cannot be ruled out, directly affecting the validity of the generalization claims.
  2. [Results] Results presentation: the abstract reports benchmark wins, but the manuscript must include full tables with per-task scores, error bars or multiple seeds, and explicit ablations isolating the contribution of RLVR versus SFT+DPO to substantiate the performance frontier claim.
minor comments (1)
  1. [Evaluation] The multi-task evaluation scheme with dev/unseen splits is a positive design choice; clarify how the unseen split is constructed and whether it overlaps with any training data beyond the stated decontamination.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript to incorporate the suggested clarifications and additions.

read point-by-point responses
  1. Referee: [Abstract / Evaluation section] Abstract and evaluation description: the claim of surpassing closed models rests on benchmark results after 'substantial decontamination,' yet no concrete method is specified (e.g., n-gram overlap thresholds, embedding similarity cutoffs, model-based detection, or paraphrase handling). Without these details, residual leakage on MMLU, GSM8K, or HumanEval cannot be ruled out, directly affecting the validity of the generalization claims.

    Authors: We agree that explicit details on the decontamination procedure are necessary to support the generalization claims. In the revised manuscript, we will add a dedicated subsection in the evaluation protocol describing the exact decontamination methods, including the n-gram overlap thresholds applied, embedding similarity cutoffs, any model-based detection steps, and handling of paraphrases. This will allow readers to assess residual leakage risks on MMLU, GSM8K, HumanEval, and other benchmarks. revision: yes

  2. Referee: [Results] Results presentation: the abstract reports benchmark wins, but the manuscript must include full tables with per-task scores, error bars or multiple seeds, and explicit ablations isolating the contribution of RLVR versus SFT+DPO to substantiate the performance frontier claim.

    Authors: We will strengthen the results section to meet this requirement. The revised paper will include comprehensive tables with per-task scores across all evaluated benchmarks, report error bars or multi-seed averages where computationally feasible, and add explicit ablation experiments that isolate the contribution of RLVR relative to the SFT+DPO baseline. These changes will more clearly substantiate the performance claims and the value of the novel RLVR method. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical post-training results with released artifacts

full rationale

The paper reports experimental outcomes from SFT, DPO, and the introduced RLVR on Llama 3.1 bases, evaluated via multi-task benchmarks with decontamination and unseen splits. No derivation chain, equations, or first-principles predictions exist that could reduce to inputs by construction; performance claims rest on direct training runs and external verification via released data, code, and models rather than self-referential metrics or fitted parameters renamed as predictions. Self-citations to prior Tulu work are present but non-load-bearing for the central empirical claims, which remain independently falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on standard supervised learning assumptions (i.i.d. data, gradient descent convergence) and benchmark validity; no new invented entities or ad-hoc axioms are introduced in the abstract. Free parameters are the usual training hyperparameters (learning rates, batch sizes, reward scales) whose specific values are not detailed here.

pith-pipeline@v0.9.0 · 5693 in / 1154 out tokens · 23646 ms · 2026-05-11T05:03:07.115505+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MeMo: Memory as a Model

    cs.CL 2026-05 unverdicted novelty 7.0

    MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...

  2. Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling

    cs.LG 2026-05 unverdicted novelty 7.0

    DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.

  3. CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

    cs.CV 2026-05 unverdicted novelty 7.0

    CurveBench benchmark reveals that even leading VLMs like Gemini 3.1 Pro reach only 71.1% accuracy recovering containment trees on easy nested-curve images and 19.1% on hard versions, while fine-tuning lifts an open 8B...

  4. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 7.0

    Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard...

  5. No More, No Less: Task Alignment in Terminal Agents

    cs.LG 2026-05 unverdicted novelty 7.0

    The TAB benchmark reveals that frontier terminal agents achieve high task completion but low selective alignment with relevant environmental cues over distractors, and prompt-injection defenses block both.

  6. Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.

  7. Variance-aware Reward Modeling with Anchor Guidance

    stat.ML 2026-05 unverdicted novelty 7.0

    Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, ...

  8. K12-KGraph: A Curriculum-Aligned Knowledge Graph for Benchmarking and Training Educational LLMs

    cs.CL 2026-05 conditional novelty 7.0

    K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.

  9. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  10. Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

    cs.CL 2026-05 unverdicted novelty 7.0

    POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...

  11. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 7.0

    MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.

  12. Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent

    cs.LG 2026-05 unverdicted novelty 7.0

    Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...

  13. Parallel-SFT: Improving Zero-Shot Cross-Programming-Language Transfer for Code RL

    cs.CL 2026-04 unverdicted novelty 7.0

    Parallel-SFT mixes parallel programs across languages during SFT to produce more transferable RL initializations, yielding better zero-shot generalization to unseen programming languages.

  14. Multi-modal Reasoning with LLMs for Visual Semantic Arithmetic

    cs.AI 2026-04 unverdicted novelty 7.0

    SAri-RFT applies GRPO-based reinforcement fine-tuning to LVLMs on novel two-term and three-term visual semantic arithmetic tasks, reaching SOTA on the new IRPD dataset and Visual7W-Telling.

  15. Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning

    cs.CL 2026-04 unverdicted novelty 7.0

    CoT-PoT ensembling achieves self-consistency accuracy in LLMs with only two samples for 78.6% of tasks, reducing computation by 9.3x compared to standard methods.

  16. You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass

    cs.CV 2026-04 unverdicted novelty 7.0

    A multi-response discriminative reward model scores N candidates in one pass via concatenation and cross-entropy, achieving SOTA on multimodal benchmarks and improving RL policies over single-response baselines.

  17. SUPERNOVA: Eliciting General Reasoning in LLMs with Reinforcement Learning on Natural Instructions

    cs.AI 2026-04 unverdicted novelty 7.0

    SUPERNOVA adapts instruction-tuning data for RLVR and achieves up to 52.8% relative gains on general reasoning benchmarks like BBEH through targeted task selection and mixing.

  18. ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

    cs.LG 2026-04 unverdicted novelty 7.0

    ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneo...

  19. What If Consensus Lies? Selective-Complementary Reinforcement Learning at Test Time

    cs.LG 2026-03 unverdicted novelty 7.0

    SCRL adds selective positive pseudo-labeling and entropy-gated negative pseudo-labeling to test-time RL, reducing noise from weak consensus and improving LLM reasoning on benchmarks.

  20. BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE

    cs.AI 2026-05 conditional novelty 6.0

    BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.

  21. PreFT: Prefill-only finetuning for efficient inference

    cs.LG 2026-05 accept novelty 6.0

    Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.

  22. N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation

    cs.LG 2026-05 unverdicted novelty 6.0

    N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.

  23. Bayesian Model Merging

    cs.LG 2026-05 unverdicted novelty 6.0

    Bayesian Model Merging introduces a bi-level optimization framework that merges task-specific models via closed-form Bayesian regression with an anchor prior and global hyperparameter search, outperforming baselines a...

  24. Learning, Fast and Slow: Towards LLMs That Adapt Continually

    cs.LG 2026-05 unverdicted novelty 6.0

    Fast-Slow Training combines slow parameter updates with fast context optimization to achieve up to 3x better sample efficiency, higher performance, less forgetting, and preserved plasticity in continual LLM learning.

  25. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...

  26. Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

    cs.AI 2026-05 unverdicted novelty 6.0

    MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...

  27. Seir\^enes: Adversarial Self-Play with Evolving Distractions for LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Seirênes trains LLMs via adversarial self-play to generate and overcome evolving distractions, producing gains of 7-10 points on math reasoning benchmarks and exposing blind spots in larger models.

  28. Understanding and Preventing Entropy Collapse in RLVR with On-Policy Entropy Flow Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    OPEFO prevents entropy collapse in RLVR by rescaling token updates according to their entropy change contributions, yielding more stable optimization and better results on math benchmarks.

  29. Annotations Mitigate Post-Training Mode Collapse

    cs.CL 2026-05 unverdicted novelty 6.0

    Annotation-anchored training reduces semantic diversity collapse in post-trained language models by a factor of six compared to standard supervised fine-tuning while preserving instruction-following and improving with scale.

  30. dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

    cs.LG 2026-05 unverdicted novelty 6.0

    dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.

  31. DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification

    cs.CL 2026-05 unverdicted novelty 6.0

    DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.

  32. Reinforcing Multimodal Reasoning Against Visual Degradation

    cs.CV 2026-05 unverdicted novelty 6.0

    ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.

  33. Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.

  34. Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

    cs.AI 2026-05 unverdicted novelty 6.0

    SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...

  35. ZAYA1-8B Technical Report

    cs.AI 2026-05 unverdicted novelty 6.0

    ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

  36. Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.

  37. MolmoAct2: Action Reasoning Models for Real-world Deployment

    cs.RO 2026-05 unverdicted novelty 6.0

    MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...

  38. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

  39. Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.

  40. When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

    cs.LG 2026-04 unverdicted novelty 6.0

    Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...

  41. What Makes Good Instruction-Tuning Data? An In-Context Learning Perspective

    cs.CL 2026-04 unverdicted novelty 6.0

    A weighted in-context influence metric selects effective instruction-tuning data, outperforming baselines while showing that harder samples have lower influence.

  42. CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution

    cs.CV 2026-04 unverdicted novelty 6.0

    CharTide decouples chart-to-code data into three perspectives and uses inquiry-driven RL with atomic QA verification to let smaller VLMs surpass GPT-4o on chart-to-code tasks.

  43. Understanding and Mitigating Spurious Signal Amplification in Test-Time Reinforcement Learning for Math Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    DDRL reduces spurious reward noise in test-time RL for math by excluding ambiguous samples, using fixed advantages, and adding consensus-based updates, outperforming prior TTRL methods on math benchmarks.

  44. SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.

  45. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  46. HEALing Entropy Collapse: Enhancing Exploration in Few-Shot RLVR via Hybrid-Domain Entropy Dynamics Alignment

    cs.LG 2026-04 unverdicted novelty 6.0

    HEAL mitigates entropy collapse in few-shot RLVR by selectively adding general-domain data and aligning trajectory-level entropy dynamics, matching full-shot performance with 32 target samples.

  47. Characterizing Model-Native Skills

    cs.AI 2026-04 conditional novelty 6.0

    Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...

  48. GroupDPO: Memory efficient Group-wise Direct Preference Optimization

    cs.CL 2026-04 unverdicted novelty 6.0

    GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.

  49. Eliciting Medical Reasoning with Knowledge-enhanced Data Synthesis: A Semi-Supervised Reinforcement Learning Approach

    cs.LG 2026-04 unverdicted novelty 6.0

    MedSSR improves LLM medical reasoning on rare diseases by up to 5.93% through knowledge-enhanced question synthesis and semi-supervised RL with self-generated pseudo-labels.

  50. The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

    cs.CR 2026-04 unverdicted novelty 6.0

    ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

  51. Target Policy Optimization

    cs.LG 2026-04 unverdicted novelty 6.0

    TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.

  52. OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search

    cs.AI 2026-04 unverdicted novelty 6.0

    OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.

  53. Hindsight-Anchored Policy Optimization: Turning Failure into Feedback in Sparse Reward Settings

    cs.LG 2026-03 unverdicted novelty 6.0

    HAPO adds a hindsight-anchored SSI operator with Thompson gating to GRPO-style RLVR, achieving asymptotic consistency that recovers unbiased on-policy gradients as the policy improves.

  54. Decoupling Reasoning and Confidence: Resurrecting Calibration in Reinforcement Learning from Verifiable Rewards

    cs.LG 2026-03 unverdicted novelty 6.0

    DCPO decouples reasoning optimization from calibration in RLVR to fix overconfidence in LLMs without losing accuracy.

  55. Specificity-aware reinforcement learning for fine-grained open-world classification

    cs.CV 2026-03 unverdicted novelty 6.0

    SpeciaRL applies a dynamic verifier-based reward in reinforcement learning to steer reasoning LMMs toward correct and specific predictions on fine-grained open-world image classification tasks.

  56. Memory in the Age of AI Agents

    cs.CL 2025-12 unverdicted novelty 6.0

    The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.

  57. Dream 7B: Diffusion Large Language Models

    cs.CL 2025-08 unverdicted novelty 6.0

    Dream 7B is a 7B diffusion LLM that refines sequences in parallel via denoising and outperforms prior diffusion models on general, mathematical, and coding benchmarks with added flexibility in generation order and qua...

  58. Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains

    cs.LG 2025-07 unverdicted novelty 6.0

    RaR uses aggregated rubric feedback as rewards in on-policy RL, delivering up to 31% relative gains on HealthBench and 7% on GPQA-Diamond versus direct Likert LLM-as-judge baselines.

  59. The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

    cs.AI 2025-06 unverdicted novelty 6.0

    LRMs exhibit complete accuracy collapse beyond certain puzzle complexities, with reasoning effort rising then declining, outperforming standard LLMs only on medium-complexity tasks.

  60. Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning

    cs.CL 2025-06 conditional novelty 6.0

    High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 73 Pith papers · 1 internal anchor

  1. [1]

    URL https://openreview.net/forum?id=Ep0TtjVoap. D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. Jha, H. Ivison, I. Magnusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. R. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. D. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. ...

  2. [2]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    URL https://openreview.net/forum?id=1qvx610Cu7. Y. Liu. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692 , 364, 2019. S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y. Tay, D. Zhou, Q. V. Le, B. Zoph, J. Wei, et al. The flan collection: Designing data and methods for effective instruction tuning.arXiv preprint ar...

  3. [3]

    {% "{% "{{␣’<|system|>\n’␣+␣message [ ’ content ’]␣+␣ ’\n’␣}}

    Association for Computational Linguistics. URLhttps://aclanthology.org/2024.emnlp-main.79. C. Xu, Q. Sun, K. Zheng, X. Geng, P. Zhao, J. Feng, C. Tao, and D. Jiang. Wizardlm: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244 , 2023. 58 H. Xu, B. Liu, L. Shu, and P. Yu. BERT post-training for review reading com...

  4. [4]

    The above example is not tied to any particular persona, but you should create one that is unique and specific to the given persona

  5. [5]

    The instruction should contain all the following verifiable constraint(s):{constraints}

  6. [6]

    User instruction:

    Your output should start with "User instruction:". Your output should not include an answer to the instruction. Figure 30 Prompt used to generate precise instruction following instances.{persona} are borrowed from Chan et al. (2024). We use the set of{constraints} defined in Zhou et al. (2023). Example seeds are manually written by authors for each constr...

  7. [7]

    You should rewrite the instruction coherently while relaxing one of the following con- straint categories:{constraints}

  8. [8]

    Remember to entirely relax one of the constraint category

  9. [9]

    User instruction:

    Your output should start with "User instruction:". Your output should not include an answer to the instruction. Figure 32 Prompt used to generate modify an instruction following query minimally such that the answer to the rewritten prompt does not satisfy the original query and thus can be used as arejected response for preference data construction. Hard ...

  10. [10]

    Only top talents can solve it correctly

    The math problem should be challenging and involve advanced mathematical skills and knowledge. Only top talents can solve it correctly

  11. [11]

    You should make full use of the persona description to create the math problem to ensure that the math problem is unique and specific to the persona

  12. [12]

    Math problem:

    Your response should always start with "Math problem:". Your response should not include a solution to the created math problem

  13. [13]

    Figure 33 Prompt used to generate hard math word problems.{persona} are borrowed from Chan et al

    Your created math problem should include no more than 2 sub-problems. Figure 33 Prompt used to generate hard math word problems.{persona} are borrowed from Chan et al. (2024). 70 Hard Math Problems (response) Provide solution to the given math problem. Problem: {generated_math_problem} Note: Provide your solution step-by-step, and end your solution in a n...

  14. [14]

    Your question should be solvable by entry- to medium-level python programmers

  15. [15]

    Your question should clearly specify the type of input, expected output and an optional example

  16. [16]

    Question: Write a python function to

    Your response should always start with "Question: Write a python function to"

  17. [17]

    Figure 35 Prompt used to generate code completion instances.{persona} are borrowed from Chan et al

    Your response should not include a solution to the created coding problem. Figure 35 Prompt used to generate code completion instances.{persona} are borrowed from Chan et al. (2024). Code Completion (response) Provide solution to the given python programming question. Question: {generated_code_problem} Note:

  18. [18]

    Your response should always start with the function definition and end with the final re- turn statement

  19. [19]

    Instruction

    Your response should only and only include python function. Figure 36 Prompt used to generate code completion. 71 System prompt for LLM-as-a-judge Your role is to evaluate text quality based on given criteria. You’ll receive an instructional description (“Instruction”) and text outputs (“Text”). Understand and interpret instructions to evaluate effectivel...

  20. [20]

    Irrelevant: No alignment

  21. [21]

    Partial Focus: Addresses one aspect poorly

  22. [22]

    - (2) Acknowledges both but slight deviations

    Partial Compliance: - (1) Meets goal or restrictions, neglecting other. - (2) Acknowledges both but slight deviations

  23. [23]

    Almost There: Near alignment, minor deviations

  24. [24]

    Figure 39 Guideline for rating a model response using the Instruction Following aspect given aninstruction and a list of completions, adapted from Cui et al

    Comprehensive Compliance: Fully aligns, meets all requirements. Figure 39 Guideline for rating a model response using the Instruction Following aspect given aninstruction and a list of completions, adapted from Cui et al. (2023). 73 Informativeness or Helpfulness Aspect (prompt) # Informativeness / Helpfulness Assessment Evaluate if model’s outputs fulfil...

  25. [25]

    Clarity and Relevance: Ensure response relates to the task and seek clarifications if needed

  26. [26]

    Useful and Comprehensive Information: Provide relevant background, reasoning steps, or detailed description

  27. [27]

    Score 1 to 5 based on extent of helpfulness, regarding both informativeness and correctness:

    Not Lengthy, No Repetition: Avoid verbosity or recycling content. Score 1 to 5 based on extent of helpfulness, regarding both informativeness and correctness:

  28. [28]

    Severely Incorrect: Contains significant inaccuracies or fabricated content, even if comprehensive information is provided

  29. [29]

    Partially Incorrect : Contains errors that may cause confusion, even though comprehensive information is present

  30. [30]

    Correct: Accurate and provides useful information that meets the task’s requirements

  31. [31]

    Highly Informative: Accurate and extensive, providing valuable insights and detailed information

  32. [32]

    Figure 40 Guideline for rating a model response using the Helpfulness aspect given aninstruction and a list of completions, adapted from Cui et al

    Outstandingly Helpful: Both accurate and in-depth, offering profound insights and comprehensive information. Figure 40 Guideline for rating a model response using the Helpfulness aspect given aninstruction and a list of completions, adapted from Cui et al. (2023). 74 Honesty Aspect (prompt) # Honesty and Uncertainty Expression Assessment Assess how well t...

  33. [33]

    Weakeners: e.g., ‘I guess,’ ‘probably.’

  34. [34]

    - No uncertainty expression indicate confidence

    Verbalized confidence scores: [0, 20] low; (20, 40] uncertain; (40, 60] moderate; (60, 80] leaning confident; (80, 100] high. - No uncertainty expression indicate confidence. - Response Correctness: Align with ground truth, or provide accurate content without fabrication. Scoring: Rate outputs 1 to 5 (or “N/A”):

  35. [35]

    Confidently Incorrect: Confident but entirely wrong

  36. [36]

    - Unconfident and entirely wrong

    Confident with Significant Mistakes / Unconfident Incorrect: - Confident but contains major errors. - Unconfident and entirely wrong

  37. [37]

    - Confident but contains minor errors

    Uncertain / ‘I Don’t Know’ / Subtle Mistakes: - ‘I don’t know’ or declines. - Confident but contains minor errors. - Unconfident and contains significant mistakes

  38. [38]

    - Makes subtle mistakes but expresses uncertainty without specifying the exact area of doubt

    Correct but Uncertain / Expressed Subtle Mistakes: - Correct but unconfident. - Makes subtle mistakes but expresses uncertainty without specifying the exact area of doubt

  39. [39]

    - Makes mistakes, but precisely acknowledges minor errors and indicates uncertainty on potential mistakes

    Correct and Confident / Precisely Express Uncertainty: - Correct and confident. - Makes mistakes, but precisely acknowledges minor errors and indicates uncertainty on potential mistakes. N/A. Not Applicable: For creative writing tasks. Figure 41 Guideline for rating a model response using the Honesty aspect given aninstruction and a list of completions, a...

  40. [40]

    Contradictory with the World (Factual Error): Entities, locations, concepts, or events that conflict with established knowledge

  41. [41]

    Contradictory with Instruction and Input: Responses diverge, introducing new facts not aligned with instructions or inputs

  42. [42]

    Scoring: Rate outputs 1 to 5 based on extent of hallucination:

    Self-Contradictory / Logical Error : Responses contain internal contradictions or logical errors within each independent text. Scoring: Rate outputs 1 to 5 based on extent of hallucination:

  43. [43]

    Completely Hallucinated: Entirely unreliable due to hallucinations

  44. [44]

    Severe Hallucination: Nearly half contains hallucinations, severe deviation from main points

  45. [45]

    Therefore, the answer is (ANSWER_LETTER)

    Partial Hallucination / Misunderstanding : Overall truthful, partial misunderstanding due to hallucinations. 4. Insignificant Hallucination: Mostly truthful, slight hallucination not affecting main points. 5. No Hallucination: Free of hallucinations. Figure 42 Guideline for rating a model response using the Truthfulness aspect given aninstruction and a li...