pith. sign in

arxiv: 2409.19256 · v2 · submitted 2024-09-28 · 💻 cs.LG · cs.DC

HybridFlow: A Flexible and Efficient RLHF Framework

Pith reviewed 2026-05-11 07:47 UTC · model grok-4.3

classification 💻 cs.LG cs.DC
keywords RLHFHybridFlowLLM alignmentdataflow orchestrationsingle-controllermulti-controllermodel reshardingthroughput optimization
0
0 comments X

The pith

HybridFlow combines single-controller and multi-controller paradigms to flexibly represent and efficiently execute RLHF dataflows for large language model alignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses inefficiencies in Reinforcement Learning from Human Feedback when applied to large language models. Single-controller systems incur high overhead dispatching commands for distributed computations inside nodes, while multi-controller systems struggle to flexibly handle the many-to-many data dependencies across nodes. HybridFlow introduces a hybrid control model together with hierarchical APIs that separate and encapsulate computation from communication, plus a specialized engine for moving model parameters between training and generation phases. This design allows arbitrary RLHF algorithms to be expressed and run on varied hardware. Experiments report throughput gains ranging from 1.53 times to 20.57 times over prior systems.

Core claim

HybridFlow combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow. Hierarchical APIs decouple and encapsulate computation and data dependencies, supporting efficient operation orchestration and flexible device mapping. The 3D-HybridEngine performs actor model resharding between training and generation with zero memory redundancy and reduced communication overhead.

What carries the argument

HybridFlow's hybrid single/multi-controller execution model together with its hierarchical APIs for decoupling computation from data dependencies and the 3D-HybridEngine for zero-redundancy model resharding.

If this is right

  • RLHF algorithms gain both flexible representation of complex data dependencies and reduced control-dispatch overhead during distributed execution.
  • Actor models can transition between training and generation phases without memory duplication or high communication costs.
  • The same framework supports multiple RLHF variants by swapping only the orchestration logic expressed through the hierarchical APIs.
  • Device mapping of computation nodes becomes independent of the dataflow representation, enabling use across different cluster sizes and hardware types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hybrid control pattern may apply to other machine-learning pipelines that alternate between distributed training and inference phases.
  • Developers could reuse the hierarchical APIs to prototype new RLHF variants without rewriting low-level communication code.
  • If the resharding engine generalizes beyond the actor model, similar zero-redundancy techniques could reduce memory pressure in other multi-stage LLM workflows.

Load-bearing premise

The hierarchical APIs and 3D-HybridEngine can be implemented with negligible overhead while supporting arbitrary RLHF algorithms and hardware without introducing new bottlenecks or correctness issues in the dataflow orchestration.

What would settle it

Running the same set of RLHF algorithms on identical hardware setups with HybridFlow yields no measurable throughput improvement or produces incorrect model outputs or training divergence.

read the original abstract

Reinforcement Learning from Human Feedback (RLHF) is widely used in Large Language Model (LLM) alignment. Traditional RL can be modeled as a dataflow, where each node represents computation of a neural network (NN) and each edge denotes data dependencies between the NNs. RLHF complicates the dataflow by expanding each node into a distributed LLM training or generation program, and each edge into a many-to-many multicast. Traditional RL frameworks execute the dataflow using a single controller to instruct both intra-node computation and inter-node communication, which can be inefficient in RLHF due to large control dispatch overhead for distributed intra-node computation. Existing RLHF systems adopt a multi-controller paradigm, which can be inflexible due to nesting distributed computation and data communication. We propose HybridFlow, which combines single-controller and multi-controller paradigms in a hybrid manner to enable flexible representation and efficient execution of the RLHF dataflow. We carefully design a set of hierarchical APIs that decouple and encapsulate computation and data dependencies in the complex RLHF dataflow, allowing efficient operation orchestration to implement RLHF algorithms and flexible mapping of the computation onto various devices. We further design a 3D-HybridEngine for efficient actor model resharding between training and generation phases, with zero memory redundancy and significantly reduced communication overhead. Our experimental results demonstrate 1.53$\times$~20.57$\times$ throughput improvement when running various RLHF algorithms using HybridFlow, as compared with state-of-the-art baselines. HybridFlow source code will be available at https://github.com/volcengine/verl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes HybridFlow, a hybrid RLHF framework that merges single-controller and multi-controller paradigms via hierarchical APIs to flexibly represent and efficiently execute complex RLHF dataflows (with nodes as distributed NN computations and edges as many-to-many multicasts). It introduces a 3D-HybridEngine for zero-redundancy actor-model resharding between training and generation phases. Experiments report 1.53×–20.57× throughput gains over state-of-the-art baselines for various RLHF algorithms, with code to be open-sourced.

Significance. If the throughput claims and generality hold, HybridFlow would meaningfully advance practical RLHF systems by addressing control overhead and inflexibility in distributed settings, offering a reusable abstraction layer that could accelerate development of new alignment algorithms while improving hardware utilization. The planned code release supports reproducibility.

major comments (2)
  1. [Abstract] Abstract and experimental results: the central throughput claim (1.53×–20.57×) is presented without any description of baselines, hardware specifications, workload details, or ablation studies isolating the hybrid control and 3D-HybridEngine contributions from other optimizations; this makes it impossible to verify whether gains stem from the proposed paradigm or implementation specifics.
  2. [Section 3] Design of hierarchical APIs and 3D-HybridEngine (Section 3): the manuscript asserts negligible orchestration overhead and correctness for arbitrary RLHF dataflows (including non-standard algorithms, heterogeneous hardware, and complex multicast patterns), yet provides neither quantitative overhead measurements, machine-checked invariants, nor edge-case coverage beyond the reported cases; this leaves the weakest assumption untested and risks the gains being non-generalizable.
minor comments (1)
  1. [Abstract] The notation '1.53×~20.57×' in the abstract is ambiguous; replace with '1.53× to 20.57×' or specify the exact range and conditions.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, providing clarifications from the full paper and indicating revisions where appropriate to improve verifiability and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract and experimental results: the central throughput claim (1.53×–20.57×) is presented without any description of baselines, hardware specifications, workload details, or ablation studies isolating the hybrid control and 3D-HybridEngine contributions from other optimizations; this makes it impossible to verify whether gains stem from the proposed paradigm or implementation specifics.

    Authors: We agree the abstract is concise and omits these details. The full manuscript's Section 4 (Experiments) specifies the baselines (DeepSpeed-Chat, vLLM+DeepSpeed, and Hugging Face TRL), hardware (NVIDIA A100 GPU clusters with 8-32 GPUs), workloads (PPO, DPO, GRPO on Llama-7B/13B models with standard datasets), and reports throughput under identical conditions. To isolate contributions, we have added new ablation studies in the revised Section 4.3 that separately measure the hybrid control paradigm and 3D-HybridEngine effects. We will also append a brief evaluation summary to the abstract. revision: yes

  2. Referee: [Section 3] Design of hierarchical APIs and 3D-HybridEngine (Section 3): the manuscript asserts negligible orchestration overhead and correctness for arbitrary RLHF dataflows (including non-standard algorithms, heterogeneous hardware, and complex multicast patterns), yet provides neither quantitative overhead measurements, machine-checked invariants, nor edge-case coverage beyond the reported cases; this leaves the weakest assumption untested and risks the gains being non-generalizable.

    Authors: The manuscript reports orchestration overhead measurements in Section 4.2 (under 5% of runtime for tested cases) and validates correctness empirically across multiple RLHF algorithms with multicast patterns. We will expand Section 3 with additional quantitative overhead data for heterogeneous hardware and more complex non-standard dataflows, plus a new subsection on edge-case coverage and limitations. Machine-checked invariants are outside the scope of this systems paper, which relies on empirical validation and open-source code for reproducibility rather than formal methods. revision: partial

standing simulated objections not resolved
  • Providing machine-checked invariants or formal proofs of correctness for arbitrary RLHF dataflows under the hierarchical APIs and 3D-HybridEngine.

Circularity Check

0 steps flagged

No circularity: claims rest on architecture design and independent runtime measurements

full rationale

The paper describes a hybrid RLHF framework with hierarchical APIs and 3D-HybridEngine for dataflow orchestration. No mathematical derivation chain, fitted parameters, or predictions exist; throughput gains (1.53×–20.57×) are reported from direct experiments on various algorithms rather than any self-referential equations or self-citation load-bearing uniqueness theorems. The central claims about flexibility and efficiency are supported by system implementation details and empirical benchmarks that do not reduce to the inputs by construction. This is a standard systems paper with no detectable circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper is an engineering systems contribution; the central claim does not rest on fitted parameters, unproven mathematical axioms, or new physical entities. It relies on standard assumptions about distributed GPU clusters and LLM training workloads.

axioms (1)
  • domain assumption Standard assumptions about distributed computing environments and neural network training dynamics hold for the target hardware and workloads.
    The framework design and performance claims presuppose typical properties of GPU clusters and LLM training without stating exceptions.
invented entities (1)
  • 3D-HybridEngine no independent evidence
    purpose: Efficient actor model resharding between training and generation phases with zero memory redundancy
    New software component introduced to solve the resharding problem; no independent evidence outside the paper's experiments is provided.

pith-pipeline@v0.9.0 · 5607 in / 1402 out tokens · 84856 ms · 2026-05-11T07:47:58.870354+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

    cs.CV 2026-04 unverdicted novelty 8.0

    EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.

  2. OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models

    cs.CV 2026-04 unverdicted novelty 8.0

    OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.

  3. SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

    cs.AI 2026-03 conditional novelty 8.0

    SARL rewards reasoning topology to improve label-free RL, outperforming baselines with gains up to 44.7% on math and 34.6% on open-ended tasks while maintaining more stable training.

  4. Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression

    cs.LG 2026-05 unverdicted novelty 7.0

    Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.

  5. AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

    cs.LG 2026-05 unverdicted novelty 7.0

    AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and Agent...

  6. DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    DeltaPrompts generates 200k high-divergence reasoning prompts via staged synthesis to escape zero-delta traps in multimodal distillation, yielding up to 15% relative gains on chart, document, and perception benchmarks.

  7. FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

    cs.LG 2026-05 conditional novelty 7.0

    FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.

  8. AIS: Adaptive Importance Sampling for Quantized RL

    stat.ML 2026-05 unverdicted novelty 7.0

    AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

  9. StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video

    cs.CV 2026-05 unverdicted novelty 7.0

    StreamPro introduces a benchmark and training method using CB-Stream Loss and GRPO to enable proactive decision-making in streaming videos, achieving 41.5 on StreamPro-Bench compared to 10.4 previously.

  10. LEAD: Length-Efficient Adaptive and Dynamic Reasoning for Large Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    LEAD uses online adaptive mechanisms including Potential-Scaled Instability and symmetric efficiency rewards based on correct rollouts to achieve higher accuracy-efficiency scores with substantially shorter reasoning ...

  11. BoostAPR: Boosting Automated Program Repair via Execution-Grounded Reinforcement Learning with Dual Reward Models

    cs.AI 2026-05 unverdicted novelty 7.0

    BoostAPR improves automated program repair by using execution-grounded RL with a sequence-level assessor and line-level credit allocator, reaching 40.7% on SWE-bench Verified and strong cross-language results.

  12. Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization

    cs.AI 2026-05 unverdicted novelty 7.0

    An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.

  13. The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

    cs.LG 2026-05 unverdicted novelty 7.0

    On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

  14. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

    cs.LG 2026-05 unverdicted novelty 7.0

    The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...

  15. Teaching Language Models to Think in Code

    cs.CL 2026-05 unverdicted novelty 7.0

    ThinC trains small models to reason primarily in code rather than natural language, outperforming tool-integrated baselines and even larger models on competition math benchmarks.

  16. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  17. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  18. SCOUT: Active Information Foraging for Long-Text Understanding with Decoupled Epistemic States

    cs.CL 2026-05 unverdicted novelty 7.0

    SCOUT achieves state-of-the-art long-text understanding with up to 8x lower token use by actively foraging for sparse query-relevant information and updating a compact provenance-grounded epistemic state.

  19. Fine-Tuning Small Reasoning Models for Quantum Field Theory

    cs.LG 2026-04 unverdicted novelty 7.0

    Small 7B reasoning models were fine-tuned on synthetic and curated QFT problems using RL and SFT, yielding performance gains, error analysis, and public release of data and traces.

  20. Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 7.0

    Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...

  21. SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees

    cs.LG 2026-04 unverdicted novelty 7.0

    SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.

  22. MMR-AD: A Large-Scale Multimodal Dataset for Benchmarking General Anomaly Detection with Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    MMR-AD is a new benchmark dataset showing that current generalist MLLMs lag industrial needs for anomaly detection, with Anomaly-R1 delivering better results through reasoning and RL.

  23. Bringing Value Models Back: Generative Critics for Value Modeling in LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    GenAC introduces generative critics with chain-of-thought reasoning and in-context conditioning to improve value approximation and downstream RL performance in LLMs compared to value-based and value-free baselines.

  24. Learning Vision-Language-Action World Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 7.0

    VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

  25. DeonticBench: A Benchmark for Reasoning over Rules

    cs.CL 2026-04 unverdicted novelty 7.0

    DEONTICBENCH is a new benchmark of 6,232 deontic reasoning tasks from U.S. legal domains where frontier LLMs reach only ~45% accuracy and symbolic Prolog assistance plus RL training still fail to solve tasks reliably.

  26. Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

    cs.LG 2026-04 unverdicted novelty 7.0

    TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.

  27. Think Anywhere in Code Generation

    cs.SE 2026-03 unverdicted novelty 7.0

    Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.

  28. PR-CAD: Progressive Refinement for Unified Controllable and Faithful Text-to-CAD Generation with Large Language Models

    cs.CL 2026-03 unverdicted novelty 7.0

    PR-CAD unifies text-to-CAD generation and editing via progressive refinement with LLMs, a new interaction dataset, and RL-enhanced reasoning to achieve better controllability and faithfulness.

  29. Topo-R1: Detecting Topological Anomalies via Vision-Language Models

    cs.CV 2026-03 unverdicted novelty 7.0

    Topo-R1 fine-tunes a vision-language model using a topology-aware reward and GRPO to detect anomalies such as broken or spurious connections in tubular segmentation masks, outperforming standard VLMs.

  30. Alternating Reinforcement Learning with Contextual Rubric Rewards: Beyond the Scalarization Strategy

    cs.LG 2026-03 unverdicted novelty 7.0

    ARL-RR alternates optimization over rubric meta-classes with dynamic selection to avoid fixed scalarization, outperforming baselines on HealthBench.

  31. Beyond One-Size-Fits-All: Adaptive Subgraph Denoising for Zero-Shot Graph Learning with Large Language Models

    cs.LG 2026-03 unverdicted novelty 7.0

    GraphSSR introduces an adaptive SSR pipeline with SSR-SFT data synthesis and SSR-RL (Authenticity-Reinforced and Denoising-Reinforced stages) to overcome one-size-fits-all subgraph noise in zero-shot LLM graph reasoning.

  32. GraphScout: Empowering Large Language Models with Intrinsic Exploration Ability for Agentic Graph Reasoning

    cs.AI 2026-03 unverdicted novelty 7.0

    GraphScout trains LLMs to autonomously synthesize structured training data from knowledge graphs via flexible exploration tools, enabling a 4B model to outperform larger LLMs by 16.7% on average with fewer inference t...

  33. Beyond Variance: Prompt-Efficient RLVR via Rare-Event Amplification and Bidirectional Pairing

    cs.LG 2026-02 unverdicted novelty 7.0

    Positive-negative prompt pairing with weighted GRPO improves RLVR sample efficiency, raising AIME 2025 Pass@8 from 16.8 to 22.2 on Qwen2.5-Math-7B while matching large-scale training.

  34. Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

    cs.AI 2026-02 unverdicted novelty 7.0

    GPS trains a small model on optimization history to predict prompt difficulty and select intermediate-difficulty diverse batches, yielding better training efficiency, final performance, and test-time allocation than b...

  35. Learning from Self-Debate: Preparing Reasoning Models for Multi-Agent Debate

    cs.CL 2026-01 unverdicted novelty 7.0

    SDRL trains LLMs via self-generated multi-path debates and joint optimization of standalone plus debate-conditioned responses to boost both single-model reasoning and multi-agent debate performance.

  36. Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

    cs.AI 2026-01 unverdicted novelty 7.0

    Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.

  37. OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling

    cs.CL 2026-01 accept novelty 7.0

    OPT-Engine shows pure-text chain-of-thought reasoning in LLMs loses robustness as optimization complexity grows, external tools fix only local arithmetic, and solver-integrated methods are bottlenecked by automated co...

  38. ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation

    cs.CL 2026-01 unverdicted novelty 7.0

    ModeX selects the modal semantic output from multiple LLM generations via a similarity graph and recursive spectral clustering without needing reward models or evaluators.

  39. CORE: Concept-Oriented Reinforcement for Bridging the Definition-Application Gap in Mathematical Reasoning

    cs.AI 2025-12 unverdicted novelty 7.0

    CORE is a concept-oriented RL method that synthesizes quizzes, injects concept snippets into rollouts, and reinforces conceptual trajectories to close the gap between restating definitions and applying them in math problems.

  40. CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

    cs.SE 2025-10 conditional novelty 7.0

    CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reas...

  41. High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

    cs.CV 2025-07 conditional novelty 7.0

    MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.

  42. MMSearch-R1: Incentivizing LMMs to Search

    cs.CV 2025-06 unverdicted novelty 7.0

    MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...

  43. CAD-Coder: Text-to-CAD Generation with Chain-of-Thought and Geometric Reward

    cs.GR 2025-05 unverdicted novelty 7.0

    CAD-Coder generates valid CadQuery scripts from text via supervised fine-tuning followed by reinforcement learning with geometric Chamfer Distance rewards and chain-of-thought planning.

  44. Group-in-Group Policy Optimization for LLM Agent Training

    cs.LG 2025-05 unverdicted novelty 7.0

    GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...

  45. Reinforcement Learning for Reasoning in Large Language Models with One Training Example

    cs.LG 2025-04 accept novelty 7.0

    One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

  46. DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 6.0

    DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains ...

  47. Training Language Agents to Learn from Experience

    cs.LG 2026-05 unverdicted novelty 6.0

    Introduces the ICT framework and an RL pipeline to train language agent reflectors that distill experience into reusable prompts, outperforming baselines on held-out tasks in ALFWorld and MiniHack.

  48. What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    SERL selectively reweights learning using task success and environment feedback to reach 90.0% success on ALFWorld and 80.1% on WebShop, outperforming RL and distillation baselines.

  49. AutoVecCoder: Teaching LLMs to Generate Explicitly Vectorized Code

    cs.CL 2026-05 unverdicted novelty 6.0

    AutoVecCoder combines VecPrompt for automated intrinsic knowledge synthesis and VecRL for efficiency-aligned RL to train an 8B LLM that achieves SOTA on SimdBench SSE/AVX subsets and sometimes exceeds -O3 compiler results.

  50. GROW: Aligning GRPO with State-Action Modeling for Open-World VLM Agents

    cs.LG 2026-05 conditional novelty 6.0

    GROW decomposes trajectories into state-action samples for GRPO training of VLM agents and reports state-of-the-art results on over 800 Minecraft tasks.

  51. GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction

    cs.AI 2026-05 unverdicted novelty 6.0

    GRID trains Qwen-based 4B models on a task-bank reward system of multi-select questions and regex targets to extract security KGs from CTI text, reporting 84.62% precision and 64.91% recall on 249 articles from five sources.

  52. VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

    cs.CV 2026-05 unverdicted novelty 6.0

    VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines includin...

  53. From Failure to Feedback: Group Revision Unlocks Hard Cases in Object-Level Grounding

    cs.CV 2026-05 unverdicted novelty 6.0

    A group-revision paradigm for GRPO-based RL fine-tuning of VLMs converts failure responses into improvement signals that refine rewards and advantages, yielding gains on referring segmentation, REC, and counting benchmarks.

  54. DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    DeltaPrompts generates 200k synthetic high-divergence reasoning prompts to escape zero-delta saturation in multimodal distillation, yielding up to 15% relative gains on chart, document, and perception benchmarks acros...

  55. Learning from Failures: Correction-Oriented Policy Optimization with Verifiable Rewards

    cs.CL 2026-05 unverdicted novelty 6.0

    CIPO jointly optimizes standard RLVR rewards with correction samples derived from the model's own failed attempts, yielding better reasoning and self-correction on math and code benchmarks.

  56. Stateful Reasoning via Insight Replay

    cs.AI 2026-05 conditional novelty 6.0

    InsightReplay improves LLM accuracy on reasoning benchmarks by extracting and replaying critical insights to maintain their accessibility during extended chain-of-thought generation.

  57. Stateful Reasoning via Insight Replay

    cs.AI 2026-05 unverdicted novelty 6.0

    InsightReplay improves long CoT reasoning by extracting critical insights from the trace and replaying them near the active frontier, delivering +1.65 average accuracy gain across 24 model-benchmark settings.

  58. CrystalReasoner: Reasoning and RL for Property-Conditioned Crystal Structure Generation

    cs.AI 2026-05 unverdicted novelty 6.0

    CrystalReasoner combines LLM reasoning traces with physical priors and multi-objective RL to generate valid, stable, and property-conditioned crystal structures.

  59. STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning

    cs.LG 2026-05 unverdicted novelty 6.0

    STRIDE co-trains generator and verifier on outcome rewards alone to deliver learnable stepwise language feedback that redirects LLM reasoning trajectories and outperforms scalar-reward baselines.

  60. Beyond GRPO and On-Policy Distillation: An Empirical Sparse-to-Dense Reward Principle for Language-Model Post-Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Sparse RL on capable teachers followed by dense distillation to students beats direct GRPO on students for verifiable math reasoning.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · cited by 152 Pith papers · 15 internal anchors

  1. [1]

    Martín Abadi. 2016. TensorFlow: learning functions at scale. InProceed- ings of the 21st ACM SIGPLAN international conference on functional programming. 1–1

  2. [2]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al . 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  3. [3]

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhar- gav S Gulavani, and Ramachandran Ramjee. 2023. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. arXiv preprint arXiv:2308.16369 (2023)

  4. [4]

    Riad Akrour, Marc Schoenauer, and Michele Sebag. 2011. Preference- based policy learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2011, Athens, Greece, September 5-9, 2011. Proceedings, Part I 11 . Springer, 12–27

  5. [5]

    Gene M Amdahl. 1967. Validity of the single processor approach to achieving large scale computing capabilities. InProceedings of the April 18-20, 1967, spring joint computer conference . 483–485

  6. [6]

    George E Andrews and Kimmo Eriksson. 2004. Integer partitions . Cambridge University Press

  7. [7]

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al . 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862 (2022)

  8. [8]

    Zhihao Bai, Zhen Zhang, Yibo Zhu, and Xin Jin. 2020. {PipeSwitch}: Fast pipelined context switching for deep learning applications. In14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20). 499–514

  9. [9]

    Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, et al . 2022. Pathways: Asynchronous distributed dataflow for ml. Proceedings of Machine Learning and Systems 4 (2022), 430–449

  10. [10]

    Eric Temple Bell. 1934. Exponential polynomials. Annals of Mathe- matics (1934), 258–277

  11. [11]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  12. [12]

    I. Caspi. 2017. Reinforcement learning coach by Intel . https://github. com/NervanaSystems/coach

  13. [13]

    Ernie Chan, Marcel Heimlich, Avi Purkayastha, and Robert Van De Geijn. 2007. Collective communication: theory, practice, and expe- rience. Concurrency and Computation: Practice and Experience 19, 13 (2007), 1749–1783

  14. [14]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Hee- woo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021)

  15. [15]

    Collosal-AI Corporation. 2023. Collosal-Chat. https://github.com/ binmakeswell/ColossalChat

  16. [16]

    NVIDIA Corporation. 2023. TensorRT-LLM: A TensorRT Toolbox for Op- timized Large Language Model Inference. https://github.com/NVIDIA/ TensorRT-LLM

  17. [17]

    NVIDIA Corporation. 2024. NeMo-Aligner: Scalable toolkit for efficient model alignment. https://github.com/NVIDIA/NeMo-Aligner

  18. [18]

    Weihao Cui, Han Zhao, Quan Chen, Hao Wei, Zirui Li, Deze Zeng, Chao Li, and Minyi Guo. 2022. {DVABatch}: Diversity-aware {Multi- Entry} {Multi-Exit} batching for efficient processing of {DNN} ser- vices on {GPUs }. In2022 USENIX Annual Technical Conference (USENIX ATC 22). 183–198

  19. [19]

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. 2024. Safe RLHF: Safe Reinforcement Learning from Human Feedback. In The Twelfth International Confer- ence on Learning Representations . https://openreview.net/forum?id= TyFrPOKYXw

  20. [20]

    Frederica Darema. 2001. The spmd model: Past, present and future. In Recent Advances in Parallel Virtual Machine and Message Passing Interface: 8th European PVM/MPI Users’ Group Meeting Santorini/Thera, Greece, September 23–26, 2001 Proceedings 8 . Springer, 1–1

  21. [21]

    Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107–113

  22. [22]

    Assaf Eisenman, Kiran Kumar Matam, Steven Ingram, Dheevatsa Mudigere, Raghuraman Krishnamoorthi, Krishnakumar Nair, Misha Smelyanskiy, and Murali Annavaram. 2022. {Check-N-Run}: A check- pointing system for training deep learning recommendation models. In 19th USENIX Symposium on Networked Systems Design and Imple- mentation (NSDI 22). 929–943

  23. [23]

    Shiqing Fan, Yi Rong, Chen Meng, Zongyan Cao, Siyu Wang, Zhen Zheng, Chuan Wu, Guoping Long, Jun Yang, Lixue Xia, et al . 2021. DAPPLE: A pipelined data parallel approach for training large models. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming . 431–445

  24. [24]

    X Yu Geoffrey, Yubo Gao, Pavel Golikov, and Gennady Pekhimenko

  25. [25]

    In2021 USENIX Annual Technical Conference (USENIX ATC 21)

    Habitat: A {Runtime-Based} computational performance predic- tor for deep neural network training. In2021 USENIX Annual Technical Conference (USENIX ATC 21). 503–521

  26. [26]

    Danijar Hafner, James Davidson, and Vincent Vanhoucke. 2017. Ten- sorflow agents: Efficient batched reinforcement learning in tensorflow. arXiv preprint arXiv:1709.02878 (2017)

  27. [27]

    Mingcong Han, Hanze Zhang, Rong Chen, and Haibo Chen. 2022. Microsecond-scale preemption for concurrent GPU-accelerated DNN inferences. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22) . 539–558

  28. [28]

    Alexander Havrilla, Maksym Zhuravinskyi, Duy Phung, Aman Tiwari, Jonathan Tow, Stella Biderman, Quentin Anthony, and Louis Castricato

  29. [29]

    In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

    trlX: A framework for large scale reinforcement learning from human feedback. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing . 8578–8595

  30. [30]

    Hesse, M

    C. Hesse, M. Plappert, A. Radford, J. Schulman, S. Sidor, and Y. Wu

  31. [31]

    https://github.com/openai/baselines

    OpenAI baselines. https://github.com/openai/baselines

  32. [32]

    Connor Holmes, Masahiro Tanaka, Michael Wyatt, Ammar Ahmad Awan, Jeff Rasley, Samyam Rajbhandari, Reza Yazdani Aminabadi, Heyang Qin, Arash Bakhtiari, Lev Kurilenko, et al. 2024. DeepSpeed- FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference. arXiv preprint arXiv:2401.08671 (2024)

  33. [33]

    Jian Hu, Xibin Wu, Xianyu, Chen Su, Leon Qiu, Daoning Jiang, Qing Wang, and Weixun Wang. 2023. OpenRLHF: A Ray-based High- performance RLHF framework. https://github.com/OpenLLMAI/ OpenRLHF

  34. [34]

    Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, and Lewis Tunstall. 2024. The N+ Implementation Details of RLHF with PPO: A Case Study on TL; DR Summarization. arXiv preprint arXiv:2403.17031 (2024)

  35. [35]

    Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019). EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands G. Sheng, C. ...

  36. [36]

    Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: distributed data-parallel programs from sequen- tial building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European conference on computer systems 2007 . 59–72

  37. [37]

    Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, and Mosharaf Chowd- hury. 2023. Oobleck: Resilient distributed training of large models using pipeline templates. In Proceedings of the 29th Symposium on Operating Systems Principles. 382–395

  38. [38]

    Sylvain Jeaugey. 2017. Nccl 2.0. In GPU Technology Conference (GTC), Vol. 2. 23

  39. [39]

    Ziheng Jiang, Haibin Lin, Yinmin Zhong, Qi Huang, Yangrui Chen, Zhi Zhang, Yanghua Peng, Xiang Li, Cong Xie, Shibiao Nong, et al

  40. [40]

    Megascale: Scaling large language model training to more than 10,000 gpus.arXiv preprint arXiv:2402.15627, 2024

    MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs. arXiv preprint arXiv:2402.15627 (2024)

  41. [41]

    Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier

  42. [42]

    A survey of reinforcement learning from human feedback

    A survey of reinforcement learning from human feedback. arXiv preprint arXiv:2312.14925 (2023)

  43. [43]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. 2017. Adam: A Method for Sto- chastic Optimization. arXiv:1412.6980 [cs.LG]

  44. [44]

    Kostrikov

    I. Kostrikov. 2017. PyTorch implementation of advantage actor critic (A2C), proximal policy optimization (PPO) and scalable trust-region method for deep reinforcement learning. https://github.com/ikostrikov/ pytorch-a2c-ppo-acktr

  45. [45]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica

  46. [46]

    InProceedings of the 29th Symposium on Operating Systems Principles

    Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles. 611–626

  47. [47]

    Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. 2023. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267 (2023)

  48. [48]

    Cheng Li. 2023. LLM-Analysis: Latency and Memory Analysis of Trans- former Models for Training and Inference. https://github.com/cli99/llm- analysis

  49. [49]

    Ziniu Li, Tian Xu, Yushun Zhang, Zhihang Lin, Yang Yu, Ruoyu Sun, and Zhi-Quan Luo. 2023. ReMax: A Simple, Effective, and Efficient Reinforcement Learning Method for Aligning Large Language Models. arXiv preprint arXiv: 2310.10505 (2023)

  50. [50]

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gon- zalez, et al. 2023. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23) . 663–679

  51. [51]

    Eric Liang, Richard Liaw, Robert Nishihara, Philipp Moritz, Roy Fox, Ken Goldberg, Joseph Gonzalez, Michael Jordan, and Ion Stoica. 2018. RLlib: Abstractions for distributed reinforcement learning. In Interna- tional conference on machine learning . PMLR, 3053–3062

  52. [52]

    Eric Liang, Zhanghao Wu, Michael Luo, Sven Mika, Joseph E Gonzalez, and Ion Stoica. 2021. RLlib Flow: Distributed Reinforcement Learning is a Dataflow Problem. Advances in Neural Information Processing Systems 34 (2021), 5506–5517

  53. [53]

    Yun Liang, Huynh Phung Huynh, Kyle Rupnow, Rick Siow Mong Goh, and Deming Chen. 2014. Efficient GPU spatial-temporal multitasking. IEEE Transactions on Parallel and Distributed Systems 26, 3 (2014), 748– 760

  54. [54]

    Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, Yinhe Han, and Xi- aowei Li. 2017. Flexflow: A flexible dataflow accelerator architecture for convolutional neural networks. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) . IEEE, 553–564

  55. [55]

    Jayashree Mohan, Amar Phanishayee, and Vijay Chidambaram. 2021. {CheckFreq}: Frequent, {Fine-Grained} {DNN} Checkpointing. In 19th USENIX Conference on File and Storage Technologies (FAST 21) . 203–216

  56. [56]

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, et al. 2018. Ray: A distributed framework for emerg- ing {AI} applications. In 13th USENIX symposium on operating systems design and implementation (OSDI 18) . 561–577

  57. [57]

    Derek G Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: a timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles. 439–455

  58. [58]

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al . 2021. Webgpt: Browser-assisted question- answering with human feedback. arXiv preprint arXiv:2112.09332 (2021)

  59. [59]

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM symposium on operating sys- tems principles. 1–15

  60. [60]

    Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGres- ley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron- lm. In Proceedings of the International Conference for High Performance Computing, Netw...

  61. [61]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wain- wright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al . 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744

  62. [62]

    Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke. 2017. Dy- namic resource management for efficient utilization of multitasking GPUs. In Proceedings of the twenty-second international conference on architectural support for programming languages and operating systems . 527–540

  63. [63]

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al . 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural informa- tion processing systems 32 (2019)

  64. [64]

    Yanghua Peng, Yibo Zhu, Yangrui Chen, Yixin Bao, Bairen Yi, Chang Lan, Chuan Wu, and Chuanxiong Guo. 2019. A Generic Commu- nication Scheduler for Distributed DNN Training Acceleration. In Proceedings of the 27th ACM Symposium on Operating Systems Prin- ciples. ACM, Huntsville Ontario Canada, 16–29. https://doi.org/10. 1145/3341301.3359642

  65. [65]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He

  66. [66]

    In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

    Zero: Memory optimizations toward training trillion param- eter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis . IEEE, 1–16

  67. [67]

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He

  68. [68]

    In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506

  69. [69]

    Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He

  70. [70]

    In 2021 USENIX Annual Technical Conference (USENIX ATC 21)

    {Zero-offload}: Democratizing {billion-scale} model training. In 2021 USENIX Annual Technical Conference (USENIX ATC 21) . 551–564

  71. [71]

    Gian-Carlo Rota. 1964. The number of partitions of a set.The American Mathematical Monthly 71, 5 (1964), 498–504

  72. [72]

    Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal HybridFlow: A Flexible and Efficient RLHF Framework EuroSys ’25, March 30-April 3, 2025, Rotterdam, Netherlands Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bit- ton, Manish Bhatt, Cristian Canton Ferre...

  73. [73]

    Code Llama: Open Foundation Models for Code

    Code Llama: Open Foundation Models for Code. arXiv preprint arXiv: 2308.12950 (2023)

  74. [74]

    Michael Santacroce, Yadong Lu, Han Yu, Yuanzhi Li, and Yelong Shen

  75. [75]

    Efficient rlhf: Reducing the memory usage of ppo

    Efficient RLHF: Reducing the Memory Usage of PPO. arXiv preprint arXiv: 2309.00754 (2023)

  76. [76]

    Hill Kohli Saxton, Grefenstette. 2019. Analysing Mathematical Rea- soning Abilities of Neural Models. arXiv:1904.01557 (2019)

  77. [77]

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. InInternational conference on machine learning . PMLR, 1889–1897

  78. [78]

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. 2018. High-Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv:1506.02438 [cs.LG]

  79. [79]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)

  80. [80]

    Alexander Sergeev and Mike Del Balso. 2018. Horovod: fast and easy distributed deep learning in TensorFlow. arXiv preprint arXiv:1802.05799 (2018)

Showing first 80 references.