hub Canonical reference

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang · 2024 · cs.AI · arXiv 2405.11143

Canonical reference. 75% of citing Pith papers cite this work as background.

45 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 45 citing papers arXiv PDF

abstract

Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) significantly improve the alignment of human-AI values, further raising the upper bound of AI capabilities, particularly in reasoning-intensive, long-context Chain-of-Thought (CoT) tasks. However, existing frameworks commonly face challenges such as inference bottlenecks and complexity barriers, which restrict their accessibility to newcomers. To bridge this gap, we introduce \textbf{OpenRLHF}, a user-friendly, scalable, and easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and HuggingFace Transformers, featuring a simplified design, clear code structure, and comprehensive documentation to facilitate entry for researchers and practitioners. Experimental results show that OpenRLHF achieves superior training efficiency, with speedups ranging from 1.22x to 1.68x across different model sizes, compared to state-of-the-art frameworks. Additionally, it requires significantly fewer lines of code for implementation. OpenRLHF is publicly available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted by leading institutions to accelerate RLHF research and learning.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 method 2

citation-polarity summary

background 9 use method 2 unclear 1

representative citing papers

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

cs.LG · 2026-05-14 · conditional · novelty 8.0

DualKV is a new FlashAttention variant that shares prompt KV across multiple rollouts in RL training, delivering 1.63-3.82x speedups on 8B-30B models while remaining mathematically identical to standard attention.

AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.

AIS: Adaptive Importance Sampling for Quantized RL

stat.ML · 2026-05-13 · unverdicted · novelty 7.0

AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.

Variance-aware Reward Modeling with Anchor Guidance

stat.ML · 2026-05-12 · unverdicted · novelty 7.0

Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.

BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preserving exact synchrony.

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

cs.SE · 2026-05-06 · unverdicted · novelty 7.0

Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.

Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

cs.CL · 2026-04-18 · unverdicted · novelty 7.0

Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-policy baselines on agentic tasks.

RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

cs.AI · 2026-02-05 · unverdicted · novelty 7.0

RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.

Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning

cs.AI · 2026-01-23 · unverdicted · novelty 7.0

SSLogic uses LLM agents in a closed Generate-Validate-Refine loop to evolve 953 logic task families from 400 seeds, producing data that yields benchmark gains of +5.2 on SynLogic, +3.0 on AIME25, and +5.5 on BBH.

HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments

cs.DC · 2025-12-13 · unverdicted · novelty 7.0

HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.

Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training

cs.LG · 2025-07-21 · unverdicted · novelty 7.0

An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.

DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback

cs.OS · 2026-05-21 · unverdicted · novelty 6.0

DeltaBox achieves millisecond-level checkpoint (14ms) and rollback (5ms) for AI agent sandboxes by layering file states and using incremental process dumps to exploit similarity between consecutive checkpoints.

Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.

Diagnosing Training Inference Mismatch in LLM Reinforcement Learning

cs.LG · 2026-05-14 · unverdicted · novelty 6.0

Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.

D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

cs.AI · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.

PriorZero: Bridging Language Priors and World Models for Decision Making

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.

Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion

cs.AI · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.

Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

cs.AI · 2026-05-08 · unverdicted · novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.

ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL

cs.DC · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.

JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.

TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training

cs.DC · 2026-04-10 · unverdicted · novelty 6.0

TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.

Mitigating LLM biases toward spurious social contexts using direct preference optimization

cs.AI · 2026-04-02 · unverdicted · novelty 6.0

Debiasing-DPO reduces bias to spurious social contexts by 84% and improves predictive accuracy by 52% on average for LLMs evaluating U.S. classroom transcripts.

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

cs.LG · 2026-02-03 · unverdicted · novelty 6.0

PULSE exploits BF16-invisible sparsity in weight updates to enable over 100x lower communication in distributed RL post-training via compute-visible sparsification.

citing papers explorer

Showing 45 of 45 citing papers.

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts cs.LG · 2026-05-14 · conditional · none · ref 6 · internal anchor
DualKV is a new FlashAttention variant that shares prompt KV across multiple rollouts in RL training, delivering 1.63-3.82x speedups on 8B-30B models while remaining mathematically identical to standard attention.
AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs cs.LG · 2026-05-15 · unverdicted · none · ref 13 · internal anchor
AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.
AIS: Adaptive Importance Sampling for Quantized RL stat.ML · 2026-05-13 · unverdicted · none · ref 8 · internal anchor
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
Variance-aware Reward Modeling with Anchor Guidance stat.ML · 2026-05-12 · unverdicted · none · ref 38 · internal anchor
Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning cs.LG · 2026-05-09 · unverdicted · none · ref 8 · internal anchor
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preserving exact synchrony.
Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning cs.SE · 2026-05-06 · unverdicted · none · ref 23 · internal anchor
Introduces the first benchmark for fine-grained failures in reinforcement fine-tuning of LLMs and an automatic management framework that detects, diagnoses, and remediates them.
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning cs.CL · 2026-04-18 · unverdicted · none · ref 7 · internal anchor
Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-policy baselines on agentic tasks.
RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training cs.AI · 2026-02-05 · unverdicted · none · ref 8 · internal anchor
RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
Scaling the Scaling Logic: Agentic Meta-Synthesis of Logic Reasoning cs.AI · 2026-01-23 · unverdicted · none · ref 3 · internal anchor
SSLogic uses LLM agents in a closed Generate-Validate-Refine loop to evolve 953 logic task families from 400 seeds, producing data that yields benchmark gains of +5.2 on SynLogic, +3.0 on AIME25, and +5.5 on BBH.
HetRL: Efficient Reinforcement Learning for LLMs in Heterogeneous Environments cs.DC · 2025-12-13 · unverdicted · none · ref 12 · internal anchor
HetRL delivers up to 9.17x higher throughput for LLM RL training on heterogeneous GPUs by using hybrid and ILP-based schedulers to solve a joint optimization problem over computation and data dependencies.
Data Mixing Agent: Learning to Re-weight Domains for Continual Pre-training cs.LG · 2025-07-21 · unverdicted · none · ref 12 · internal anchor
An RL agent learns domain re-weighting policies from evaluation feedback to improve balanced performance in continual pre-training of LLMs across source and target domains.
DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback cs.OS · 2026-05-21 · unverdicted · none · ref 47 · internal anchor
DeltaBox achieves millisecond-level checkpoint (14ms) and rollback (5ms) for AI agent sandboxes by layering file states and using incremental process dumps to exploit similarity between consecutive checkpoints.
Taming the Thinker: Conditional Entropy Shaping for Adaptive LLM Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 22 · internal anchor
CES applies conditional bidirectional entropy control on top of DAPO to improve accuracy and shorten responses on mathematical benchmarks for 7B and 1.5B LLMs.
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning cs.LG · 2026-05-14 · unverdicted · none · ref 44 · internal anchor
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models cs.AI · 2026-05-13 · unverdicted · none · ref 12 · 2 links · internal anchor
D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
PriorZero: Bridging Language Priors and World Models for Decision Making cs.LG · 2026-05-12 · unverdicted · none · ref 51 · internal anchor
PriorZero uses root-only LLM prior injection in MCTS and alternating world-model training with LLM fine-tuning to raise exploration efficiency and final performance on Jericho text games and BabyAI gridworlds.
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion cs.AI · 2026-05-12 · unverdicted · none · ref 10 · 2 links · internal anchor
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall improvement in simultaneous alignment.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable cs.AI · 2026-05-08 · unverdicted · none · ref 18 · internal anchor
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and speed on reasoning benchmarks.
ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL cs.DC · 2026-05-07 · unverdicted · none · ref 28 · 2 links · internal anchor
ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training cs.LG · 2026-04-26 · unverdicted · none · ref 19 · internal anchor
JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale cs.CL · 2026-04-13 · unverdicted · none · ref 7 · internal anchor
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training cs.DC · 2026-04-10 · unverdicted · none · ref 12 · internal anchor
TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.
Mitigating LLM biases toward spurious social contexts using direct preference optimization cs.AI · 2026-04-02 · unverdicted · none · ref 12 · internal anchor
Debiasing-DPO reduces bias to spurious social contexts by 84% and improves predictive accuracy by 52% on average for LLMs evaluating U.S. classroom transcripts.
Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL cs.LG · 2026-02-03 · unverdicted · none · ref 12 · internal anchor
PULSE exploits BF16-invisible sparsity in weight updates to enable over 100x lower communication in distributed RL post-training via compute-visible sparsification.
Factored Causal Representation Learning for Robust Reward Modeling in RLHF cs.LG · 2026-01-29 · unverdicted · none · ref 13 · internal anchor
A factored causal representation learning method improves robustness of reward models in RLHF by isolating causal factors from biases like length and sycophancy using adversarial gradient reversal.
FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning cs.LG · 2026-01-26 · unverdicted · none · ref 15 · internal anchor
FP8-RL delivers up to 44% faster rollouts in LLM RL by using blockwise FP8 quantization, KV-cache recalibration, and importance-sampling corrections while keeping learning behavior close to BF16 baselines.
Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning cs.LG · 2025-11-24 · unverdicted · none · ref 8 · internal anchor
A periodically asynchronous on-policy RL system for LLM post-training achieves up to 3x throughput gains by separating inference and training with periodic policy synchronization and a tri-model architecture.
Seer: Online Context Learning for Fast Synchronous LLM Reinforcement Learning cs.DC · 2025-11-18 · unverdicted · none · ref 15 · internal anchor
Seer improves synchronous LLM RL rollout throughput by up to 2.04x and reduces long-tail latency by 72-94% via divided rollout, context-aware scheduling, and adaptive grouped speculative decoding based on prompt similarity observations.
RLBoost: Harvesting Preemptible Resources for Cost-Efficient Reinforcement Learning on LLMs cs.DC · 2025-10-22 · unverdicted · none · ref 17 · internal anchor
RLBoost harvests preemptible GPUs for RL rollout via a hybrid architecture with adaptive offload, pull-based transfer, and token-level migration, delivering 1.51x-1.97x throughput and 28-49% better cost efficiency than on-demand-only setups.
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning cs.CL · 2025-06-02 · conditional · none · ref 9 · internal anchor
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
VLA-RL: Towards Masterful and General Robotic Manipulation with Scalable Reinforcement Learning cs.RO · 2025-05-24 · conditional · none · ref 30 · internal anchor
VLA-RL applies online RL to pretrained VLAs, yielding a 4.5% gain over strong baselines on 40 LIBERO manipulation tasks and matching commercial models like π₀-FAST.
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning cs.AI · 2025-03-18 · conditional · none · ref 21 · internal anchor
Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL cs.CL · 2025-03-10 · unverdicted · none · ref 25 · internal anchor
A two-stage RL framework first boosts text reasoning in 3B LMMs then adapts it to multimodal inputs, producing modest benchmark gains of 4.5-4.8%.
Supervising the search process produces reliable and generalizable information-seeking agents cs.CL · 2025-02-19 · unverdicted · none · ref 24 · internal anchor
Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.
PlexRL: Cluster-Level Orchestration of Serviceized LLM Execution for RLVR cs.DC · 2026-05-20 · unverdicted · none · ref 11 · internal anchor
PlexRL multiplexes unified LLM services across RLVR jobs at the cluster level to exploit anti-correlated idle times and reduce GPU-hour costs by up to 37.58% with minimal per-job overhead.
PipeMax: Enhancing Offline LLM Inference on Commodity GPU Servers cs.DC · 2026-05-04 · unverdicted · none · ref 6 · internal anchor
PipeMax integrates pipeline parallelism with offloading to achieve up to 2.51x higher throughput than vLLM for offline LLM inference on commodity 8-GPU servers.
Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning cs.CL · 2026-04-06 · unverdicted · none · ref 21 · 2 links · internal anchor
ProxyCoT transfers CoT reasoning from proxy short contexts to full long contexts through RL/distillation followed by SFT, outperforming baselines with lower overhead and generalizing out-of-domain.
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap cs.CL · 2025-08-06 · unverdicted · none · ref 49 · internal anchor
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 271 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence cs.AI · 2026-05-07 · unverdicted · none · ref 63 · 2 links · internal anchor
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 15 · internal anchor
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Learning to Reason at the Frontier of Learnability cs.LG · 2025-02-17 · unverdicted · none · ref 22 · internal anchor
A curriculum sampling questions with high variance in success rate improves reinforcement learning performance for LLM reasoning tasks.
A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 201 · internal anchor
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
Reinforcement Learning from Human Feedback cs.LG · 2025-04-16 · unverdicted · none · ref 156 · internal anchor
The book introduces the origins, mathematical setup, and optimization stages of RLHF including reward modeling, reinforcement learning, rejection sampling, and direct alignment algorithms.
MinT: Managed Infrastructure for Training and Serving Millions of LLMs cs.LG · 2026-05-13 · unreviewed · ref 9 · internal anchor

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer