Canonical reference

Title resolution pending

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L · 2022

Canonical reference. 100% of citing Pith papers cite this work as background.

16 Pith papers citing it

Background 100% of classified citations

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

cs.AI · 2025-03-07 · conditional · novelty 7.0

RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.

We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning?

cs.AI · 2024-07-01 · accept · novelty 7.0

WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

Distributional Reinforcement Learning via the Cram\'er Distance

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

C-DSAC applies the Cramér distance to distributional value learning inside SAC and outperforms standard SAC on robotic benchmarks, with larger gains in complex environments due to confidence-driven conservative updates.

RAGEN-2: Reasoning Collapse in Agentic RL

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

cs.CL · 2025-09-25 · unverdicted · novelty 6.0

RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.

SkyReels-V2: Infinite-length Film Generative Model

cs.CV · 2025-04-17 · unverdicted · novelty 6.0

SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

cs.CL · 2024-06-25 · unverdicted · novelty 6.0

FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.

Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

cs.LG · 2024-02-22 · conditional · novelty 6.0

REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

cs.CL · 2024-02-20 · conditional · novelty 6.0

DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.

Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature

cs.CL · 2025-09-20 · unverdicted · novelty 5.0

HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.

InvDesFlow-AL: active learning-based workflow for inverse design of functional materials

cond-mat.mtrl-sci · 2025-05-14 · unverdicted · novelty 5.0

InvDesFlow-AL combines active learning with diffusion generative models to improve crystal structure prediction accuracy by 33% and identifies Li2AuH6 as a candidate BCS superconductor with 140 K transition temperature.

Detecting Language Model Attacks with Perplexity

cs.CL · 2023-08-27 · unverdicted · novelty 5.0

Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.

Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation

cs.CV · 2024-02-27 · unverdicted · novelty 4.0

Optimizing the noise schedule, preparing a balanced bucketed dataset, and aligning outputs with human preferences enables Playground v2.5 to reach state-of-the-art aesthetic quality across aspect ratios.

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

cs.HC · 2024-01-10 · unverdicted · novelty 3.0

This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.

Bridging Brains and Machines: A Unified Frontier in Neuroscience, Artificial Intelligence, and Neuromorphic Systems

q-bio.NC · 2025-07-14 · unverdicted · novelty 2.0

A position and survey paper that identifies convergence between neuroscience, AGI, and neuromorphic computing and outlines four key integration challenges.

citing papers explorer

Showing 16 of 16 citing papers.

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model cs.AI · 2025-03-07 · conditional · none · ref 9
RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.
We-Math: Does Your Large Multimodal Model Achieve Human-like Mathematical Reasoning? cs.AI · 2024-07-01 · accept · none · ref 4
WE-MATH benchmark reveals most LMMs rely on rote memorization for visual math while GPT-4o has shifted toward knowledge generalization.
GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 186
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
Distributional Reinforcement Learning via the Cram\'er Distance cs.LG · 2026-04-26 · unverdicted · none · ref 25
C-DSAC applies the Cramér distance to distributional value learning inside SAC and outperforms standard SAC on robotic benchmarks, with larger gains in complex environments due to confidence-driven conservative updates.
RAGEN-2: Reasoning Collapse in Agentic RL cs.LG · 2026-04-07 · unverdicted · none · ref 32
Template collapse is a distinct failure mode in agentic RL invisible to entropy; mutual information proxies diagnose it better and SNR-aware filtering using reward variance improves input-dependent reasoning and task performance across planning, math, navigation, and code tasks.
RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards cs.CL · 2025-09-25 · unverdicted · none · ref 30
RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.
SkyReels-V2: Infinite-length Film Generative Model cs.CV · 2025-04-17 · unverdicted · none · ref 39
SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale cs.CL · 2024-06-25 · unverdicted · none · ref 16
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs cs.LG · 2024-02-22 · conditional · none · ref 38
REINFORCE-style variants outperform PPO, DPO, and RAFT in RLHF for LLMs by removing unnecessary PPO components and adapting the simpler method to LLM alignment characteristics.
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive cs.CL · 2024-02-20 · conditional · none · ref 41
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
Heterogeneous Adaptive Policy Optimization: Tailoring Optimization to Every Token's Nature cs.CL · 2025-09-20 · unverdicted · none · ref 18
HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.
InvDesFlow-AL: active learning-based workflow for inverse design of functional materials cond-mat.mtrl-sci · 2025-05-14 · unverdicted · none · ref 62
InvDesFlow-AL combines active learning with diffusion generative models to improve crystal structure prediction accuracy by 33% and identifies Li2AuH6 as a candidate BCS superconductor with 140 K transition temperature.
Detecting Language Model Attacks with Perplexity cs.CL · 2023-08-27 · unverdicted · none · ref 23
Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.
Playground v2.5: Three Insights towards Enhancing Aesthetic Quality in Text-to-Image Generation cs.CV · 2024-02-27 · unverdicted · none · ref 25
Optimizing the noise schedule, preparing a balanced bucketed dataset, and aligning outputs with human preferences enables Playground v2.5 to reach state-of-the-art aesthetic quality across aspect ratios.
Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security cs.HC · 2024-01-10 · unverdicted · none · ref 45
This survey discusses key components and challenges for Personal LLM Agents and reviews solutions for their capability, efficiency, and security.
Bridging Brains and Machines: A Unified Frontier in Neuroscience, Artificial Intelligence, and Neuromorphic Systems q-bio.NC · 2025-07-14 · unverdicted · none · ref 142
A position and survey paper that identifies convergence between neuroscience, AGI, and neuromorphic computing and outlines four key integration challenges.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer