pith. sign in

hub Mixed citations

Advances in neural information processing systems , volume=

Mixed citation behavior. Most common role is background (43%).

97 Pith papers citing it
Background 43% of classified citations

hub tools

citation-role summary

background 5 method 2

citation-polarity summary

claims ledger

  • background Appendix B), which normalizes rewards within the G rollouts of each task into group-relative advantages. Utilization and query.The action tokens a1:T are conditioned on (xi, zi) and optimized by the task outcome Rutil i =r(τ i). The query qi precedes the actions in the same sequence and receives gradients through the same objective: J util(θ) =J GRPO θ;{τ 1, . . . , τG},{ ˆA1, . . . , ˆAG}  .(8) Re-ranking.The permutation σi is generated conditioned on the task xi and retrieved candidates Bi K,

co-cited works

representative citing papers

Base Models Look Human To AI Detectors

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.

DISA: Offline Importance Sampling for Distribution-Matching LLM-RL

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

DISA decouples partition function estimation using offline importance sampling for distribution-matching LLM-RL, matching or exceeding online baselines like FlowRL on math and code benchmarks while retaining more strategy diversity.

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.

Variance-aware Reward Modeling with Anchor Guidance

stat.ML · 2026-05-12 · unverdicted · novelty 7.0

Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.

citing papers explorer

Showing 50 of 97 citing papers.