Pure Python

Long-form factuality in large language models · 2025 · arXiv 2505.15801

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

cs.LG · 2026-04-08 · unverdicted · novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.

From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models

cs.CL · 2025-11-14 · unverdicted · novelty 6.0

Tool use in LLMs improves final-answer accuracy but degrades reasoning quality through Tool-Induced Myopia, with the effect worsening as tool calls increase and shifting errors toward logic and assumption failures.

citing papers explorer

Showing 3 of 3 citing papers.

Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning cs.LG · 2026-04-08 · unverdicted · none · ref 150
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Delay, Plateau, or Collapse: Evaluating the Impact of Systematic Verification Error on RLVR cs.LG · 2026-04-06 · unverdicted · none · ref 26
Systematic false positives in verifiers can cause RLVR training to reach suboptimal plateaus or collapse, with outcomes driven by error patterns rather than overall error rate.
From Proof to Program: Characterizing Tool-Induced Reasoning Hallucinations in Large Language Models cs.CL · 2025-11-14 · unverdicted · none · ref 7
Tool use in LLMs improves final-answer accuracy but degrades reasoning quality through Tool-Induced Myopia, with the effect worsening as tool calls increase and shifting errors toward logic and assumption failures.

Pure Python

fields

years

verdicts

representative citing papers

citing papers explorer