pith. sign in

hub

Reasoning gym: Reasoning environments for reinforcement learning with verifiable rewards.arXiv preprint arXiv:2505.24760

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

hub tools

citation-role summary

background 2 dataset 2

citation-polarity summary

years

2026 14 2025 2

verdicts

UNVERDICTED 16

clear filters

representative citing papers

Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Introduces OPT* tasks and two training regimes (solver-guided online policy optimization with rank-based reward shaping and search-based offline RL) plus a theoretical link between search success and information extraction per budget unit, showing empirical gains in optimization-like reasoning.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.

TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

TRON supplies 520 rule-verifiable online visual reasoning environments across five ability buckets that generate unlimited training instances for RL post-training, yielding consistent gains on ten external multimodal benchmarks for three vision-language models.

AIPO: Learning to Reason from Active Interaction

cs.CL · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

AIPO adds active multi-agent consultation (Verify, Knowledge, Reasoning agents) plus custom importance sampling to RLVR training so LLMs expand their reasoning boundary and then operate without the agents.

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 3 refs

RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.

Robots Need More than VLA and World Models

cs.RO · 2026-06-04 · unverdicted · novelty 5.0

The paper identifies four missing interfaces (data autolabelling, embodiment retargeting, physics-grounded world models, and video-based reward inference) as the central bottleneck beyond VLA scaling for robot intelligence.

Mellum2 Technical Report

cs.CL · 2026-05-29 · unverdicted · novelty 3.0

Mellum 2 is a 12B MoE model with 2.5B active parameters, trained on 10.6T tokens with MoE, GQA, SWA, and MTP, then post-trained into Instruct and Thinking variants, claimed competitive with 4B-14B models at 2.5B compute.

citing papers explorer

Showing 4 of 4 citing papers after filters.

  • Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces cs.AI · 2026-06-03 · unverdicted · none · ref 31

    Introduces OPT* tasks and two training regimes (solver-guided online policy optimization with rank-based reward shaping and search-based offline RL) plus a theoretical link between search success and information extraction per budget unit, showing empirical gains in optimization-like reasoning.

  • TRON: Targeted Rule-Verifiable Online Environments for Visual Reasoning RL cs.AI · 2026-06-01 · unverdicted · none · ref 33

    TRON supplies 520 rule-verifiable online visual reasoning environments across five ability buckets that generate unlimited training instances for RL post-training, yielding consistent gains on ten external multimodal benchmarks for three vision-language models.

  • Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key cs.AI · 2026-05-07 · unverdicted · none · ref 93 · 3 links

    RL training compute for logical reasoning follows a power law with horizon depth whose exponent rises with logical expressiveness, yielding better downstream transfer when models train on richer logics.

  • SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning cs.AI · 2026-01-08 · unverdicted · none · ref 34

    SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.