Pith Number

pith:4HPHFHH5

pith:2026:4HPHFHH5UYY7QTGPI6WFBJOIBD

not attested not anchored not stored refs resolved

GRLO: Towards Generalizable Reinforcement Learning in Open-Ended Environments from Zero

Shangjian Yin, Yue Dong, Yu Fu, Zhouxing Shi

Reinforcement learning from open-ended conversations transfers to improve math and code performance without domain-specific training.

arxiv:2605.15464 v1 · 2026-05-14 · cs.LG · cs.AI · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{4HPHFHH5UYY7QTGPI6WFBJOIBD}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

on Qwen3-4B-Base backbone, GRLO improves the average performance across all domains from 24.1 to 63.1 with only 5K prompts and 22.7 GPU hours, requiring about 46× less data and 68× less compute than a strong in-domain RLVR baseline. The resulting model is even competitive with Qwen's released post-trained models which required a much larger training cost.

C2weakest assumption

The assumption that conversational abilities explicitly acquired through RLHF in open-ended environments will implicitly transfer to downstream tasks such as mathematical reasoning and code generation without any direct training on those domains.

C3one line summary

GRLO shows RLHF from scratch on 5K open-ended prompts raises average performance from 24.1 to 63.1 across domains on Qwen3-4B-Base using 46x less data and 68x less compute than in-domain RLVR while remaining competitive with heavily post-trained models.

References

45 extracted · 45 resolved · 3 Pith anchors

[1] Proximal Policy Optimization Algorithms , author=. 2017 , eprint= 2017

[2] 2024 , journal = 2024

[3] Language Models that Think, Chat Better , author=. 2025 , eprint= 2025

[4] Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models , author=. 2025 , eprint= 2025

[5] Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators , author=. 2025 , eprint= 2025

Formal links

2 machine-checked theorem links

Receipt and verification

First computed	2026-05-20T00:00:59.902791Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

e1de729cfda631f84ccf47ac50a5c808d0039710ff1634fba3a864d54491274d

Aliases

arxiv: 2605.15464 · arxiv_version: 2605.15464v1 · doi: 10.48550/arxiv.2605.15464 · pith_short_12: 4HPHFHH5UYY7 · pith_short_16: 4HPHFHH5UYY7QTGP · pith_short_8: 4HPHFHH5

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/4HPHFHH5UYY7QTGPI6WFBJOIBD \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: e1de729cfda631f84ccf47ac50a5c808d0039710ff1634fba3a864d54491274d

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "5024d189c952fe99f92d232998f50195460df206b26b16ebea862bea00ae5fb2",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-14T23:05:23Z",
    "title_canon_sha256": "05b8f09e958e40042d9785d856788815ef5a80527455c9032cbd99bfc43a9a7e"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.15464",
    "kind": "arxiv",
    "version": 1
  }
}