pith. sign in
Pith Number

pith:RYVQEAUD

pith:2025:RYVQEAUDGW65ZGD55QAOJNHSIW
not attested not anchored not stored refs resolved

Agentic Reinforced Policy Optimization

Fuzheng Zhang, Guanting Dong, Guorui Zhou, Hangyu Mao, Huiyang Wang, Jiazhen Du, Ji-Rong Wen, Kai Ma, Licheng Bao, Yifei Chen, Yutao Zhu, Zhicheng Dou, Zhongxia Chen, Zhongyuan Wang

ARPO improves LLM agent performance on long-horizon tasks by sampling more at high-entropy steps right after each tool call.

arxiv:2507.19849 v1 · 2025-07-26 · cs.LG · cs.AI · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{RYVQEAUDGW65ZGD55QAOJNHSIW}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

ARPO achieves improved performance using only half of the tool-use budget required by existing methods, offering a scalable solution for aligning LLM-based agents with real-time dynamic environments.

C2weakest assumption

The preliminary observation that LLMs exhibit highly uncertain behavior (increased entropy) immediately following tool interactions is general enough to guide adaptive sampling across tasks and that this mechanism reliably improves long-horizon performance.

C3one line summary

ARPO adds entropy-based adaptive rollouts and stepwise advantage attribution to RL for LLM agents, outperforming prior trajectory-level methods on 13 benchmarks with half the tool budget.

References

11 extracted · 11 resolved · 2 Pith anchors

[1] REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization 2020 · doi:10.18653/v1/2020.coling-main.580
[3] Prabha, D., Aswini, J., Maheswari, B., Subramanian, R 2023 · doi:10.18653/v1/2023.findings-emnlp.378
[5] Scaling Relationship on Learning Mathematical Reasoning with Large Language Models 2025 · doi:10.48550/arxiv
[6] thinking while doing 2024
[7] Each interaction response length is capped at 4096 tokens

Formal links

2 machine-checked theorem links

Cited by

26 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:15.333245Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

8e2b02028335bddc987dec00e4b4f2459a73c968a690cb2f56fc3280e364f4d7

Aliases

arxiv: 2507.19849 · arxiv_version: 2507.19849v1 · doi: 10.48550/arxiv.2507.19849 · pith_short_12: RYVQEAUDGW65 · pith_short_16: RYVQEAUDGW65ZGD5 · pith_short_8: RYVQEAUD
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/RYVQEAUDGW65ZGD55QAOJNHSIW \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 8e2b02028335bddc987dec00e4b4f2459a73c968a690cb2f56fc3280e364f4d7
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "2d063dcb52d9088260070a91f280b9064b4539cd1d082dfcb0de4de283df80a3",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2025-07-26T07:53:11Z",
    "title_canon_sha256": "c6efe2ebcc3ed7ebb55512d4066b4de04d544066275cf02ab776cb1a95f4a0df"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2507.19849",
    "kind": "arxiv",
    "version": 1
  }
}