pith. sign in
Pith Number

pith:LRMVZIQ4

pith:2026:LRMVZIQ4PYFXAYYZLNGLEZQUGT
not attested not anchored not stored refs resolved

ROAD: Adaptive Data Mixing for Offline-to-Online Reinforcement Learning via Bi-Level Optimization

(2) Ant Group, China, China), Jian Liu (2), Letian Yang (1), Shanghai, Shuai Li (1) ((1) Shanghai Jiao Tong University, Weiqiang Wang (2), Xu Liu (1), Yiqiang Lu (2)

ROAD frames data mixing in offline-to-online reinforcement learning as a bi-level optimization problem solved by a multi-armed bandit to automate replay ratios.

arxiv:2605.14497 v1 · 2026-05-14 · cs.LG · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{LRMVZIQ4PYFXAYYZLNGLEZQUGT}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Our empirical results demonstrate that this approach consistently outperforms existing data replay methods across various datasets, eliminating the need for manual, context-specific adjustments while achieving superior stability and asymptotic performance.

C2weakest assumption

The surrogate objective used inside the multi-armed bandit sufficiently approximates the true bi-level gradient so that the outer-level data-mixing decisions actually improve the final policy performance.

C3one line summary

ROAD formulates data mixing as a bi-level optimization problem solved via multi-armed bandit to adaptively balance offline priors and online updates in RL.

References

41 extracted · 41 resolved · 4 Pith anchors

[1] Efficient online reinforcement learning with offline data 2023
[2] MOORL: A frame- work for integrating offline-online reinforcement learning 2025
[3] D4RL: Datasets for Deep Data-Driven Reinforcement Learning 2020 · arXiv:2004.07219
[4] Soft actor-critic: Off-policy maximum entropy deep reinforcement learn- ing with a stochastic actor 2018
[5] Modem: Accelerating visual model-based reinforcement learning with demonstrations 2023
Receipt and verification
First computed 2026-05-17T23:39:06.361561Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

5c595ca21c7e0b7063195b4cb2661434fe63434295697c0a11e39585a72f9109

Aliases

arxiv: 2605.14497 · arxiv_version: 2605.14497v1 · doi: 10.48550/arxiv.2605.14497 · pith_short_12: LRMVZIQ4PYFX · pith_short_16: LRMVZIQ4PYFXAYYZ · pith_short_8: LRMVZIQ4
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/LRMVZIQ4PYFXAYYZLNGLEZQUGT \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 5c595ca21c7e0b7063195b4cb2661434fe63434295697c0a11e39585a72f9109
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "503a5d5ebe496c4c6f24c513eecce7c4434bffaebbd30fd8d393cbf674ec4041",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-14T07:35:58Z",
    "title_canon_sha256": "11b8a06a8ccb4408cfc2bc36fdd84d56c8f412eb4767a63ea1f8f4168572c3aa"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.14497",
    "kind": "arxiv",
    "version": 1
  }
}