Pith Number

pith:RITIFKOX

pith:2026:RITIFKOXBVM3H2XGQT4N23QQAZ

not attested not anchored not stored refs resolved

Provably avoiding over-optimization in Direct Preference Optimization without knowing the data distribution

Adam Barla, Emanuele Nevali, Luca Viano, Volkan Cevher

PEPO mitigates DPO over-optimization by achieving sample complexity bounds that depend only on single-policy concentrability.

arxiv:2602.06239 v2 · 2026-02-05 · cs.LG

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{RITIFKOXBVM3H2XGQT4N23QQAZ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

In the tabular setting, PEPO achieves sample complexity guarantees depending only on a single-policy concentrability coefficient, thus avoiding the all-policy concentrability which affects the guarantees of algorithms prone to over-optimization, such as DPO.

C2weakest assumption

That an ensemble of policies trained on disjoint subsets can be aggregated via worst-case construction to produce pessimism without access to the data-generating distribution or explicit reward model.

C3one line summary

PEPO uses pessimistic ensembling of DPO policies on data subsets to achieve single-policy concentrability sample bounds and avoid over-optimization in tabular settings.

References

49 extracted · 49 resolved · 5 Pith anchors

[1] Design considerations in offline preference-based rl

[2] XRPO: Pushing the limits of GRPO with targeted exploration and exploitation

[3] Value-incentivized preference optimization: A unified approach to online and offline rlhf.arXiv preprint arXiv:2405.19320,

[4] On extending direct preference optimization to accommodate ties.arXiv preprint arXiv:2409.17431,

[5] AvoidingO(eRmax)scaling in rlhf through preference-based exploration.arXiv preprint arXiv:2502.00666,

Formal links

2 machine-checked theorem links

Receipt and verification

First computed	2026-05-18T02:44:31.368347Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

8a2682a9d70d59b3eae684f8dd6e1006674a20be9c82ad8450a542a6432ce51d

Aliases

arxiv: 2602.06239 · arxiv_version: 2602.06239v2 · doi: 10.48550/arxiv.2602.06239 · pith_short_12: RITIFKOXBVM3 · pith_short_16: RITIFKOXBVM3H2XG · pith_short_8: RITIFKOX

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/RITIFKOXBVM3H2XGQT4N23QQAZ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 8a2682a9d70d59b3eae684f8dd6e1006674a20be9c82ad8450a542a6432ce51d

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "138b14a500ed947f80d010919a83f4bdc3475151f49ad3e001a915a05d5c0d43",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-02-05T22:31:07Z",
    "title_canon_sha256": "db7ddacc3944814b6e89500111aa47ddc7ffbdfc4cd5b094ac1b2867786e411b"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2602.06239",
    "kind": "arxiv",
    "version": 2
  }
}