Pith Number

pith:7DDKR4XU

pith:2026:7DDKR4XUTVPMLMVNBPZAA2FTN5

not attested not anchored not stored refs resolved

Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

Byeongguk Jeon, JaeHyeok Doo, Kimin Lee, Minjoon Seo, Seonghyeon Ye

Q-Flow stabilizes training of expressive flow-based policies in reinforcement learning by propagating terminal values backward along deterministic flow paths.

arxiv:2605.13435 v1 · 2026-05-13 · cs.LG · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{7DDKR4XUTVPMLMVNBPZAA2FTN5}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Q-Flow leverages the deterministic nature of flow dynamics to explicitly propagate terminal trajectory value to intermediate latent states along the policy-induced flow, enabling stable policy optimization using intermediate value gradients without unrolling the numerical solver.

C2weakest assumption

The assumption that propagating terminal trajectory value to intermediate latent states along the flow provides reliable gradients for policy optimization without introducing bias or instability from the flow matching process itself.

C3one line summary

Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.

References

26 extracted · 26 resolved · 8 Pith anchors

[1] Diffusion guidance is a controllable policy improvement operator.arXiv preprint arXiv:2505.23458

[2] D4RL: Datasets for Deep Data-Driven Reinforcement Learning 2004 · arXiv:2004.07219

[3] IDQL: Implicit Q-Learning as an Actor-Critic Method with Diffusion Policies · arXiv:2304.10573

[4] AlignIQL: Policy alignment in implicit q-learning through constrained optimization

[5] Gaussian Error Linear Units (GELUs) · arXiv:1606.08415

Receipt and verification

First computed	2026-05-18T02:44:47.114249Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

f8c6a8f2f49d5ec5b2ad0bf20068b36f4e0483f4f460a5f7fd10d6ada7f80dcb

Aliases

arxiv: 2605.13435 · arxiv_version: 2605.13435v1 · doi: 10.48550/arxiv.2605.13435 · pith_short_12: 7DDKR4XUTVPM · pith_short_16: 7DDKR4XUTVPMLMVN · pith_short_8: 7DDKR4XU

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/7DDKR4XUTVPMLMVNBPZAA2FTN5 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: f8c6a8f2f49d5ec5b2ad0bf20068b36f4e0483f4f460a5f7fd10d6ada7f80dcb

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "d3f3b4c1a13b3c5ce7999a8811fedb2756df7e952a1be1811ec7abc65d2dd2fc",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2026-05-13T12:31:02Z",
    "title_canon_sha256": "5d8746ad1b114d8b8f642d3fc3e2b0905a72d645b9d317e72b75b91c473bb228"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13435",
    "kind": "arxiv",
    "version": 1
  }
}