Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning

Ting Long; Wei Liu

arxiv: 2605.22376 · v1 · pith:P5CZ7ZIEnew · submitted 2026-05-21 · 💻 cs.LG

Target-Aligned Bellman Backup for Cross-domain Offline Reinforcement Learning

Wei Liu , Ting Long This is my paper

Pith reviewed 2026-05-22 07:59 UTC · model grok-4.3

classification 💻 cs.LG

keywords cross-domain offline reinforcement learningBellman targetsdata transferabilityselective backuppolicy optimizationvalue estimation

0 comments

The pith

Assessing source-domain transitions by their alignment with target-domain Bellman targets rather than surface similarity improves policy learning in cross-domain offline RL with scarce target data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that standard ways of transferring data from a source domain to a target domain in offline reinforcement learning rely on measuring how similar individual transitions look or behave, yet this can mislead because visually or dynamically similar transitions often produce different long-term returns once the policy is executed in the target environment. Because policy optimization depends on accurate Bellman targets to judge the quality of actions, the authors argue that transferability should instead be judged by how much each source transition helps produce reliable Bellman targets in the target domain. They introduce a selective backup procedure that keeps only those source transitions whose contribution improves target-domain value estimation. A reader would care because many practical RL problems have abundant source data but very little target data, and the mismatch between surface similarity and value consistency explains why prior transfer methods sometimes degrade performance.

Core claim

The central claim is that assessing the transferability of source-domain transitions based on their alignment with target-domain Bellman targets, rather than superficial transition similarity, enables more effective policy learning in cross-domain offline RL with limited target data.

What carries the argument

Target-Aligned Bellman Backup (TABB), which selectively leverages source-domain data by measuring their contribution to accurate Bellman target estimation in the target domain.

If this is right

Source data that produces inconsistent long-term returns in the target domain is down-weighted even if the transitions appear similar.
Policy optimization receives higher-quality value estimates when target data is highly limited.
The method produces consistent performance gains across a broad range of cross-domain offline RL settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment principle could be tested in online cross-domain transfer where new target samples can be collected adaptively.
It raises the question of whether other transfer problems in sequential decision-making should replace state-action similarity with value-function alignment.
If the alignment score can be computed without extra target samples, the approach may lower the data requirements for safe deployment in new environments.

Load-bearing premise

Alignment with target-domain Bellman targets can be measured reliably from source data alone and used for selective backup without introducing new biases or requiring extra target-domain samples for calibration.

What would settle it

An experiment in which source transitions are selected by measured alignment with target Bellman targets yet the resulting policy performs no better than, or worse than, a policy trained with similarity-based selection on the same limited target data.

Figures

Figures reproduced from arXiv: 2605.22376 by Ting Long, Wei Liu.

**Figure 2.** Figure 2: Results under different levels. We evaluate the robustness of TABB from two perspectives: varying dynamics-shift intensities and heterogeneous source-target data quality. Varying dynamics-shift intensities. We construct cross-domain tasks with varying friction levels in the Hopper and Walker2D environments. The friction level is selected from {0.1, 0.5, 2.0, 5.0}, covering a broad range of dynamics vari… view at source ↗

**Figure 3.** Figure 3: Oracle Bellman Error of source transitions ranked by TBM and transition similarity. TABB uses TBM to estimate the target-domain Bellman target for each source transition, and measures its mismatch with the original source Bellman target. The resulting mismatch is then used to reweight source transitions according to their consistency with target-domain Bellman learning. To empirically examine whether the… view at source ↗

read the original abstract

Cross-domain offline reinforcement learning (CDRL) aims to improve policy learning in a target domain by leveraging data collected from a source domain. Existing works typically assess the transferability of source-domain data by measuring its similarity to target-domain transitions, and implicitly perform transition-level selection. Transitions that are considered similar are assigned higher weights or rewards, while dissimilar ones are down-weighted. However, transition-level similarity does not necessarily imply consistency in long-term returns. Even visually or dynamically similar transitions may lead to significantly different outcomes in the target domain, which can mislead policy learning and degrade performance. To address this issue, we revisit the fundamental objective of policy learning. Since policy optimization ultimately relies on Bellman targets to evaluate the quality of decisions, we propose to assess the transferability of source-domain transitions based on their alignment with target-domain Bellman targets, rather than superficial transition similarity. Based on this insight, we propose a method termed Target-Aligned Bellman Backup (TABB), which selectively leverages source-domain data by measuring their contribution to accurate Bellman target estimation in the target domain. We evaluate TABB across a broad range of cross-domain offline RL settings with highly limited target-domain data. Experimental results show that TABB consistently achieves strong performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Target-Aligned Bellman Backup (TABB) for cross-domain offline RL. It claims that evaluating source-domain transition transferability via alignment with target-domain Bellman targets (rather than transition-level similarity) enables more effective policy learning when target data is highly limited. The method selectively backs up source data according to its contribution to accurate target Bellman estimation, and experiments across a range of CDRL settings are reported to show consistent strong performance.

Significance. If the central claim holds, the work offers a conceptually clean shift from superficial similarity metrics to value-consistent selection in CDRL. This could reduce bias from mismatched long-term returns and improve data efficiency in low-target-data regimes. The emphasis on Bellman-target alignment is a strength relative to prior transition-similarity approaches.

major comments (2)

[Method / Target Bellman target estimation] The skeptic concern is load-bearing: with highly limited target data, any estimate of the target value function or Bellman residual necessarily has high variance and coverage gaps. The paper must show (e.g., via ablation or theoretical bound) that the alignment score does not systematically prefer source transitions that fit estimation error rather than true target dynamics; otherwise the claimed advantage over similarity-based selection collapses.
[Experiments] Experimental claims of 'strong performance' and 'consistent' gains require concrete support. The manuscript should report statistical significance, number of seeds, exact baselines (including recent similarity-based CDRL methods), and ablations isolating the Bellman-alignment component versus naive weighting.

minor comments (2)

[Method] Notation for the alignment score and the selective backup operator should be introduced with a clear equation early in the method section to avoid ambiguity when reading the algorithm.
[Abstract / Introduction] The abstract states results across 'a broad range of cross-domain offline RL settings' but does not enumerate the domains or difficulty levels; a short table or list in the introduction would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major comment below and describe the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Method / Target Bellman target estimation] The skeptic concern is load-bearing: with highly limited target data, any estimate of the target value function or Bellman residual necessarily has high variance and coverage gaps. The paper must show (e.g., via ablation or theoretical bound) that the alignment score does not systematically prefer source transitions that fit estimation error rather than true target dynamics; otherwise the claimed advantage over similarity-based selection collapses.

Authors: We appreciate this important concern. Our alignment score is explicitly defined to measure contribution to accurate target-domain Bellman target estimation rather than raw transition similarity. While we acknowledge that limited target data introduces variance, the selection criterion prioritizes source transitions whose inclusion reduces the estimated Bellman residual on the observed target samples. In the revision we add an ablation that perturbs the target value estimates with controlled noise and shows that TABB retains its advantage over similarity baselines, indicating that the method is not simply fitting estimation artifacts. A full theoretical guarantee remains difficult in the fully offline cross-domain setting, but the added empirical analysis directly tests the skeptic scenario raised. revision: yes
Referee: [Experiments] Experimental claims of 'strong performance' and 'consistent' gains require concrete support. The manuscript should report statistical significance, number of seeds, exact baselines (including recent similarity-based CDRL methods), and ablations isolating the Bellman-alignment component versus naive weighting.

Authors: We agree that the experimental section would benefit from greater rigor. In the revised manuscript we report results over 10 random seeds with mean and standard deviation, include paired t-test p-values for all comparisons, and explicitly list all baselines (including the most recent similarity-based CDRL approaches). We also add an ablation that replaces the Bellman-alignment weighting with a naive transition-similarity weighting scheme while keeping all other components fixed; the performance drop confirms the contribution of the alignment component. These changes directly address the request for concrete support of the reported gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper defines TABB by measuring source transitions' contribution to target-domain Bellman target accuracy, using an external target value estimate rather than fitting parameters to the same quantity being predicted. No step reduces a claimed prediction to a fitted input by construction, nor relies on self-citation for a uniqueness theorem or ansatz. The central method is specified via explicit alignment scoring on Bellman residuals, which is independent of the final policy performance metric. This qualifies as a normal non-finding under the guidelines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The report is based solely on the abstract; no specific free parameters, axioms, or invented entities can be extracted or verified from the provided text.

pith-pipeline@v0.9.0 · 5742 in / 1100 out tokens · 20462 ms · 2026-05-22T07:59:23.731995+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TBM(ξ) = |(r + γV(z′s)) − (ˆr + γV(ˆz′s))| … ω(ξ) = exp(−TBM(ξ)) / Σ exp(−TBM(ξ′))
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

policy optimization ultimately relies on Bellman targets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

[1]

Con- trastive representation for data filtering in cross-domain offline reinforcement learning.arXiv preprint arXiv:2405.06192, 2024

Xiaoyu Wen, Chenjia Bai, Kang Xu, Xudong Yu, Yang Zhang, Xuelong Li, and Zhen Wang. Con- trastive representation for data filtering in cross-domain offline reinforcement learning.arXiv preprint arXiv:2405.06192, 2024

work page arXiv 2024
[2]

Beyond ood state actions: Supported cross-domain offline reinforcement learning

Jinxin Liu, Ziqi Zhang, Zhenyu Wei, Zifeng Zhuang, Yachen Kang, Sibo Gai, and Donglin Wang. Beyond ood state actions: Supported cross-domain offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 13945–13953, 2024

work page 2024
[3]

Cross-domain offline policy adaptation with optimal transport and dataset constraint

Jiafei Lyu, Mengbei Yan, Zhongjian Qiao, Runze Liu, Xiaoteng Ma, Deheng Ye, Jing-Wen Yang, Zongqing Lu, and Xiu Li. Cross-domain offline policy adaptation with optimal transport and dataset constraint. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[4]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005
[5]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Text-based interactive recommendation via offline reinforcement learning

Ruiyi Zhang, Tong Yu, Yilin Shen, and Hongxia Jin. Text-based interactive recommendation via offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11694–11702, 2022

work page 2022
[7]

Prefrec: Preference-based recommender systems for reinforcing long-term user engagement

Wanqi Xue, Qingpeng Cai, Zhenghai Xue, Shuo Sun, Shuchang Liu, Dong Zheng, Peng Jiang, and Bo An. Prefrec: Preference-based recommender systems for reinforcing long-term user engagement. arXiv preprint arXiv:2212.02779, 2022

work page arXiv 2022
[8]

A general offline reinforcement learning framework for interactive recommendation

Teng Xiao and Donglin Wang. A general offline reinforcement learning framework for interactive recommendation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 4512–4520, 2021

work page 2021
[9]

Morel: Model-based offline reinforcement learning.Advances in neural information processing systems, 33:21810–21823, 2020

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning.Advances in neural information processing systems, 33:21810–21823, 2020. 11

work page 2020
[10]

Constraints penalized q-learning for safe offline reinforcement learning

Haoran Xu, Xianyuan Zhan, and Xiangyu Zhu. Constraints penalized q-learning for safe offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8753–8760, 2022

work page 2022
[11]

Deep reinforcement learning for autonomous driving: A survey.IEEE transactions on intelligent transportation systems, 23(6):4909–4926, 2021

B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous driving: A survey.IEEE transactions on intelligent transportation systems, 23(6):4909–4926, 2021

work page 2021
[12]

Off-dynamics reinforcement learning: Training for transfer with domain classifiers.arXiv preprint arXiv:2006.13916, 2020

Benjamin Eysenbach, Swapnil Asawa, Shreyas Chaudhari, Sergey Levine, and Ruslan Salakhutdinov. Off-dynamics reinforcement learning: Training for transfer with domain classifiers.arXiv preprint arXiv:2006.13916, 2020

work page arXiv 2006
[13]

Dara: Dynamics-aware reward augmentation in offline reinforcement learning.arXiv preprint arXiv:2203.06662, 2022

Jinxin Liu, Hongyin Zhang, and Donglin Wang. Dara: Dynamics-aware reward augmentation in offline reinforcement learning.arXiv preprint arXiv:2203.06662, 2022

work page arXiv 2022
[14]

Cross-domain policy adaptation via value-guided data filtering.Advances in Neural Information Processing Systems, 36:73395–73421, 2023

Kang Xu, Chenjia Bai, Xiaoteng Ma, Dong Wang, Bin Zhao, Zhen Wang, Xuelong Li, and Wei Li. Cross-domain policy adaptation via value-guided data filtering.Advances in Neural Information Processing Systems, 36:73395–73421, 2023

work page 2023
[15]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

work page 2052
[16]

Stabilizing off-policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019

work page 2019
[17]

Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179–1191, 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179–1191, 2020

work page 2020
[18]

Dmc: Nearest neighbor guidance diffusion model for offline cross-domain reinforcement learning.arXiv preprint arXiv:2507.20499, 2025

Linh Le Pham Van, Minh Hoang Nguyen, Duc Kieu, Hung Le, Hung The Tran, and Sunil Gupta. Dmc: Nearest neighbor guidance diffusion model for offline cross-domain reinforcement learning.arXiv preprint arXiv:2507.20499, 2025

work page arXiv 2025
[19]

Offline reinforce- ment learning with domain-unlabeled data.arXiv preprint arXiv:2404.07465, 2024

Soichiro Nishimori, Xin-Qiang Cai, Johannes Ackermann, and Masashi Sugiyama. Offline reinforce- ment learning with domain-unlabeled data.arXiv preprint arXiv:2404.07465, 2024

work page arXiv 2024
[20]

Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone

Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone. arXiv preprint arXiv:2412.06685, 2024

work page arXiv 2024
[21]

Pre-training for robots: Offline rl enables learning new tasks from a handful of trials

Aviral Kumar, Anikait Singh, Frederik Ebert, Mitsuhiko Nakamoto, Yanlai Yang, Chelsea Finn, and Sergey Levine. Pre-training for robots: Offline rl enables learning new tasks from a handful of trials. arXiv preprint arXiv:2210.05178, 2022

work page arXiv 2022
[22]

Return augmented decision transformer for off-dynamics reinforcement learning.arXiv preprint arXiv:2410.23450, 2024

Ruhan Wang, Yu Yang, Zhishuai Liu, Dongruo Zhou, and Pan Xu. Return augmented decision transformer for off-dynamics reinforcement learning.arXiv preprint arXiv:2410.23450, 2024

work page arXiv 2024
[23]

Mobody: Model based off-dynamics offline reinforcement learning.arXiv preprint arXiv:2506.08460, 2025

Yihong Guo, Yu Yang, Pan Xu, and Anqi Liu. Mobody: Model based off-dynamics offline reinforcement learning.arXiv preprint arXiv:2506.08460, 2025

work page arXiv 2025
[24]

State regularized policy optimization on data with dynamics shift.Advances in neural information processing systems, 36:32926–32937, 2023

Zhenghai Xue, Qingpeng Cai, Shuchang Liu, Dong Zheng, Peng Jiang, Kun Gai, and Bo An. State regularized policy optimization on data with dynamics shift.Advances in neural information processing systems, 36:32926–32937, 2023. 12

work page 2023
[25]

A markovian decision process.Journal of mathematics and mechanics, pages 679–684, 1957

Richard Bellman. A markovian decision process.Journal of mathematics and mechanics, pages 679–684, 1957

work page 1957
[26]

Markov decision processes.Handbooks in operations research and management science, 2:331–434, 1990

Martin L Puterman. Markov decision processes.Handbooks in operations research and management science, 2:331–434, 1990

work page 1990
[27]

Reinforcement learning: A survey

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996

work page 1996
[28]

Odrl: A benchmark for off-dynamics reinforcement learning.Advances in Neural Information Processing Systems, 37:59859–59911, 2024

Jiafei Lyu, Kang Xu, Jiacheng Xu, Jing-Wen Yang, Zongzhang Zhang, Chenjia Bai, Zongqing Lu, Xiu Li, et al. Odrl: A benchmark for off-dynamics reinforcement learning.Advances in Neural Information Processing Systems, 37:59859–59911, 2024

work page 2024
[29]

Dynamic programming.science, 153(3731):34–37, 1966

Richard Bellman. Dynamic programming.science, 153(3731):34–37, 1966

work page 1966
[30]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998
[31]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

work page 2012
[32]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.arXiv preprint arXiv:1709.10087, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[34]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018
[35]

Dual-robust cross-domain offline reinforcement learning against dynamics shifts.arXiv preprint arXiv:2512.02486, 2025

Zhongjian Qiao, Rui Yang, Jiafei Lyu, Xiu Li, Zhongxiang Dai, Zhuoran Yang, Siyang Gao, and Shuang Qiu. Dual-robust cross-domain offline reinforcement learning against dynamics shifts.arXiv preprint arXiv:2512.02486, 2025

work page arXiv 2025
[36]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021. 13 A Proof In this section, we formally present the proof of Theorem 1. We decompose the target Bellman error as: δtar(ξ) = (rtar −ˆr) +γ V(z ′ s,tar)−V( ˆz′ s) + ˆr+γV( ˆz′ s)−(r+γV(z ′ s)) .(13) Applying the tri...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[1] [1]

Con- trastive representation for data filtering in cross-domain offline reinforcement learning.arXiv preprint arXiv:2405.06192, 2024

Xiaoyu Wen, Chenjia Bai, Kang Xu, Xudong Yu, Yang Zhang, Xuelong Li, and Zhen Wang. Con- trastive representation for data filtering in cross-domain offline reinforcement learning.arXiv preprint arXiv:2405.06192, 2024

work page arXiv 2024

[2] [2]

Beyond ood state actions: Supported cross-domain offline reinforcement learning

Jinxin Liu, Ziqi Zhang, Zhenyu Wei, Zifeng Zhuang, Yachen Kang, Sibo Gai, and Donglin Wang. Beyond ood state actions: Supported cross-domain offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 13945–13953, 2024

work page 2024

[3] [3]

Cross-domain offline policy adaptation with optimal transport and dataset constraint

Jiafei Lyu, Mengbei Yan, Zhongjian Qiao, Runze Liu, Xiaoteng Ma, Deheng Ye, Jing-Wen Yang, Zongqing Lu, and Xiu Li. Cross-domain offline policy adaptation with optimal transport and dataset constraint. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[4] [4]

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems.arXiv preprint arXiv:2005.01643, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2005

[5] [5]

Bridge Data: Boosting Generalization of Robotic Skills with Cross-Domain Datasets

Frederik Ebert, Yanlai Yang, Karl Schmeckpeper, Bernadette Bucher, Georgios Georgakis, Kostas Daniilidis, Chelsea Finn, and Sergey Levine. Bridge data: Boosting generalization of robotic skills with cross-domain datasets.arXiv preprint arXiv:2109.13396, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[6] [6]

Text-based interactive recommendation via offline reinforcement learning

Ruiyi Zhang, Tong Yu, Yilin Shen, and Hongxia Jin. Text-based interactive recommendation via offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11694–11702, 2022

work page 2022

[7] [7]

Prefrec: Preference-based recommender systems for reinforcing long-term user engagement

Wanqi Xue, Qingpeng Cai, Zhenghai Xue, Shuo Sun, Shuchang Liu, Dong Zheng, Peng Jiang, and Bo An. Prefrec: Preference-based recommender systems for reinforcing long-term user engagement. arXiv preprint arXiv:2212.02779, 2022

work page arXiv 2022

[8] [8]

A general offline reinforcement learning framework for interactive recommendation

Teng Xiao and Donglin Wang. A general offline reinforcement learning framework for interactive recommendation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 4512–4520, 2021

work page 2021

[9] [9]

Morel: Model-based offline reinforcement learning.Advances in neural information processing systems, 33:21810–21823, 2020

Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. Morel: Model-based offline reinforcement learning.Advances in neural information processing systems, 33:21810–21823, 2020. 11

work page 2020

[10] [10]

Constraints penalized q-learning for safe offline reinforcement learning

Haoran Xu, Xianyuan Zhan, and Xiangyu Zhu. Constraints penalized q-learning for safe offline reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8753–8760, 2022

work page 2022

[11] [11]

Deep reinforcement learning for autonomous driving: A survey.IEEE transactions on intelligent transportation systems, 23(6):4909–4926, 2021

B Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A Al Sallab, Senthil Yogamani, and Patrick Pérez. Deep reinforcement learning for autonomous driving: A survey.IEEE transactions on intelligent transportation systems, 23(6):4909–4926, 2021

work page 2021

[12] [12]

Off-dynamics reinforcement learning: Training for transfer with domain classifiers.arXiv preprint arXiv:2006.13916, 2020

Benjamin Eysenbach, Swapnil Asawa, Shreyas Chaudhari, Sergey Levine, and Ruslan Salakhutdinov. Off-dynamics reinforcement learning: Training for transfer with domain classifiers.arXiv preprint arXiv:2006.13916, 2020

work page arXiv 2006

[13] [13]

Dara: Dynamics-aware reward augmentation in offline reinforcement learning.arXiv preprint arXiv:2203.06662, 2022

Jinxin Liu, Hongyin Zhang, and Donglin Wang. Dara: Dynamics-aware reward augmentation in offline reinforcement learning.arXiv preprint arXiv:2203.06662, 2022

work page arXiv 2022

[14] [14]

Cross-domain policy adaptation via value-guided data filtering.Advances in Neural Information Processing Systems, 36:73395–73421, 2023

Kang Xu, Chenjia Bai, Xiaoteng Ma, Dong Wang, Bin Zhao, Zhen Wang, Xuelong Li, and Wei Li. Cross-domain policy adaptation via value-guided data filtering.Advances in Neural Information Processing Systems, 36:73395–73421, 2023

work page 2023

[15] [15]

Off-policy deep reinforcement learning without exploration

Scott Fujimoto, David Meger, and Doina Precup. Off-policy deep reinforcement learning without exploration. InInternational conference on machine learning, pages 2052–2062. PMLR, 2019

work page 2052

[16] [16]

Stabilizing off-policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019

Aviral Kumar, Justin Fu, Matthew Soh, George Tucker, and Sergey Levine. Stabilizing off-policy q-learning via bootstrapping error reduction.Advances in neural information processing systems, 32, 2019

work page 2019

[17] [17]

Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179–1191, 2020

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179–1191, 2020

work page 2020

[18] [18]

Dmc: Nearest neighbor guidance diffusion model for offline cross-domain reinforcement learning.arXiv preprint arXiv:2507.20499, 2025

Linh Le Pham Van, Minh Hoang Nguyen, Duc Kieu, Hung Le, Hung The Tran, and Sunil Gupta. Dmc: Nearest neighbor guidance diffusion model for offline cross-domain reinforcement learning.arXiv preprint arXiv:2507.20499, 2025

work page arXiv 2025

[19] [19]

Offline reinforce- ment learning with domain-unlabeled data.arXiv preprint arXiv:2404.07465, 2024

Soichiro Nishimori, Xin-Qiang Cai, Johannes Ackermann, and Masashi Sugiyama. Offline reinforce- ment learning with domain-unlabeled data.arXiv preprint arXiv:2404.07465, 2024

work page arXiv 2024

[20] [20]

Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone

Max Sobol Mark, Tian Gao, Georgia Gabriela Sampaio, Mohan Kumar Srirama, Archit Sharma, Chelsea Finn, and Aviral Kumar. Policy agnostic rl: Offline rl and online rl fine-tuning of any class and backbone. arXiv preprint arXiv:2412.06685, 2024

work page arXiv 2024

[21] [21]

Pre-training for robots: Offline rl enables learning new tasks from a handful of trials

Aviral Kumar, Anikait Singh, Frederik Ebert, Mitsuhiko Nakamoto, Yanlai Yang, Chelsea Finn, and Sergey Levine. Pre-training for robots: Offline rl enables learning new tasks from a handful of trials. arXiv preprint arXiv:2210.05178, 2022

work page arXiv 2022

[22] [22]

Return augmented decision transformer for off-dynamics reinforcement learning.arXiv preprint arXiv:2410.23450, 2024

Ruhan Wang, Yu Yang, Zhishuai Liu, Dongruo Zhou, and Pan Xu. Return augmented decision transformer for off-dynamics reinforcement learning.arXiv preprint arXiv:2410.23450, 2024

work page arXiv 2024

[23] [23]

Mobody: Model based off-dynamics offline reinforcement learning.arXiv preprint arXiv:2506.08460, 2025

Yihong Guo, Yu Yang, Pan Xu, and Anqi Liu. Mobody: Model based off-dynamics offline reinforcement learning.arXiv preprint arXiv:2506.08460, 2025

work page arXiv 2025

[24] [24]

State regularized policy optimization on data with dynamics shift.Advances in neural information processing systems, 36:32926–32937, 2023

Zhenghai Xue, Qingpeng Cai, Shuchang Liu, Dong Zheng, Peng Jiang, Kun Gai, and Bo An. State regularized policy optimization on data with dynamics shift.Advances in neural information processing systems, 36:32926–32937, 2023. 12

work page 2023

[25] [25]

A markovian decision process.Journal of mathematics and mechanics, pages 679–684, 1957

Richard Bellman. A markovian decision process.Journal of mathematics and mechanics, pages 679–684, 1957

work page 1957

[26] [26]

Markov decision processes.Handbooks in operations research and management science, 2:331–434, 1990

Martin L Puterman. Markov decision processes.Handbooks in operations research and management science, 2:331–434, 1990

work page 1990

[27] [27]

Reinforcement learning: A survey

Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996

work page 1996

[28] [28]

Odrl: A benchmark for off-dynamics reinforcement learning.Advances in Neural Information Processing Systems, 37:59859–59911, 2024

Jiafei Lyu, Kang Xu, Jiacheng Xu, Jing-Wen Yang, Zongzhang Zhang, Chenjia Bai, Zongqing Lu, Xiu Li, et al. Odrl: A benchmark for off-dynamics reinforcement learning.Advances in Neural Information Processing Systems, 37:59859–59911, 2024

work page 2024

[29] [29]

Dynamic programming.science, 153(3731):34–37, 1966

Richard Bellman. Dynamic programming.science, 153(3731):34–37, 1966

work page 1966

[30] [30]

MIT press Cambridge, 1998

Richard S Sutton, Andrew G Barto, et al.Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998

work page 1998

[31] [31]

Mujoco: A physics engine for model-based control

Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012

work page 2012

[32] [32]

Learning Complex Dexterous Manipulation with Deep Reinforcement Learning and Demonstrations

Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations.arXiv preprint arXiv:1709.10087, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[33] [33]

D4RL: Datasets for Deep Data-Driven Reinforcement Learning

Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[34] [34]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

work page 2018

[35] [35]

Dual-robust cross-domain offline reinforcement learning against dynamics shifts.arXiv preprint arXiv:2512.02486, 2025

Zhongjian Qiao, Rui Yang, Jiafei Lyu, Xiu Li, Zhongxiang Dai, Zhuoran Yang, Siyang Gao, and Shuang Qiu. Dual-robust cross-domain offline reinforcement learning against dynamics shifts.arXiv preprint arXiv:2512.02486, 2025

work page arXiv 2025

[36] [36]

Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021. 13 A Proof In this section, we formally present the proof of Theorem 1. We decompose the target Bellman error as: δtar(ξ) = (rtar −ˆr) +γ V(z ′ s,tar)−V( ˆz′ s) + ˆr+γV( ˆz′ s)−(r+γV(z ′ s)) .(13) Applying the tri...

work page internal anchor Pith review Pith/arXiv arXiv 2021