Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

Dianbo Liu; Hongyu He; Shaowen Xiang; Ziwen Xie

arxiv: 2605.29500 · v1 · pith:76YQK7AAnew · submitted 2026-05-28 · 💻 cs.LG · cs.AI

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

Ziwen Xie , Shaowen Xiang , Hongyu He , Dianbo Liu This is my paper

Pith reviewed 2026-06-29 08:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords off-policy evaluationslate recommendationimportance samplingquotient DAGdynamic programmingautoregressive modelspropensity estimation

0 comments

The pith

Quotient DAGs merge equivalent histories to compute exact unordered slate propensities without factorial sums.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a quotient-DAG representation for off-policy evaluation when the behavior policy generates ordered sequences but the evaluation target cares only about the unordered set. It merges histories that are equivalent under the reward or estimator and assigns importance weights via target-to-behavior forward-flow ratios on the resulting graph. This construction produces Forward-DP, a dynamic program on the subset DAG that yields the exact propensity for any given unordered slate. The method removes nuisance variance that arises when order is treated as meaningful even though it is not for downstream use. It supplies a practical primitive for propensity-based evaluation and model selection in autoregressive slate recommenders.

Core claim

For slate recommendation under a set-sufficient next-item interface, the quotient-DAG view yields Forward-DP, a subset-DAG dynamic program that computes exact unordered propensities without factorial enumeration by using target-to-behavior forward-flow ratios on the merged graph.

What carries the argument

Quotient-DAG that merges histories equivalent for evaluation and assigns weights using forward-flow importance sampling ratios on the merged graph.

If this is right

The resulting propensity primitive enables practical propensity-based evaluation for context-dependent autoregressive slate loggers.
Exact unordered slate propensities can be obtained without summing over all generation orders.
Forward-flow ratios on the merged graph supply the correct importance weights once equivalent histories are identified.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same merging step could reduce variance in importance-sampling estimates whenever the evaluation metric is invariant to order.
Quotient constructions of this form may extend to other sequential processes in which only the set of chosen actions affects the outcome.
The approach offers a template for handling symmetries between logging and evaluation policies in off-policy settings.

Load-bearing premise

The reward and downstream estimator depend only on the unordered slate and the next-item interface is set-sufficient.

What would settle it

Compute the Forward-DP propensity for a small slate and verify that it equals the direct sum of the behavior-policy probabilities over every possible ordering of that slate.

Figures

Figures reproduced from arXiv: 2605.29500 by Dianbo Liu, Hongyu He, Shaowen Xiang, Ziwen Xie.

**Figure 1.** Figure 1: From trajectory IS to quotient-DAG weighting. (a) Trajectory IS assigns a likelihood ratio to each realized path. (b) The rollout tree shares prefixes but still represents distinct historyprefix nodes, even with repeated state labels. (c) A quotient DAG merges evaluation-equivalent prefixes into multi-parent nodes such as z4. The quotient likelihood ratio is the forward-flow ratio Fπ(z)/Fβ(z), not any sin… view at source ↗

**Figure 2.** Figure 2: KuaiRec scaling. Left: joint slate-propensity wall-clock time for [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Bias versus standard-deviation decomposition of RMSE on the finite-horizon MDP bench [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 4.** Figure 4: K = 4 off-policy model-selection metrics by estimator (visualizing [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

read the original abstract

Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial when online testing is costly or risky, such as in recommendation or healthcare. Standard importance sampling reweights each logged trajectory, but it can treat details of the generation process as meaningful even when the evaluation target ignores them: for example, an autoregressive slate recommender may generate an ordered sequence of items while the reward and downstream estimator depend only on the unordered slate. This creates nuisance variance and a computational gap, since exact unordered slate propensities require summing over all generation orders. We introduce a quotient-DAG view that merges histories equivalent for evaluation and assigns weights using target-to-behavior forward-flow ratios on the merged graph. For slate recommendation under a set-sufficient next-item interface, this yields Forward-DP, a subset-DAG dynamic program that computes exact unordered propensities without factorial enumeration. The resulting propensity primitive enables practical propensity-based evaluation and model selection for context-dependent autoregressive slate loggers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The quotient-DAG construction plus Forward-DP gives exact unordered slate propensities without factorial sums, which is the concrete advance here.

read the letter

The paper's core move is to treat histories that are equivalent under the evaluation target as the same node in a quotient DAG, then compute target-to-behavior forward-flow ratios on that merged graph. For autoregressive slate policies where the reward depends only on the unordered set, this produces Forward-DP, a dynamic program over subsets that delivers the exact propensity without summing over all generation orders.

That construction is new relative to standard importance sampling and the slate-specific methods cited in the abstract. It directly attacks the nuisance variance that comes from treating order as meaningful when the downstream estimator does not. The set-sufficiency assumption on the next-item interface is stated up front, so the scope is clear.

The main soft spot is that everything rests on the modeling choice that reward and estimator ignore order; if that does not hold in a given application the reduction does not apply. The abstract gives no runtime numbers or variance comparisons, so it is still open how large the practical gain is once the DP is implemented. I also want to see whether the subset-DAG size stays manageable for realistic slate cardinalities.

This is for people doing off-policy evaluation on context-dependent slate loggers in recsys or similar domains. A reader who already works with importance sampling on ordered trajectories will see the value immediately.

The work is coherent on its own terms and the central claim is falsifiable, so it should go to peer review.

Referee Report

0 major / 2 minor

Summary. The paper introduces a quotient-DAG representation of the slate generation process that merges histories equivalent under the evaluation metric (unordered slate). It defines forward-flow importance sampling ratios on this merged graph and, for set-sufficient next-item interfaces, derives Forward-DP: a subset-DAG dynamic program that computes exact unordered slate propensities without enumerating permutations. The resulting propensity estimator is intended to support lower-variance off-policy evaluation and model selection when the reward and downstream estimator depend only on the unordered slate.

Significance. If the derivation holds, the method supplies an exact, computationally tractable propensity primitive for a practically relevant class of autoregressive slate loggers. This directly addresses the factorial blow-up that currently prevents exact importance sampling in ordered slate generation, and the graph-theoretic reduction may generalize to other sequential settings where only a quotient of the trajectory matters. The explicit statement of the set-sufficiency and unordered-reward assumptions makes the scope of the claim clear.

minor comments (2)

The abstract and introduction repeatedly use the term 'set-sufficient next-item interface' without a formal definition or reference to the precise interface condition used in the Forward-DP construction; a short paragraph or boxed definition early in §2 would improve readability.
Notation for the forward-flow ratio (target-to-behavior) is introduced in the abstract but not contrasted with the conventional backward importance weight; a one-line comparison in the related-work section would help readers situate the contribution.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading, positive summary, and significance assessment of the quotient-DAG approach. The recommendation of minor revision is noted. No specific major comments were raised in the report, so we provide no point-by-point responses below.

Circularity Check

0 steps flagged

No significant circularity; new graph reduction applied to explicit modeling assumptions

full rationale

The paper introduces quotient DAGs and forward-flow ratios as a new construction for off-policy evaluation on slates where rewards depend only on unordered sets. Forward-DP is presented as a dynamic program on the resulting subset-DAG that avoids factorial enumeration. No derivation step reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation; the set-sufficiency condition and unordered-slate reward dependence are stated as explicit modeling choices rather than derived outputs. The central claim is a computational reduction whose correctness is independent of any internal fit or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on domain assumptions about set-sufficiency and the new quotient-DAG and Forward-DP structures introduced without independent evidence outside the paper.

axioms (2)

domain assumption The reward and downstream estimator depend only on the unordered slate
Explicitly stated in the abstract as the condition creating the nuisance variance problem.
domain assumption Set-sufficient next-item interface
Required for the Forward-DP to compute exact propensities without factorial enumeration.

invented entities (3)

Quotient DAG no independent evidence
purpose: Merges histories equivalent for evaluation
New graph view introduced to collapse orderings that do not affect the target.
Forward-DP no independent evidence
purpose: Computes exact unordered propensities via subset-DAG dynamic program
New algorithm presented as the practical primitive enabled by the quotient view.
forward-flow ratios no independent evidence
purpose: Assigns weights using target-to-behavior ratios on the merged graph
Core weighting mechanism of the new importance sampling approach.

pith-pipeline@v0.9.1-grok · 5714 in / 1371 out tokens · 28867 ms · 2026-06-29T08:58:14.981563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 13 canonical work pages · 2 internal anchors

[1]

Miroslav Dudík, John Langford, and Lihong Li

arXiv:2205.08084. Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Lise Getoor and Tobias Scheffer, editors,Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 1097–1104. Omnipress,

work page arXiv 2011
[2]

KuaiRec: A fully-observed dataset and insights for evaluating recom- mender systems

Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. KuaiRec: A fully-observed dataset and insights for evaluating recom- mender systems. In Mohammad Al Hasan and Li Xiong, editors,Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, O...

2022
[3]

Ge Gao, Jonathan D

doi: 10.1145/3511808.3557220. Ge Gao, Jonathan D. Chang, Claire Cardie, Kianté Brantley, and Thorsten Joachims. Policy-gradient training of language models for ranking.CoRR, abs/2310.04407,

work page doi:10.1145/3511808.3557220
[5]

2022 , isbn =

doi: 10.1145/3523227.3546767. Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of the 33rd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, JMLR Workshop and Conference Proceedings, pages 652...

work page doi:10.1145/3523227.3546767 2016
[6]

Wang-Cheng Kang and Julian J. McAuley. Self-attentive sequential recommendation. InIEEE International Conference on Data Mining, ICDM 2018, Singapore, November 17-20, 2018, pages 197–206. IEEE Computer Society,

2018
[7]

doi: 10.1109/ICDM.2018.00035. SASRec. Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, and Yuta Saito. Off-policy evaluation of ranking policies under diverse user behavior. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 1154...

work page doi:10.1109/icdm.2018.00035 2018
[8]

doi: 10.1145/3580305. 3599447. AIPS estimator. Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. Off-policy evaluation of slate bandit policies via optimizing abstraction. In Tat-Seng Chua, Chong-Wah Ngo, Ravi Kumar, Hady W. Lauw, and Roy Ka-Wei Lee, editors,Proceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, May 13-17, 2024, pages 3150–3161. ACM,

work page doi:10.1145/3580305 2024
[9]

OPCB estimator

doi: 10.1145/3589334.3645343. OPCB estimator. Matthieu Komorowski, Leo A. Celi, Omar Badawi, Anthony C. Gordon, and A. Aldo Faisal. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11):1716–1720,

work page doi:10.1145/3589334.3645343
[10]

Wouter Kool, Herke van Hoof, and Max Welling

doi: 10.1038/s41591-018-0213-5. Wouter Kool, Herke van Hoof, and Max Welling. Stochastic beams and where to find them: The Gumbel-top-k trick for sampling sequences without replacement. InProceedings of the 36th International Conference on Machine Learning (ICML),

work page doi:10.1038/s41591-018-0213-5
[11]

Understanding the curse of horizon in off- policy evaluation via conditional importance sampling

10 Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. Understanding the curse of horizon in off- policy evaluation via conditional importance sampling. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, pages 6184–6193. PMLR,

2020
[12]

Chi, and Qiaozhu Mei

Jiaqi Ma, Xinyang Yi, Weijing Tang, Zhe Zhao, Lichan Hong, Ed H. Chi, and Qiaozhu Mei. Learning- to-rank with partitioned preference: Fast estimation for the Plackett-Luce model. In Arindam Banerjee and Kenji Fukumizu, editors,The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, Proce...

2021
[13]

Ashique Rupam Mahmood, Hado van Hasselt, and Richard S. Sutton. Weighted importance sampling for off-policy learning with linear function approximation. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors,Advances in Neural Infor- mation Processing Systems 27: Annual Conference on Neural Information Proce...

2014
[14]

Concept-driven off policy evaluation.CoRR, abs/2411.19395,

Ritam Majumdar, Jack Teversham, and Sonali Parbhoo. Concept-driven off policy evaluation.CoRR, abs/2411.19395,

work page arXiv
[16]

Michael Oberst and David A. Sontag. Counterfactual off-policy evaluation with Gumbel-max structural causal models. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Proceedings of Machine Learning Research, pages 4881–4890. PMLR,

2019
[17]

Pavse and Josiah P

Brahma S. Pavse and Josiah P. Hanna. Scaling marginalized importance sampling to high-dimensional state-spaces via state abstraction. In Brian Williams, Yiling Chen, and Jennifer Neville, editors, Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 9417–9425. AAAI Press,

2023
[18]

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!

doi: 10.1609/aaai.v37i8.26128. Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!CoRR, abs/2312.02724,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v37i8.26128
[19]

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!

doi: 10.48550/ARXIV . 2312.02724. Doina Precup, Richard S. Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. In Pat Langley, editor,Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pages 759–766. Morgan Kaufmann,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2000
[20]

Tran, Jonah Samost, Maciej Kula, Ed H

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Mahesh Sathiamoorthy. Recommender systems with generative retrieval. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in N...

2023
[21]

Mark Rowland, Anna Harutyunyan, Hado van Hasselt, Diana Borsa, Tom Schaul, Rémi Munos, and Will Dabney

TIGER. Mark Rowland, Anna Harutyunyan, Hado van Hasselt, Diana Borsa, Tom Schaul, Rémi Munos, and Will Dabney. Conditional importance sampling for off-policy learning. In Silvia Chiappa and Roberto Calandra, editors,The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy]...

2020
[22]

Effective off-policy evaluation and learning in contextual combinatorial bandits

11 Tatsuhiro Shimizu, Koichi Tanaka, Ren Kishimoto, Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. Effective off-policy evaluation and learning in contextual combinatorial bandits. In Tommaso Di Noia, Pasquale Lops, Thorsten Joachims, Katrien Verbert, Pablo Castells, Zhenhua Dong, and Ben London, editors,Proceedings of the 18th ACM Conference on Recomm...

2024
[23]

Slate-OPE via abstraction

doi: 10.1145/3640457.3688099. Slate-OPE via abstraction. Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Wenwu Zhu, Dacheng Tao, Xueqi Cheng, Peng Cui, Elke A. Rundensteiner, David Carmel, Qi He, and Jeffrey Xu Yu, editors,Proce...

work page doi:10.1145/3640457.3688099 2019
[24]

InProceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM)

doi: 10.1145/3357384.3357895. Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is ChatGPT good at search? investigating large language models as re-ranking agents. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proc...

work page doi:10.1145/3357384.3357895 2023
[25]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, and Imed Zitouni

doi: 10.18653/v1/2023.emnlp-main.923. Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural I...

work page doi:10.18653/v1/2023.emnlp-main.923 2023
[26]

Thomas and Emma Brunskill

Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of the 33rd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, JMLR Workshop and Conference Proceedings, pages 2139–2148. JMLR.org,

2016
[27]

Minimax weight and q-function learning for off-policy evaluation

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, pages 9659–9668. PMLR,

2020
[28]

Control variates for slate off-policy evaluation

Nikos Vlassis, Ashok Chandrashekar, Fernando Amat Gil, and Nathan Kallus. Control variates for slate off-policy evaluation. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Ne...

2021
[29]

Towards optimal off-policy evaluation for reinforce- ment learning with marginalized importance sampling

Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Towards optimal off-policy evaluation for reinforce- ment learning with marginalized importance sampling. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché Buc, Emily B. Fox, and Roman Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Inform...

2019
[30]

GenDICE: Generalized offline estimation of stationary values

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. GenDICE: Generalized offline estimation of stationary values. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

2020
[31]

The target policy is π= softmax(Q ∗) at temperature 0.1, and the behavior policy is β=ϵ -greedy(Q∗) at ϵ= 0.3

is a stochastic MDP with 720 ground states, 8 actions, horizon H= 5 , and terminal rewards {+1,0,−1} for discharge, non- terminal, and death respectively. The target policy is π= softmax(Q ∗) at temperature 0.1, and the behavior policy is β=ϵ -greedy(Q∗) at ϵ= 0.3 . The ground-truth value is V(π) =−0.2683 . The learned-abstraction baselines in Table 1 are...

2023
[32]

and trained for 20,000 steps (DualDICE) or 15,000 steps (GenDICE) with Adam, batch size 2048, and gradient clipping at norm

2048
[33]

converged

DualDICE uses the primal form (ζ(s, a) =∂f(bellman) for f(x) =x 2/2); GenDICE uses the saddle-point form with τ≥0 via softplus, learning rates (ητ , ηf) = (10−4,10 −3), normalization penalty λ= 10 , and chi-square penalty α= 1 . Each estimator is run for 50 trials with seeds {43, . . . ,92}. The reported Table 1 entries are bias, std, and RMSE relative to...

2018
[34]

The left panel compares exact enumeration over all K! generation orders, Forward-DP, and a Gumbel-top- K Monte Carlo estimator [Kool et al., 2019]

KuaiRec scaling details for Figure 2.Figure 2 evaluates how slate size K affects both joint- propensity computation and downstream OPE. The left panel compares exact enumeration over all K! generation orders, Forward-DP, and a Gumbel-top- K Monte Carlo estimator [Kool et al., 2019]. Direct K!-enumeration is measured up to K= 8 and extrapolated for K >8 , ...

2019
[35]

Lower than 1.00 indicates lower RMSE than the denominator

Table 5: Within-K RMSE ratios on KuaiRec (M= 15 , nusers = 300, ntrials = 200). Lower than 1.00 indicates lower RMSE than the denominator. Ratios should be compared within the same K, not across differentK. KFF-OIS / OIS DR / OIS FF-DR / OIS FF-DR / DR 40.378 0.494 0.184 0.372 60.191 0.682 0.094 0.138 80.142 0.164 0.077 0.470 100.497 0.941 0.401 0.426 120...

2019
[36]

Full proofs appear in Appendix A (§A.1, §A.3, §A.4, §A.6)

state their assumptions in the proposition body. Full proofs appear in Appendix A (§A.1, §A.3, §A.4, §A.6). The variance gap derivation specialized to slate OPE is in §A.5. Guidelines: • The answer [N/A] means that the paper does not include theoretical results. • All the theorems, formulas, and proofs in the paper should be numbered and cross- referenced...

2019
[37]

Guidelines: • The answer [N/A] means that the paper does not include experiments

The Qwen2.5-3B logit cache for the cross-architecture run is the only GPU-dependent step and is amortized once per user. Guidelines: • The answer [N/A] means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. • The pape...

2019

[1] [1]

Miroslav Dudík, John Langford, and Lihong Li

arXiv:2205.08084. Miroslav Dudík, John Langford, and Lihong Li. Doubly robust policy evaluation and learning. In Lise Getoor and Tobias Scheffer, editors,Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 1097–1104. Omnipress,

work page arXiv 2011

[2] [2]

KuaiRec: A fully-observed dataset and insights for evaluating recom- mender systems

Chongming Gao, Shijun Li, Wenqiang Lei, Jiawei Chen, Biao Li, Peng Jiang, Xiangnan He, Jiaxin Mao, and Tat-Seng Chua. KuaiRec: A fully-observed dataset and insights for evaluating recom- mender systems. In Mohammad Al Hasan and Li Xiong, editors,Proceedings of the 31st ACM International Conference on Information & Knowledge Management, Atlanta, GA, USA, O...

2022

[3] [3]

Ge Gao, Jonathan D

doi: 10.1145/3511808.3557220. Ge Gao, Jonathan D. Chang, Claire Cardie, Kianté Brantley, and Thorsten Joachims. Policy-gradient training of language models for ranking.CoRR, abs/2310.04407,

work page doi:10.1145/3511808.3557220

[4] [5]

2022 , isbn =

doi: 10.1145/3523227.3546767. Nan Jiang and Lihong Li. Doubly robust off-policy value evaluation for reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of the 33rd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, JMLR Workshop and Conference Proceedings, pages 652...

work page doi:10.1145/3523227.3546767 2016

[5] [6]

Wang-Cheng Kang and Julian J. McAuley. Self-attentive sequential recommendation. InIEEE International Conference on Data Mining, ICDM 2018, Singapore, November 17-20, 2018, pages 197–206. IEEE Computer Society,

2018

[6] [7]

doi: 10.1109/ICDM.2018.00035. SASRec. Haruka Kiyohara, Masatoshi Uehara, Yusuke Narita, Nobuyuki Shimizu, Yasuo Yamamoto, and Yuta Saito. Off-policy evaluation of ranking policies under diverse user behavior. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 1154...

work page doi:10.1109/icdm.2018.00035 2018

[7] [8]

doi: 10.1145/3580305. 3599447. AIPS estimator. Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. Off-policy evaluation of slate bandit policies via optimizing abstraction. In Tat-Seng Chua, Chong-Wah Ngo, Ravi Kumar, Hady W. Lauw, and Roy Ka-Wei Lee, editors,Proceedings of the ACM on Web Conference 2024, WWW 2024, Singapore, May 13-17, 2024, pages 3150–3161. ACM,

work page doi:10.1145/3580305 2024

[8] [9]

OPCB estimator

doi: 10.1145/3589334.3645343. OPCB estimator. Matthieu Komorowski, Leo A. Celi, Omar Badawi, Anthony C. Gordon, and A. Aldo Faisal. The Artificial Intelligence Clinician learns optimal treatment strategies for sepsis in intensive care. Nature Medicine, 24(11):1716–1720,

work page doi:10.1145/3589334.3645343

[9] [10]

Wouter Kool, Herke van Hoof, and Max Welling

doi: 10.1038/s41591-018-0213-5. Wouter Kool, Herke van Hoof, and Max Welling. Stochastic beams and where to find them: The Gumbel-top-k trick for sampling sequences without replacement. InProceedings of the 36th International Conference on Machine Learning (ICML),

work page doi:10.1038/s41591-018-0213-5

[10] [11]

Understanding the curse of horizon in off- policy evaluation via conditional importance sampling

10 Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. Understanding the curse of horizon in off- policy evaluation via conditional importance sampling. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, pages 6184–6193. PMLR,

2020

[11] [12]

Chi, and Qiaozhu Mei

Jiaqi Ma, Xinyang Yi, Weijing Tang, Zhe Zhao, Lichan Hong, Ed H. Chi, and Qiaozhu Mei. Learning- to-rank with partitioned preference: Fast estimation for the Plackett-Luce model. In Arindam Banerjee and Kenji Fukumizu, editors,The 24th International Conference on Artificial Intelligence and Statistics, AISTATS 2021, April 13-15, 2021, Virtual Event, Proce...

2021

[12] [13]

Ashique Rupam Mahmood, Hado van Hasselt, and Richard S. Sutton. Weighted importance sampling for off-policy learning with linear function approximation. In Zoubin Ghahramani, Max Welling, Corinna Cortes, Neil D. Lawrence, and Kilian Q. Weinberger, editors,Advances in Neural Infor- mation Processing Systems 27: Annual Conference on Neural Information Proce...

2014

[13] [14]

Concept-driven off policy evaluation.CoRR, abs/2411.19395,

Ritam Majumdar, Jack Teversham, and Sonali Parbhoo. Concept-driven off policy evaluation.CoRR, abs/2411.19395,

work page arXiv

[14] [16]

Michael Oberst and David A. Sontag. Counterfactual off-policy evaluation with Gumbel-max structural causal models. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, Proceedings of Machine Learning Research, pages 4881–4890. PMLR,

2019

[15] [17]

Pavse and Josiah P

Brahma S. Pavse and Josiah P. Hanna. Scaling marginalized importance sampling to high-dimensional state-spaces via state abstraction. In Brian Williams, Yiling Chen, and Jennifer Neville, editors, Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Washington, DC, USA, February 7-14, 2023, pages 9417–9425. AAAI Press,

2023

[16] [18]

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!

doi: 10.1609/aaai.v37i8.26128. Ronak Pradeep, Sahel Sharifymoghaddam, and Jimmy Lin. RankZephyr: Effective and robust zero-shot listwise reranking is a breeze!CoRR, abs/2312.02724,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v37i8.26128

[17] [19]

RankZephyr: Effective and Robust Zero-Shot Listwise Reranking is a Breeze!

doi: 10.48550/ARXIV . 2312.02724. Doina Precup, Richard S. Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. In Pat Langley, editor,Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pages 759–766. Morgan Kaufmann,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2000

[18] [20]

Tran, Jonah Samost, Maciej Kula, Ed H

Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan Hulikal Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Mahesh Sathiamoorthy. Recommender systems with generative retrieval. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in N...

2023

[19] [21]

Mark Rowland, Anna Harutyunyan, Hado van Hasselt, Diana Borsa, Tom Schaul, Rémi Munos, and Will Dabney

TIGER. Mark Rowland, Anna Harutyunyan, Hado van Hasselt, Diana Borsa, Tom Schaul, Rémi Munos, and Will Dabney. Conditional importance sampling for off-policy learning. In Silvia Chiappa and Roberto Calandra, editors,The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020, 26-28 August 2020, Online [Palermo, Sicily, Italy]...

2020

[20] [22]

Effective off-policy evaluation and learning in contextual combinatorial bandits

11 Tatsuhiro Shimizu, Koichi Tanaka, Ren Kishimoto, Haruka Kiyohara, Masahiro Nomura, and Yuta Saito. Effective off-policy evaluation and learning in contextual combinatorial bandits. In Tommaso Di Noia, Pasquale Lops, Thorsten Joachims, Katrien Verbert, Pablo Castells, Zhenhua Dong, and Ben London, editors,Proceedings of the 18th ACM Conference on Recomm...

2024

[21] [23]

Slate-OPE via abstraction

doi: 10.1145/3640457.3688099. Slate-OPE via abstraction. Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Wenwu Zhu, Dacheng Tao, Xueqi Cheng, Peng Cui, Elke A. Rundensteiner, David Carmel, Qi He, and Jeffrey Xu Yu, editors,Proce...

work page doi:10.1145/3640457.3688099 2019

[22] [24]

InProceedings of the 28th ACM International Conference on Information and Knowledge Management (CIKM)

doi: 10.1145/3357384.3357895. Weiwei Sun, Lingyong Yan, Xinyu Ma, Shuaiqiang Wang, Pengjie Ren, Zhumin Chen, Dawei Yin, and Zhaochun Ren. Is ChatGPT good at search? investigating large language models as re-ranking agents. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Proc...

work page doi:10.1145/3357384.3357895 2023

[23] [25]

Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, and Imed Zitouni

doi: 10.18653/v1/2023.emnlp-main.923. Adith Swaminathan, Akshay Krishnamurthy, Alekh Agarwal, Miroslav Dudík, John Langford, Damien Jose, and Imed Zitouni. Off-policy evaluation for slate recommendation. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural I...

work page doi:10.18653/v1/2023.emnlp-main.923 2023

[24] [26]

Thomas and Emma Brunskill

Philip S. Thomas and Emma Brunskill. Data-efficient off-policy policy evaluation for reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of the 33rd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, JMLR Workshop and Conference Proceedings, pages 2139–2148. JMLR.org,

2016

[25] [27]

Minimax weight and q-function learning for off-policy evaluation

Masatoshi Uehara, Jiawei Huang, and Nan Jiang. Minimax weight and q-function learning for off-policy evaluation. InProceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, pages 9659–9668. PMLR,

2020

[26] [28]

Control variates for slate off-policy evaluation

Nikos Vlassis, Ashok Chandrashekar, Fernando Amat Gil, and Nathan Kallus. Control variates for slate off-policy evaluation. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, Ne...

2021

[27] [29]

Towards optimal off-policy evaluation for reinforce- ment learning with marginalized importance sampling

Tengyang Xie, Yifei Ma, and Yu-Xiang Wang. Towards optimal off-policy evaluation for reinforce- ment learning with marginalized importance sampling. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché Buc, Emily B. Fox, and Roman Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Inform...

2019

[28] [30]

GenDICE: Generalized offline estimation of stationary values

Ruiyi Zhang, Bo Dai, Lihong Li, and Dale Schuurmans. GenDICE: Generalized offline estimation of stationary values. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30,

2020

[29] [31]

The target policy is π= softmax(Q ∗) at temperature 0.1, and the behavior policy is β=ϵ -greedy(Q∗) at ϵ= 0.3

is a stochastic MDP with 720 ground states, 8 actions, horizon H= 5 , and terminal rewards {+1,0,−1} for discharge, non- terminal, and death respectively. The target policy is π= softmax(Q ∗) at temperature 0.1, and the behavior policy is β=ϵ -greedy(Q∗) at ϵ= 0.3 . The ground-truth value is V(π) =−0.2683 . The learned-abstraction baselines in Table 1 are...

2023

[30] [32]

and trained for 20,000 steps (DualDICE) or 15,000 steps (GenDICE) with Adam, batch size 2048, and gradient clipping at norm

2048

[31] [33]

converged

DualDICE uses the primal form (ζ(s, a) =∂f(bellman) for f(x) =x 2/2); GenDICE uses the saddle-point form with τ≥0 via softplus, learning rates (ητ , ηf) = (10−4,10 −3), normalization penalty λ= 10 , and chi-square penalty α= 1 . Each estimator is run for 50 trials with seeds {43, . . . ,92}. The reported Table 1 entries are bias, std, and RMSE relative to...

2018

[32] [34]

The left panel compares exact enumeration over all K! generation orders, Forward-DP, and a Gumbel-top- K Monte Carlo estimator [Kool et al., 2019]

KuaiRec scaling details for Figure 2.Figure 2 evaluates how slate size K affects both joint- propensity computation and downstream OPE. The left panel compares exact enumeration over all K! generation orders, Forward-DP, and a Gumbel-top- K Monte Carlo estimator [Kool et al., 2019]. Direct K!-enumeration is measured up to K= 8 and extrapolated for K >8 , ...

2019

[33] [35]

Lower than 1.00 indicates lower RMSE than the denominator

Table 5: Within-K RMSE ratios on KuaiRec (M= 15 , nusers = 300, ntrials = 200). Lower than 1.00 indicates lower RMSE than the denominator. Ratios should be compared within the same K, not across differentK. KFF-OIS / OIS DR / OIS FF-DR / OIS FF-DR / DR 40.378 0.494 0.184 0.372 60.191 0.682 0.094 0.138 80.142 0.164 0.077 0.470 100.497 0.941 0.401 0.426 120...

2019

[34] [36]

Full proofs appear in Appendix A (§A.1, §A.3, §A.4, §A.6)

state their assumptions in the proposition body. Full proofs appear in Appendix A (§A.1, §A.3, §A.4, §A.6). The variance gap derivation specialized to slate OPE is in §A.5. Guidelines: • The answer [N/A] means that the paper does not include theoretical results. • All the theorems, formulas, and proofs in the paper should be numbered and cross- referenced...

2019

[35] [37]

Guidelines: • The answer [N/A] means that the paper does not include experiments

The Qwen2.5-3B logit cache for the cross-architecture run is the only GPU-dependent step and is amortized once per user. Guidelines: • The answer [N/A] means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. • The pape...

2019