Learning in Position-Aware Multinomial Logit Bandits: From Multiplicative to General Position Effects
Pith reviewed 2026-05-20 14:32 UTC · model grok-4.3
The pith
The paper gives the first regret-optimal algorithms for jointly learning assortments and display positions under position-aware multinomial logit choice.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For the multiplicative model, a cross-position pairwise maximum-likelihood estimator with clipping yields an algorithm whose regret is Õ(√(NT)), matching the lower bound; for the general model, a new algorithm attains the minimax rate characterized by a matching lower bound.
What carries the argument
Cross-position pairwise maximum-likelihood estimator (with clipping) for the multiplicative case; GP2-UCB for the general case; per-round optimization via Dinkelbach's method plus maximum-weight bipartite matching.
If this is right
- Real-time platforms can now update assortment and position decisions after every single purchase observation rather than waiting for epochs.
- The √K gap that previously separated upper and lower bounds for the multiplicative model is eliminated.
- The same algorithmic template yields matching upper and lower bounds once position effects become fully general.
- An efficient exact solver for the per-round combinatorial optimization problem is supplied.
Where Pith is reading between the lines
- The pairwise estimator may extend to other choice models that admit a similar cross-position likelihood structure.
- The results suggest that position bias can be learned at essentially the same cost as item-level learning when the model is correctly specified.
- The optimization subroutine could be reused inside other revenue-management settings that require fractional programming over assignments.
Load-bearing premise
Customer choices are generated exactly according to the multinomial logit model whose position-dependent attraction parameters are fixed over time.
What would settle it
A sequence of experiments in which the observed cumulative regret grows faster than Õ(√(NT)) under the multiplicative model, or faster than the minimax rate under the general model, would falsify the claimed bounds.
read the original abstract
We study the dynamic joint assortment selection and positioning problem, where the attraction of each product depends on both its intrinsic appeal and its display position under a Multinomial Logit (MNL) choice framework. Our study ranges from the multiplicative position effects model, in which each product's attraction is scaled by a position-specific factor, to a general position effects model assigning independent attraction parameters to every product--position pair to capture heterogeneous synergies. For both models, we design round-based learning algorithms that update decisions after every single feedback, and establish the first regret-optimal characterization. Besides, our round-based algorithms provide the prompt operations needed by modern platforms. For the multiplicative model, we develop a cross-position pairwise maximum likelihood estimator with a clipping mechanism, and prove that our algorithm P2MLE-UCB attains a regret of $\tilde{O}(\sqrt{NT})$, matching the lower bound and closing the $\sqrt{K}$ gap left by prior epoch-based analyses. For the general model, we establish a minimax lower bound and propose GP2-UCB with a matching upper bound. Moreover, we design an efficient subroutine for the per-round joint assortment and positioning optimization based on Dinkelbach's method and maximum-weight bipartite matching. Numerical experiments on synthetic data and the Expedia dataset show that our algorithms consistently outperform state-of-the-art benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript studies the dynamic joint assortment selection and positioning problem under the Multinomial Logit (MNL) choice model with position-dependent product attractions. It considers the multiplicative position effects model, where attractions are scaled by position-specific factors, and the general position effects model with independent attraction parameters for each product-position pair. For the multiplicative case, the authors develop the P2MLE-UCB algorithm using a cross-position pairwise maximum likelihood estimator with a clipping mechanism and prove a regret bound of Õ(√(NT)) that matches the lower bound, closing the √K gap from prior epoch-based work. For the general model, they establish a minimax lower bound and propose GP2-UCB with a matching upper bound. The paper also provides an efficient per-round optimization subroutine based on Dinkelbach's method and maximum-weight bipartite matching, along with numerical experiments on synthetic data and the Expedia dataset.
Significance. If the central regret bounds hold, this work makes a substantial contribution by delivering the first regret-optimal characterization for position-aware MNL bandits in both the multiplicative and general settings. The Õ(√(NT)) bound for the multiplicative model resolves a concrete gap left by earlier analyses, while the matching minimax results for the general model are notable. The round-based design with single-feedback updates and the efficient Dinkelbach+bipartite-matching subroutine are practical strengths for deployment on modern platforms. The empirical validation on real data further supports the claims.
minor comments (3)
- [Algorithm and estimator description] The description of the clipping mechanism in the cross-position pairwise MLE (mentioned in the abstract and likely detailed in the algorithm section) would benefit from an explicit statement of how the clipping threshold is chosen and its precise effect on the estimation error term in the regret analysis.
- [Numerical experiments] In the numerical experiments section, additional details on how position effects are instantiated or estimated from the Expedia dataset would help readers assess the realism of the simulation setup.
- [Lower bound discussion] The paper could add a short remark confirming that the lower-bound construction for the multiplicative model incorporates the joint assortment-positioning decision space, to make the matching claim fully transparent.
Simulated Author's Rebuttal
We thank the referee for the positive and accurate summary of our work, as well as for recommending minor revision. We appreciate the recognition that our round-based algorithms close the √K gap in the multiplicative setting and deliver matching minimax bounds in the general position-effects model. We will incorporate all suggested minor improvements in the revised manuscript.
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper introduces new estimators (cross-position pairwise MLE with clipping for the multiplicative model) and algorithms (P2MLE-UCB and GP2-UCB) under standard fixed-parameter MNL assumptions. Regret bounds are derived via standard stochastic bandit tools (UCB-style exploration, concentration inequalities, and per-round optimization via Dinkelbach + bipartite matching), with matching lower bounds established separately. These steps do not reduce by construction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations; the analysis remains independent of the target regret quantities and is externally falsifiable through the stated model and concentration arguments. No circular patterns from the enumerated list are present.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Customer choices are generated according to the multinomial logit model with position-dependent attractions that are fixed but unknown.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
For the multiplicative model, we develop a cross-position pairwise maximum likelihood estimator with a clipping mechanism, and prove that our algorithm P2MLE-UCB attains a regret of Õ(√(NT))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Abeliuk A, Berbeglia G, Cebrian M, Van Hentenryck P (2016) Assortment optimization under a multinomial logit model with position bias and social influence. 4OR 14(1):57--75
work page 2016
-
[2]
Conference on learning theory, 76--78 (PMLR)
Agrawal S, Avadhanula V, Goyal V, Zeevi A (2017) Thompson sampling for the mnl-bandit. Conference on learning theory, 76--78 (PMLR)
work page 2017
-
[3]
Operations Research 67(5):1453--1485
Agrawal S, Avadhanula V, Goyal V, Zeevi A (2019) Mnl-bandit: A dynamic learning approach to assortment selection. Operations Research 67(5):1453--1485
work page 2019
-
[4]
Management Science 67(6):3519--3550
Aouad A, Segev D (2021) Display optimization for vertically differentiated locations under multinomial logit preferences. Management Science 67(6):3519--3550
work page 2021
-
[5]
Communications of the ACM 14(12):802--804
Bourgeois F, Lassalle JC (1971) An extension of the munkres algorithm for the assignment problem to rectangular matrices. Communications of the ACM 14(12):802--804
work page 1971
-
[6]
Management science 53(2):276--292
Caro F, Gallien J (2007) Dynamic assortment with demand learning for seasonal consumer goods. Management science 53(2):276--292
work page 2007
-
[7]
Cesa-Bianchi N, Lugosi G (2006) Prediction, learning, and games (Cambridge university press)
work page 2006
-
[8]
ACM Transactions on Information Systems 41(3):1--39
Chen J, Dong H, Wang X, Feng F, Wang M, He X (2023 a ) Bias and debias in recommender system: A survey and future directions. ACM Transactions on Information Systems 41(3):1--39
work page 2023
-
[9]
Chen N, Gao P, Wang C, Wang Y (2023 b ) Assortment optimization for the multinomial logit model with repeated customer interactions. Available at SSRN 4526247
work page 2023
-
[10]
Production and Operations Management 30(1):85--102
Chen X, Shi C, Wang Y, Zhou Y (2021 a ) Dynamic assortment planning under nested logit models. Production and Operations Management 30(1):85--102
work page 2021
-
[11]
Operations Research Letters 46(5):534--537
Chen X, Wang Y (2018) A note on a tight lower bound for capacitated mnl-bandit assortment selection models. Operations Research Letters 46(5):534--537
work page 2018
-
[12]
Mathematics of Operations Research 46(4):1639--1657
Chen X, Wang Y, Zhou Y (2021 b ) Optimal policy for dynamic assortment planning under multinomial logit models. Mathematics of Operations Research 46(4):1639--1657
work page 2021
-
[13]
The 22nd International Conference on Artificial Intelligence and Statistics, 438--447 (PMLR)
Cheung WC, Tan V, Zhong Z (2019) A thompson sampling algorithm for cascading bandits. The 22nd International Conference on Artificial Intelligence and Statistics, 438--447 (PMLR)
work page 2019
-
[14]
Proceedings of the 2008 international conference on web search and data mining, 87--94
Craswell N, Zoeter O, Taylor M, Ramsey B (2008) An experimental comparison of click position-bias models. Proceedings of the 2008 international conference on web search and data mining, 87--94
work page 2008
-
[15]
Management science 13(7):492--498
Dinkelbach W (1967) On nonlinear fractional programming. Management science 13(7):492--498
work page 1967
-
[16]
arXiv preprint arXiv:2510.01693
Dong J, Mo W, Qi Z, Shi C, Fang EX, Tarokh V (2025) Pasta: A unified framework for offline assortment learning. arXiv preprint arXiv:2510.01693
-
[17]
the Annals of Probability 100--118
Freedman DA (1975) On tail probabilities for martingales. the Annals of Probability 100--118
work page 1975
-
[18]
Operations Research 68(1):134--160
Gallego G, Li A, Truong VA, Wang X (2020) Approximation algorithms for product framing and pricing. Operations Research 68(1):134--160
work page 2020
-
[19]
Operations research 69(5):1509--1532
Gao P, Ma Y, Chen N, Gallego G, Li A, Rusmevichientong P, Topaloglu H (2021) Assortment optimization and pricing under the multinomial logit model with impatient customers: Sequential recommendation and selection. Operations research 69(5):1509--1532
work page 2021
-
[20]
The Thirty-ninth Annual Conference on Neural Information Processing Systems
Han Y, Blanchet J, Zhou Z (2025 a ) Improved confidence regions and optimal algorithms for online and offline linear mnl bandits. The Thirty-ninth Annual Conference on Neural Information Processing Systems
work page 2025
-
[21]
arXiv preprint arXiv:2502.06777
Han Y, Zhong H, Lu M, Blanchet J, Zhou Z (2025 b ) Learning an optimal assortment policy under observational data. arXiv preprint arXiv:2502.06777
-
[22]
Manufacturing & Service Operations Management 26(1):215--232
Jasin S, Lyu C, Najafi S, Zhang H (2024) Assortment optimization with multi-item basket purchase under multivariate mnl model. Manufacturing & Service Operations Management 26(1):215--232
work page 2024
-
[23]
Production and Operations Management 35(3):1133--1150
Jiang B, Wang Z, Xue C, Zhang N (2026) Assortment optimization in the presence of focal effect: Operational insights and efficient algorithms. Production and Operations Management 35(3):1133--1150
work page 2026
-
[24]
Proceedings of the tenth ACM international conference on web search and data mining, 781--789
Joachims T, Swaminathan A, Schnabel T (2017) Unbiased learning-to-rank with biased feedback. Proceedings of the tenth ACM international conference on web search and data mining, 781--789
work page 2017
-
[25]
Jonker R, Volgenant A (1987) A shortest augmenting path algorithm for dense and sparse linear assignment problems. Computing 38(4):325--340
work page 1987
-
[26]
International conference on machine learning, 767--776 (PMLR)
Kveton B, Szepesvari C, Wen Z, Ashkan A (2015) Cascading bandits: Learning to rank in the cascade model. International conference on machine learning, 767--776 (PMLR)
work page 2015
-
[27]
Operations research 73(1):109--138
Li S, Luo Q, Huang Z, Shi C (2025) Online learning for constrained assortment optimization under markov chain choice model. Operations research 73(1):109--138
work page 2025
-
[28]
International conference on machine learning, 1245--1253 (PMLR)
Li S, Wang B, Zhang S, Chen W (2016) Contextual combinatorial cascading bandits. International conference on machine learning, 1245--1253 (PMLR)
work page 2016
-
[29]
Liang Y, Mao X, Wang S (2026) Online joint assortment-inventory optimization under mnl choices. Operations Research
work page 2026
-
[30]
Luo Y, Sun WW, Liu Y (2025) Rate-optimal online learning for dynamic assortment selection with positioning. Operations Research
work page 2025
-
[31]
Manufacturing & Service Operations Management 23(2):525--545
Miao S, Chao X (2021) Dynamic joint assortment and pricing optimization with demand learning. Manufacturing & Service Operations Management 23(2):525--545
work page 2021
-
[32]
Najafi S, Jasin S, Uichanco J, Zhao J (2026) Assortment and price optimization under a multiattribute (contextual) choice model. Operations Research
work page 2026
-
[33]
Operations research 58(6):1666--1680
Rusmevichientong P, Shen ZJM, Shmoys DB (2010) Dynamic assortment optimization with a multinomial logit choice model and capacity constraint. Operations research 58(6):1666--1680
work page 2010
-
[34]
arXiv preprint arXiv:2402.18917
Saha A, Gaillard P (2024) Stop relying on no-choice and do not repeat the moves: Optimal, efficient and practical algorithms for assortment optimization. arXiv preprint arXiv:2402.18917
-
[35]
Manufacturing & Service Operations Management 15(3):387--404
Saur \'e D, Zeevi A (2013) Optimal dynamic assortment planning with demand learning. Manufacturing & Service Operations Management 15(3):387--404
work page 2013
-
[36]
Schaible S (1976) Fractional programming. ii, on dinkelbach's algorithm. Management science 22(8):868--873
work page 1976
-
[37]
Proceedings of the eleventh ACM international conference on web search and data mining, 610--618
Wang X, Golbandi N, Bendersky M, Metzler D, Najork M (2018) Position bias estimation for unbiased learning to rank in personal search. Proceedings of the eleventh ACM international conference on web search and data mining, 610--618
work page 2018
-
[38]
Manufacturing & Service Operations Management 25(5):1748--1764
Xu Y, Wang Z (2023) Assortment optimization for a multistage choice model. Manufacturing & Service Operations Management 25(5):1748--1764
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.