Optimal Online and Offline Algorithms for Contextual MNL with Applications to Assortment and Pricing
Pith reviewed 2026-05-10 02:30 UTC · model grok-4.3
The pith
New algorithms for joint contextual MNL assortment and pricing deliver improved online regret bounds of order W sqrt(d T log N)/L0 and local suboptimality guarantees offline.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our SupCB-type algorithm improves the best previously known regret bound to O~(W sqrt(d T log N)/L0), and when specialized to assortment-only or pricing-only it recovers the near-minimax-optimal rates.
Load-bearing premise
The new confidence region for demand estimation under price-dependent features is valid and the model parameters satisfy the boundedness conditions (including L0) needed for the regret and suboptimality analyses to hold.
read the original abstract
Selecting which products to display and at what prices is a central decision in retail and e-commerce operations. In many applications, these two choices must be made jointly under limited display capacity and uncertain customer demand. In this paper, we study the joint assortment and pricing problem under a price-based contextual multinomial logit model, where customer preferences depend on both product features and selling prices. Our analysis begins with the construction of a new confidence region for demand estimation under price-dependent features. Building on this result, we develop a pessimistic offline algorithm and SupCB-type online algorithms for joint assortment and pricing optimization. In the offline setting, we establish a suboptimality guarantee governed by local information around the optimal assortment-price pair, rather than by exact coverage of the optimal action. In the online setting, our SupCB-type algorithm improves the best previously known regret bound to $\widetilde{O}(W\sqrt{dT\log N}/L_0)$, and we also provide a computationally simpler Thompson-sampling alternative. When specialized to the assortment-only or pricing-only setting, our bounds recover the near-minimax-optimal rates established in those respective domains, thereby bridging the gap between the mature study on assortment optimization or dynamic pricing and the limited literature on their joint optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper addresses the joint assortment and pricing problem under a price-based contextual multinomial logit model. It introduces a new confidence region for estimating demand parameters when features depend on prices. This enables a pessimistic offline algorithm with a suboptimality guarantee based on local information around the optimum, and SupCB-type online algorithms with regret bound ~O(W sqrt(d T log N)/L0), as well as a Thompson sampling alternative. Specializations to assortment-only and pricing-only settings recover near-minimax optimal rates.
Significance. Assuming the new confidence region is valid and the boundedness conditions hold, the results provide an improved regret bound for the joint problem compared to prior work and unify it with the more developed separate literatures on assortment optimization and dynamic pricing. The local suboptimality approach in the offline case is innovative, and recovering optimal rates in special cases strengthens the contribution. The dual provision of online and offline methods, including a computationally simpler Thompson sampling option, enhances the practical relevance. The work is significant for operations research and revenue management applications in retail.
major comments (3)
- [Abstract and confidence region construction] The validity of the novel confidence region for demand estimation under price-dependent features is central to all main results. The construction must rigorously handle the dependence between prices and features in the MNL probabilities to ensure correct coverage and the claimed tightness; without detailed proof steps, it is unclear whether the radius supports the regret bound ~O(W sqrt(d T log N)/L0) and the local suboptimality guarantee.
- [Online regret analysis] The SupCB-type algorithm improves the regret to ~O(W sqrt(d T log N)/L0). However, this bound relies on the parameter L0 (a lower bound presumably on the minimum eigenvalue or sensitivity); the paper needs to explicitly state the assumptions ensuring L0 is bounded away from zero and how it relates to the model parameters, as the bound becomes vacuous otherwise.
- [Offline algorithm] The pessimistic offline algorithm's suboptimality guarantee is based on local information around the optimal assortment-price pair rather than exact coverage. The derivation from the confidence region to this local guarantee should be detailed to confirm it is load-bearing and not relying on post-hoc assumptions.
minor comments (3)
- [Notation] The parameters W, d, L0, N should be clearly defined at the beginning, including their interpretations and any boundedness assumptions.
- [Abstract] The abstract mentions recovering near-minimax-optimal rates but does not cite the specific prior works establishing those rates for comparison.
- [Thompson sampling] Details on the Thompson sampling alternative, such as its regret bound or computational advantages, should be expanded beyond the abstract mention.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of our contribution and for the constructive major comments. We address each point below with clarifications and commit to revisions that strengthen the exposition without altering the core results.
read point-by-point responses
-
Referee: The validity of the novel confidence region for demand estimation under price-dependent features is central to all main results. The construction must rigorously handle the dependence between prices and features in the MNL probabilities to ensure correct coverage and the claimed tightness; without detailed proof steps, it is unclear whether the radius supports the regret bound ~O(W sqrt(d T log N)/L0) and the local suboptimality guarantee.
Authors: We appreciate this observation. The confidence region is derived via a self-normalized martingale bound that explicitly accounts for the price-dependent feature vectors inside the MNL probabilities; the radius is obtained by solving a fixed-point equation that incorporates the worst-case Lipschitz constant of the logit link. The full proof appears in Appendix A. To address the concern, we will insert a concise proof sketch (including the key martingale step and the resulting radius expression) into Section 3.1 of the main text, confirming that the radius is tight enough to support both the online regret and the local offline guarantee. revision: yes
-
Referee: The SupCB-type algorithm improves the regret to ~O(W sqrt(d T log N)/L0). However, this bound relies on the parameter L0 (a lower bound presumably on the minimum eigenvalue or sensitivity); the paper needs to explicitly state the assumptions ensuring L0 is bounded away from zero and how it relates to the model parameters, as the bound becomes vacuous otherwise.
Authors: We agree that the dependence on L0 must be stated explicitly. L0 is defined as the smallest eigenvalue of the expected outer-product matrix of the (price-augmented) feature vectors under the optimal policy; it is bounded away from zero under the standard assumptions that feature vectors are bounded, the price grid is finite, and the optimal policy places positive mass on a set of actions whose features span R^d. We will add a new paragraph in Section 2.2 that lists these regularity conditions, relates L0 to the minimal eigenvalue of the design matrix, and notes that the bound is vacuous only when the problem is information-theoretically degenerate. A short discussion of how L0 can be estimated from data will also be included. revision: yes
-
Referee: The pessimistic offline algorithm's suboptimality guarantee is based on local information around the optimal assortment-price pair rather than exact coverage. The derivation from the confidence region to this local guarantee should be detailed to confirm it is load-bearing and not relying on post-hoc assumptions.
Authors: The local guarantee follows directly from the fact that the pessimistic objective is a uniform upper bound on the true revenue function inside a neighborhood whose radius is controlled by the confidence-region width; inside this neighborhood the revenue function satisfies a quadratic growth condition (by twice-continuous differentiability of the MNL revenue and positive definiteness of the Hessian at the optimum). The derivation therefore uses only the coverage property of the confidence region and the local curvature assumption already stated in Section 4. We will expand the proof of Theorem 4.2 with an explicit two-page sketch that isolates each step, making clear that no additional post-hoc assumptions are introduced. revision: yes
Circularity Check
No circularity: regret bound and suboptimality guarantees depend on exogenous parameters
full rationale
The paper's central derivation begins with an explicit construction of a new confidence region for demand parameters under price-dependent features, then applies this region to obtain a pessimistic offline suboptimality bound (governed by local information around the optimum) and an online SupCB regret of O~(W sqrt(d T log N)/L0). Both the regret expression and the recovery of near-minimax rates in assortment-only or pricing-only special cases are stated directly in terms of the exogenous model parameters W, d, L0 and the validity of the constructed region; none of these quantities are fitted inside the derivation or renamed as predictions. No self-citation is load-bearing for the uniqueness or validity of the confidence region, and no step reduces by construction to its own inputs. The analysis is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- L0
- W
axioms (1)
- domain assumption Customer preferences follow a price-based contextual multinomial logit model.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2012.06961 , year=
Online stochastic optimization with wasserstein based non-stationarity , author=. arXiv preprint arXiv:2012.06961 , year=
-
[2]
Operations Research Letters , volume=
Capacitated assortment and price optimization under the multinomial logit model , author=. Operations Research Letters , volume=. 2012 , publisher=
work page 2012
-
[3]
Manufacturing & Service Operations Management , volume=
Dynamic joint assortment and pricing optimization with demand learning , author=. Manufacturing & Service Operations Management , volume=. 2021 , publisher=
work page 2021
-
[4]
Sequential batch learning in finite-action linear contextual bandits,
Sequential batch learning in finite-action linear contextual bandits , author=. arXiv preprint arXiv:2004.06321 , year=
-
[5]
Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs , author=. 2023 , eprint=
work page 2023
-
[6]
44th Annual IEEE Symposium on Foundations of Computer Science, 2003
The value of knowing a demand curve: Bounds on regret for online posted-price auctions , author=. 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings. , pages=. 2003 , organization=
work page 2003
-
[7]
Dynamic pricing under a general parametric choice model , author=. Operations Research , volume=. 2012 , publisher=
work page 2012
-
[8]
Advances in Neural Information Processing Systems , volume=
Logarithmic regret in feature-based dynamic pricing , author=. Advances in Neural Information Processing Systems , volume=
-
[9]
Journal of Machine Learning Research , volume=
Perishability of data: Dynamic pricing under varying-coefficient models , author=. Journal of Machine Learning Research , volume=
-
[10]
Journal of Machine Learning Research , volume=
Multi-scale online learning: Theory and applications to online auctions and pricing , author=. Journal of Machine Learning Research , volume=
-
[11]
Manufacturing & Service Operations Management , volume=
Product line selection and pricing with modularity in design , author=. Manufacturing & Service Operations Management , volume=. 2005 , publisher=
work page 2005
-
[12]
Manufacturing & Service Operations Management , volume =
Li, Hongmin and Huh, Woonghee Tim , title =. Manufacturing & Service Operations Management , volume =
-
[13]
Demand management and inventory control for substitutable products , author=. 2021 , publisher=
work page 2021
-
[14]
Production and Operations Management , volume =
Besbes, Omar and Sauré, Denis , title =. Production and Operations Management , volume =
-
[15]
arXiv preprint arXiv:2504.02324 , year=
Dynamic Assortment Selection and Pricing with Censored Preference Feedback , author=. arXiv preprint arXiv:2504.02324 , year=
-
[16]
Manufacturing & Service Operations Management , volume=
Pricing multiple products with the multinomial logit and nested logit models: Concavity and implications , author=. Manufacturing & Service Operations Management , volume=. 2011 , publisher=
work page 2011
-
[17]
Production and Operations Management , volume=
Product assortment and price competition under multinomial logit demand , author=. Production and Operations Management , volume=. 2016 , publisher=
work page 2016
-
[18]
A nonparametric joint assortment and price choice model , author=. Management Science , volume=. 2017 , publisher=
work page 2017
-
[19]
Proceedings of the 41st International Conference on Machine Learning , series =
Xu, Jianyu and Wang, Yu-Xiang , title =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , publisher =
work page 2024
-
[20]
The impact of consumer search cost on assortment planning and pricing , author=. Management Science , volume=. 2018 , publisher=
work page 2018
-
[21]
Improved Confidence Regions and Optimal Algorithms for Online and Offline Linear
Han, Yuxuan and Blanchet, Jose and Zhou, Zhengyuan , booktitle=. Improved Confidence Regions and Optimal Algorithms for Online and Offline Linear
-
[22]
Assortment optimization and pricing under the multinomial logit model with impatient customers: Sequential recommendation and selection , author=. Operations Research , volume=. 2021 , publisher=
work page 2021
-
[23]
Dynamic pricing without knowing the demand function: Risk bounds and near-optimal algorithms , author=. Operations Research , volume=. 2009 , publisher=
work page 2009
-
[24]
2020 IEEE International Symposium on Information Theory (ISIT) , pages=
Multi-product dynamic pricing in high-dimensions with heterogeneous price sensitivity , author=. 2020 IEEE International Symposium on Information Theory (ISIT) , pages=. 2020 , organization=
work page 2020
-
[25]
A unified confidence sequence for generalized linear models, with applications to bandits
A unified confidence sequence for generalized linear models, with applications to bandits , author=. arXiv preprint arXiv:2407.13977 , year=
-
[26]
Capacitated assortment and price optimization under the multinomial logit model , journal =. 2012 , issn =. doi:https://doi.org/10.1016/j.orl.2012.08.003 , author =
-
[27]
Advances in Neural Information Processing Systems (NeurIPS) , year =
Thompson Sampling for Multinomial Logit Contextual Bandits , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[28]
arXiv preprint arXiv:2503.11819 , year=
Online Assortment and Price Optimization Under Contextual Choice Models , author=. arXiv preprint arXiv:2503.11819 , year=
-
[29]
The 22nd International Conference on Artificial Intelligence and Statistics , pages=
Active ranking with subset-wise preferences , author=. The 22nd International Conference on Artificial Intelligence and Statistics , pages=. 2019 , organization=
work page 2019
-
[30]
Advances in Neural Information Processing Systems , volume=
Eluder dimension and the sample complexity of optimistic exploration , author=. Advances in Neural Information Processing Systems , volume=
-
[31]
International Conference on Algorithmic Learning Theory , pages=
Efficient local planning with linear function approximation , author=. International Conference on Algorithmic Learning Theory , pages=. 2022 , organization=
work page 2022
-
[32]
Revenue management under a general discrete choice model of consumer behavior , author=. Management Science , volume=. 2004 , publisher=
work page 2004
-
[33]
Improved Online Confidence Bounds for Multinomial Logistic Bandits , author=. arXiv preprint arXiv:2502.10020 , year=
-
[34]
Advances in Neural Information Processing Systems , volume=
Online (multinomial) logistic bandit: Improved regret and constant computation cost , author=. Advances in Neural Information Processing Systems , volume=
-
[35]
Advances in Neural Information Processing Systems , volume=
Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage , author=. Advances in Neural Information Processing Systems , volume=
-
[36]
Advances in neural information processing systems , volume=
Improved algorithms for linear stochastic bandits , author=. Advances in neural information processing systems , volume=
-
[37]
arXiv preprint arXiv:2502.06777
Learning an Optimal Assortment Policy under Observational Data , author=. arXiv preprint arXiv:2502.06777 , year=
-
[38]
Modeling, Prediction, Assortment and Price Optimization Under Consumer Choice Behavior , author=. Foundations and Trends. 2025 , publisher=
work page 2025
-
[39]
Introduction to multi-armed bandits , author=. Foundations and Trends. 2019 , publisher=
work page 2019
- [40]
- [41]
- [42]
-
[43]
MNL-bandit: A dynamic learning approach to assortment selection , author=. Operations Research , volume=. 2019 , publisher=
work page 2019
-
[44]
Mathematics of Operations Research , volume=
Robust Markov decision processes , author=. Mathematics of Operations Research , volume=. 2013 , publisher=
work page 2013
-
[45]
A comparative empirical study of discrete choice models in retail operations , author=. Management Science , volume=. 2022 , publisher=
work page 2022
-
[46]
Manufacturing & Service Operations Management , volume=
Optimal dynamic assortment planning with demand learning , author=. Manufacturing & Service Operations Management , volume=. 2013 , publisher=
work page 2013
-
[47]
Conference on learning theory , pages=
Thompson sampling for the mnl-bandit , author=. Conference on learning theory , pages=. 2017 , organization=
work page 2017
-
[48]
Robust assortment optimization under the markov chain choice model , author=. Operations Research , year=
-
[49]
Near-optimal Regret Bounds for Reinforcement Learning , author=. J. Mach. Learn. Res. , year=
-
[50]
Randomized assortment optimization , author=. Operations Research , year=
-
[51]
Advances in Neural Information Processing Systems , volume=
Simple and fast algorithm for binary integer and online linear programming , author=. Advances in Neural Information Processing Systems , volume=
-
[52]
Dynamic assortment with demand learning for seasonal consumer goods , author=. Management science , volume=. 2007 , publisher=
work page 2007
-
[53]
Contextual bandits with linear payoff functions , author=. Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics , pages=. 2011 , organization=
work page 2011
-
[54]
International Conference on Machine Learning , pages=
Provably optimal algorithms for generalized linear contextual bandits , author=. International Conference on Machine Learning , pages=. 2017 , organization=
work page 2017
-
[55]
Decoupling Learning and Decision-Making: Breaking the O(
Gao, Wenzhi and Sun, Chunlin and Xue, Chenyu and Ge, Dongdong and Ye, Yinyu , journal=. Decoupling Learning and Decision-Making: Breaking the O(
-
[56]
Canadian Journal of Mathematics , volume=
The equivalence of two extremum problems , author=. Canadian Journal of Mathematics , volume=. 1960 , publisher=
work page 1960
-
[57]
Mathematics of Operations Research , volume=
Delay-adaptive learning in generalized linear contextual bandits , author=. Mathematics of Operations Research , volume=. 2024 , publisher=
work page 2024
-
[58]
Theoretical Computer Science , volume=
Parameter identification in Markov chain choice models , author=. Theoretical Computer Science , volume=. 2020 , publisher=
work page 2020
-
[59]
Advances in Neural Information Processing Systems , volume=
Dynamic pricing and assortment under a contextual MNL demand , author=. Advances in Neural Information Processing Systems , volume=
-
[60]
Operations Research Letters , volume=
A note on a tight lower bound for capacitated MNL-bandit assortment selection models , author=. Operations Research Letters , volume=. 2018 , publisher=
work page 2018
-
[61]
Advances in Neural Information Processing Systems , volume=
Thompson sampling for multinomial logit contextual bandits , author=. Advances in Neural Information Processing Systems , volume=
- [62]
- [63]
-
[64]
Multinomial probit: the theory and its application to demand forecasting , author=. 2014 , publisher=
work page 2014
-
[65]
The exponomial choice model: A new alternative for assortment and price optimization , author=. Operations Research , volume=. 2016 , publisher=
work page 2016
-
[66]
Journal of applied Econometrics , volume=
Mixed MNL models for discrete response , author=. Journal of applied Econometrics , volume=. 2000 , publisher=
work page 2000
-
[67]
Assortment optimization under variants of the nested logit model , author=. Operations Research , volume=. 2014 , publisher=
work page 2014
-
[68]
Discrete choice methods with simulation , author=. 2009 , publisher=
work page 2009
-
[69]
Proceedings of the AAAI conference on artificial intelligence , volume=
Multinomial logit contextual bandits: Provable optimality and practicality , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
-
[70]
An expectation-maximization algorithm to estimate the parameters of the Markov chain choice model , author=. Operations Research , volume=. 2018 , publisher=
work page 2018
-
[71]
International Conference on Artificial Intelligence and Statistics , pages=
Instance-wise minimax-optimal algorithms for logistic bandits , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2021 , organization=
work page 2021
-
[72]
International Conference on Machine Learning , pages=
Improved optimistic algorithms for logistic bandits , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
-
[73]
Assortment planning under the multinomial logit model with totally unimodular constraint structures , author=. Work in Progress , year=
-
[74]
Advances in Neural Information Processing Systems , volume=
Bridging offline reinforcement learning and imitation learning: A tale of pessimism , author=. Advances in Neural Information Processing Systems , volume=
-
[75]
Advances in Neural Information Processing Systems , volume=
Pessimism for Offline Linear Contextual Bandits using ell\_p Confidence Sets , author=. Advances in Neural Information Processing Systems , volume=
-
[76]
Operations Research Letters , volume=
On the tightness of an LP relaxation for rational optimization and its applications , author=. Operations Research Letters , volume=. 2016 , publisher=
work page 2016
-
[77]
Dynamic assortment optimization with a multinomial logit choice model and capacity constraint , author=. Operations research , volume=. 2010 , publisher=
work page 2010
-
[78]
Mathematics of Operations Research , volume=
Optimal policy for dynamic assortment planning under multinomial logit models , author=. Mathematics of Operations Research , volume=. 2021 , publisher=
work page 2021
-
[79]
Production and Operations Management , volume=
Dynamic assortment planning under nested logit models , author=. Production and Operations Management , volume=. 2021 , publisher=
work page 2021
-
[80]
Journal of machine learning research , volume=
Dynamic assortment optimization with changing contextual information , author=. Journal of machine learning research , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.