Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights
Pith reviewed 2026-05-20 12:35 UTC · model grok-4.3
The pith
Maximum entropy reinforcement learning produces customer trajectories that match real retail paths more closely than TSP or nearest-neighbor heuristics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We cast customer trajectory prediction as a maximum entropy reinforcement learning problem that balances reward maximization with stochasticity to reflect customers' bounded rationality. Using real-world trajectory data from a convenience store, RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains.
What carries the argument
Maximum-entropy reinforcement learning formulation that models customers as agents maximizing a reward function while maintaining entropy to capture stochastic movement and bounded rationality.
If this is right
- RL trajectories supply more accurate impulse purchase rate estimates than TSP or PNN.
- Shelf traffic density predictions improve when using RL paths instead of the heuristics.
- Product repositioning decisions derived from RL paths match those from real data.
- Estimated profit gains from RL-guided layout changes are comparable to gains calculated from actual trajectories.
Where Pith is reading between the lines
- Retailers without trajectory sensors could run the RL model on a small pilot dataset to evaluate layout changes before committing to physical changes.
- The same agent-based approach could be adapted to movement prediction in warehouses, airports, or museums where full path data are also expensive to obtain.
- Adding time-of-day or demographic variables to the reward function would be a direct next step to test whether prediction accuracy rises further.
Load-bearing premise
The reward function and entropy coefficient in the maximum entropy RL model sufficiently capture the factors driving real customer movement and bounded rationality so the generated trajectories generalize beyond the training data.
What would settle it
A new set of real customer trajectories collected after implementing the RL-recommended product repositioning; if the observed profit change does not match the RL-predicted gain, the central claim is falsified.
Figures
read the original abstract
Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highly accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes casting customer trajectory prediction in retail as a maximum entropy reinforcement learning problem to model bounded rationality. Using real-world trajectory data from a convenience store, it claims that RL-generated paths align more closely with observed customer behavior than TSP or PNN heuristics, yielding better estimates of impulse purchase rates and shelf traffic densities. Only RL-based predictions produce repositioning decisions for impulse products that match those from actual data and deliver comparable estimated profit gains. Source code is released on GitHub.
Significance. If the central claims hold after addressing validation gaps, the work offers a practical, behaviorally grounded alternative to heuristics or fully data-intensive methods for retail layout optimization. The open-source code is a clear strength supporting reproducibility. The approach could make accurate trajectory modeling more accessible to retailers facing the accuracy-practicality tradeoff highlighted by the 28% deviation from shortest paths.
major comments (3)
- [§4] §4 (Experimental Setup): The manuscript does not report whether the RL model was evaluated on held-out trajectories or trained and tested on the same convenience-store data. Without explicit out-of-sample validation or cross-validation details, the reported superior alignment in trajectory match and impulse rates risks being an in-sample fit rather than evidence of generalization.
- [§3.2] §3.2 (Reward Design): The reward function weights and entropy coefficient are free parameters whose specific values and selection procedure are not detailed. The central claim that max-ent RL captures customer movement better than TSP/PNN depends on these choices; an ablation or sensitivity analysis on these components is needed to rule out overfitting to the evaluation metrics.
- [Results] Results section (quantitative comparisons): The improvements in trajectory alignment, impulse purchase rates, and layout decisions are presented without statistical significance tests or confidence intervals. This weakens the assertion that only RL yields repositioning decisions aligned with real data and comparable profit gains.
minor comments (2)
- [Abstract] Abstract: The 28% average deviation from shortest paths is stated without a pointer to the exact calculation or data subset used; this should be clarified with a reference to the relevant methods or results subsection.
- [§3] Notation: The description of the maximum-entropy RL formulation would benefit from an explicit equation for the objective (e.g., the soft Q-function or entropy-regularized reward) to aid readers outside the RL community.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating where revisions will be made to improve transparency and rigor.
read point-by-point responses
-
Referee: [§4] §4 (Experimental Setup): The manuscript does not report whether the RL model was evaluated on held-out trajectories or trained and tested on the same convenience-store data. Without explicit out-of-sample validation or cross-validation details, the reported superior alignment in trajectory match and impulse rates risks being an in-sample fit rather than evidence of generalization.
Authors: We appreciate this point on validation rigor. The experimental setup in §4 uses the full real-world trajectory dataset from the convenience store, with the RL policy trained via maximum entropy RL and evaluated by generating trajectories that are compared to observed paths. To strengthen clarity, we will revise §4 to explicitly describe the train/test split (e.g., 5-fold cross-validation with held-out trajectories for final metrics) and confirm that all reported alignment, impulse rate, and layout results are computed on out-of-sample data. This addresses the generalization concern directly. revision: yes
-
Referee: [§3.2] §3.2 (Reward Design): The reward function weights and entropy coefficient are free parameters whose specific values and selection procedure are not detailed. The central claim that max-ent RL captures customer movement better than TSP/PNN depends on these choices; an ablation or sensitivity analysis on these components is needed to rule out overfitting to the evaluation metrics.
Authors: We agree that additional detail on hyperparameter choices would strengthen the manuscript. The reward weights (for distance, shelf visits, and impulse items) and entropy coefficient were tuned on a small validation subset to produce trajectories whose deviation from shortest paths matches the observed 28% average in the data, reflecting bounded rationality. In the revision we will report the exact values used, the selection procedure, and include a sensitivity analysis table showing how moderate changes to these parameters affect trajectory match and impulse purchase metrics. This will demonstrate that the superiority over TSP/PNN is robust rather than an artifact of specific tuning. revision: yes
-
Referee: Results section (quantitative comparisons): The improvements in trajectory alignment, impulse purchase rates, and layout decisions are presented without statistical significance tests or confidence intervals. This weakens the assertion that only RL yields repositioning decisions aligned with real data and comparable profit gains.
Authors: We acknowledge that the lack of statistical tests reduces the strength of the quantitative claims. The Results section currently reports mean improvements (e.g., lower trajectory deviation and better profit alignment for RL), but does not include p-values or intervals. We will add paired statistical tests (e.g., Wilcoxon signed-rank) across the cross-validation folds together with 95% confidence intervals for the key metrics. This revision will provide formal support for the claim that only RL-based predictions produce repositioning decisions aligned with real data. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained against external benchmarks
full rationale
The paper frames customer trajectory generation as a maximum-entropy RL problem whose policy is learned from observed convenience-store paths, then compares the resulting synthetic trajectories against two non-learned heuristics (TSP, PNN) on downstream metrics such as impulse-purchase rates and shelf densities. Because the baselines are fixed, parameter-free constructions that do not incorporate any fitted reward or entropy term, the reported superiority of RL trajectories constitutes an independent empirical comparison rather than a reduction to the training data by construction. No self-citation chain, uniqueness theorem, or ansatz imported from prior author work is invoked to justify the modeling choice; the reward function is presented as an explicit modeling assumption whose adequacy is tested by out-of-sample alignment and layout-decision fidelity. The presence of publicly released code further allows external verification that the evaluation loop does not collapse into a tautology. Consequently the central claim retains independent content and receives a circularity score of zero.
Axiom & Free-Parameter Ledger
free parameters (2)
- reward weights
- entropy coefficient
axioms (2)
- domain assumption Customer movement can be represented as a Markov Decision Process
- domain assumption Customers exhibit bounded rationality that produces stochastic rather than optimal paths
Reference graph
Works this paper leans on
-
[1]
Fouad Ben Abdelaziz, Bacel Maddah, Tülay Flamand, and Jimmy Azar. 2024. Store-wide space planning balancing impulse and convenience.European Journal of Operational Research312, 1 (2024), 211–226
work page 2024
-
[2]
Salam Qaddoori Dawood Al-Zubaidi, Gualtiero Fantoni, and Franco Failli. 2021. Analysis of drivers for solving facility layout problems: A Literature review. Journal of industrial information integration21 (2021), 100187
work page 2021
-
[3]
Jimmy Azar and Hoda Daou. 2023. In-Store Traffic Density Estimation. InRetail Space Analytics. Springer, 35–50
work page 2023
-
[4]
Danny N Bellenger, Dan H Robertson, and Elizabeth C Hirschman. 1978. Impulse buying varies by product.Journal of advertising research18, 6 (1978), 15–18
work page 1978
-
[5]
2016.Store layout using location modelling to increase purchases
Joyendu Bhadury, Rajan Batta, Jessica Dorismond, Chien-Chih Peng, and Shrideep Sadhale. 2016.Store layout using location modelling to increase purchases. Techni- cal Report. University of Buffalo working paper. http://www. acsu. buffalo. edu/˜ batta
work page 2016
-
[6]
2007.Retail facility layout design
Ahmet Reha Botsali. 2007.Retail facility layout design. Ph.D. Dissertation. Texas A & M University
work page 2007
-
[7]
A Reha Botsalı, Georgia-Ann Klutke, and Brett A Peters. 2023. Effect of Customer Travel Behavior on Grid Layout and Shelf Space Allocation in Retail Facilities. InRetail Space Analytics. Springer, 1–20
work page 2023
-
[8]
Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan Terry
-
[9]
InAdvances in Neural Information Processing Systems 36, New Orleans, LA, USA
Minigrid & Miniworld: Modular & Customizable Reinforcement Learn- ing Environments for Goal-Oriented Tasks. InAdvances in Neural Information Processing Systems 36, New Orleans, LA, USA
-
[10]
Marcel Corstjens and Peter Doyle. 1981. A model for optimizing retail space allocations.Management Science27, 7 (1981), 822–833
work page 1981
-
[11]
Marcel Corstjens and Peter Doyle. 1983. A dynamic model for strategically allocating retail space.Journal of the Operational Research Society34, 10 (1983), 943–951
work page 1983
-
[12]
Elif Danisman and Alice E Smith. 2023. Data-Driven Analytical Grocery Store Design. InRetail Space Analytics. Springer, 75–101
work page 2023
-
[13]
Jessica Dorismond. 2016. Supermarket optimization: Simulation modeling and analysis of a grocery store layout. In2016 Winter Simulation Conference (WSC). 3656–3657. https://doi.org/10.1109/WSC.2016.7822385
-
[14]
Jessica Dorismond, Jose L Walteros, and Rajan Batta. 2023. A Simulation Based Tool to Guide Periodic Changes in a Supermarket Layout. InRetail Space Analytics. Springer, 51–74
work page 2023
-
[15]
Amine Drira, Henri Pierreval, and Sonia Hajri-Gabouj. 2007. Facility layout problems: A survey.Annual reviews in control31, 2 (2007), 255–267
work page 2007
-
[16]
Gihan S Edirisinghe and Charles L Munson. 2023. Strategic rearrangement of retail shelf space allocations: Using data insights to encourage impulse buying. Expert Systems with Applications216 (2023), 119442
work page 2023
-
[17]
Tulay Flamand, Ahmed Ghoniem, and Bacel Maddah. 2016. Promoting impulse buying by allocating retail shelf space to grouped product categories.Journal of the Operational Research Society67, 7 (2016), 953–969
work page 2016
-
[18]
Tülay Flamand, Ahmed Ghoniem, and Bacel Maddah. 2023. Store-Wide Shelf-Space Allocation with Ripple Effects Driving Traffic.Operations Research71, 4 (2023), 1073–1092. https://doi.org/10.1287/opre.2023.2437 arXiv:https://doi.org/10.1287/opre.2023.2437
-
[19]
Ahmed Ghoniem, Tulay Flamand, and Mohamed Haouari. 2016. Optimization- based very large-scale neighborhood search for generalized assignment problems with location/allocation considerations.INFORMS Journal on Computing28, 3 (2016), 575–588
work page 2016
-
[20]
Donald H Granbois. 1968. Improving the study of customer in-store behavior. Journal of Marketing32, 4_part_1 (1968), 28–33
work page 1968
-
[21]
Evren Gul, Alvin Lim, and Jiefeng Xu. 2023. Retail store layout optimization for maximum product visibility.Journal of the Operational Research Society74, 4 (2023), 1079–1091
work page 2023
-
[22]
Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Rein- forcement learning with deep energy-based policies. InInternational conference on machine learning. PMLR, 1352–1361
work page 2017
-
[23]
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning. Pmlr, 1861–1870
work page 2018
-
[24]
Sagarkumar Hirpara and Pratik J Parikh. 2021. Retail facility layout considering shopper path.Computers & Industrial Engineering154 (2021), 106919
work page 2021
-
[25]
Kimberly Holmgren. 2021. Customer path generation simulation for selection from proposed grocery store layouts. In2021 Winter Simulation Conference (WSC). IEEE, 1–11
work page 2021
-
[26]
Jan Holmström. 1997. Product range management: a case study of supply chain operations in the European grocery industry.Supply Chain Management: An International Journal2, 3 (1997), 107–115
work page 1997
-
[27]
Sam K Hui, Peter S Fader, and Eric T Bradlow. 2009. Research note—the traveling salesman goes shopping: The systematic deviations of grocery paths from TSP optimality.Marketing science28, 3 (2009), 566–572
work page 2009
-
[28]
Sam K Hui, J Jeffrey Inman, Yanliu Huang, and Jacob Suher. 2013. The effect of in- store travel distance on unplanned spending: Applications to mobile promotion strategies.Journal of Marketing77, 2 (2013), 1–16
work page 2013
-
[29]
Easwar S Iyer. 1989. Unplanned Purchasing: Knowledge of shopping environment and.Journal of retailing65, 1 (1989), 40
work page 1989
-
[30]
Lene Granzau Juel-Jacobsen. 2015. Aisles of life: outline of a customer-centric ap- proach to retail space management.The International Review of Retail, Distribution and Consumer Research25, 2 (2015), 162–180
work page 2015
-
[31]
David T Kollat and Ronald P Willett. 1967. Customer impulse purchasing behavior. Journal of marketing research4, 1 (1967), 21–31
work page 1967
-
[32]
2011.A facility layout design methodology for retail environments
Chen Li. 2011.A facility layout design methodology for retail environments. Ph.D. Dissertation. University of Pittsburgh
work page 2011
-
[33]
2023.Dynamic Digital Twins for On-Shelf A vailability in the Retail Store
Xiangyu Li. 2023.Dynamic Digital Twins for On-Shelf A vailability in the Retail Store. McGill University (Canada)
work page 2023
-
[34]
Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nova, et al. 2021. A graph placement methodology for fast chip design.Nature 594, 7862 (2021), 207–212
work page 2021
-
[35]
Elif Ozgormus and Alice E Smith. 2020. A data-driven approach to grocery store block layout.Computers & Industrial Engineering139 (2020), 105562
work page 2020
-
[36]
POPAI. 2014. The 2014 POPAI Mass Merchant Shopper Engagement Study: Media Report
work page 2014
-
[37]
Remi. 2024. How Profitable is a Convenience Store? Revenue & Prof- its Analysis — sharpsheets.io. https://sharpsheets.io/blog/how-profitable-is- a-convenience-store/. [Accessed 14-05-2025]
work page 2024
-
[38]
Dennis W. Rook and Robert J. Fisher. 1995. Normative Influences on Impul- sive Buying Behavior.Journal of Consumer Research22, 3 (12 1995), 305–
work page 1995
-
[39]
https://doi.org/10.1086/209452 arXiv:https://academic.oup.com/jcr/article- pdf/22/3/305/5069267/22-3-305.pdf
-
[40]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov
-
[41]
Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[42]
2018.Reinforcement learning: An intro- duction
Richard S Sutton and Andrew G Barto. 2018.Reinforcement learning: An intro- duction. MIT press
work page 2018
-
[43]
Paco Underhill. 2009.Why we buy: The science of shopping–updated and revised for the Internet, the global consumer, and beyond. Simon and Schuster
work page 2009
-
[44]
Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al . 2008. Maximum entropy inverse reinforcement learning.. InAaai, Vol. 8. Chicago, IL, USA, 1433–1438
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.