pith. sign in

arxiv: 2605.18449 · v1 · pith:AALMVIYJnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

Pith reviewed 2026-05-20 12:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords customer trajectoriesreinforcement learningretail optimizationmaximum entropy RLimpulse purchasesstore layouttrajectory predictionbounded rationality
0
0 comments X

The pith

Maximum entropy reinforcement learning produces customer trajectories that match real retail paths more closely than TSP or nearest-neighbor heuristics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that framing customer movement in stores as a maximum-entropy reinforcement learning task generates paths reflecting bounded rationality better than standard heuristics. Real trajectory data is costly to collect, so an inexpensive model that reproduces observed behavior enables practical layout and placement decisions without full data collection. On convenience-store data the RL paths give tighter estimates of impulse-buy rates and shelf traffic than TSP or PNN, and only the RL suggestions for moving impulse products produce repositioning choices and profit estimates that line up with those derived from actual paths.

Core claim

We cast customer trajectory prediction as a maximum entropy reinforcement learning problem that balances reward maximization with stochasticity to reflect customers' bounded rationality. Using real-world trajectory data from a convenience store, RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains.

What carries the argument

Maximum-entropy reinforcement learning formulation that models customers as agents maximizing a reward function while maintaining entropy to capture stochastic movement and bounded rationality.

If this is right

  • RL trajectories supply more accurate impulse purchase rate estimates than TSP or PNN.
  • Shelf traffic density predictions improve when using RL paths instead of the heuristics.
  • Product repositioning decisions derived from RL paths match those from real data.
  • Estimated profit gains from RL-guided layout changes are comparable to gains calculated from actual trajectories.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Retailers without trajectory sensors could run the RL model on a small pilot dataset to evaluate layout changes before committing to physical changes.
  • The same agent-based approach could be adapted to movement prediction in warehouses, airports, or museums where full path data are also expensive to obtain.
  • Adding time-of-day or demographic variables to the reward function would be a direct next step to test whether prediction accuracy rises further.

Load-bearing premise

The reward function and entropy coefficient in the maximum entropy RL model sufficiently capture the factors driving real customer movement and bounded rationality so the generated trajectories generalize beyond the training data.

What would settle it

A new set of real customer trajectories collected after implementing the RL-recommended product repositioning; if the observed profit change does not match the RL-predicted gain, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.18449 by Derek Nowrouzezahrai, Ken Ming Lee, Maxime C. Cohen, Paul Barde.

Figure 1
Figure 1. Figure 1: Various representations of the retail store. (a) Grid-based representation for RL training. (b) Overlay of customer [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Trajectory heatmap of customers purchasing from the Cold Food category with (9,5) checkout, across all methods. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Heatmap of trajectories and shelf-traffic density [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Left column shows trajectory heatmaps for Cluster [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Within-Cluster Sum of Squares score against the number of clusters. To compute 𝑃purchase, we fol￾low the method used by Doris￾mond et al. [13] and clustered all 61 baskets using the elbow method, resulting in three clus￾ters ( [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Shelf traffic density heatmaps for Cluster 2 trajec [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of shelf layout recommendations by different methods. Suggested shelf placements for Soft Drinks and [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highly accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes casting customer trajectory prediction in retail as a maximum entropy reinforcement learning problem to model bounded rationality. Using real-world trajectory data from a convenience store, it claims that RL-generated paths align more closely with observed customer behavior than TSP or PNN heuristics, yielding better estimates of impulse purchase rates and shelf traffic densities. Only RL-based predictions produce repositioning decisions for impulse products that match those from actual data and deliver comparable estimated profit gains. Source code is released on GitHub.

Significance. If the central claims hold after addressing validation gaps, the work offers a practical, behaviorally grounded alternative to heuristics or fully data-intensive methods for retail layout optimization. The open-source code is a clear strength supporting reproducibility. The approach could make accurate trajectory modeling more accessible to retailers facing the accuracy-practicality tradeoff highlighted by the 28% deviation from shortest paths.

major comments (3)
  1. [§4] §4 (Experimental Setup): The manuscript does not report whether the RL model was evaluated on held-out trajectories or trained and tested on the same convenience-store data. Without explicit out-of-sample validation or cross-validation details, the reported superior alignment in trajectory match and impulse rates risks being an in-sample fit rather than evidence of generalization.
  2. [§3.2] §3.2 (Reward Design): The reward function weights and entropy coefficient are free parameters whose specific values and selection procedure are not detailed. The central claim that max-ent RL captures customer movement better than TSP/PNN depends on these choices; an ablation or sensitivity analysis on these components is needed to rule out overfitting to the evaluation metrics.
  3. [Results] Results section (quantitative comparisons): The improvements in trajectory alignment, impulse purchase rates, and layout decisions are presented without statistical significance tests or confidence intervals. This weakens the assertion that only RL yields repositioning decisions aligned with real data and comparable profit gains.
minor comments (2)
  1. [Abstract] Abstract: The 28% average deviation from shortest paths is stated without a pointer to the exact calculation or data subset used; this should be clarified with a reference to the relevant methods or results subsection.
  2. [§3] Notation: The description of the maximum-entropy RL formulation would benefit from an explicit equation for the objective (e.g., the soft Q-function or entropy-regularized reward) to aid readers outside the RL community.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the manuscript and indicating where revisions will be made to improve transparency and rigor.

read point-by-point responses
  1. Referee: [§4] §4 (Experimental Setup): The manuscript does not report whether the RL model was evaluated on held-out trajectories or trained and tested on the same convenience-store data. Without explicit out-of-sample validation or cross-validation details, the reported superior alignment in trajectory match and impulse rates risks being an in-sample fit rather than evidence of generalization.

    Authors: We appreciate this point on validation rigor. The experimental setup in §4 uses the full real-world trajectory dataset from the convenience store, with the RL policy trained via maximum entropy RL and evaluated by generating trajectories that are compared to observed paths. To strengthen clarity, we will revise §4 to explicitly describe the train/test split (e.g., 5-fold cross-validation with held-out trajectories for final metrics) and confirm that all reported alignment, impulse rate, and layout results are computed on out-of-sample data. This addresses the generalization concern directly. revision: yes

  2. Referee: [§3.2] §3.2 (Reward Design): The reward function weights and entropy coefficient are free parameters whose specific values and selection procedure are not detailed. The central claim that max-ent RL captures customer movement better than TSP/PNN depends on these choices; an ablation or sensitivity analysis on these components is needed to rule out overfitting to the evaluation metrics.

    Authors: We agree that additional detail on hyperparameter choices would strengthen the manuscript. The reward weights (for distance, shelf visits, and impulse items) and entropy coefficient were tuned on a small validation subset to produce trajectories whose deviation from shortest paths matches the observed 28% average in the data, reflecting bounded rationality. In the revision we will report the exact values used, the selection procedure, and include a sensitivity analysis table showing how moderate changes to these parameters affect trajectory match and impulse purchase metrics. This will demonstrate that the superiority over TSP/PNN is robust rather than an artifact of specific tuning. revision: yes

  3. Referee: Results section (quantitative comparisons): The improvements in trajectory alignment, impulse purchase rates, and layout decisions are presented without statistical significance tests or confidence intervals. This weakens the assertion that only RL yields repositioning decisions aligned with real data and comparable profit gains.

    Authors: We acknowledge that the lack of statistical tests reduces the strength of the quantitative claims. The Results section currently reports mean improvements (e.g., lower trajectory deviation and better profit alignment for RL), but does not include p-values or intervals. We will add paired statistical tests (e.g., Wilcoxon signed-rank) across the cross-validation folds together with 95% confidence intervals for the key metrics. This revision will provide formal support for the claim that only RL-based predictions produce repositioning decisions aligned with real data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained against external benchmarks

full rationale

The paper frames customer trajectory generation as a maximum-entropy RL problem whose policy is learned from observed convenience-store paths, then compares the resulting synthetic trajectories against two non-learned heuristics (TSP, PNN) on downstream metrics such as impulse-purchase rates and shelf densities. Because the baselines are fixed, parameter-free constructions that do not incorporate any fitted reward or entropy term, the reported superiority of RL trajectories constitutes an independent empirical comparison rather than a reduction to the training data by construction. No self-citation chain, uniqueness theorem, or ansatz imported from prior author work is invoked to justify the modeling choice; the reward function is presented as an explicit modeling assumption whose adequacy is tested by out-of-sample alignment and layout-decision fidelity. The presence of publicly released code further allows external verification that the evaluation loop does not collapse into a tautology. Consequently the central claim retains independent content and receives a circularity score of zero.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on a reward function whose weights are calibrated to observed trajectories and on standard RL modeling assumptions about sequential customer decisions.

free parameters (2)
  • reward weights
    Weights assigned to different behavioral factors (e.g., shelf visits, path length) that are tuned so generated trajectories match real data.
  • entropy coefficient
    Scalar controlling the degree of stochasticity in the maximum entropy objective, chosen to reflect bounded rationality.
axioms (2)
  • domain assumption Customer movement can be represented as a Markov Decision Process
    Invoked when casting trajectory prediction as an RL problem.
  • domain assumption Customers exhibit bounded rationality that produces stochastic rather than optimal paths
    Used to justify maximum entropy RL over deterministic shortest-path methods.

pith-pipeline@v0.9.0 · 5754 in / 1377 out tokens · 45857 ms · 2026-05-20T12:35:24.463893+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 1 internal anchor

  1. [1]

    Fouad Ben Abdelaziz, Bacel Maddah, Tülay Flamand, and Jimmy Azar. 2024. Store-wide space planning balancing impulse and convenience.European Journal of Operational Research312, 1 (2024), 211–226

  2. [2]

    Salam Qaddoori Dawood Al-Zubaidi, Gualtiero Fantoni, and Franco Failli. 2021. Analysis of drivers for solving facility layout problems: A Literature review. Journal of industrial information integration21 (2021), 100187

  3. [3]

    Jimmy Azar and Hoda Daou. 2023. In-Store Traffic Density Estimation. InRetail Space Analytics. Springer, 35–50

  4. [4]

    Danny N Bellenger, Dan H Robertson, and Elizabeth C Hirschman. 1978. Impulse buying varies by product.Journal of advertising research18, 6 (1978), 15–18

  5. [5]

    2016.Store layout using location modelling to increase purchases

    Joyendu Bhadury, Rajan Batta, Jessica Dorismond, Chien-Chih Peng, and Shrideep Sadhale. 2016.Store layout using location modelling to increase purchases. Techni- cal Report. University of Buffalo working paper. http://www. acsu. buffalo. edu/˜ batta

  6. [6]

    2007.Retail facility layout design

    Ahmet Reha Botsali. 2007.Retail facility layout design. Ph.D. Dissertation. Texas A & M University

  7. [7]

    A Reha Botsalı, Georgia-Ann Klutke, and Brett A Peters. 2023. Effect of Customer Travel Behavior on Grid Layout and Shelf Space Allocation in Retail Facilities. InRetail Space Analytics. Springer, 1–20

  8. [8]

    Maxime Chevalier-Boisvert, Bolun Dai, Mark Towers, Rodrigo Perez-Vicente, Lucas Willems, Salem Lahlou, Suman Pal, Pablo Samuel Castro, and Jordan Terry

  9. [9]

    InAdvances in Neural Information Processing Systems 36, New Orleans, LA, USA

    Minigrid & Miniworld: Modular & Customizable Reinforcement Learn- ing Environments for Goal-Oriented Tasks. InAdvances in Neural Information Processing Systems 36, New Orleans, LA, USA

  10. [10]

    Marcel Corstjens and Peter Doyle. 1981. A model for optimizing retail space allocations.Management Science27, 7 (1981), 822–833

  11. [11]

    Marcel Corstjens and Peter Doyle. 1983. A dynamic model for strategically allocating retail space.Journal of the Operational Research Society34, 10 (1983), 943–951

  12. [12]

    Elif Danisman and Alice E Smith. 2023. Data-Driven Analytical Grocery Store Design. InRetail Space Analytics. Springer, 75–101

  13. [13]

    Jessica Dorismond. 2016. Supermarket optimization: Simulation modeling and analysis of a grocery store layout. In2016 Winter Simulation Conference (WSC). 3656–3657. https://doi.org/10.1109/WSC.2016.7822385

  14. [14]

    Jessica Dorismond, Jose L Walteros, and Rajan Batta. 2023. A Simulation Based Tool to Guide Periodic Changes in a Supermarket Layout. InRetail Space Analytics. Springer, 51–74

  15. [15]

    Amine Drira, Henri Pierreval, and Sonia Hajri-Gabouj. 2007. Facility layout problems: A survey.Annual reviews in control31, 2 (2007), 255–267

  16. [16]

    Gihan S Edirisinghe and Charles L Munson. 2023. Strategic rearrangement of retail shelf space allocations: Using data insights to encourage impulse buying. Expert Systems with Applications216 (2023), 119442

  17. [17]

    Tulay Flamand, Ahmed Ghoniem, and Bacel Maddah. 2016. Promoting impulse buying by allocating retail shelf space to grouped product categories.Journal of the Operational Research Society67, 7 (2016), 953–969

  18. [18]

    Tülay Flamand, Ahmed Ghoniem, and Bacel Maddah. 2023. Store-Wide Shelf-Space Allocation with Ripple Effects Driving Traffic.Operations Research71, 4 (2023), 1073–1092. https://doi.org/10.1287/opre.2023.2437 arXiv:https://doi.org/10.1287/opre.2023.2437

  19. [19]

    Ahmed Ghoniem, Tulay Flamand, and Mohamed Haouari. 2016. Optimization- based very large-scale neighborhood search for generalized assignment problems with location/allocation considerations.INFORMS Journal on Computing28, 3 (2016), 575–588

  20. [20]

    Donald H Granbois. 1968. Improving the study of customer in-store behavior. Journal of Marketing32, 4_part_1 (1968), 28–33

  21. [21]

    Evren Gul, Alvin Lim, and Jiefeng Xu. 2023. Retail store layout optimization for maximum product visibility.Journal of the Operational Research Society74, 4 (2023), 1079–1091

  22. [22]

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. 2017. Rein- forcement learning with deep energy-based policies. InInternational conference on machine learning. PMLR, 1352–1361

  23. [23]

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning. Pmlr, 1861–1870

  24. [24]

    Sagarkumar Hirpara and Pratik J Parikh. 2021. Retail facility layout considering shopper path.Computers & Industrial Engineering154 (2021), 106919

  25. [25]

    Kimberly Holmgren. 2021. Customer path generation simulation for selection from proposed grocery store layouts. In2021 Winter Simulation Conference (WSC). IEEE, 1–11

  26. [26]

    Jan Holmström. 1997. Product range management: a case study of supply chain operations in the European grocery industry.Supply Chain Management: An International Journal2, 3 (1997), 107–115

  27. [27]

    Sam K Hui, Peter S Fader, and Eric T Bradlow. 2009. Research note—the traveling salesman goes shopping: The systematic deviations of grocery paths from TSP optimality.Marketing science28, 3 (2009), 566–572

  28. [28]

    Sam K Hui, J Jeffrey Inman, Yanliu Huang, and Jacob Suher. 2013. The effect of in- store travel distance on unplanned spending: Applications to mobile promotion strategies.Journal of Marketing77, 2 (2013), 1–16

  29. [29]

    Easwar S Iyer. 1989. Unplanned Purchasing: Knowledge of shopping environment and.Journal of retailing65, 1 (1989), 40

  30. [30]

    Lene Granzau Juel-Jacobsen. 2015. Aisles of life: outline of a customer-centric ap- proach to retail space management.The International Review of Retail, Distribution and Consumer Research25, 2 (2015), 162–180

  31. [31]

    David T Kollat and Ronald P Willett. 1967. Customer impulse purchasing behavior. Journal of marketing research4, 1 (1967), 21–31

  32. [32]

    2011.A facility layout design methodology for retail environments

    Chen Li. 2011.A facility layout design methodology for retail environments. Ph.D. Dissertation. University of Pittsburgh

  33. [33]

    2023.Dynamic Digital Twins for On-Shelf A vailability in the Retail Store

    Xiangyu Li. 2023.Dynamic Digital Twins for On-Shelf A vailability in the Retail Store. McGill University (Canada)

  34. [34]

    Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nova, et al. 2021. A graph placement methodology for fast chip design.Nature 594, 7862 (2021), 207–212

  35. [35]

    Elif Ozgormus and Alice E Smith. 2020. A data-driven approach to grocery store block layout.Computers & Industrial Engineering139 (2020), 105562

  36. [36]

    POPAI. 2014. The 2014 POPAI Mass Merchant Shopper Engagement Study: Media Report

  37. [37]

    Remi. 2024. How Profitable is a Convenience Store? Revenue & Prof- its Analysis — sharpsheets.io. https://sharpsheets.io/blog/how-profitable-is- a-convenience-store/. [Accessed 14-05-2025]

  38. [38]

    Rook and Robert J

    Dennis W. Rook and Robert J. Fisher. 1995. Normative Influences on Impul- sive Buying Behavior.Journal of Consumer Research22, 3 (12 1995), 305–

  39. [39]

    https://doi.org/10.1086/209452 arXiv:https://academic.oup.com/jcr/article- pdf/22/3/305/5069267/22-3-305.pdf

  40. [40]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

  41. [41]

    Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347 (2017)

  42. [42]

    2018.Reinforcement learning: An intro- duction

    Richard S Sutton and Andrew G Barto. 2018.Reinforcement learning: An intro- duction. MIT press

  43. [43]

    2009.Why we buy: The science of shopping–updated and revised for the Internet, the global consumer, and beyond

    Paco Underhill. 2009.Why we buy: The science of shopping–updated and revised for the Internet, the global consumer, and beyond. Simon and Schuster

  44. [44]

    Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, Anind K Dey, et al . 2008. Maximum entropy inverse reinforcement learning.. InAaai, Vol. 8. Chicago, IL, USA, 1433–1438