A Simulation Environment and Reinforcement Learning Method for Waste Reduction

Maarten de Rijke; Mozhdeh Ariannezhad; Paul Groth; Sami Jullien

arxiv: 2205.15455 · v2 · pith:R4BMIJDQnew · submitted 2022-05-30 · 💻 cs.LG · cs.AI

A Simulation Environment and Reinforcement Learning Method for Waste Reduction

Sami Jullien , Mozhdeh Ariannezhad , Paul Groth , Maarten de Rijke This is my paper

classification 💻 cs.LG cs.AI

keywords environmentlearningreinforcementwastedistributionalgroceryinventoryuncertainty

0 comments

read the original abstract

In retail (e.g., grocery stores, apparel shops, online retailers), inventory managers have to balance short-term risk (no items to sell) with long-term-risk (over ordering leading to product waste). This balancing task is made especially hard due to the lack of information about future customer purchases. In this paper, we study the problem of restocking a grocery store's inventory with perishable items over time, from a distributional point of view. The objective is to maximize sales while minimizing waste, with uncertainty about the actual consumption by costumers. This problem is of a high relevance today, given the growing demand for food and the impact of food waste on the environment, the economy, and purchasing power. We frame inventory restocking as a new reinforcement learning task that exhibits stochastic behavior conditioned on the agent's actions, making the environment partially observable. We make two main contributions. First, we introduce a new reinforcement learning environment, RetaiL, based on real grocery store data and expert knowledge. This environment is highly stochastic, and presents a unique challenge for reinforcement learning practitioners. We show that uncertainty about the future behavior of the environment is not handled well by classical supply chain algorithms, and that distributional approaches are a good way to account for the uncertainty. Second, we introduce GTDQN, a distributional reinforcement learning algorithm that learns a generalized Tukey Lambda distribution over the reward space. GTDQN provides a strong baseline for our environment. It outperforms other distributional reinforcement learning approaches in this partially observable setting, in both overall reward and reduction of generated waste.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Smart Transportation Without Neurons -- Fair Metro Network Expansion with Tabular Reinforcement Learning
cs.LG 2026-06 conditional novelty 5.0

Tabular RL on a Non-Markovian Rewards Decision Process formulation matches deep RL performance on real metro expansion in Xi'an and Amsterdam while cutting episodes by 18x and carbon emissions by 12x on average.