Establishes last-iterate convergence rates for on-policy Q-learning under minimal irreducibility assumptions, with sample complexity O(1/ξ²) matching off-policy up to exploration factors.
Reinforcementlearningbasedrecommendersystems: A survey.ACM Comput
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
Vanishing L2 regularization yields provable convergence for softmax MAB policies and improves empirical performance.
fcTS corrects and reweights historical observations via drift models for linear, periodic, and regime-switching non-stationarities in contextual bandits, outperforming forgetting baselines in structured cases.
Introduces semantic Pareto-DQN for multi-objective recommendation that sustains trajectory variance to improve diversity and fairness on MovieLens with limited engagement loss.
citing papers explorer
-
A Minimal-Assumption Analysis of Q-Learning with Time-Varying Policies
Establishes last-iterate convergence rates for on-policy Q-learning under minimal irreducibility assumptions, with sample complexity O(1/ξ²) matching off-policy up to exploration factors.
-
Vanishing L2 regularization for the softmax Multi Armed Bandit
Vanishing L2 regularization yields provable convergence for softmax MAB policies and improves empirical performance.
-
Flow-Corrected Thompson Sampling for Non-Stationary Contextual Bandits
fcTS corrects and reweights historical observations via drift models for linear, periodic, and regime-switching non-stationarities in contextual bandits, outperforming forgetting baselines in structured cases.
-
Breaking the Filter Bubble: A Semantic Pareto-DQN Framework for Multi-Objective Recommendation
Introduces semantic Pareto-DQN for multi-objective recommendation that sustains trajectory variance to improve diversity and fairness on MovieLens with limited engagement loss.