Tensor-Efficient High-Dimensional Q-learning

Dan Li; Junyi Wu

arxiv: 2511.03595 · v2 · submitted 2025-11-05 · 💻 cs.LG · cs.SY· eess.SY

Tensor-Efficient High-Dimensional Q-learning

Junyi Wu , Dan Li This is my paper

Pith reviewed 2026-05-18 01:03 UTC · model grok-4.3

classification 💻 cs.LG cs.SYeess.SY

keywords Q-learningtensor decompositionlow-rank approximationreinforcement learningexplorationsample efficiencyCP tensorhigh-dimensional control

0 comments

The pith

Representing the Q-function as a low-rank CP tensor with error-uncertainty guided exploration improves sample efficiency over matrix low-rank and deep RL methods under matched parameter budgets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Tensor-Efficient Q-Learning to tackle the exponential growth of state-action pairs in high-dimensional reinforcement learning. It models the Q-function using a low-rank CP tensor decomposition over discretized spaces and directs exploration by combining the tensor's approximation error with how often each state has been visited, while adding frequency-aware regularization to keep updates stable. Experiments on classic control tasks show that this structured representation learns with fewer samples than either matrix-based low-rank alternatives or neural network approaches when the total number of parameters is held constant. A sympathetic reader would care because many real-world control problems make each environment interaction expensive, so any method that extracts more learning per sample has direct practical value.

Core claim

TEQL represents the Q-function as a low-rank CP tensor over discretized state-action spaces and exploits the tensor structure for uncertainty-aware exploration. It incorporates Error-Uncertainty Guided Exploration that combines tensor approximation error with visit counts to guide action selection, along with frequency-aware regularization to stabilize updates. Under matched parameter budgets, experiments on classic control tasks demonstrate that TEQL outperforms both matrix-based low-rank methods and deep RL baselines in sample efficiency.

What carries the argument

Low-rank CP tensor representation of the Q-function paired with Error-Uncertainty Guided Exploration that merges approximation error and visit counts for action selection.

Load-bearing premise

Many high-dimensional control tasks exhibit low-rank structure in their value functions that a CP tensor can capture efficiently.

What would settle it

Finding a high-dimensional control task whose value function has no useful low-rank structure and showing that TEQL then loses its sample-efficiency advantage over parameter-matched baselines would falsify the central claim.

Figures

Figures reproduced from arXiv: 2511.03595 by Dan Li, Junyi Wu.

**Figure 1.** Figure 1: , TEQL operates through an integrated sequence of computational stages that transform environmental observations into improved policy estimates. Environment 𝑠!, 𝑎!, 𝑟!, 𝑠!"# 𝑠! 𝑎! Select action 𝑎! with EUGE Low-Rank Tensor Q-Function Update ≈ 𝑄$! F# F$ F%! ⊙ F%!"# F%!"%" ⊙ ⊙… 𝑑!×𝑑"× ⋯×𝑑#!"# ⊙…⊙ ⊙ 𝑑!× 𝑟 𝑑"× 𝑟 𝑑#!× 𝑟 𝑑#!$!× 𝑟 𝑑#!$##× 𝑟 Iteratively update factor matrices 𝐹!, 𝐹", … , 𝐹#!$#"until convergence 𝑄$… view at source ↗

**Figure 2.** Figure 2: Comparison of TEQL (with frequency penalty and EUGE) against Original TLR [PITH_FULL_IMAGE:figures/full_fig_p025_2.png] view at source ↗

**Figure 3.** Figure 3: Sample efficiency analysis showing episodes required to reach performance [PITH_FULL_IMAGE:figures/full_fig_p026_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation study comparing TEQL with and without the frequency penalty in [PITH_FULL_IMAGE:figures/full_fig_p028_4.png] view at source ↗

**Figure 5.** Figure 5: Sample efficiency analysis for ablation study comparing TEQL with and with [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

**Figure 6.** Figure 6: TEQL convergence under varying discretization granularities. Pendulum (left) [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗

**Figure 7.** Figure 7: Performance comparison between TLR and TEQL under very coarse discretization. [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗

**Figure 8.** Figure 8: Performance comparison under coarse discretization conditions. Pendulum (left) [PITH_FULL_IMAGE:figures/full_fig_p032_8.png] view at source ↗

**Figure 9.** Figure 9: Performance comparison under very fine discretization conditions. Pendulum [PITH_FULL_IMAGE:figures/full_fig_p033_9.png] view at source ↗

read the original abstract

High-dimensional reinforcement learning(RL) faces challenges with complex calculations and low sample efficiency in large state-action spaces. Q-learning algorithms struggle particularly with the curse of dimensionality, where the number of state-action pairs grows exponentially with problem size. While neural network-based approaches like Deep Q-Networks have shown success, they do not explicitly exploit problem structure. Many high-dimensional control tasks exhibit low-rank structure in their value functions, and tensor-based methods using low-rank decomposition offer parameter-efficient representations. However, existing tensor-based Q-learning methods focus on representation fidelity without leveraging this structure for exploration. We propose Tensor-Efficient Q-Learning (TEQL), which represents the Q-function as a low-rank CP tensor over discretized state-action spaces and exploits the tensor structure for uncertainty-aware exploration. TEQL incorporates Error-Uncertainty Guided Exploration (EUGE), which combines tensor approximation error with visit counts to guide action selection, along with frequency-aware regularization to stabilize updates. Under matched parameter budgets, experiments on classic control tasks demonstrate that TEQL outperforms both matrix-based low-rank methods and deep RL baselines in sample efficiency, making it suitable for resource-constrained applications where sampling costs are high.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TEQL adds an error-driven exploration rule on top of a CP tensor Q-function, but the tests stay on low-dimensional control tasks that do not exercise the high-dimensional regime the paper motivates.

read the letter

TEQL represents the Q-function as a low-rank CP tensor over a discretized state-action grid and feeds the tensor reconstruction error, along with visit counts, into an exploration rule called EUGE. That particular way of turning the low-rank fit into an uncertainty signal for action choice is not present in the matrix or tensor RL papers cited in the abstract. The authors also add frequency-aware regularization to keep updates from drifting when some state-action pairs are rarely seen. On classic control benchmarks, with parameter budgets matched to the baselines, the method reports better sample efficiency than both matrix low-rank Q-learning and standard deep RL agents. Those are the concrete contributions that stand out on a first read. The experiments, however, use only low-dimensional problems such as CartPole and Acrobot. Even after discretization these spaces remain modest, so the exponential growth in state-action pairs that the introduction highlights never appears. The results therefore leave open whether the tensor structure actually helps when the grid becomes large enough that parameter efficiency matters most. The low-rank assumption on the value function is invoked without sensitivity checks or measurements of how often it holds. The abstract also omits run counts, error bars, and the exact procedure for choosing the CP rank, which makes it hard to gauge how stable the reported gains are. This paper will mainly interest researchers who want explicit structure in RL instead of neural networks, especially in sample-constrained settings. A reader looking for new exploration heuristics built on tensor approximations could extract the EUGE construction and try it elsewhere. The work is internally coherent and introduces a distinct algorithm with some empirical support, so it should receive peer review rather than a desk reject. Referees can ask for tests on larger discretized domains or direct checks on the low-rank property.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Tensor-Efficient Q-Learning (TEQL) for high-dimensional reinforcement learning. It represents the Q-function as a low-rank CP tensor over discretized state-action spaces, introduces Error-Uncertainty Guided Exploration (EUGE) that combines tensor approximation error with visit counts, and adds frequency-aware regularization. Under matched parameter budgets, experiments on classic control tasks show TEQL outperforming matrix-based low-rank methods and deep RL baselines in sample efficiency.

Significance. If the central empirical claims hold, TEQL offers a parameter-efficient, structure-exploiting alternative to deep RL for Q-learning in settings where sampling is costly. The combination of CP decomposition with uncertainty-guided exploration via tensor error is a distinctive contribution that could improve sample efficiency when low-rank structure is present. The work also supplies concrete algorithmic components (EUGE and frequency-aware regularization) that are not reducible to prior fitted quantities.

major comments (2)

[Experiments] The experimental evaluation is confined to low-dimensional classic control tasks (CartPole, Acrobot, etc.) whose continuous states are 2–4 dimensional. Even after discretization these produce only modest state-action cardinalities; the exponential growth that motivates the CP tensor representation and the high-dimensional motivation in the introduction is never realized. Consequently the reported gains do not test whether the tensor structure or EUGE yields efficiency improvements precisely when the curse of dimensionality is active.
[Introduction and Method] The low-rank assumption on value functions is invoked to justify the CP representation, yet no quantification of the approximation error as a function of rank, no sensitivity analysis to discretization granularity, and no comparison against higher-rank or full-tensor baselines appear. The free parameters (CP rank and regularization coefficient) are therefore not shown to be robust or minimal in the regimes the method targets.

minor comments (2)

The abstract and experimental description omit the exact discretization procedure, number of independent runs, and whether error bars or statistical tests accompany the sample-efficiency curves.
Notation for the CP tensor factors and the precise definition of the EUGE bonus should be stated once in a single location rather than distributed across text and pseudocode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We agree that expanding the experimental evaluation to better address high-dimensional regimes and providing additional analyses of the low-rank approximation would strengthen the paper. We will revise the manuscript to incorporate these suggestions.

read point-by-point responses

Referee: [Experiments] The experimental evaluation is confined to low-dimensional classic control tasks (CartPole, Acrobot, etc.) whose continuous states are 2–4 dimensional. Even after discretization these produce only modest state-action cardinalities; the exponential growth that motivates the CP tensor representation and the high-dimensional motivation in the introduction is never realized. Consequently the reported gains do not test whether the tensor structure or EUGE yields efficiency improvements precisely when the curse of dimensionality is active.

Authors: We acknowledge that the current experiments use classic control tasks with 2–4 dimensional states, which after discretization yield moderate state-action spaces rather than fully realizing exponential growth. These tasks were chosen to enable controlled evaluation of sample efficiency under matched parameter budgets while exhibiting exploitable low-rank structure. We agree this does not directly test performance under the most severe curse-of-dimensionality conditions. In the revised manuscript we will add results on a higher-dimensional task (e.g., a discretized multi-dimensional grid-world or synthetic high-dimensional MDP) to demonstrate the tensor representation and EUGE benefits when state-action cardinalities grow exponentially. revision: yes
Referee: [Introduction and Method] The low-rank assumption on value functions is invoked to justify the CP representation, yet no quantification of the approximation error as a function of rank, no sensitivity analysis to discretization granularity, and no comparison against higher-rank or full-tensor baselines appear. The free parameters (CP rank and regularization coefficient) are therefore not shown to be robust or minimal in the regimes the method targets.

Authors: We agree that the manuscript would benefit from explicit quantification of the approximation error and sensitivity studies. The current work selects CP rank empirically for performance but does not report error-versus-rank curves or discretization sensitivity. In the revision we will add: (i) plots of tensor approximation error as a function of CP rank on the evaluated tasks, (ii) sensitivity analysis across discretization granularities, and (iii) comparisons against higher-rank CP tensors and, where feasible, full-tensor baselines on smaller instances. We will also report performance across a range of regularization coefficients to illustrate robustness of the chosen hyperparameters. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines TEQL via an explicit CP tensor representation of the Q-function over discretized spaces, then introduces EUGE (combining approximation error with visit counts) and frequency-aware regularization as separate algorithmic components. These definitions and the exploration rule are constructed from standard tensor decomposition and count-based ideas without reducing any claimed prediction or uniqueness result back to quantities fitted inside the same equations. No self-citations are invoked as load-bearing premises for a uniqueness theorem, no ansatz is smuggled, and no fitted parameter is relabeled as an out-of-sample prediction. The derivation therefore remains self-contained against external benchmarks such as standard Q-learning and matrix low-rank baselines.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The approach rests on the domain assumption that value functions possess exploitable low-rank structure and introduces new algorithmic components whose effectiveness is demonstrated only empirically on the reported tasks.

free parameters (2)

CP tensor rank
The rank of the CP decomposition must be chosen or tuned; it directly controls the parameter budget and approximation quality.
regularization coefficient
The strength of the frequency-aware regularization term is a tunable hyperparameter that affects update stability.

axioms (1)

domain assumption High-dimensional control tasks exhibit low-rank structure in their value functions
This premise is stated explicitly in the abstract as the justification for using tensor decomposition instead of full tabular or neural representations.

invented entities (1)

Error-Uncertainty Guided Exploration (EUGE) no independent evidence
purpose: To guide action selection by combining tensor approximation error with visit counts
New exploration heuristic introduced by the paper; no independent evidence outside the reported experiments is provided.

pith-pipeline@v0.9.0 · 5728 in / 1477 out tokens · 32120 ms · 2026-05-18T01:03:17.181592+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We represent the Q-function as a tensor Q ∈ R^{d1×⋯×dN} ... CANDECOMP/PARAFAC (CP) decomposition ... ˆQ(i1,...,iN) = ∑_r ∏_n F_n(i_n,r)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EU_t(st,a) = ˆQ_t(st,a) + c·(Qerror,t(st,a) + √(log Ntotal,t(st) / (Nt(st,a)+1)))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

[1]

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2-3):235--256

work page 2002
[2]

G., Osband, I., and Munos, R

Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning , pages 263--272

work page 2017
[3]

Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Mathematical Journal , 19(3):357--367

work page 1967
[4]

Bellman, R. (1957). Dynamic Programming . Princeton University Press

work page 1957
[5]

Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming . Athena Scientific

work page 1996
[6]

Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning , 22(1-3):33--57

work page 1996
[7]

eckart–young

Carroll, J. D. and Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via an n‐way generalization of “eckart–young” decomposition. Psychometrika , 35(3):283--319

work page 1970
[8]

Dong, K., Wang, Y., Chen, X., and Wang, L. (2020). Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. In International Conference on Learning Representations (ICLR)

work page 2020
[9]

J., Li, J., Paduraru, C., Precup, D., and Le, Q

Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Precup, D., and Le, Q. V. (2021). Challenges of real-world reinforcement learning: A survey. arXiv preprint arXiv:2102.04584

work page arXiv 2021
[10]

M., Ghavamzadeh, M., Szepesvári, C., and Mannor, S

Farahmand, A. M., Ghavamzadeh, M., Szepesvári, C., and Mannor, S. (2008). Regularized fitted q-iteration: A new algorithm for reinforcement learning. In Advances in Neural Information Processing Systems , pages 489--496

work page 2008
[11]

M., Ghavamzadeh, M., Szepesvári, C., and Mannor, S

Farahmand, A. M., Ghavamzadeh, M., Szepesvári, C., and Mannor, S. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems , pages 568--576

work page 2010
[12]

Ghavamzadeh, M., Mannor, S., Pineau, J., and Tamar, A. (2015). Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning , 8(5-6):359--483

work page 2015
[13]

Gordon, G. J. (1995). Stable function approximation in dynamic programming. In International Conference on Machine Learning , pages 261--268

work page 1995
[14]

Gottesman, O., Johansson, F., Komorowski, M., Faisal, A., Sontag, D., Doshi-Velez, F., and Celi, L. A. (2019). Guidelines for reinforcement learning in healthcare. Nature Medicine , 25(1):16--18

work page 2019
[15]

Hitchcock, F. L. (1927). The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics , 6(1-4):164--189

work page 1927
[16]

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association , 58(301):13--30

work page 1963
[17]

Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2016). VIME: Variational Information Maximizing Exploration . In Advances in Neural Information Processing Systems (NeurIPS) , volume 29, pages 1109--1117

work page 2016
[18]

Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research , 11:1563--1600

work page 2010
[19]

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. (2018). Is q-learning provably efficient? In Advances in Neural Information Processing Systems , pages 4864--4874

work page 2018
[20]

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. (2020). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory , pages 2137--2143

work page 2020
[21]

A., and Peters, J

Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. International Journal of Robotics Research , 32(11):1238--1274

work page 2013
[22]

Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review , 51(3):455--500

work page 2009
[23]

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics , 6(1):4--22

work page 1985
[24]

Li, Y., Ngom, A., and Ruan, Y. (2010). Scalable tensor decompositions for multi-aspect data mining. In Proceedings of the International World Wide Web Conference , pages 651--660

work page 2010
[25]

Continuous control with deep reinforcement learning

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971

work page internal anchor Pith review Pith/arXiv arXiv 2015
[26]

Littman, M. L. and Szepesvari, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In International Conference on Machine Learning , pages 310--318

work page 1996
[27]

Lu, C., Shi, L., Chen, Z., Wu, C., and Wierman, A. (2024). Overcoming the Curse of Dimensionality in Reinforcement Learning Through Approximate Factorization . CoRR , abs/2411.07591

work page arXiv 2024
[28]

A., Veness, J., Bellemare, M

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature , 518(7540):529--533

work page 2015
[29]

and Szepesvari, C

Munos, R. and Szepesvari, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research , 9:815--857

work page 2008
[30]

O'Donoghue, B., Osband, I., Munos, R., and Mnih, V. (2018). The Uncertainty Bellman Equation and Exploration . In Proceedings of the 35th International Conference on Machine Learning (ICML) , pages 3836--3845

work page 2018
[31]

Osband, I., Aslanides, J., and Cassirer, A. (2019). Deep exploration via randomized value functions. Journal of Machine Learning Research , 20(124):1--62

work page 2019
[32]

Parr, R., Painter-Wakefield, C., Li, L., and Littman, M. L. (2008). An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In International Conference on Machine Learning , pages 752--759

work page 2008
[33]

Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses of Dimensionality . Wiley

work page 2007
[34]

Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming . Wiley

work page 1994
[35]

Rozada, S., Paternain, S., and Marques, A. G. (2024). Tensor and matrix low-rank value-function approximation in reinforcement learning. IEEE Transactions on Signal Processing , 72:1634--1649

work page 2024
[36]

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning , pages 1889--1897

work page 2015
[37]

T., Batselier, K., and Kooij, J

Schuurmans, J. T., Batselier, K., and Kooij, J. F. P. (2023). How informative is the approximation error from tensor decomposition for neural network compression? In International Conference on Learning Representations (ICLR) 2023

work page 2023
[38]

Shah, D., Song, D., Xu, Z., and Yang, Y. (2020). Sample efficient reinforcement learning via low-rank matrix estimation. arXiv preprint arXiv:2006.01527

work page arXiv 2020
[39]

D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E

Sidiropoulos, N. D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E. E., and Faloutsos, C. (2017). Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing , 65(13):3551--3582

work page 2017
[40]

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning , 3(1):9--44

work page 1988
[41]

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction . MIT Press, 2 edition

work page 2018
[42]

Szepesvari, C. (2010). Algorithms for Reinforcement Learning , volume 4. Morgan & Claypool Publishers

work page 2010
[43]

Tsai, K.-C., Zhuang, Z., Lent, R., Wang, J., Qi, Q., Wang, L.-C., and Han, Z. (2021). Tensor-based reinforcement learning for network routing. IEEE Journal of Selected Topics in Signal Processing , 15(3):640--653

work page 2021
[44]

Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and q-learning. Machine Learning , 16(3):185--202

work page 1994
[45]

Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control , 42(5):674--690

work page 1997
[46]

Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning , 8(3-4):279--292

work page 1992
[47]

Yang, Y., Wang, Z., and Li, J. (2019). Sample-efficient deep reinforcement learning via episodic backward update. In Advances in Neural Information Processing Systems , pages 2110--2119

work page 2019

[1] [1]

Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2-3):235--256

work page 2002

[2] [2]

G., Osband, I., and Munos, R

Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning , pages 263--272

work page 2017

[3] [3]

Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Mathematical Journal , 19(3):357--367

work page 1967

[4] [4]

Bellman, R. (1957). Dynamic Programming . Princeton University Press

work page 1957

[5] [5]

Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming . Athena Scientific

work page 1996

[6] [6]

Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning , 22(1-3):33--57

work page 1996

[7] [7]

eckart–young

Carroll, J. D. and Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via an n‐way generalization of “eckart–young” decomposition. Psychometrika , 35(3):283--319

work page 1970

[8] [8]

Dong, K., Wang, Y., Chen, X., and Wang, L. (2020). Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. In International Conference on Learning Representations (ICLR)

work page 2020

[9] [9]

J., Li, J., Paduraru, C., Precup, D., and Le, Q

Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Precup, D., and Le, Q. V. (2021). Challenges of real-world reinforcement learning: A survey. arXiv preprint arXiv:2102.04584

work page arXiv 2021

[10] [10]

M., Ghavamzadeh, M., Szepesvári, C., and Mannor, S

Farahmand, A. M., Ghavamzadeh, M., Szepesvári, C., and Mannor, S. (2008). Regularized fitted q-iteration: A new algorithm for reinforcement learning. In Advances in Neural Information Processing Systems , pages 489--496

work page 2008

[11] [11]

M., Ghavamzadeh, M., Szepesvári, C., and Mannor, S

Farahmand, A. M., Ghavamzadeh, M., Szepesvári, C., and Mannor, S. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems , pages 568--576

work page 2010

[12] [12]

Ghavamzadeh, M., Mannor, S., Pineau, J., and Tamar, A. (2015). Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning , 8(5-6):359--483

work page 2015

[13] [13]

Gordon, G. J. (1995). Stable function approximation in dynamic programming. In International Conference on Machine Learning , pages 261--268

work page 1995

[14] [14]

Gottesman, O., Johansson, F., Komorowski, M., Faisal, A., Sontag, D., Doshi-Velez, F., and Celi, L. A. (2019). Guidelines for reinforcement learning in healthcare. Nature Medicine , 25(1):16--18

work page 2019

[15] [15]

Hitchcock, F. L. (1927). The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics , 6(1-4):164--189

work page 1927

[16] [16]

Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association , 58(301):13--30

work page 1963

[17] [17]

Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2016). VIME: Variational Information Maximizing Exploration . In Advances in Neural Information Processing Systems (NeurIPS) , volume 29, pages 1109--1117

work page 2016

[18] [18]

Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research , 11:1563--1600

work page 2010

[19] [19]

Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. (2018). Is q-learning provably efficient? In Advances in Neural Information Processing Systems , pages 4864--4874

work page 2018

[20] [20]

Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. (2020). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory , pages 2137--2143

work page 2020

[21] [21]

A., and Peters, J

Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. International Journal of Robotics Research , 32(11):1238--1274

work page 2013

[22] [22]

Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review , 51(3):455--500

work page 2009

[23] [23]

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics , 6(1):4--22

work page 1985

[24] [24]

Li, Y., Ngom, A., and Ruan, Y. (2010). Scalable tensor decompositions for multi-aspect data mining. In Proceedings of the International World Wide Web Conference , pages 651--660

work page 2010

[25] [25]

Continuous control with deep reinforcement learning

Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971

work page internal anchor Pith review Pith/arXiv arXiv 2015

[26] [26]

Littman, M. L. and Szepesvari, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In International Conference on Machine Learning , pages 310--318

work page 1996

[27] [27]

Lu, C., Shi, L., Chen, Z., Wu, C., and Wierman, A. (2024). Overcoming the Curse of Dimensionality in Reinforcement Learning Through Approximate Factorization . CoRR , abs/2411.07591

work page arXiv 2024

[28] [28]

A., Veness, J., Bellemare, M

Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature , 518(7540):529--533

work page 2015

[29] [29]

and Szepesvari, C

Munos, R. and Szepesvari, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research , 9:815--857

work page 2008

[30] [30]

O'Donoghue, B., Osband, I., Munos, R., and Mnih, V. (2018). The Uncertainty Bellman Equation and Exploration . In Proceedings of the 35th International Conference on Machine Learning (ICML) , pages 3836--3845

work page 2018

[31] [31]

Osband, I., Aslanides, J., and Cassirer, A. (2019). Deep exploration via randomized value functions. Journal of Machine Learning Research , 20(124):1--62

work page 2019

[32] [32]

Parr, R., Painter-Wakefield, C., Li, L., and Littman, M. L. (2008). An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In International Conference on Machine Learning , pages 752--759

work page 2008

[33] [33]

Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses of Dimensionality . Wiley

work page 2007

[34] [34]

Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming . Wiley

work page 1994

[35] [35]

Rozada, S., Paternain, S., and Marques, A. G. (2024). Tensor and matrix low-rank value-function approximation in reinforcement learning. IEEE Transactions on Signal Processing , 72:1634--1649

work page 2024

[36] [36]

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning , pages 1889--1897

work page 2015

[37] [37]

T., Batselier, K., and Kooij, J

Schuurmans, J. T., Batselier, K., and Kooij, J. F. P. (2023). How informative is the approximation error from tensor decomposition for neural network compression? In International Conference on Learning Representations (ICLR) 2023

work page 2023

[38] [38]

Shah, D., Song, D., Xu, Z., and Yang, Y. (2020). Sample efficient reinforcement learning via low-rank matrix estimation. arXiv preprint arXiv:2006.01527

work page arXiv 2020

[39] [39]

D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E

Sidiropoulos, N. D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E. E., and Faloutsos, C. (2017). Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing , 65(13):3551--3582

work page 2017

[40] [40]

Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning , 3(1):9--44

work page 1988

[41] [41]

Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction . MIT Press, 2 edition

work page 2018

[42] [42]

Szepesvari, C. (2010). Algorithms for Reinforcement Learning , volume 4. Morgan & Claypool Publishers

work page 2010

[43] [43]

Tsai, K.-C., Zhuang, Z., Lent, R., Wang, J., Qi, Q., Wang, L.-C., and Han, Z. (2021). Tensor-based reinforcement learning for network routing. IEEE Journal of Selected Topics in Signal Processing , 15(3):640--653

work page 2021

[44] [44]

Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and q-learning. Machine Learning , 16(3):185--202

work page 1994

[45] [45]

Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control , 42(5):674--690

work page 1997

[46] [46]

Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning , 8(3-4):279--292

work page 1992

[47] [47]

Yang, Y., Wang, Z., and Li, J. (2019). Sample-efficient deep reinforcement learning via episodic backward update. In Advances in Neural Information Processing Systems , pages 2110--2119

work page 2019