Tensor-Efficient High-Dimensional Q-learning
Pith reviewed 2026-05-18 01:03 UTC · model grok-4.3
The pith
Representing the Q-function as a low-rank CP tensor with error-uncertainty guided exploration improves sample efficiency over matrix low-rank and deep RL methods under matched parameter budgets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TEQL represents the Q-function as a low-rank CP tensor over discretized state-action spaces and exploits the tensor structure for uncertainty-aware exploration. It incorporates Error-Uncertainty Guided Exploration that combines tensor approximation error with visit counts to guide action selection, along with frequency-aware regularization to stabilize updates. Under matched parameter budgets, experiments on classic control tasks demonstrate that TEQL outperforms both matrix-based low-rank methods and deep RL baselines in sample efficiency.
What carries the argument
Low-rank CP tensor representation of the Q-function paired with Error-Uncertainty Guided Exploration that merges approximation error and visit counts for action selection.
Load-bearing premise
Many high-dimensional control tasks exhibit low-rank structure in their value functions that a CP tensor can capture efficiently.
What would settle it
Finding a high-dimensional control task whose value function has no useful low-rank structure and showing that TEQL then loses its sample-efficiency advantage over parameter-matched baselines would falsify the central claim.
Figures
read the original abstract
High-dimensional reinforcement learning(RL) faces challenges with complex calculations and low sample efficiency in large state-action spaces. Q-learning algorithms struggle particularly with the curse of dimensionality, where the number of state-action pairs grows exponentially with problem size. While neural network-based approaches like Deep Q-Networks have shown success, they do not explicitly exploit problem structure. Many high-dimensional control tasks exhibit low-rank structure in their value functions, and tensor-based methods using low-rank decomposition offer parameter-efficient representations. However, existing tensor-based Q-learning methods focus on representation fidelity without leveraging this structure for exploration. We propose Tensor-Efficient Q-Learning (TEQL), which represents the Q-function as a low-rank CP tensor over discretized state-action spaces and exploits the tensor structure for uncertainty-aware exploration. TEQL incorporates Error-Uncertainty Guided Exploration (EUGE), which combines tensor approximation error with visit counts to guide action selection, along with frequency-aware regularization to stabilize updates. Under matched parameter budgets, experiments on classic control tasks demonstrate that TEQL outperforms both matrix-based low-rank methods and deep RL baselines in sample efficiency, making it suitable for resource-constrained applications where sampling costs are high.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Tensor-Efficient Q-Learning (TEQL) for high-dimensional reinforcement learning. It represents the Q-function as a low-rank CP tensor over discretized state-action spaces, introduces Error-Uncertainty Guided Exploration (EUGE) that combines tensor approximation error with visit counts, and adds frequency-aware regularization. Under matched parameter budgets, experiments on classic control tasks show TEQL outperforming matrix-based low-rank methods and deep RL baselines in sample efficiency.
Significance. If the central empirical claims hold, TEQL offers a parameter-efficient, structure-exploiting alternative to deep RL for Q-learning in settings where sampling is costly. The combination of CP decomposition with uncertainty-guided exploration via tensor error is a distinctive contribution that could improve sample efficiency when low-rank structure is present. The work also supplies concrete algorithmic components (EUGE and frequency-aware regularization) that are not reducible to prior fitted quantities.
major comments (2)
- [Experiments] The experimental evaluation is confined to low-dimensional classic control tasks (CartPole, Acrobot, etc.) whose continuous states are 2–4 dimensional. Even after discretization these produce only modest state-action cardinalities; the exponential growth that motivates the CP tensor representation and the high-dimensional motivation in the introduction is never realized. Consequently the reported gains do not test whether the tensor structure or EUGE yields efficiency improvements precisely when the curse of dimensionality is active.
- [Introduction and Method] The low-rank assumption on value functions is invoked to justify the CP representation, yet no quantification of the approximation error as a function of rank, no sensitivity analysis to discretization granularity, and no comparison against higher-rank or full-tensor baselines appear. The free parameters (CP rank and regularization coefficient) are therefore not shown to be robust or minimal in the regimes the method targets.
minor comments (2)
- The abstract and experimental description omit the exact discretization procedure, number of independent runs, and whether error bars or statistical tests accompany the sample-efficiency curves.
- Notation for the CP tensor factors and the precise definition of the EUGE bonus should be stated once in a single location rather than distributed across text and pseudocode.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We agree that expanding the experimental evaluation to better address high-dimensional regimes and providing additional analyses of the low-rank approximation would strengthen the paper. We will revise the manuscript to incorporate these suggestions.
read point-by-point responses
-
Referee: [Experiments] The experimental evaluation is confined to low-dimensional classic control tasks (CartPole, Acrobot, etc.) whose continuous states are 2–4 dimensional. Even after discretization these produce only modest state-action cardinalities; the exponential growth that motivates the CP tensor representation and the high-dimensional motivation in the introduction is never realized. Consequently the reported gains do not test whether the tensor structure or EUGE yields efficiency improvements precisely when the curse of dimensionality is active.
Authors: We acknowledge that the current experiments use classic control tasks with 2–4 dimensional states, which after discretization yield moderate state-action spaces rather than fully realizing exponential growth. These tasks were chosen to enable controlled evaluation of sample efficiency under matched parameter budgets while exhibiting exploitable low-rank structure. We agree this does not directly test performance under the most severe curse-of-dimensionality conditions. In the revised manuscript we will add results on a higher-dimensional task (e.g., a discretized multi-dimensional grid-world or synthetic high-dimensional MDP) to demonstrate the tensor representation and EUGE benefits when state-action cardinalities grow exponentially. revision: yes
-
Referee: [Introduction and Method] The low-rank assumption on value functions is invoked to justify the CP representation, yet no quantification of the approximation error as a function of rank, no sensitivity analysis to discretization granularity, and no comparison against higher-rank or full-tensor baselines appear. The free parameters (CP rank and regularization coefficient) are therefore not shown to be robust or minimal in the regimes the method targets.
Authors: We agree that the manuscript would benefit from explicit quantification of the approximation error and sensitivity studies. The current work selects CP rank empirically for performance but does not report error-versus-rank curves or discretization sensitivity. In the revision we will add: (i) plots of tensor approximation error as a function of CP rank on the evaluated tasks, (ii) sensitivity analysis across discretization granularities, and (iii) comparisons against higher-rank CP tensors and, where feasible, full-tensor baselines on smaller instances. We will also report performance across a range of regularization coefficients to illustrate robustness of the chosen hyperparameters. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper defines TEQL via an explicit CP tensor representation of the Q-function over discretized spaces, then introduces EUGE (combining approximation error with visit counts) and frequency-aware regularization as separate algorithmic components. These definitions and the exploration rule are constructed from standard tensor decomposition and count-based ideas without reducing any claimed prediction or uniqueness result back to quantities fitted inside the same equations. No self-citations are invoked as load-bearing premises for a uniqueness theorem, no ansatz is smuggled, and no fitted parameter is relabeled as an out-of-sample prediction. The derivation therefore remains self-contained against external benchmarks such as standard Q-learning and matrix low-rank baselines.
Axiom & Free-Parameter Ledger
free parameters (2)
- CP tensor rank
- regularization coefficient
axioms (1)
- domain assumption High-dimensional control tasks exhibit low-rank structure in their value functions
invented entities (1)
-
Error-Uncertainty Guided Exploration (EUGE)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We represent the Q-function as a tensor Q ∈ R^{d1×⋯×dN} ... CANDECOMP/PARAFAC (CP) decomposition ... ˆQ(i1,...,iN) = ∑_r ∏_n F_n(i_n,r)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
EU_t(st,a) = ˆQ_t(st,a) + c·(Qerror,t(st,a) + √(log Ntotal,t(st) / (Nt(st,a)+1)))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine Learning , 47(2-3):235--256
work page 2002
-
[2]
Azar, M. G., Osband, I., and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning , pages 263--272
work page 2017
-
[3]
Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Mathematical Journal , 19(3):357--367
work page 1967
-
[4]
Bellman, R. (1957). Dynamic Programming . Princeton University Press
work page 1957
-
[5]
Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming . Athena Scientific
work page 1996
-
[6]
Bradtke, S. J. and Barto, A. G. (1996). Linear least-squares algorithms for temporal difference learning. Machine Learning , 22(1-3):33--57
work page 1996
-
[7]
Carroll, J. D. and Chang, J. J. (1970). Analysis of individual differences in multidimensional scaling via an n‐way generalization of “eckart–young” decomposition. Psychometrika , 35(3):283--319
work page 1970
-
[8]
Dong, K., Wang, Y., Chen, X., and Wang, L. (2020). Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. In International Conference on Learning Representations (ICLR)
work page 2020
-
[9]
J., Li, J., Paduraru, C., Precup, D., and Le, Q
Dulac-Arnold, G., Levine, N., Mankowitz, D. J., Li, J., Paduraru, C., Precup, D., and Le, Q. V. (2021). Challenges of real-world reinforcement learning: A survey. arXiv preprint arXiv:2102.04584
-
[10]
M., Ghavamzadeh, M., Szepesvári, C., and Mannor, S
Farahmand, A. M., Ghavamzadeh, M., Szepesvári, C., and Mannor, S. (2008). Regularized fitted q-iteration: A new algorithm for reinforcement learning. In Advances in Neural Information Processing Systems , pages 489--496
work page 2008
-
[11]
M., Ghavamzadeh, M., Szepesvári, C., and Mannor, S
Farahmand, A. M., Ghavamzadeh, M., Szepesvári, C., and Mannor, S. (2010). Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems , pages 568--576
work page 2010
-
[12]
Ghavamzadeh, M., Mannor, S., Pineau, J., and Tamar, A. (2015). Bayesian reinforcement learning: A survey. Foundations and Trends in Machine Learning , 8(5-6):359--483
work page 2015
-
[13]
Gordon, G. J. (1995). Stable function approximation in dynamic programming. In International Conference on Machine Learning , pages 261--268
work page 1995
-
[14]
Gottesman, O., Johansson, F., Komorowski, M., Faisal, A., Sontag, D., Doshi-Velez, F., and Celi, L. A. (2019). Guidelines for reinforcement learning in healthcare. Nature Medicine , 25(1):16--18
work page 2019
-
[15]
Hitchcock, F. L. (1927). The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics , 6(1-4):164--189
work page 1927
-
[16]
Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association , 58(301):13--30
work page 1963
-
[17]
Houthooft, R., Chen, X., Duan, Y., Schulman, J., De Turck, F., and Abbeel, P. (2016). VIME: Variational Information Maximizing Exploration . In Advances in Neural Information Processing Systems (NeurIPS) , volume 29, pages 1109--1117
work page 2016
-
[18]
Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research , 11:1563--1600
work page 2010
-
[19]
Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. (2018). Is q-learning provably efficient? In Advances in Neural Information Processing Systems , pages 4864--4874
work page 2018
-
[20]
Jin, C., Yang, Z., Wang, Z., and Jordan, M. I. (2020). Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory , pages 2137--2143
work page 2020
-
[21]
Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. International Journal of Robotics Research , 32(11):1238--1274
work page 2013
-
[22]
Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications. SIAM Review , 51(3):455--500
work page 2009
-
[23]
Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics , 6(1):4--22
work page 1985
-
[24]
Li, Y., Ngom, A., and Ruan, Y. (2010). Scalable tensor decompositions for multi-aspect data mining. In Proceedings of the International World Wide Web Conference , pages 651--660
work page 2010
-
[25]
Continuous control with deep reinforcement learning
Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[26]
Littman, M. L. and Szepesvari, C. (1996). A generalized reinforcement-learning model: Convergence and applications. In International Conference on Machine Learning , pages 310--318
work page 1996
- [27]
-
[28]
Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. (2015). Human-level control through deep reinforcement learning. Nature , 518(7540):529--533
work page 2015
-
[29]
Munos, R. and Szepesvari, C. (2008). Finite-time bounds for fitted value iteration. Journal of Machine Learning Research , 9:815--857
work page 2008
-
[30]
O'Donoghue, B., Osband, I., Munos, R., and Mnih, V. (2018). The Uncertainty Bellman Equation and Exploration . In Proceedings of the 35th International Conference on Machine Learning (ICML) , pages 3836--3845
work page 2018
-
[31]
Osband, I., Aslanides, J., and Cassirer, A. (2019). Deep exploration via randomized value functions. Journal of Machine Learning Research , 20(124):1--62
work page 2019
-
[32]
Parr, R., Painter-Wakefield, C., Li, L., and Littman, M. L. (2008). An analysis of linear models, linear value-function approximation, and feature selection for reinforcement learning. In International Conference on Machine Learning , pages 752--759
work page 2008
-
[33]
Powell, W. B. (2007). Approximate Dynamic Programming: Solving the Curses of Dimensionality . Wiley
work page 2007
-
[34]
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming . Wiley
work page 1994
-
[35]
Rozada, S., Paternain, S., and Marques, A. G. (2024). Tensor and matrix low-rank value-function approximation in reinforcement learning. IEEE Transactions on Signal Processing , 72:1634--1649
work page 2024
-
[36]
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning , pages 1889--1897
work page 2015
-
[37]
T., Batselier, K., and Kooij, J
Schuurmans, J. T., Batselier, K., and Kooij, J. F. P. (2023). How informative is the approximation error from tensor decomposition for neural network compression? In International Conference on Learning Representations (ICLR) 2023
work page 2023
- [38]
-
[39]
D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E
Sidiropoulos, N. D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E. E., and Faloutsos, C. (2017). Tensor decomposition for signal processing and machine learning. IEEE Transactions on Signal Processing , 65(13):3551--3582
work page 2017
-
[40]
Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning , 3(1):9--44
work page 1988
-
[41]
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction . MIT Press, 2 edition
work page 2018
-
[42]
Szepesvari, C. (2010). Algorithms for Reinforcement Learning , volume 4. Morgan & Claypool Publishers
work page 2010
-
[43]
Tsai, K.-C., Zhuang, Z., Lent, R., Wang, J., Qi, Q., Wang, L.-C., and Han, Z. (2021). Tensor-based reinforcement learning for network routing. IEEE Journal of Selected Topics in Signal Processing , 15(3):640--653
work page 2021
-
[44]
Tsitsiklis, J. N. (1994). Asynchronous stochastic approximation and q-learning. Machine Learning , 16(3):185--202
work page 1994
-
[45]
Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control , 42(5):674--690
work page 1997
-
[46]
Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning , 8(3-4):279--292
work page 1992
-
[47]
Yang, Y., Wang, Z., and Li, J. (2019). Sample-efficient deep reinforcement learning via episodic backward update. In Advances in Neural Information Processing Systems , pages 2110--2119
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.