pith. sign in

arxiv: 2501.10598 · v3 · pith:NHGF75XPnew · submitted 2025-01-17 · 💻 cs.LG

Addressing Finite-Horizon MDPs via Low-Rank Tensor Value Approximation

Pith reviewed 2026-05-23 04:46 UTC · model grok-4.3

classification 💻 cs.LG
keywords finite-horizon MDPslow-rank tensor approximationpolicy iterationvalue function approximationblock coordinate descentreinforcement learningmodel-free RL
0
0 comments X

The pith

Finite-horizon MDPs become tractable by approximating their value functions as low-rank tensors inside policy iteration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models the time-varying value functions of finite-horizon MDPs as low-rank tensors so that the Bellman equations remain solvable even when state spaces are large. Low-rank policy evaluation is paired with greedy policy improvement inside an iterative loop, and the resulting constrained optimization problems are solved by block-coordinate descent or block-coordinate gradient descent. Both algorithms carry convergence guarantees, and the paper proves that any bounded error introduced by the low-rank constraint during evaluation produces only bounded degradation in the final policy. The same framework is adapted to the model-free setting by replacing exact expectations with averages over sampled trajectories. Experiments on synthetic and resource-allocation tasks show that the approach cuts computation while returning policies whose attained returns stay competitive with exact methods.

Core claim

Value functions of finite-horizon MDPs can be represented as low-rank tensors; solving the Bellman equations under this low-rank constraint via block-coordinate methods produces near-optimal policies, and bounded low-rank policy-evaluation error implies bounded policy improvement.

What carries the argument

Low-rank tensor constraint on non-stationary value functions, which reduces representation size and converts the Bellman optimality equations into a tractable constrained optimization problem solved by BCD or BCGD.

If this is right

  • Bounded low-rank policy-evaluation error produces only bounded degradation in the improved policy.
  • The same low-rank formulation works when transition probabilities are replaced by empirical averages from sampled trajectories.
  • Both block-coordinate descent and block-coordinate gradient descent converge to stationary points of the low-rank constrained problem.
  • Computational cost and memory scale with the low-rank factors rather than the full state-space size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The error-propagation result could be turned into explicit sample-complexity bounds once the low-rank rank and the number of iterations are fixed.
  • If other sequential decision problems with time-varying costs also admit low-rank value structure, the same modeling step would apply directly.
  • The block-coordinate solvers developed here might serve as building blocks for other tensor-constrained dynamic programming tasks.

Load-bearing premise

Value functions arising in the finite-horizon MDPs of interest admit accurate low-rank tensor approximations.

What would settle it

An MDP whose value functions have high tensor rank, for which the low-rank method returns policies whose returns fall substantially below those obtained by exact dynamic programming on the same problem.

Figures

Figures reproduced from arXiv: 2501.10598 by Antonio G. Marques, Jose Luis Orejuela, Sergio Rozada.

Figure 1
Figure 1. Figure 1: Picture (a) shows the rewards of the grid-world environment. The remaining pictures show in time-step [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: NFE between the tensor of optimal VFs obtained via policy iteration low-rank PARAFAC decomposition in the grid-like setup. it requires more computationally intensive updates than BCGD￾PI. Notably, BCD-PI consistently achieves policies with optimal VFs, whereas BCGD-PI occasionally fails, introducing noise as seen in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The figure shows results for BCD-PE and BCD-PI in the first row, and BCGD-PE and BCGD-PI in the second. The columns display: (i) PE convergence in terms of (a) NFE and (b) L(Q); and (ii) PI convergence in terms of (c) NFE and (d) empirical return. an additional penalty for failing to reach the target SoC. This creates a tension between immediate and long-term objectives. In earlier time steps, the agent sh… view at source ↗
Figure 4
Figure 4. Figure 4: The results are consistent with the previous experiment: [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 4
Figure 4. Figure 4: The figure shows the average return for S-BCGD-PI and BCTD-PI against different baselines for (a) the wireless communications setup, and for (b) the battery charging setup. S-BCGD-PI and BCTD-PI converge faster than the baselines and require significantly less number of parameters. with C¯ h d\ = PπhC h+1 d\ − Ch d\ . Expanding and grouping yields 1 H X H h=1 ∥r + C¯ h d\ qd∥ 2 2 = 1 H [PITH_FULL_IMAGE:fi… view at source ↗
read the original abstract

We study the problem of learning optimal policies in finite-horizon Markov Decision Processes (MDPs) using low-rank reinforcement learning (RL) methods. In finite-horizon MDPs, the policies, and therefore the value functions (VFs) are not stationary. This aggravates the challenges of high-dimensional MDPs, as they suffer from the curse of dimensionality and high sample complexity. To address these issues, we propose modeling the VFs of finite-horizon MDPs as low-rank tensors, enabling a scalable representation that renders the problem of learning optimal policies tractable. Our approach focuses on VF approximation within a policy iteration framework, where low-rank policy evaluation is combined with greedy policy improvement to compute near-optimal policies. We introduce an optimization-based framework for solving the Bellman equations with low-rank constraints, along with block-coordinate descent (BCD) and block-coordinate gradient descent (BCGD) algorithms, both with theoretical convergence guarantees. We further establish that bounded low-rank policy evaluation error translates into bounded policy improvement in the finite-horizon setting. For scenarios where the system dynamics are unknown, we adapt the proposed BCGD method to estimate the VFs using sampled trajectories. Numerical experiments further demonstrate that the proposed framework reduces computational demands in controlled synthetic scenarios and more realistic resource allocation problems, while achieving competitive policy performance in terms of attained returns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes modeling value functions of finite-horizon MDPs as low-rank tensors to mitigate the curse of dimensionality. Within a policy-iteration framework it develops an optimization-based approach to the constrained Bellman equations, introduces BCD and BCGD solvers with convergence guarantees, proves that bounded low-rank policy-evaluation error implies bounded policy improvement, adapts the method to the model-free setting via sampled trajectories, and reports competitive empirical performance on synthetic and resource-allocation instances.

Significance. If the low-rank tensor model is faithful for the MDPs under consideration, the framework supplies a scalable representation together with provably convergent algorithms and an error-propagation guarantee that links policy-evaluation accuracy to policy improvement. The model-free extension and the reported reduction in computational cost would be practically relevant for high-dimensional finite-horizon problems.

major comments (2)
  1. [Abstract] Abstract and introduction: the central error-propagation claim (bounded low-rank policy-evaluation error implies bounded policy improvement) and the tractability argument both rest on the premise that value functions admit accurate low-rank tensor approximations, yet no sufficient conditions on the transition kernel or reward function are supplied that would guarantee this structure or quantify the incurred approximation error. Without such justification the translation result does not necessarily apply to the original MDP.
  2. [Theoretical results (convergence section)] The convergence guarantees for BCD and BCGD are stated for the low-rank constrained problem; because the paper provides neither a priori bounds on the distance between the low-rank solution and the true value tensor nor conditions under which this distance is small, it is unclear whether the guarantees remain meaningful for the underlying finite-horizon MDP.
minor comments (2)
  1. Notation for the tensor ranks and the precise definition of the low-rank constraint set should be introduced earlier and used consistently throughout the algorithmic and theoretical sections.
  2. The experimental section would benefit from an explicit statement of the tensor ranks chosen for each domain and a sensitivity plot showing how performance degrades when the rank is misspecified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, clarifying the scope of our contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and introduction: the central error-propagation claim (bounded low-rank policy-evaluation error implies bounded policy improvement) and the tractability argument both rest on the premise that value functions admit accurate low-rank tensor approximations, yet no sufficient conditions on the transition kernel or reward function are supplied that would guarantee this structure or quantify the incurred approximation error. Without such justification the translation result does not necessarily apply to the original MDP.

    Authors: Our error-propagation theorem shows that if low-rank policy-evaluation error is bounded then policy improvement is bounded. This implication is independent of the conditions that make the low-rank structure accurate; it applies to any approximation achieving bounded error. The paper treats low-rank tensor modeling as an explicit modeling choice that yields tractability (analogous to other function-approximation schemes in RL) and supplies convergent algorithms together with the linking theorem under that choice. No claim is made that low-rank structure holds for every MDP; empirical results on the tested instances support practical utility. The translation result therefore applies precisely when the bounded-error premise holds. revision: no

  2. Referee: [Theoretical results (convergence section)] The convergence guarantees for BCD and BCGD are stated for the low-rank constrained problem; because the paper provides neither a priori bounds on the distance between the low-rank solution and the true value tensor nor conditions under which this distance is small, it is unclear whether the guarantees remain meaningful for the underlying finite-horizon MDP.

    Authors: BCD and BCGD are shown to converge for the low-rank constrained optimization problem itself. Their relevance to the original MDP is supplied by the separate error-propagation theorem that converts any bound on the distance between the obtained low-rank solution and the true value tensor into a bound on policy sub-optimality. A priori bounds on that distance would require additional structural assumptions on the transition kernel or reward; such assumptions lie outside the paper's scope of developing the low-rank framework and the general linking result. The guarantees are therefore meaningful whenever the modeling assumption yields acceptably small error, which can be assessed empirically or via domain knowledge. revision: no

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper explicitly adopts low-rank tensor modeling of finite-horizon value functions as an upfront modeling premise that renders the high-dimensional problem tractable. From this assumption it derives BCD/BCGD algorithms for the constrained Bellman equations, supplies separate convergence guarantees for those algorithms, and proves an error-propagation theorem that bounded low-rank policy-evaluation error implies bounded policy improvement. None of these steps reduce by definition or by self-citation to the input assumption; the low-rank structure is not fitted from the target quantities nor renamed from prior results, and no load-bearing uniqueness theorem or ansatz is imported from the authors' own prior work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the low-rank tensor structure being a faithful and useful model for the value functions together with the convergence properties of the block-coordinate methods; no free parameters are explicitly introduced in the abstract.

axioms (2)
  • standard math Block-coordinate descent and block-coordinate gradient descent converge to a stationary point under the low-rank constraints
    Invoked to guarantee that the policy-evaluation step can be solved reliably.
  • domain assumption Standard finite-horizon MDP assumptions (finite state-action spaces, existence of optimal policies, proper discounting or terminal conditions)
    Required for the policy-iteration framework and the error-propagation argument to hold.
invented entities (1)
  • Low-rank tensor representation of value functions no independent evidence
    purpose: To obtain a compact, scalable surrogate for the non-stationary value functions that mitigates the curse of dimensionality
    This is the core modeling invention that makes the subsequent optimization and error analysis possible.

pith-pipeline@v0.9.0 · 5771 in / 1454 out tokens · 41407 ms · 2026-05-23T04:46:21.296042+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

  1. [1]

    Bertsekas, Dynamic programming and optimal control: Volume I, vol

    D. Bertsekas, Dynamic programming and optimal control: Volume I, vol. 4. Athena scientific, 2012

  2. [2]

    M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

  3. [3]

    R. S. Sutton, Reinforcement learning: An introduction . A Bradford Book, 2018

  4. [4]

    Bertsekas, Reinforcement learning and optimal control , vol

    D. Bertsekas, Reinforcement learning and optimal control , vol. 1. Athena Scientific, 2019

  5. [5]

    Mastering the game of Go with deep neural networks and tree search,

    D. Silver et al. , “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016

  6. [6]

    Mastering the game of Go without human knowledge,

    D. Silver et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017

  7. [7]

    Language models are few-shot learners,

    T. Brown et al. , “Language models are few-shot learners,” in Advances Neural Info. Process. Syst. , vol. 33, pp. 1877–1901, 2020

  8. [8]

    Dynamic programming,

    R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. 34– 37, 1966

  9. [9]

    S. M. Kakade, On the sample complexity of reinforcement learning . University of London, University College London, 2003

  10. [10]

    Bertsekas, Neuro-dynamic programming

    D. Bertsekas, Neuro-dynamic programming. Athena Scientific, 1996

  11. [11]

    Least-squares policy iteration,

    M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” J. Mach. Learn. Res. (JMLR) , vol. 4, pp. 1107–1149, 2003. 14

  12. [12]

    Human-level control through deep reinforcement learning,

    V . Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015

  13. [13]

    Multi- task reinforcement learning in reproducing kernel Hilbert spaces via cross- learning,

    J. Cervino, J. A. Bazerque, M. Calvo-Fullana, and A. Ribeiro, “Multi- task reinforcement learning in reproducing kernel Hilbert spaces via cross- learning,” IEEE Trans. Signal Process. , vol. 69, pp. 5947–5962, 2021

  14. [14]

    Tensor low-rank approximation of finite- horizon value functions,

    S. Rozada and A. G. Marques, “Tensor low-rank approximation of finite- horizon value functions,” in IEEE Intl. Conf. Acoust., Speech Signal Process. (ICASSP), pp. 5975–5979, IEEE, 2024

  15. [15]

    Lazy approximation for solving continuous finite-horizon MDPs,

    L. Li and M. L. Littman, “Lazy approximation for solving continuous finite-horizon MDPs,” in AAAI Conf. Artif. Intell. , vol. 5, pp. 1175–1180, 2005

  16. [16]

    Finite horizon risk sensitive MDP and linear programming,

    A. Kumar, V . Kavitha, and N. Hemachandra, “Finite horizon risk sensitive MDP and linear programming,” in IEEE Conf. Decision Control (CDC) , pp. 7826–7831, IEEE, 2015

  17. [17]

    Linear programming formulation for non-stationary, finite-horizon Markov decision process models,

    A. Bhattacharya and J. P. Kharoufeh, “Linear programming formulation for non-stationary, finite-horizon Markov decision process models,” Oper- ations Research Lett. , vol. 45, no. 6, pp. 570–574, 2017

  18. [18]

    A sample-efficient algorithm for episodic finite-horizon MDP with constraints,

    K. C. Kalagarla, R. Jain, and P. Nuzzo, “A sample-efficient algorithm for episodic finite-horizon MDP with constraints,” in AAAI Conf. Artif. Intell., vol. 35, pp. 8030–8037, 2021

  19. [19]

    Algorithmic survey of parametric value function approximation,

    M. Geist and O. Pietquin, “Algorithmic survey of parametric value function approximation,” IEEE Trans. Neural Netw. Learning Syst. , vol. 24, no. 6, pp. 845–867, 2013

  20. [20]

    Neural network-based finite-horizon optimal control of uncertain affine nonlinear discrete-time systems,

    Q. Zhao, H. Xu, and S. Jagannathan, “Neural network-based finite-horizon optimal control of uncertain affine nonlinear discrete-time systems,” IEEE Trans. Neural Netw. Learning Syst. , vol. 26, no. 3, pp. 486–499, 2014

  21. [21]

    Neural network-based finite horizon optimal adaptive consensus control of mobile robot formations,

    H. Guzey, H. Xu, and J. Sarangapani, “Neural network-based finite horizon optimal adaptive consensus control of mobile robot formations,” Optimal Control Applications and Methods , vol. 37, no. 5, pp. 1014–1034, 2016

  22. [22]

    Deep neural networks algorithms for stochastic control problems on finite horizon: Convergence analysis,

    C. Hur ´e, H. Pham, A. Bachouch, and N. Langren ´e, “Deep neural networks algorithms for stochastic control problems on finite horizon: Convergence analysis,” SIAM J. Numerical Analysis , vol. 59, no. 1, pp. 525–557, 2021

  23. [23]

    Sample complexity of episodic fixed-horizon reinforcement learning,

    C. Dann and E. Brunskill, “Sample complexity of episodic fixed-horizon reinforcement learning,” in Advances Neural Info. Process. Syst. , vol. 28, 2015

  24. [24]

    Fixed-horizon temporal difference methods for stable reinforcement learning,

    K. De Asis, A. Chan, S. Pitis, R. Sutton, and D. Graves, “Fixed-horizon temporal difference methods for stable reinforcement learning,” in AAAI Conf. Artif. Intell. , vol. 34, pp. 3741–3748, 2020

  25. [25]

    Tensor decompositions and applications,

    T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009

  26. [26]

    Tensor completion and low-n-rank tensor recovery via convex optimization,

    S. Gandy, B. Recht, and I. Yamada, “Tensor completion and low-n-rank tensor recovery via convex optimization,” Inverse Problems, vol. 27, no. 2, p. 025010, 2011

  27. [27]

    Tensor decomposition for signal processing and machine learning,

    N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalex- akis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” IEEE Trans. Signal Process., vol. 65, no. 13, pp. 3551– 3582, 2017

  28. [28]

    Low-rank tensor methods for communicating Markov processes,

    D. Kressner and F. Macedo, “Low-rank tensor methods for communicating Markov processes,” in Intl. Conf. Quantitative Evaluation of Syst. , pp. 25– 40, Springer, 2014

  29. [29]

    Low-rank tensor methods for Markov chains with applications to tumor progression models,

    P. Georg, L. Grasedyck, M. Klever, R. Schill, R. Spang, and T. Wettig, “Low-rank tensor methods for Markov chains with applications to tumor progression models,” J. Math. Biology , vol. 86, no. 1, p. 7, 2023

  30. [30]

    Low-rank tensors for multi-dimensional Markov models,

    M. Navarro, S. Rozada, A. G. Marques, and S. Segarra, “Low-rank tensors for multi-dimensional Markov models,” arXiv preprint arXiv:2411.02098, 2024

  31. [31]

    Reinforcement Learning in Rich-Observation MDPs using Spectral Methods

    K. Azizzadenesheli, A. Lazaric, and A. Anandkumar, “Reinforcement learning in rich-observation MDPs using spectral methods,” arXiv preprint arXiv:1611.03907, 2016

  32. [32]

    Maximum likelihood tensor decomposition of Markov decision process,

    C. Ni and M. Wang, “Maximum likelihood tensor decomposition of Markov decision process,” in IEEE Intl. Symposium Info. Theory (ISIT) , pp. 3062–3066, IEEE, 2019

  33. [33]

    Learning good state and action representations via tensor decomposition,

    C. Ni, A. R. Zhang, Y . Duan, and M. Wang, “Learning good state and action representations via tensor decomposition,” in IEEE Intl. Symposium Info. Theory (ISIT) , pp. 1682–1687, IEEE, 2021

  34. [34]

    Learning good state and action representations for Markov decision process via tensor decomposition,

    C. Ni, Y . Duan, M. Dahleh, M. Wang, and A. R. Zhang, “Learning good state and action representations for Markov decision process via tensor decomposition,” J. Mach. Learn. Res. (JMLR) , vol. 24, no. 115, pp. 1–53, 2023

  35. [35]

    Efficient high- dimensional stochastic optimal motion control using tensor-train decom- position.,

    A. A. Gorodetsky, S. Karaman, and Y . M. Marzouk, “Efficient high- dimensional stochastic optimal motion control using tensor-train decom- position.,” in Robotics: Science and Syst. , Citeseer, 2015

  36. [36]

    High-dimensional stochas- tic optimal control using continuous tensor decompositions,

    A. Gorodetsky, S. Karaman, and Y . Marzouk, “High-dimensional stochas- tic optimal control using continuous tensor decompositions,” The Intl. J. Robotics Research, vol. 37, no. 2-3, pp. 340–377, 2018

  37. [37]

    Tensor decomposition meth- ods for high-dimensional Hamilton–Jacobi–Bellman equations,

    S. Dolgov, D. Kalise, and K. K. Kunisch, “Tensor decomposition meth- ods for high-dimensional Hamilton–Jacobi–Bellman equations,” SIAM J. Scientific Computing, vol. 43, no. 3, pp. A1625–A1650, 2021

  38. [38]

    Approximating optimal feedback controllers of finite horizon control problems using hierarchical tensor formats,

    M. Oster, L. Sallandt, and R. Schneider, “Approximating optimal feedback controllers of finite horizon control problems using hierarchical tensor formats,” SIAM J. Scientific Computing , vol. 44, no. 3, pp. B746–B770, 2022

  39. [39]

    Harnessing structures for value-based planning and reinforcement learning,

    Y . Yang, G. Zhang, Z. Xu, and D. Katabi, “Harnessing structures for value-based planning and reinforcement learning,” in Intl. Conf. Learning Representations (ICLR), 2020

  40. [40]

    Sample efficient reinforcement learning via low-rank matrix estimation,

    D. Shah, D. Song, Z. Xu, and Y . Yang, “Sample efficient reinforcement learning via low-rank matrix estimation,” in Advances Neural Info. Pro- cess. Syst., (Red Hook, NY , USA), Curran Associates Inc., 2020

  41. [41]

    Low-rank state-action value- function approximation,

    S. Rozada, V . Tenorio, and A. G. Marques, “Low-rank state-action value- function approximation,” in European Signal Process. Conf. (EUSIPCO) , pp. 1471–1475, IEEE, 2021

  42. [42]

    Tensor-based reinforcement learning for network routing,

    K.-C. Tsai et al. , “Tensor-based reinforcement learning for network routing,” IEEE J. Sel. Topics Signal Process. , vol. 15, no. 3, pp. 617– 629, 2021

  43. [43]

    Tensor and matrix low- rank value-function approximation in reinforcement learning,

    S. Rozada, S. Paternain, and A. G. Marques, “Tensor and matrix low- rank value-function approximation in reinforcement learning,”IEEE Trans. Signal Process., vol. 72, pp. 1634–1649, 2024

  44. [44]

    PARAFAC. tutorial and applications,

    R. Bro, “PARAFAC. tutorial and applications,” Chemometrics Intell. Laboratory Syst., vol. 38, no. 2, pp. 149–171, 1997

  45. [45]

    Bertsekas, Non-linear programming

    D. Bertsekas, Non-linear programming. Athena Scientific, 1999

  46. [46]

    A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factor- ization and completion,

    Y . Xu and W. Yin, “A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factor- ization and completion,” SIAM J. Imaging Science, vol. 6, no. 3, pp. 1758– 1789, 2013

  47. [47]

    Revisiting fundamentals of experience replay,

    W. Fedus, P. Ramachandran, R. Agarwal, Y . Bengio, H. Larochelle, M. Rowland, and W. Dabney, “Revisiting fundamentals of experience replay,” in Intl. Conf. Machine Learning (ICML) , pp. 3061–3071, PMLR, 2020

  48. [48]

    A finite time analysis of temporal difference learning with linear function approximation,

    J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learning with linear function approximation,” in Conf. on Learning Theory (COT) , pp. 1691–1692, PMLR, 2018

  49. [49]

    TD conver- gence: An optimization perspective,

    K. Asadi, S. Sabach, Y . Liu, O. Gottesman, and R. Fakoor, “TD conver- gence: An optimization perspective,” in Advances Neural Info. Process. Syst., vol. 36, 2024

  50. [50]

    Solving finite-horizon MDPs via tensor low-rank methods

    S. Rozada, “Solving finite-horizon MDPs via tensor low-rank methods.” https://github.com/sergiorozada12/fhtlr-opt-learning, 2024

  51. [51]

    A tutorial on linear function approximators for dynamic programming and reinforcement learning,

    A. Geramifard et al. , “A tutorial on linear function approximators for dynamic programming and reinforcement learning,” Foundations and Trends® in Machine Learning , vol. 6, no. 4, pp. 375–451, 2013

  52. [52]

    Almost-sure iden- tifiability of multidimensional harmonic retrieval,

    T. Jiang, N. D. Sidiropoulos, and J. M. Ten Berge, “Almost-sure iden- tifiability of multidimensional harmonic retrieval,” IEEE Trans. Signal Process., vol. 49, no. 9, pp. 1849–1859, 2001

  53. [53]

    Block stochastic gradient iteration for convex and nonconvex optimization,

    Y . Xu and W. Yin, “Block stochastic gradient iteration for convex and nonconvex optimization,” SIAM J. Optimization, vol. 25, no. 3, pp. 1686– 1716, 2015