Addressing Finite-Horizon MDPs via Low-Rank Tensor Value Approximation

Antonio G. Marques; Jose Luis Orejuela; Sergio Rozada

arxiv: 2501.10598 · v3 · pith:NHGF75XPnew · submitted 2025-01-17 · 💻 cs.LG

Addressing Finite-Horizon MDPs via Low-Rank Tensor Value Approximation

Sergio Rozada , Jose Luis Orejuela , Antonio G. Marques This is my paper

Pith reviewed 2026-05-23 04:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords finite-horizon MDPslow-rank tensor approximationpolicy iterationvalue function approximationblock coordinate descentreinforcement learningmodel-free RL

0 comments

The pith

Finite-horizon MDPs become tractable by approximating their value functions as low-rank tensors inside policy iteration.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models the time-varying value functions of finite-horizon MDPs as low-rank tensors so that the Bellman equations remain solvable even when state spaces are large. Low-rank policy evaluation is paired with greedy policy improvement inside an iterative loop, and the resulting constrained optimization problems are solved by block-coordinate descent or block-coordinate gradient descent. Both algorithms carry convergence guarantees, and the paper proves that any bounded error introduced by the low-rank constraint during evaluation produces only bounded degradation in the final policy. The same framework is adapted to the model-free setting by replacing exact expectations with averages over sampled trajectories. Experiments on synthetic and resource-allocation tasks show that the approach cuts computation while returning policies whose attained returns stay competitive with exact methods.

Core claim

Value functions of finite-horizon MDPs can be represented as low-rank tensors; solving the Bellman equations under this low-rank constraint via block-coordinate methods produces near-optimal policies, and bounded low-rank policy-evaluation error implies bounded policy improvement.

What carries the argument

Low-rank tensor constraint on non-stationary value functions, which reduces representation size and converts the Bellman optimality equations into a tractable constrained optimization problem solved by BCD or BCGD.

If this is right

Bounded low-rank policy-evaluation error produces only bounded degradation in the improved policy.
The same low-rank formulation works when transition probabilities are replaced by empirical averages from sampled trajectories.
Both block-coordinate descent and block-coordinate gradient descent converge to stationary points of the low-rank constrained problem.
Computational cost and memory scale with the low-rank factors rather than the full state-space size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The error-propagation result could be turned into explicit sample-complexity bounds once the low-rank rank and the number of iterations are fixed.
If other sequential decision problems with time-varying costs also admit low-rank value structure, the same modeling step would apply directly.
The block-coordinate solvers developed here might serve as building blocks for other tensor-constrained dynamic programming tasks.

Load-bearing premise

Value functions arising in the finite-horizon MDPs of interest admit accurate low-rank tensor approximations.

What would settle it

An MDP whose value functions have high tensor rank, for which the low-rank method returns policies whose returns fall substantially below those obtained by exact dynamic programming on the same problem.

Figures

Figures reproduced from arXiv: 2501.10598 by Antonio G. Marques, Jose Luis Orejuela, Sergio Rozada.

**Figure 2.** Figure 2: NFE between the tensor of optimal VFs obtained via policy iteration low-rank PARAFAC decomposition in the grid-like setup. it requires more computationally intensive updates than BCGDPI. Notably, BCD-PI consistently achieves policies with optimal VFs, whereas BCGD-PI occasionally fails, introducing noise as seen in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: The figure shows results for BCD-PE and BCD-PI in the first row, and BCGD-PE and BCGD-PI in the second. The columns display: (i) PE convergence in terms of (a) NFE and (b) L(Q); and (ii) PI convergence in terms of (c) NFE and (d) empirical return. an additional penalty for failing to reach the target SoC. This creates a tension between immediate and long-term objectives. In earlier time steps, the agent sh… view at source ↗

**Figure 4.** Figure 4: The results are consistent with the previous experiment: [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 4.** Figure 4: The figure shows the average return for S-BCGD-PI and BCTD-PI against different baselines for (a) the wireless communications setup, and for (b) the battery charging setup. S-BCGD-PI and BCTD-PI converge faster than the baselines and require significantly less number of parameters. with C¯ h d\ = PπhC h+1 d\ − Ch d\ . Expanding and grouping yields 1 H X H h=1 ∥r + C¯ h d\ qd∥ 2 2 = 1 H [PITH_FULL_IMAGE:fi… view at source ↗

read the original abstract

We study the problem of learning optimal policies in finite-horizon Markov Decision Processes (MDPs) using low-rank reinforcement learning (RL) methods. In finite-horizon MDPs, the policies, and therefore the value functions (VFs) are not stationary. This aggravates the challenges of high-dimensional MDPs, as they suffer from the curse of dimensionality and high sample complexity. To address these issues, we propose modeling the VFs of finite-horizon MDPs as low-rank tensors, enabling a scalable representation that renders the problem of learning optimal policies tractable. Our approach focuses on VF approximation within a policy iteration framework, where low-rank policy evaluation is combined with greedy policy improvement to compute near-optimal policies. We introduce an optimization-based framework for solving the Bellman equations with low-rank constraints, along with block-coordinate descent (BCD) and block-coordinate gradient descent (BCGD) algorithms, both with theoretical convergence guarantees. We further establish that bounded low-rank policy evaluation error translates into bounded policy improvement in the finite-horizon setting. For scenarios where the system dynamics are unknown, we adapt the proposed BCGD method to estimate the VFs using sampled trajectories. Numerical experiments further demonstrate that the proposed framework reduces computational demands in controlled synthetic scenarios and more realistic resource allocation problems, while achieving competitive policy performance in terms of attained returns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete low-rank tensor route to finite-horizon policy iteration with BCD/BCGD solvers and an error-propagation result, but everything rests on an unbacked modeling assumption.

read the letter

The one thing to know is that this work models non-stationary value functions in finite-horizon MDPs as low-rank tensors, then solves the constrained Bellman equations with block-coordinate descent and its gradient variant. That setup plus the claim that bounded low-rank evaluation error implies bounded policy sub-optimality is the actual new piece beyond standard low-rank RL extensions. They also adapt the method to sampled trajectories when dynamics are unknown and run experiments on synthetic MDPs and a resource-allocation task that show competitive returns with lower compute. Those elements are cleanly stated and the convergence guarantees for the solvers are presented as independent results. The error-translation theorem is the part that would matter most if it holds up. The central soft spot is exactly what the stress-test note flags: the paper supplies no conditions on the transition or reward structure that would make the value tensor close to low-rank, nor any quantitative bound on the approximation error the constraint introduces. Without that, the policy-improvement guarantee only applies inside the low-rank model and does not automatically transfer to the original MDP. The experiments are run on controlled cases where the assumption is likely satisfied by construction, so they do not test the modeling premise. This is for RL researchers who already work with tensor or low-rank methods on high-dimensional finite-horizon problems and are willing to accept the structural assumption as a starting point. It has enough distinct technical content and a clear algorithmic contribution that it deserves a serious referee to check the proofs and the experimental design. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper proposes modeling value functions of finite-horizon MDPs as low-rank tensors to mitigate the curse of dimensionality. Within a policy-iteration framework it develops an optimization-based approach to the constrained Bellman equations, introduces BCD and BCGD solvers with convergence guarantees, proves that bounded low-rank policy-evaluation error implies bounded policy improvement, adapts the method to the model-free setting via sampled trajectories, and reports competitive empirical performance on synthetic and resource-allocation instances.

Significance. If the low-rank tensor model is faithful for the MDPs under consideration, the framework supplies a scalable representation together with provably convergent algorithms and an error-propagation guarantee that links policy-evaluation accuracy to policy improvement. The model-free extension and the reported reduction in computational cost would be practically relevant for high-dimensional finite-horizon problems.

major comments (2)

[Abstract] Abstract and introduction: the central error-propagation claim (bounded low-rank policy-evaluation error implies bounded policy improvement) and the tractability argument both rest on the premise that value functions admit accurate low-rank tensor approximations, yet no sufficient conditions on the transition kernel or reward function are supplied that would guarantee this structure or quantify the incurred approximation error. Without such justification the translation result does not necessarily apply to the original MDP.
[Theoretical results (convergence section)] The convergence guarantees for BCD and BCGD are stated for the low-rank constrained problem; because the paper provides neither a priori bounds on the distance between the low-rank solution and the true value tensor nor conditions under which this distance is small, it is unclear whether the guarantees remain meaningful for the underlying finite-horizon MDP.

minor comments (2)

Notation for the tensor ranks and the precise definition of the low-rank constraint set should be introduced earlier and used consistently throughout the algorithmic and theoretical sections.
The experimental section would benefit from an explicit statement of the tensor ranks chosen for each domain and a sensitivity plot showing how performance degrades when the rank is misspecified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, clarifying the scope of our contributions.

read point-by-point responses

Referee: [Abstract] Abstract and introduction: the central error-propagation claim (bounded low-rank policy-evaluation error implies bounded policy improvement) and the tractability argument both rest on the premise that value functions admit accurate low-rank tensor approximations, yet no sufficient conditions on the transition kernel or reward function are supplied that would guarantee this structure or quantify the incurred approximation error. Without such justification the translation result does not necessarily apply to the original MDP.

Authors: Our error-propagation theorem shows that if low-rank policy-evaluation error is bounded then policy improvement is bounded. This implication is independent of the conditions that make the low-rank structure accurate; it applies to any approximation achieving bounded error. The paper treats low-rank tensor modeling as an explicit modeling choice that yields tractability (analogous to other function-approximation schemes in RL) and supplies convergent algorithms together with the linking theorem under that choice. No claim is made that low-rank structure holds for every MDP; empirical results on the tested instances support practical utility. The translation result therefore applies precisely when the bounded-error premise holds. revision: no
Referee: [Theoretical results (convergence section)] The convergence guarantees for BCD and BCGD are stated for the low-rank constrained problem; because the paper provides neither a priori bounds on the distance between the low-rank solution and the true value tensor nor conditions under which this distance is small, it is unclear whether the guarantees remain meaningful for the underlying finite-horizon MDP.

Authors: BCD and BCGD are shown to converge for the low-rank constrained optimization problem itself. Their relevance to the original MDP is supplied by the separate error-propagation theorem that converts any bound on the distance between the obtained low-rank solution and the true value tensor into a bound on policy sub-optimality. A priori bounds on that distance would require additional structural assumptions on the transition kernel or reward; such assumptions lie outside the paper's scope of developing the low-rank framework and the general linking result. The guarantees are therefore meaningful whenever the modeling assumption yields acceptably small error, which can be assessed empirically or via domain knowledge. revision: no

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper explicitly adopts low-rank tensor modeling of finite-horizon value functions as an upfront modeling premise that renders the high-dimensional problem tractable. From this assumption it derives BCD/BCGD algorithms for the constrained Bellman equations, supplies separate convergence guarantees for those algorithms, and proves an error-propagation theorem that bounded low-rank policy-evaluation error implies bounded policy improvement. None of these steps reduce by definition or by self-citation to the input assumption; the low-rank structure is not fitted from the target quantities nor renamed from prior results, and no load-bearing uniqueness theorem or ansatz is imported from the authors' own prior work. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the low-rank tensor structure being a faithful and useful model for the value functions together with the convergence properties of the block-coordinate methods; no free parameters are explicitly introduced in the abstract.

axioms (2)

standard math Block-coordinate descent and block-coordinate gradient descent converge to a stationary point under the low-rank constraints
Invoked to guarantee that the policy-evaluation step can be solved reliably.
domain assumption Standard finite-horizon MDP assumptions (finite state-action spaces, existence of optimal policies, proper discounting or terminal conditions)
Required for the policy-iteration framework and the error-propagation argument to hold.

invented entities (1)

Low-rank tensor representation of value functions no independent evidence
purpose: To obtain a compact, scalable surrogate for the non-stationary value functions that mitigates the curse of dimensionality
This is the core modeling invention that makes the subsequent optimization and error analysis possible.

pith-pipeline@v0.9.0 · 5771 in / 1454 out tokens · 41407 ms · 2026-05-23T04:46:21.296042+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

[1]

Bertsekas, Dynamic programming and optimal control: Volume I, vol

D. Bertsekas, Dynamic programming and optimal control: Volume I, vol. 4. Athena scientific, 2012

work page 2012
[2]

M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014
[3]

R. S. Sutton, Reinforcement learning: An introduction . A Bradford Book, 2018

work page 2018
[4]

Bertsekas, Reinforcement learning and optimal control , vol

D. Bertsekas, Reinforcement learning and optimal control , vol. 1. Athena Scientific, 2019

work page 2019
[5]

Mastering the game of Go with deep neural networks and tree search,

D. Silver et al. , “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016

work page 2016
[6]

Mastering the game of Go without human knowledge,

D. Silver et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017

work page 2017
[7]

Language models are few-shot learners,

T. Brown et al. , “Language models are few-shot learners,” in Advances Neural Info. Process. Syst. , vol. 33, pp. 1877–1901, 2020

work page 1901
[8]

Dynamic programming,

R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. 34– 37, 1966

work page 1966
[9]

S. M. Kakade, On the sample complexity of reinforcement learning . University of London, University College London, 2003

work page 2003
[10]

Bertsekas, Neuro-dynamic programming

D. Bertsekas, Neuro-dynamic programming. Athena Scientific, 1996

work page 1996
[11]

Least-squares policy iteration,

M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” J. Mach. Learn. Res. (JMLR) , vol. 4, pp. 1107–1149, 2003. 14

work page 2003
[12]

Human-level control through deep reinforcement learning,

V . Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015

work page 2015
[13]

Multi- task reinforcement learning in reproducing kernel Hilbert spaces via cross- learning,

J. Cervino, J. A. Bazerque, M. Calvo-Fullana, and A. Ribeiro, “Multi- task reinforcement learning in reproducing kernel Hilbert spaces via cross- learning,” IEEE Trans. Signal Process. , vol. 69, pp. 5947–5962, 2021

work page 2021
[14]

Tensor low-rank approximation of finite- horizon value functions,

S. Rozada and A. G. Marques, “Tensor low-rank approximation of finite- horizon value functions,” in IEEE Intl. Conf. Acoust., Speech Signal Process. (ICASSP), pp. 5975–5979, IEEE, 2024

work page 2024
[15]

Lazy approximation for solving continuous finite-horizon MDPs,

L. Li and M. L. Littman, “Lazy approximation for solving continuous finite-horizon MDPs,” in AAAI Conf. Artif. Intell. , vol. 5, pp. 1175–1180, 2005

work page 2005
[16]

Finite horizon risk sensitive MDP and linear programming,

A. Kumar, V . Kavitha, and N. Hemachandra, “Finite horizon risk sensitive MDP and linear programming,” in IEEE Conf. Decision Control (CDC) , pp. 7826–7831, IEEE, 2015

work page 2015
[17]

Linear programming formulation for non-stationary, finite-horizon Markov decision process models,

A. Bhattacharya and J. P. Kharoufeh, “Linear programming formulation for non-stationary, finite-horizon Markov decision process models,” Oper- ations Research Lett. , vol. 45, no. 6, pp. 570–574, 2017

work page 2017
[18]

A sample-efficient algorithm for episodic finite-horizon MDP with constraints,

K. C. Kalagarla, R. Jain, and P. Nuzzo, “A sample-efficient algorithm for episodic finite-horizon MDP with constraints,” in AAAI Conf. Artif. Intell., vol. 35, pp. 8030–8037, 2021

work page 2021
[19]

Algorithmic survey of parametric value function approximation,

M. Geist and O. Pietquin, “Algorithmic survey of parametric value function approximation,” IEEE Trans. Neural Netw. Learning Syst. , vol. 24, no. 6, pp. 845–867, 2013

work page 2013
[20]

Neural network-based finite-horizon optimal control of uncertain affine nonlinear discrete-time systems,

Q. Zhao, H. Xu, and S. Jagannathan, “Neural network-based finite-horizon optimal control of uncertain affine nonlinear discrete-time systems,” IEEE Trans. Neural Netw. Learning Syst. , vol. 26, no. 3, pp. 486–499, 2014

work page 2014
[21]

Neural network-based finite horizon optimal adaptive consensus control of mobile robot formations,

H. Guzey, H. Xu, and J. Sarangapani, “Neural network-based finite horizon optimal adaptive consensus control of mobile robot formations,” Optimal Control Applications and Methods , vol. 37, no. 5, pp. 1014–1034, 2016

work page 2016
[22]

Deep neural networks algorithms for stochastic control problems on finite horizon: Convergence analysis,

C. Hur ´e, H. Pham, A. Bachouch, and N. Langren ´e, “Deep neural networks algorithms for stochastic control problems on finite horizon: Convergence analysis,” SIAM J. Numerical Analysis , vol. 59, no. 1, pp. 525–557, 2021

work page 2021
[23]

Sample complexity of episodic fixed-horizon reinforcement learning,

C. Dann and E. Brunskill, “Sample complexity of episodic fixed-horizon reinforcement learning,” in Advances Neural Info. Process. Syst. , vol. 28, 2015

work page 2015
[24]

Fixed-horizon temporal difference methods for stable reinforcement learning,

K. De Asis, A. Chan, S. Pitis, R. Sutton, and D. Graves, “Fixed-horizon temporal difference methods for stable reinforcement learning,” in AAAI Conf. Artif. Intell. , vol. 34, pp. 3741–3748, 2020

work page 2020
[25]

Tensor decompositions and applications,

T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009

work page 2009
[26]

Tensor completion and low-n-rank tensor recovery via convex optimization,

S. Gandy, B. Recht, and I. Yamada, “Tensor completion and low-n-rank tensor recovery via convex optimization,” Inverse Problems, vol. 27, no. 2, p. 025010, 2011

work page 2011
[27]

Tensor decomposition for signal processing and machine learning,

N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalex- akis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” IEEE Trans. Signal Process., vol. 65, no. 13, pp. 3551– 3582, 2017

work page 2017
[28]

Low-rank tensor methods for communicating Markov processes,

D. Kressner and F. Macedo, “Low-rank tensor methods for communicating Markov processes,” in Intl. Conf. Quantitative Evaluation of Syst. , pp. 25– 40, Springer, 2014

work page 2014
[29]

Low-rank tensor methods for Markov chains with applications to tumor progression models,

P. Georg, L. Grasedyck, M. Klever, R. Schill, R. Spang, and T. Wettig, “Low-rank tensor methods for Markov chains with applications to tumor progression models,” J. Math. Biology , vol. 86, no. 1, p. 7, 2023

work page 2023
[30]

Low-rank tensors for multi-dimensional Markov models,

M. Navarro, S. Rozada, A. G. Marques, and S. Segarra, “Low-rank tensors for multi-dimensional Markov models,” arXiv preprint arXiv:2411.02098, 2024

work page arXiv 2024
[31]

Reinforcement Learning in Rich-Observation MDPs using Spectral Methods

K. Azizzadenesheli, A. Lazaric, and A. Anandkumar, “Reinforcement learning in rich-observation MDPs using spectral methods,” arXiv preprint arXiv:1611.03907, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[32]

Maximum likelihood tensor decomposition of Markov decision process,

C. Ni and M. Wang, “Maximum likelihood tensor decomposition of Markov decision process,” in IEEE Intl. Symposium Info. Theory (ISIT) , pp. 3062–3066, IEEE, 2019

work page 2019
[33]

Learning good state and action representations via tensor decomposition,

C. Ni, A. R. Zhang, Y . Duan, and M. Wang, “Learning good state and action representations via tensor decomposition,” in IEEE Intl. Symposium Info. Theory (ISIT) , pp. 1682–1687, IEEE, 2021

work page 2021
[34]

Learning good state and action representations for Markov decision process via tensor decomposition,

C. Ni, Y . Duan, M. Dahleh, M. Wang, and A. R. Zhang, “Learning good state and action representations for Markov decision process via tensor decomposition,” J. Mach. Learn. Res. (JMLR) , vol. 24, no. 115, pp. 1–53, 2023

work page 2023
[35]

Efficient high- dimensional stochastic optimal motion control using tensor-train decom- position.,

A. A. Gorodetsky, S. Karaman, and Y . M. Marzouk, “Efficient high- dimensional stochastic optimal motion control using tensor-train decom- position.,” in Robotics: Science and Syst. , Citeseer, 2015

work page 2015
[36]

High-dimensional stochas- tic optimal control using continuous tensor decompositions,

A. Gorodetsky, S. Karaman, and Y . Marzouk, “High-dimensional stochas- tic optimal control using continuous tensor decompositions,” The Intl. J. Robotics Research, vol. 37, no. 2-3, pp. 340–377, 2018

work page 2018
[37]

Tensor decomposition meth- ods for high-dimensional Hamilton–Jacobi–Bellman equations,

S. Dolgov, D. Kalise, and K. K. Kunisch, “Tensor decomposition meth- ods for high-dimensional Hamilton–Jacobi–Bellman equations,” SIAM J. Scientific Computing, vol. 43, no. 3, pp. A1625–A1650, 2021

work page 2021
[38]

Approximating optimal feedback controllers of finite horizon control problems using hierarchical tensor formats,

M. Oster, L. Sallandt, and R. Schneider, “Approximating optimal feedback controllers of finite horizon control problems using hierarchical tensor formats,” SIAM J. Scientific Computing , vol. 44, no. 3, pp. B746–B770, 2022

work page 2022
[39]

Harnessing structures for value-based planning and reinforcement learning,

Y . Yang, G. Zhang, Z. Xu, and D. Katabi, “Harnessing structures for value-based planning and reinforcement learning,” in Intl. Conf. Learning Representations (ICLR), 2020

work page 2020
[40]

Sample efficient reinforcement learning via low-rank matrix estimation,

D. Shah, D. Song, Z. Xu, and Y . Yang, “Sample efficient reinforcement learning via low-rank matrix estimation,” in Advances Neural Info. Pro- cess. Syst., (Red Hook, NY , USA), Curran Associates Inc., 2020

work page 2020
[41]

Low-rank state-action value- function approximation,

S. Rozada, V . Tenorio, and A. G. Marques, “Low-rank state-action value- function approximation,” in European Signal Process. Conf. (EUSIPCO) , pp. 1471–1475, IEEE, 2021

work page 2021
[42]

Tensor-based reinforcement learning for network routing,

K.-C. Tsai et al. , “Tensor-based reinforcement learning for network routing,” IEEE J. Sel. Topics Signal Process. , vol. 15, no. 3, pp. 617– 629, 2021

work page 2021
[43]

Tensor and matrix low- rank value-function approximation in reinforcement learning,

S. Rozada, S. Paternain, and A. G. Marques, “Tensor and matrix low- rank value-function approximation in reinforcement learning,”IEEE Trans. Signal Process., vol. 72, pp. 1634–1649, 2024

work page 2024
[44]

PARAFAC. tutorial and applications,

R. Bro, “PARAFAC. tutorial and applications,” Chemometrics Intell. Laboratory Syst., vol. 38, no. 2, pp. 149–171, 1997

work page 1997
[45]

Bertsekas, Non-linear programming

D. Bertsekas, Non-linear programming. Athena Scientific, 1999

work page 1999
[46]

A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factor- ization and completion,

Y . Xu and W. Yin, “A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factor- ization and completion,” SIAM J. Imaging Science, vol. 6, no. 3, pp. 1758– 1789, 2013

work page 2013
[47]

Revisiting fundamentals of experience replay,

W. Fedus, P. Ramachandran, R. Agarwal, Y . Bengio, H. Larochelle, M. Rowland, and W. Dabney, “Revisiting fundamentals of experience replay,” in Intl. Conf. Machine Learning (ICML) , pp. 3061–3071, PMLR, 2020

work page 2020
[48]

A finite time analysis of temporal difference learning with linear function approximation,

J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learning with linear function approximation,” in Conf. on Learning Theory (COT) , pp. 1691–1692, PMLR, 2018

work page 2018
[49]

TD conver- gence: An optimization perspective,

K. Asadi, S. Sabach, Y . Liu, O. Gottesman, and R. Fakoor, “TD conver- gence: An optimization perspective,” in Advances Neural Info. Process. Syst., vol. 36, 2024

work page 2024
[50]

Solving finite-horizon MDPs via tensor low-rank methods

S. Rozada, “Solving finite-horizon MDPs via tensor low-rank methods.” https://github.com/sergiorozada12/fhtlr-opt-learning, 2024

work page 2024
[51]

A tutorial on linear function approximators for dynamic programming and reinforcement learning,

A. Geramifard et al. , “A tutorial on linear function approximators for dynamic programming and reinforcement learning,” Foundations and Trends® in Machine Learning , vol. 6, no. 4, pp. 375–451, 2013

work page 2013
[52]

Almost-sure iden- tifiability of multidimensional harmonic retrieval,

T. Jiang, N. D. Sidiropoulos, and J. M. Ten Berge, “Almost-sure iden- tifiability of multidimensional harmonic retrieval,” IEEE Trans. Signal Process., vol. 49, no. 9, pp. 1849–1859, 2001

work page 2001
[53]

Block stochastic gradient iteration for convex and nonconvex optimization,

Y . Xu and W. Yin, “Block stochastic gradient iteration for convex and nonconvex optimization,” SIAM J. Optimization, vol. 25, no. 3, pp. 1686– 1716, 2015

work page 2015

[1] [1]

Bertsekas, Dynamic programming and optimal control: Volume I, vol

D. Bertsekas, Dynamic programming and optimal control: Volume I, vol. 4. Athena scientific, 2012

work page 2012

[2] [2]

M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014

work page 2014

[3] [3]

R. S. Sutton, Reinforcement learning: An introduction . A Bradford Book, 2018

work page 2018

[4] [4]

Bertsekas, Reinforcement learning and optimal control , vol

D. Bertsekas, Reinforcement learning and optimal control , vol. 1. Athena Scientific, 2019

work page 2019

[5] [5]

Mastering the game of Go with deep neural networks and tree search,

D. Silver et al. , “Mastering the game of Go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, 2016

work page 2016

[6] [6]

Mastering the game of Go without human knowledge,

D. Silver et al., “Mastering the game of Go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, 2017

work page 2017

[7] [7]

Language models are few-shot learners,

T. Brown et al. , “Language models are few-shot learners,” in Advances Neural Info. Process. Syst. , vol. 33, pp. 1877–1901, 2020

work page 1901

[8] [8]

Dynamic programming,

R. Bellman, “Dynamic programming,” Science, vol. 153, no. 3731, pp. 34– 37, 1966

work page 1966

[9] [9]

S. M. Kakade, On the sample complexity of reinforcement learning . University of London, University College London, 2003

work page 2003

[10] [10]

Bertsekas, Neuro-dynamic programming

D. Bertsekas, Neuro-dynamic programming. Athena Scientific, 1996

work page 1996

[11] [11]

Least-squares policy iteration,

M. G. Lagoudakis and R. Parr, “Least-squares policy iteration,” J. Mach. Learn. Res. (JMLR) , vol. 4, pp. 1107–1149, 2003. 14

work page 2003

[12] [12]

Human-level control through deep reinforcement learning,

V . Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529–533, 2015

work page 2015

[13] [13]

Multi- task reinforcement learning in reproducing kernel Hilbert spaces via cross- learning,

J. Cervino, J. A. Bazerque, M. Calvo-Fullana, and A. Ribeiro, “Multi- task reinforcement learning in reproducing kernel Hilbert spaces via cross- learning,” IEEE Trans. Signal Process. , vol. 69, pp. 5947–5962, 2021

work page 2021

[14] [14]

Tensor low-rank approximation of finite- horizon value functions,

S. Rozada and A. G. Marques, “Tensor low-rank approximation of finite- horizon value functions,” in IEEE Intl. Conf. Acoust., Speech Signal Process. (ICASSP), pp. 5975–5979, IEEE, 2024

work page 2024

[15] [15]

Lazy approximation for solving continuous finite-horizon MDPs,

L. Li and M. L. Littman, “Lazy approximation for solving continuous finite-horizon MDPs,” in AAAI Conf. Artif. Intell. , vol. 5, pp. 1175–1180, 2005

work page 2005

[16] [16]

Finite horizon risk sensitive MDP and linear programming,

A. Kumar, V . Kavitha, and N. Hemachandra, “Finite horizon risk sensitive MDP and linear programming,” in IEEE Conf. Decision Control (CDC) , pp. 7826–7831, IEEE, 2015

work page 2015

[17] [17]

Linear programming formulation for non-stationary, finite-horizon Markov decision process models,

A. Bhattacharya and J. P. Kharoufeh, “Linear programming formulation for non-stationary, finite-horizon Markov decision process models,” Oper- ations Research Lett. , vol. 45, no. 6, pp. 570–574, 2017

work page 2017

[18] [18]

A sample-efficient algorithm for episodic finite-horizon MDP with constraints,

K. C. Kalagarla, R. Jain, and P. Nuzzo, “A sample-efficient algorithm for episodic finite-horizon MDP with constraints,” in AAAI Conf. Artif. Intell., vol. 35, pp. 8030–8037, 2021

work page 2021

[19] [19]

Algorithmic survey of parametric value function approximation,

M. Geist and O. Pietquin, “Algorithmic survey of parametric value function approximation,” IEEE Trans. Neural Netw. Learning Syst. , vol. 24, no. 6, pp. 845–867, 2013

work page 2013

[20] [20]

Neural network-based finite-horizon optimal control of uncertain affine nonlinear discrete-time systems,

Q. Zhao, H. Xu, and S. Jagannathan, “Neural network-based finite-horizon optimal control of uncertain affine nonlinear discrete-time systems,” IEEE Trans. Neural Netw. Learning Syst. , vol. 26, no. 3, pp. 486–499, 2014

work page 2014

[21] [21]

Neural network-based finite horizon optimal adaptive consensus control of mobile robot formations,

H. Guzey, H. Xu, and J. Sarangapani, “Neural network-based finite horizon optimal adaptive consensus control of mobile robot formations,” Optimal Control Applications and Methods , vol. 37, no. 5, pp. 1014–1034, 2016

work page 2016

[22] [22]

Deep neural networks algorithms for stochastic control problems on finite horizon: Convergence analysis,

C. Hur ´e, H. Pham, A. Bachouch, and N. Langren ´e, “Deep neural networks algorithms for stochastic control problems on finite horizon: Convergence analysis,” SIAM J. Numerical Analysis , vol. 59, no. 1, pp. 525–557, 2021

work page 2021

[23] [23]

Sample complexity of episodic fixed-horizon reinforcement learning,

C. Dann and E. Brunskill, “Sample complexity of episodic fixed-horizon reinforcement learning,” in Advances Neural Info. Process. Syst. , vol. 28, 2015

work page 2015

[24] [24]

Fixed-horizon temporal difference methods for stable reinforcement learning,

K. De Asis, A. Chan, S. Pitis, R. Sutton, and D. Graves, “Fixed-horizon temporal difference methods for stable reinforcement learning,” in AAAI Conf. Artif. Intell. , vol. 34, pp. 3741–3748, 2020

work page 2020

[25] [25]

Tensor decompositions and applications,

T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM Review, vol. 51, no. 3, pp. 455–500, 2009

work page 2009

[26] [26]

Tensor completion and low-n-rank tensor recovery via convex optimization,

S. Gandy, B. Recht, and I. Yamada, “Tensor completion and low-n-rank tensor recovery via convex optimization,” Inverse Problems, vol. 27, no. 2, p. 025010, 2011

work page 2011

[27] [27]

Tensor decomposition for signal processing and machine learning,

N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalex- akis, and C. Faloutsos, “Tensor decomposition for signal processing and machine learning,” IEEE Trans. Signal Process., vol. 65, no. 13, pp. 3551– 3582, 2017

work page 2017

[28] [28]

Low-rank tensor methods for communicating Markov processes,

D. Kressner and F. Macedo, “Low-rank tensor methods for communicating Markov processes,” in Intl. Conf. Quantitative Evaluation of Syst. , pp. 25– 40, Springer, 2014

work page 2014

[29] [29]

Low-rank tensor methods for Markov chains with applications to tumor progression models,

P. Georg, L. Grasedyck, M. Klever, R. Schill, R. Spang, and T. Wettig, “Low-rank tensor methods for Markov chains with applications to tumor progression models,” J. Math. Biology , vol. 86, no. 1, p. 7, 2023

work page 2023

[30] [30]

Low-rank tensors for multi-dimensional Markov models,

M. Navarro, S. Rozada, A. G. Marques, and S. Segarra, “Low-rank tensors for multi-dimensional Markov models,” arXiv preprint arXiv:2411.02098, 2024

work page arXiv 2024

[31] [31]

Reinforcement Learning in Rich-Observation MDPs using Spectral Methods

K. Azizzadenesheli, A. Lazaric, and A. Anandkumar, “Reinforcement learning in rich-observation MDPs using spectral methods,” arXiv preprint arXiv:1611.03907, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[32] [32]

Maximum likelihood tensor decomposition of Markov decision process,

C. Ni and M. Wang, “Maximum likelihood tensor decomposition of Markov decision process,” in IEEE Intl. Symposium Info. Theory (ISIT) , pp. 3062–3066, IEEE, 2019

work page 2019

[33] [33]

Learning good state and action representations via tensor decomposition,

C. Ni, A. R. Zhang, Y . Duan, and M. Wang, “Learning good state and action representations via tensor decomposition,” in IEEE Intl. Symposium Info. Theory (ISIT) , pp. 1682–1687, IEEE, 2021

work page 2021

[34] [34]

Learning good state and action representations for Markov decision process via tensor decomposition,

C. Ni, Y . Duan, M. Dahleh, M. Wang, and A. R. Zhang, “Learning good state and action representations for Markov decision process via tensor decomposition,” J. Mach. Learn. Res. (JMLR) , vol. 24, no. 115, pp. 1–53, 2023

work page 2023

[35] [35]

Efficient high- dimensional stochastic optimal motion control using tensor-train decom- position.,

A. A. Gorodetsky, S. Karaman, and Y . M. Marzouk, “Efficient high- dimensional stochastic optimal motion control using tensor-train decom- position.,” in Robotics: Science and Syst. , Citeseer, 2015

work page 2015

[36] [36]

High-dimensional stochas- tic optimal control using continuous tensor decompositions,

A. Gorodetsky, S. Karaman, and Y . Marzouk, “High-dimensional stochas- tic optimal control using continuous tensor decompositions,” The Intl. J. Robotics Research, vol. 37, no. 2-3, pp. 340–377, 2018

work page 2018

[37] [37]

Tensor decomposition meth- ods for high-dimensional Hamilton–Jacobi–Bellman equations,

S. Dolgov, D. Kalise, and K. K. Kunisch, “Tensor decomposition meth- ods for high-dimensional Hamilton–Jacobi–Bellman equations,” SIAM J. Scientific Computing, vol. 43, no. 3, pp. A1625–A1650, 2021

work page 2021

[38] [38]

Approximating optimal feedback controllers of finite horizon control problems using hierarchical tensor formats,

M. Oster, L. Sallandt, and R. Schneider, “Approximating optimal feedback controllers of finite horizon control problems using hierarchical tensor formats,” SIAM J. Scientific Computing , vol. 44, no. 3, pp. B746–B770, 2022

work page 2022

[39] [39]

Harnessing structures for value-based planning and reinforcement learning,

Y . Yang, G. Zhang, Z. Xu, and D. Katabi, “Harnessing structures for value-based planning and reinforcement learning,” in Intl. Conf. Learning Representations (ICLR), 2020

work page 2020

[40] [40]

Sample efficient reinforcement learning via low-rank matrix estimation,

D. Shah, D. Song, Z. Xu, and Y . Yang, “Sample efficient reinforcement learning via low-rank matrix estimation,” in Advances Neural Info. Pro- cess. Syst., (Red Hook, NY , USA), Curran Associates Inc., 2020

work page 2020

[41] [41]

Low-rank state-action value- function approximation,

S. Rozada, V . Tenorio, and A. G. Marques, “Low-rank state-action value- function approximation,” in European Signal Process. Conf. (EUSIPCO) , pp. 1471–1475, IEEE, 2021

work page 2021

[42] [42]

Tensor-based reinforcement learning for network routing,

K.-C. Tsai et al. , “Tensor-based reinforcement learning for network routing,” IEEE J. Sel. Topics Signal Process. , vol. 15, no. 3, pp. 617– 629, 2021

work page 2021

[43] [43]

Tensor and matrix low- rank value-function approximation in reinforcement learning,

S. Rozada, S. Paternain, and A. G. Marques, “Tensor and matrix low- rank value-function approximation in reinforcement learning,”IEEE Trans. Signal Process., vol. 72, pp. 1634–1649, 2024

work page 2024

[44] [44]

PARAFAC. tutorial and applications,

R. Bro, “PARAFAC. tutorial and applications,” Chemometrics Intell. Laboratory Syst., vol. 38, no. 2, pp. 149–171, 1997

work page 1997

[45] [45]

Bertsekas, Non-linear programming

D. Bertsekas, Non-linear programming. Athena Scientific, 1999

work page 1999

[46] [46]

A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factor- ization and completion,

Y . Xu and W. Yin, “A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factor- ization and completion,” SIAM J. Imaging Science, vol. 6, no. 3, pp. 1758– 1789, 2013

work page 2013

[47] [47]

Revisiting fundamentals of experience replay,

W. Fedus, P. Ramachandran, R. Agarwal, Y . Bengio, H. Larochelle, M. Rowland, and W. Dabney, “Revisiting fundamentals of experience replay,” in Intl. Conf. Machine Learning (ICML) , pp. 3061–3071, PMLR, 2020

work page 2020

[48] [48]

A finite time analysis of temporal difference learning with linear function approximation,

J. Bhandari, D. Russo, and R. Singal, “A finite time analysis of temporal difference learning with linear function approximation,” in Conf. on Learning Theory (COT) , pp. 1691–1692, PMLR, 2018

work page 2018

[49] [49]

TD conver- gence: An optimization perspective,

K. Asadi, S. Sabach, Y . Liu, O. Gottesman, and R. Fakoor, “TD conver- gence: An optimization perspective,” in Advances Neural Info. Process. Syst., vol. 36, 2024

work page 2024

[50] [50]

Solving finite-horizon MDPs via tensor low-rank methods

S. Rozada, “Solving finite-horizon MDPs via tensor low-rank methods.” https://github.com/sergiorozada12/fhtlr-opt-learning, 2024

work page 2024

[51] [51]

A tutorial on linear function approximators for dynamic programming and reinforcement learning,

A. Geramifard et al. , “A tutorial on linear function approximators for dynamic programming and reinforcement learning,” Foundations and Trends® in Machine Learning , vol. 6, no. 4, pp. 375–451, 2013

work page 2013

[52] [52]

Almost-sure iden- tifiability of multidimensional harmonic retrieval,

T. Jiang, N. D. Sidiropoulos, and J. M. Ten Berge, “Almost-sure iden- tifiability of multidimensional harmonic retrieval,” IEEE Trans. Signal Process., vol. 49, no. 9, pp. 1849–1859, 2001

work page 2001

[53] [53]

Block stochastic gradient iteration for convex and nonconvex optimization,

Y . Xu and W. Yin, “Block stochastic gradient iteration for convex and nonconvex optimization,” SIAM J. Optimization, vol. 25, no. 3, pp. 1686– 1716, 2015

work page 2015