pith. sign in

arxiv: 2605.16777 · v1 · pith:N7WH75Z4new · submitted 2026-05-16 · 📡 eess.SY · cs.SY

AoI-MDP: An AoI Optimized Markov Decision Process (Student Abstract)

Pith reviewed 2026-05-19 21:28 UTC · model grok-4.3

classification 📡 eess.SY cs.SY
keywords Age of InformationMarkov Decision ProcessReinforcement LearningUnderwater VehiclesObservation DelayAutonomous SystemsInformation Freshness
0
0 comments X

The pith

Incorporating age of information into the state space and adding a wait action lets reinforcement learning produce better policies for underwater vehicles facing observation delays.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes AoI-MDP to address observation delays in autonomous underwater vehicle tasks by treating those delays as signal age and embedding them directly in the MDP state. It further expands the action set with an explicit wait option and folds age of information into the reward signal so that reinforcement learning jointly optimizes freshness and task performance. A reader would care because many real ocean missions suffer from significant sensor or communication lag, and any method that improves decisions under such lag has direct practical value. Simulations indicate the augmented process beats the ordinary MDP on the same tasks while remaining stable and generalizable.

Core claim

AoI-MDP models observation delay as signal delay and places it inside the state space. It introduces a wait action and integrates age of information into the reward functions, enabling reinforcement learning to optimize both information freshness and decision quality. The resulting policies outperform those of the standard MDP in simulated underwater scenarios.

What carries the argument

Age-of-information-augmented MDP that adds delay to the state vector and a wait action to the action set so reinforcement learning can trade off freshness against task progress.

Load-bearing premise

Adding age of information to the state and a wait action will produce better policies without introducing instability or requiring extensive new hyper-parameter tuning in the reinforcement learning process.

What would settle it

A set of underwater-task simulations in which AoI-MDP achieves no higher cumulative reward or success rate than standard MDP across varied delay distributions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.16777 by Guanwen Xie, Jingzehua Xu, Shuai Zhang, Xinqi Wang, Yimian Ding, Yiyuan Yang.

Figure 1
Figure 1. Figure 1: Illustration of the AoI model. • We utilize statistical delay modeling (SDM) for delay￾oriented modeling of observation delay via sensor-based model, yielding realistic results. • Comprehensive experiments in the underwater data col￾lection task show AoI-MDP’s superior feasibility and performance in balancing multi-objective optimization. Methodology AoI Optimized Markov Decision Process. As illustrated in… view at source ↗
Figure 2
Figure 2. Figure 2: Diagram of the heading and time delay estimation. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of experimental results of RL training [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
read the original abstract

Ocean exploration places high demands on autonomous underwater vehicles, especially when there's observation delay. We propose age of information optimized Markov decision process (AoI-MDP) to enhance underwater tasks by modeling observation delay as signal delay and including it in the state space. AoI-MDP also introduces wait time in the action space and integrates AoI with reward functions, optimizing information freshness and decision-making using reinforcement learning. Simulations show AoI-MDP outperforms the standard MDP, demonstrating superior performance, feasibility, and generalization in underwater tasks. To accelerate relevant research, we have made the codes available as open-source at https://github.com/Xiboxtg/AoI-MDP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes AoI-MDP, an extension of the standard Markov decision process for autonomous underwater vehicles that incorporates age of information (AoI) to model observation delays. AoI is added to the state space, a wait action is introduced in the action space, and AoI is integrated into the reward function. Reinforcement learning is used to derive policies that optimize information freshness alongside task performance. Simulations are reported to show that AoI-MDP outperforms the baseline MDP, with claims of superior performance, feasibility, and generalization; the implementation code is released as open source.

Significance. If the performance advantage can be shown under matched training conditions, the approach offers a practical way to handle communication delays in underwater robotic control, a setting where observation staleness directly affects decision quality. The open-source code release is a clear strength that supports reproducibility and extension by other researchers.

major comments (2)
  1. [Abstract / Simulation results] The central performance claim (abstract: 'Simulations show AoI-MDP outperforms the standard MDP') is unsupported by any quantitative metrics, tables, or figures. No values for cumulative reward, average AoI, task success rate, or other measures are supplied, nor are error bars or statistical tests across seeds reported. This absence prevents verification that the observed gap is attributable to the AoI augmentation rather than implementation artifacts.
  2. [Method and simulation setup] No information is given on training parity between AoI-MDP and the baseline MDP. The manuscript does not state whether the same number of episodes, wall-clock budget, or hyper-parameter tuning effort was used, despite AoI-MDP expanding the state space (by the AoI variable) and action space (by the wait action). Without explicit confirmation of comparable training resources, the superiority claim cannot be isolated to the modeling choice.
minor comments (2)
  1. [Overall] The manuscript is brief, consistent with a student abstract format; a short paragraph describing the specific underwater task (e.g., navigation, target tracking) and the simulation environment parameters would improve readability.
  2. [Code availability] The GitHub link is provided; confirming that the repository contains the exact environment, reward definitions, and training scripts used for the reported simulations would strengthen the reproducibility claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive comments on our student abstract. We address the major concerns below and will incorporate revisions to provide more quantitative evidence and clarify the experimental setup.

read point-by-point responses
  1. Referee: [Abstract / Simulation results] The central performance claim (abstract: 'Simulations show AoI-MDP outperforms the standard MDP') is unsupported by any quantitative metrics, tables, or figures. No values for cumulative reward, average AoI, task success rate, or other measures are supplied, nor are error bars or statistical tests across seeds reported. This absence prevents verification that the observed gap is attributable to the AoI augmentation rather than implementation artifacts.

    Authors: We agree with the referee that the abstract lacks specific quantitative metrics to support the performance claims. Given the constraints of the student abstract format, we will revise the manuscript to include key simulation results in a table format, reporting metrics such as average cumulative reward, mean AoI, and task success rates for AoI-MDP versus the baseline MDP, including standard deviations from multiple runs. revision: yes

  2. Referee: [Method and simulation setup] No information is given on training parity between AoI-MDP and the baseline MDP. The manuscript does not state whether the same number of episodes, wall-clock budget, or hyper-parameter tuning effort was used, despite AoI-MDP expanding the state space (by the AoI variable) and action space (by the wait action). Without explicit confirmation of comparable training resources, the superiority claim cannot be isolated to the modeling choice.

    Authors: We acknowledge the need for explicit details on the experimental setup. The training was performed with equivalent resources: the same number of episodes and hyper-parameter settings were used for both models. We will update the manuscript to describe the simulation parameters in detail, confirming the matched training conditions and noting any differences due to the increased state-action space. revision: yes

Circularity Check

0 steps flagged

No circularity in AoI-MDP proposal or simulation claims

full rationale

The paper proposes AoI-MDP by augmenting a standard MDP state space with an age-of-information variable, adding a wait action, and folding AoI into the reward function, then reports simulation outperformance versus baseline MDP. No equations, parameter-fitting steps, self-citations, or uniqueness theorems are present that would reduce any claimed result to an input by construction. The approach relies on conventional RL components applied to the modified formulation, rendering the work self-contained against external benchmarks with no load-bearing circular reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described. The approach relies on standard MDP transition and reward structures plus the addition of AoI terms whose functional form is not specified.

pith-pipeline@v0.9.0 · 5656 in / 1123 out tokens · 36127 ms · 2026-05-19T21:28:14.765743+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 1 internal anchor

  1. [1]

    Vol and Energy-Aware AUV-Assisted Data Collection for Internet of Underwater Things , year=

    Xu, Jingzehua and Zhang, Zekai and Wang, Ziyuan and Wang, Jingjing and Rent, Yong , booktitle=. Vol and Energy-Aware AUV-Assisted Data Collection for Internet of Underwater Things , year=

  2. [2]

    Noncooperative Mobile Target Tracking Using Multiple AUVs in Anchor-Free Environments , year=

    Li, Yichen and Liu, Lingya and Yu, Wenbin and Wang, Yiyin and Guan, Xinping , journal=. Noncooperative Mobile Target Tracking Using Multiple AUVs in Anchor-Free Environments , year=

  3. [3]

    Environment and Energy-Aware AUV-Assisted Data Collection for the Internet of Underwater Things , year=

    Zhang, Zekai and Xu, Jingzehua and Xie, Guanwen and Wang, Jingjing and Han, Zhu and Ren, Yong , journal=. Environment and Energy-Aware AUV-Assisted Data Collection for the Internet of Underwater Things , year=

  4. [4]

    Wei, Wei and Wang, Jingjing and Du, Jun and Fang, Zhengru and Ren, Yong and Chen, C. L. Philip , journal=. Differential Game-Based Deep Reinforcement Learning in Underwater Target Hunting Task , year=

  5. [5]

    Reinforcement Learning and Particle Swarm Optimization Supporting Real-Time Rescue Assignments for Multiple Autonomous Underwater Vehicles , year=

    Wu, Jiehong and Song, Chengxin and Ma, Jian and Wu, Jinsong and Han, Guangjie , journal=. Reinforcement Learning and Particle Swarm Optimization Supporting Real-Time Rescue Assignments for Multiple Autonomous Underwater Vehicles , year=

  6. [6]

    Optimizing Information Freshness in Wireless Networks Under General Interference Constraints , year=

    Talak, Rajat and Karaman, Sertac and Modiano, Eytan , journal=. Optimizing Information Freshness in Wireless Networks Under General Interference Constraints , year=

  7. [7]

    and Sun, Yin and Brown, D

    Yates, Roy D. and Sun, Yin and Brown, D. Richard and Kaul, Sanjit K. and Modiano, Eytan and Ulukus, Sennur , journal=. Age of Information: An Introduction and Survey , year=

  8. [8]

    Underwater Searching and Multiround Data Collection via AUV Swarms: An Energy-Efficient AoI-Aware MAPPO Approach , year=

    Jiang, Bingqing and Du, Jun and Jiang, Chunxiao and Han, Zhu and Debbah, Merouane , journal=. Underwater Searching and Multiround Data Collection via AUV Swarms: An Energy-Efficient AoI-Aware MAPPO Approach , year=

  9. [9]

    Proceedings of the 1992 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems , pages =

    Altman, Eitan and Nain, Philippe , title =. Proceedings of the 1992 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems , pages =. 1992 , isbn =

  10. [10]

    and Engelbrecht, S.E

    Katsikopoulos, K.V. and Engelbrecht, S.E. , journal=. Markov decision processes with delays and asynchronous cost collection , year=

  11. [11]

    Multi-Objective-Optimization Assisted Data Collection Framework for IoUT Based on Offline Reinforcement

    Multi-Objective-Optimization Multi-AUV Assisted Data Collection Framework for IoUT Based on Offline Reinforcement Learning , author=. arXiv preprint arXiv:2410.11282 , year=

  12. [12]

    Multi-AUV Assisted Seamless Underwater Target Tracking Relying on Deep Learning and Reinforcement Learning , year=

    Xu, Jingzehua and Ding, Yimian and Zhang, Zekai and Xie, Guanwen and Wang, Ziyuan and Zeng, Yongming and Li, Gang , booktitle=. Multi-AUV Assisted Seamless Underwater Target Tracking Relying on Deep Learning and Reinforcement Learning , year=

  13. [13]

    IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) , pages=

    Deep reinforcement learning for fresh data collection in UAV-assisted IoT networks , author=. IEEE INFOCOM 2020-IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) , pages=. 2020 , organization=

  14. [14]

    1960 , url=

    Dynamic Programming and Markov Processes , author=. 1960 , url=

  15. [15]

    SIGMETRICS Perform

    Altman, Eitan and Nain, Philippe , title =. SIGMETRICS Perform. Eval. Rev. , month =. 1992 , issue_date =

  16. [16]

    UAV-UGV-Based System for AoI minimization in IoT Networks , year=

    Messaoudi, Kaddour and Oubbati, Omar Sami and Rachedi, Abderrezak and Bendouma, Tahar , booktitle=. UAV-UGV-Based System for AoI minimization in IoT Networks , year=

  17. [17]

    Cooperative Transmission for AoI-Penalty Aware State Estimation in Marine IoT Systems , year=

    Lyu, Ling and Dai, Yanpeng and Cheng, Nan and Zhu, Shanying and Ding, Zhengtao and Guan, Xinping , booktitle=. Cooperative Transmission for AoI-Penalty Aware State Estimation in Marine IoT Systems , year=

  18. [18]

    and Koksal, C

    Sun, Yin and Uysal-Biyikoglu, Elif and Yates, Roy D. and Koksal, C. Emre and Shroff, Ness B. , journal=. Update or Wait: How to Keep Your Data Fresh , year=

  19. [19]

    IEEE Transactions on Vehicular Technology , volume=

    3U: Joint design of UAV-USV-UUV networks for cooperative target hunting , author=. IEEE Transactions on Vehicular Technology , volume=. 2022 , publisher=