{"total":18,"items":[{"citing_arxiv_id":"2606.27112","ref_index":30,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Heavy-Ball Q-Learning with Residual Weighting Correction","primary_cat":"cs.LG","submitted_at":"2026-06-25T14:48:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Corrected heavy-ball Q-learning with convergence and acceleration guarantees is derived via switched linear system and joint spectral radius analysis, extended to linear function approximation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17811","ref_index":42,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"UMB: A Unified Markov Binary Format for Probabilistic Model Checking (extended version)","primary_cat":"cs.LO","submitted_at":"2026-06-16T11:39:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UMB is a new binary file format for probabilistic systems that provides a unified, efficient alternative to tool-specific textual representations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04872","ref_index":7,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Consistent Distributed Cooperative Localization for Ultra Large-Scale Multi-agent Systems","primary_cat":"eess.SY","submitted_at":"2026-06-03T13:36:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new cooperative localization algorithm based on overlapping covariance intersection is fully distributed, provably recursively consistent, and scalable to ultra large-scale multi-agent systems without performance loss from ignored cross-correlations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03048","ref_index":7,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Value Function Semi-Algebraic Set in Partially Observable Markov Decision Processes","primary_cat":"math.OC","submitted_at":"2026-06-02T02:30:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Feasible value functions in POMDPs under memoryless policies form a semi-algebraic set defined by polynomial inequalities from the model parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01979","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Simple Hierarchical Causality Primer","primary_cat":"cs.MA","submitted_at":"2026-06-01T09:41:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Presents a simple discrete primer on hierarchical causality that requires causation classes, aggregation operators, and discrete event-time maps to connect actor and agent levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31388","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Constrained Multi-Objective Reinforcement Learning with Max-Min Criterion","primary_cat":"cs.LG","submitted_at":"2026-05-29T14:52:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Introduces a constrained max-min MORL algorithm with convergence analysis, validated in tabular settings and three simulated control domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28653","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Adaptive clinical trials based on design-optimal e-values with automatic curtailment: An application to single-arm trials with binary data","primary_cat":"stat.ME","submitted_at":"2026-05-27T15:54:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Finite-horizon optimal e-value designs for adaptive single-arm binary trials are constructed via dynamic programming and shown to have competitive operating characteristics with automatic futility indication.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22364","ref_index":24,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Scaling Observation-aware Planning in Uncertain Domains","primary_cat":"cs.AI","submitted_at":"2026-05-21T11:58:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A POMDP decomposition method scales solving of the Sensor Selection Problem and Positional Observability Problem by 3 and 5 orders of magnitude in instance size and runtime.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12288","ref_index":181,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching","primary_cat":"cs.CL","submitted_at":"2026-05-12T15:44:33+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Motivated by these gaps, we study token-level preference modeling. By defining preferences via a token-level Bradley-Terry model and adopting a ratio- matching perspective, we learn preference-optimal decisions at each prefix using only sequence-level comparison data. 3. Background 3.1. Preliminary and Notions When viewing text generation as a Markov decision process (Puterman, 1994), we define the state at step t as the prompt together with the response prefix produced so far, i.e., st = [x, y<t]. The action is the next token to generate, at =y t, and the per-token reward is given by Rt :=R(s t, at) = R([x, y<t], yt). Using these definitions, for a policy π we define the state-action value function Qπ, the state value functionV π and the advantage functionA π as:"},{"citing_arxiv_id":"2605.11897","ref_index":29,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Fast Computation of Conditional Probabilities in MDPs and Markov Chain Families","primary_cat":"cs.LO","submitted_at":"2026-05-12T10:11:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new efficient algorithm computes optimal conditional reachability probabilities in MDPs without creating hard cyclic reductions, achieving linear time on acyclic cases and substantial speedups on benchmarks from Bayesian networks, probabilistic programs, and runtime monitoring.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07537","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Multi-Environment POMDPs with Finite-Horizon Objectives","primary_cat":"cs.AI","submitted_at":"2026-05-08T10:14:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The optimal value and policy computation problem for finite-horizon objectives in multi-environment POMDPs is PSPACE-complete, and a new algorithm solves it more efficiently than previous methods on classical benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05812","ref_index":35,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Long-Horizon Q-Learning: Accurate Value Learning via n-Step Inequalities","primary_cat":"cs.AI","submitted_at":"2026-05-07T07:47:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LQL turns n-step action-sequence lower bounds into a practical hinge-loss stabilizer for off-policy Q-learning without extra networks or forward passes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[0,2] 0.0 [0,0] 0.0 [0,0] 31.3 [2,58] task20.0 [0,0] 63.3 [50,76] 36.0 [24,56] 6.7 [4,10] 4.7 [2,6] 97.3 [96,100] task30.0 [0,0] 7.3 [4,10] 4.7 [0,12] 0.0 [0,0] 0.0 [0,0] 70.7 [66,76] task40.0 [0,0] 4.7 [4,6] 0.0 [0,0] 0.0 [0,0] 0.0 [0,0] 80.7 [64,92] task50.0 [0,0] 98.7 [96,100] 90.0 [82,96] 61.3 [54,66] 26.0 [20,36] 98.7 [96,100] Total0.0 [0,0] 38.4 [35,41] 26.4 [23,30] 13.6 [12,15] 6.1 [5,8] 75.7 [70,81] Table 6:RoboMimic results.Each cell is the success rate (%) at the end of online training (mean across seeds). The Total row is the equal-weight mean of Square and Can.Boldmarks methods within 95% of the row maximum; an overbar marks methods within 95% of the per-actor maximum. Best-of-N FQL Gaussian Action Chunking"},{"citing_arxiv_id":"2604.23068","ref_index":80,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Probabilistic Hazard Analysis Framework with Stochastic Optimal Control for Deteriorating Civil Infrastructure Systems","primary_cat":"eess.SY","submitted_at":"2026-04-24T23:38:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A life-cycle optimization framework for deteriorating infrastructure under hazards is formulated as an MDP with a Kronecker-factored tensor method that reduces computational complexity from exponential to linear while preserving exact dynamic programming solutions.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"As in the original extended fragility framework, these multi-state transition probabilities are modeled using softmax regression [48]. 3 Life-Cycle Optimization as a Markov Decision Process This section formally casts the adaptive maintenance problem as a finite-horizon MDP, a powerful mathematical framework for sequential decision-making under uncertainty [80, 81]. We first define the MDP components, then highlight the computational challenges posed by system-level optimization, and finally introduce a novel tensor-based algorithm that makes finding the optimal policy tractable. The key notation for the MDP formulation is summarized in Table 2. 3.1 Problem Formulation as a Finite-Horizon MDP A finite-horizon MDP is defined by the tuple ( T,S,A, P, C, γ )."},{"citing_arxiv_id":"2604.22991","ref_index":6,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Optimal strategies in the all-heads coin game","primary_cat":"math.PR","submitted_at":"2026-04-24T20:16:05+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11507","ref_index":104,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Deep Learning for Sequential Decision Making under Uncertainty: Foundations, Frameworks, and Frontiers","primary_cat":"math.OC","submitted_at":"2026-04-13T14:11:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A tutorial framing deep learning as a complement to optimization for sequential decision-making under uncertainty, with applications in supply chains, healthcare, and energy.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"275(3):795-821, URLhttp://dx.doi.org/10.1016/j.ejor.2018.07.014. [103] Powell WB (2022)Reinforcement Learning and Stochastic Optimization: A Unified Framework for Sequential Decisions(Hoboken, NJ: Wiley), ISBN 9781119815068, URLhttps://www.wiley.com/ en-us/Reinforcement+Learning+and+Stochastic+Optimization%3A+A+Unified+Framework+for+ Sequential+Decisions-p-9781119815037. [104] Puterman ML (1994)Markov Decision Processes: Discrete Stochastic Dynamic Programming(New York: John Wiley & Sons), URLhttp://dx.doi.org/10.1002/9780470316887. [105] RibeiroMT,SinghS,GuestrinC(2016)\"whyshouldItrustyou?\": Explainingthepredictionsofanyclassifier. 42 Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,"},{"citing_arxiv_id":"2602.06603","ref_index":28,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The hidden risks of temporal resampling in clinical reinforcement learning","primary_cat":"cs.LG","submitted_at":"2026-02-06T11:02:06+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Resampling clinical time series into uniform bins for offline RL reduces performance by up to 60% and causes retrospective evaluations to overestimate returns by 1.5-3x versus unprocessed data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.24298","ref_index":39,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning","primary_cat":"cs.LG","submitted_at":"2025-05-30T07:18:25+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.04244","ref_index":119,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Benchmark Data Contamination of Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2024-06-06T16:41:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Machine Learning Research 21, 140 (2020), 1-67. http://jmlr.org/papers/v21/20-074.html [118] Federico Ranaldi, Elena Sofia Ruzzetti, Dario Onorati, Leonardo Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli, and Fabio Massimo Zanzotto. 2024. Investigating the Impact of Data Contamination of Large Language Models in Text-to-SQL Translation. arXiv:2402.08100 [cs.CL] [119] Martin Riddell, Ansong Ni, and Arman Cohan. 2024. Quantifying Contamination in Evaluating Code Generation Capabilities of Language Models. arXiv:2403.04811 [cs.SE] [120] Jesse Roberts. 2024. How Powerful are Decoder-Only Transformer Neural Models? arXiv:2305.17026 [cs.CL] [121] Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, and Samuel Dooley."}],"limit":50,"offset":0}