{"total":13,"items":[{"citing_arxiv_id":"2606.03070","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ASymPO: Asymmetric-Scale Policy Optimization for Asynchronous LLM Post-Training Without Behavior Information","primary_cat":"cs.LG","submitted_at":"2026-06-02T03:00:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ASymPO normalizes token losses by average current-policy negative log-probability to restore zero-sum balance in asynchronous LLM RL without behavior information.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20999","ref_index":133,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Concentration of General Stochastic Approximation Under Heavy-Tailed Markovian Noise","primary_cat":"math.PR","submitted_at":"2026-05-20T10:38:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Establishes maximal concentration bounds for stochastic approximation under heavy-tailed Markovian noise, with tails ranging from sub-Gaussian to heavier than Weibull depending on step sizes and contractivity properties, plus a truncation argument for unbounded noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14350","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Distributionally Robust Multi-Task Reinforcement Learning via Adaptive Task Sampling","primary_cat":"cs.LG","submitted_at":"2026-05-14T04:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DRATS derives a minimax objective from a feasibility formulation of MTRL to adaptively sample tasks with the largest return gaps, leading to better worst-task performance on MetaWorld benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"that the θ-player is C-low-regret (Definition 4.1). For any target accuracy ϵ >0 , set η= 2 logk ϵ and α= √2 logk G √ T where G :=M+ 1+max(ηM,logk) η . Then for any T≥ 4(G√2 logk+C) 2 ϵ2 , DRATS satisfies: max i∈[k] 1 T TX t=1 gi(θt)≤min θ∈Θ max i∈[k] gi(θ) +ϵ.(30) Proof.Step 1: theq-player.At roundt, theq-player minimizes the KL-regularized loss ℓt(q) =−⟨q, g(θ t)⟩+ 1 η KL(q∥p0).(31) via mirror descent with the negative entropy mirror map ψ(q) =P i qi logq i with Bregman diver- gence Dψ(u,v) =KL(u||v) , step size α∈(0, η] , and uniform initialization q1 =p 0. Since ℓt is the sum of the linear term ⟨q, g(θt)⟩ and 1 η KL(q∥p0), which is 1-strongly convex with respect to ∥ · ∥1 [Duchi, 2023],ℓ t is convex. Its gradient at interior points of the simplex is"},{"citing_arxiv_id":"2605.06032","ref_index":280,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Does Synthetic Data Help? Empirical Evidence from Deep Learning Time Series Forecasters","primary_cat":"cs.LG","submitted_at":"2026-05-07T11:22:45+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Synthetic data augmentation helps channel-mixing time series models but degrades channel-independent ones, with reliable gains only from seasonal-trend generators and gradual schedules in low-resource settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14895","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Importance Sampling: Rejection-Gated Policy Optimization","primary_cat":"cs.LG","submitted_at":"2026-04-16T11:39:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RGPO replaces importance sampling with a smooth [0,1] acceptance gate in policy gradients, unifying TRPO/PPO/REINFORCE, bounding variance for heavy-tailed ratios, and showing gains in online RLHF experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.25424","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Polychromic Objectives for Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2025-09-29T19:32:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces polychromic objectives adapted into PPO via vine sampling and modified advantages, showing higher success rates and better coverage under perturbations on BabyAI, Minigrid, and algorithmic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.09838","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives","primary_cat":"cs.LG","submitted_at":"2025-09-11T20:34:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2005.01643","ref_index":215,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems","primary_cat":"cs.LG","submitted_at":"2020-05-04T17:00:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1910.00177","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2019-10-01T02:23:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AWR learns policies via advantage-weighted supervised regression on actions, achieving competitive off-policy performance on Gym tasks and strong results from static data alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.11770","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"To Learn or Not to Learn: Analyzing the Role of Learning for Navigation in Virtual Environments","primary_cat":"cs.CV","submitted_at":"2019-07-26T19:45:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Classical agents outperform learning-based ones on MINOS and Stanford 3D Indoor Spaces, with learned agents weaker at collision avoidance and memory but stronger at handling ambiguity and noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.06396","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Dual Memory Structure for Efficient Use of Replay Memory in Deep Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2019-07-15T09:45:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Dual memory (main plus cache) for replay memory in DRL yields higher scores than single memory across three Gym environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1906.11046","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multi-Agent Deep Reinforcement Learning for Liquidation Strategy Analysis","primary_cat":"q-fin.TR","submitted_at":"2019-06-24T20:22:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The authors extend the Almgren-Chriss model to a multi-agent setting and apply deep reinforcement learning to simulate and optimize liquidation strategies under practical constraints.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1906.09734","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Optimal Use of Experience in First Person Shooter Environments","primary_cat":"cs.LG","submitted_at":"2019-06-24T05:37:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Empirical tests in VizDoom show multiple DQN updates per step do not improve performance after learning rate adjustment, with a 4:1 update-to-step ratio optimal before significant degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}