{"total":66,"items":[{"citing_arxiv_id":"2605.22721","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-Evolving Multi-Agent Systems via Decentralized Memory","primary_cat":"cs.MA","submitted_at":"2026-05-21T16:55:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DecentMem is a decentralized dual-pool memory framework for self-evolving multi-agent systems that provides O(log T) regret guarantees and yields up to 23.8% accuracy gains over centralized baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22166","ref_index":48,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Adapting the Interface, Not the Model: Runtime Harness Adaptation for Deterministic LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-21T08:36:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Life-Harness evolves reusable runtime interventions from training failures to improve frozen LLM agents by 88.5% on average across 126 settings in seven deterministic environments while transferring across 18 model backbones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21984","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Echo: Learning from Experience Data via User-Driven Refinement","primary_cat":"cs.AI","submitted_at":"2026-05-21T04:34:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Echo is a framework that harvests user-driven refinements of agent proposals as training signals to align models with real-world needs, demonstrated by raising code completion acceptance from 25.7% to 35.7% in production.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21463","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mem-$\\pi$: Adaptive Memory through Learning When and What to Generate","primary_cat":"cs.CL","submitted_at":"2026-05-20T17:51:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20477","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Training Language Agents to Learn from Experience","primary_cat":"cs.LG","submitted_at":"2026-05-19T20:41:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces the ICT framework and an RL pipeline to train language agent reflectors that distill experience into reusable prompts, outperforming baselines on held-out tasks in ALFWorld and MiniHack.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20061","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Rewarding Beliefs, Not Actions: Consistency-Guided Credit Assignment for Long-Horizon Agents","primary_cat":"cs.CL","submitted_at":"2026-05-19T16:19:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ReBel uses belief-consistency supervision and belief-aware grouping to improve credit assignment in long-horizon RL for LLM agents, achieving up to 20.4 percentage points higher success and 2.1x better sample efficiency than GRPO on ALFWorld and WebShop.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19099","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows","primary_cat":"cs.AI","submitted_at":"2026-05-18T20:37:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18141","ref_index":41,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Brief Overview: On-Policy Self-Distillation In Large Language Models","primary_cat":"cs.HC","submitted_at":"2026-05-18T09:47:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18109","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-18T09:19:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17829","ref_index":51,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Interactive Evaluation Requires a Design Science","primary_cat":"cs.AI","submitted_at":"2026-05-18T04:03:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axis taxonomy and design principles.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16604","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"R2V Agent: Teaching SLMs When to Ask for Help","primary_cat":"cs.LG","submitted_at":"2026-05-15T20:10:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"R2V-Agent combines an SLM policy trained via BC and DPO with a step-level risk-calibrated router using Brier scores and CVaR to escalate to LLM only on high residual failure risk, improving success-cost tradeoffs on HumanEval+, TextWorld, and TerminalBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14504","ref_index":16,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution","primary_cat":"cs.AI","submitted_at":"2026-05-14T07:47:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LongAct benchmark evaluates long-horizon household task execution from free-form instructions; HoloMind agent raises performance but top VLMs still reach only 59% goal completion and 16% full-task success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13716","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SkillOps: Managing LLM Agent Skill Libraries as Self-Maintaining Software Ecosystems","primary_cat":"cs.SE","submitted_at":"2026-05-13T16:02:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SkillOps maintains LLM skill libraries via Skill Contracts and ecosystem graphs, raising ALFWorld task success to 79.5% as a standalone agent and improving retrieval baselines by up to 2.9 points with near-zero library-time LLM cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13037","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MAP: A Map-then-Act Paradigm for Long-Horizon Interactive Agent Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-13T05:46:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAP improves LLM agent reasoning by constructing a structured cognitive map of the environment before task execution, yielding performance gains on benchmarks like ARC-AGI-3 and superior training data via the new MAP-2K dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12058","ref_index":13,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Holder Policy Optimisation","primary_cat":"cs.LG","submitted_at":"2026-05-12T12:45:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HölderPO unifies token-level aggregation in GRPO via the Hölder mean with a tunable p parameter and annealing schedule, delivering 54.9% average accuracy on math benchmarks and 93.8% success on ALFWorld.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Here bAi is the advantage estimator and ϵ is the clipping threshold. The reason we choose sequence- level clipping is to control gradient variance (see Appendix D and I.2). Specifically, p= 1 recovers GRPO (Appendix G.2), while p= 0 recovers GSPO (Appendix G.3). To analyse how p shapes the optimisation, we study ∇θρi,p(θ), which governs the direction of the policy gradients (see Eq. (9), (13), (16)). A direct calculation (Appendix G.1) yields ∇θρi,p(θ) =ρ i,p(θ) |yi|X t=1 Wi,t(p)· ∇ θ logπ θ(yi,t |x, y i,<t)W i,t(p) := ri,t(θ)p P|yi| k=1 ri,k(θ)p ,(3) where the per-token gradient weights Wi,t(p) form a probability distribution denoted by W p i . Cru- cially, varying p does not alter the per-token log-gradient directions; instead, it solely reweights the"},{"citing_arxiv_id":"2605.11814","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare","primary_cat":"cs.AI","submitted_at":"2026-05-12T09:06:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for personalized healthcare.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"AMA-Bench [42], evaluate conversational memory, memory operations, or agent trajectories under general-domain settings. Related benchmarks including PersonaMem [15], MemBench [30], MemoryArena [12], Memora [31], and RealTalk [16] further study dynamic profiling, continual memory, and personalized agents. By contrast, long-context or interactive environments such as RULER [13], LongBench [2], WebArena [44], and ALFWorld [28] focus on static context processing or non-medical task environments. Overall, existing benchmarks do not target personalized medical dialogue, where clinical specificity, heterogeneous memory priority, and streaming evaluation are central. 3 Preliminary Study In our real personalized healthcare deployments, patient interactions are often complex and diverse, giving rise to"},{"citing_arxiv_id":"2605.10663","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents","primary_cat":"cs.AI","submitted_at":"2026-05-11T14:43:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"On the utilization side, the solver is trained under a realistic distribution of skill quality: it encounters both informative and noisy skills during the same rollout, which mirrors the conditions it faces at test time and builds robustness accordingly. 4 Experiments 4.1 Experimental Setup We instantiate all methods with Qwen2.5-7B-Instruct [32] as the base model and evaluate them on two benchmarks with explicit task-level splits: ALFWorld [23] and Mind2Web [7]. We primarily compare our approach against two categories of baselines: prompt-based experience-driven self- evolution methods ExpeL [38], Memento [40] and ReasoningBank [16] and the RL-based method, GRPO [19] and SkillRL [30]. However, since SkillRL relies on a certain amount of cold-start data, we are unable to evaluate it on Mind2Web."},{"citing_arxiv_id":"2605.10325","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Verifiable Process Rewards for Agentic Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-11T10:30:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Verifiable Process Rewards (VPR) converts symbolic oracles into dense turn-level supervision for reinforcement learning in agentic reasoning, outperforming outcome-only rewards and transferring to general benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"86 62.16 3.3 Out-of-Domain Generalization We evaluate whether the reasoning skills learned from verifiable game tasks are generalizable to tasks out side the training distribution. We consider 7 general reasoning benchmarks including GSM8K, MATH-500, AIME24/25, GPQA-Diamond, BBH, and MMLU-Pro (Table 2) and 2 agentic reasoning tasks including ALFWorld [24] and WebShop [34] (Table 3) and report the standard pass@1 measured over multiple evaluation runs; no further fine-tuning is performed. General Reasoning Benchmarks.Every VPR-trained model improves the average score over the base across all 7 benchmarks, withMinesweeper-trained VPR yielding the highest average. The improvements are most visible on harder benchmarks (AIME24/25, GPQA-Diamond) and small or"},{"citing_arxiv_id":"2605.09487","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Kintsugi: Learning Policies by Repairing Executable Knowledge Bases","primary_cat":"cs.LG","submitted_at":"2026-05-10T11:51:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Kintsugi learns policies by repairing composable executable knowledge bases through agentic diagnosis, localized typed edits, and deterministic verification gates that admit only improvements.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"This design tests artifact-centered policy improvement rather than claiming that symbolic policies replace neural robotics. We evaluate whether capability is acquired through KB evolution, whether it resides in the executable artifact, whether behavior can be locally edited, and which KB components are load-bearing. On long-horizon text-agent tasks, Kintsugi evaluates endpoint capability on ALFWorld [24], WebShop [32], and TextCraft [21]. On ALFWorld, we further isolate the artifact claim through a cold-start study, fixed-KB access controls, local editability tests, and component ablations. For object-centric manipulation, MetaWorld [34] and Predicators [26, 25] evaluate KB- driven symbolic policies under compatible state and skill interfaces, while RoboSuite is used as a"},{"citing_arxiv_id":"2605.09423","ref_index":72,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning","primary_cat":"cs.AI","submitted_at":"2026-05-10T08:51:50+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"fixed-environment training only in that the scene distribution drifts between rounds; the agent's update rule is unchanged. SIMWORLDSTUDIOreuses the Gym interface of §2.1 without modification, so an RL policy (e.g., PPO [ 64]) updates via standard policy gradients on the reward returned by step(), while an LLM-based policy updates through in-context mechanisms such as incremental rule accumulation or reflection-style memory [72, 66]. Coding Agent Evolving.SIMCODER's update is in-context: between rounds the embodied agent's performance is fed back as context for the next generation episode, and SIMCODERreweights its skill retrievals and tool invocations to raise difficulty where success rates plateau, lower it where the agent stalls, and oversample structural features the agent has not yet mastered."},{"citing_arxiv_id":"2605.09330","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory","primary_cat":"cs.LG","submitted_at":"2026-05-10T05:04:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"14:else ifno recorded mediating path connectsXandYin ˆGthen 15:S spur ← S spur ∪ {(X, Y,T 2)}▷unmeasured confounding 16:else if∃C ⋆ ∈ VwithX, X 0 →C ⋆ andX 0 →Yin ˆGthen 17:S spur ← S spur ∪ {(X, Y,T 3)}▷collider bias 18:end if 19:end for 20:// Validation: confirm each candidate via conditional independence tests (App.A) 21:returnS spur Example: Explicit Confounding in ALFWorld [38] (More examples in App.B.1) An agent is solving household tasks. The task category C (e.g., \"heat-and-place\") confounds memory and action: it causes past trajectories mentioning \"microwave\" to be retrieved into context (X), and it independently dictates the correct action \"go to microwave\" (Y ). The agent then learns a shortcut whenever \"microwave\" appears in retrieved memory, output \"go to microwave,\" even"},{"citing_arxiv_id":"2605.09278","ref_index":57,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EquiMem: Calibrating Shared Memory in Multi-Agent Debate via Game-Theoretic Equilibrium","primary_cat":"cs.AI","submitted_at":"2026-05-10T03:04:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EquiMem calibrates shared memory in multi-agent debate by computing a game-theoretic equilibrium from agent queries and paths, outperforming heuristics and LLM validators across benchmarks while remaining robust to adversarial agents.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"and across diverse benchmarks, MAD frameworks, and memory architectures, it consistently outperforms existing safeguards, remains robust under adversarial agents, and incurs negligible inference overhead. 1 Introduction Multi-agent debate (MAD) systems built on large language models (LLMs) have shown strong performance on complex reasoning [17, 35, 42, 70], embodied action [57, 71], and planning [24, 33, 38] tasks, where agents iteratively discuss, critique, and refine each other's outputs [9, 36, 43]. To support interactions beyond a single round, recent MAD systems add ashared memorythat persists intermediate reasoning, past actions, and episodic trajectories across rounds [3, 69, 89, 94]. While shared memory boosts long-horizon reasoning, it also opens a critical vulnerability: a corrupted"},{"citing_arxiv_id":"2605.08904","ref_index":140,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07594","ref_index":46,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MemCompiler: Compile, Don't Inject -- State-Conditioned Memory for Embodied Agents","primary_cat":"cs.RO","submitted_at":"2026-05-08T11:07:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MemCompiler reframes memory use as state-conditioned compilation, delivering relevant guidance via text and latent channels to improve embodied agent performance up to 129% and cut latency 60% versus static injection.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"std(𝑟 1:𝐺 ) is the group-normalized advantage. This stage enables the Memory Compiler to explore compilation strategies beyond the SFT distribution and exploit those that maximize task success. 4. Experiment 4.1. Experimental Setup Benchmarks.We evaluate on three benchmarks spanning household manipulation, multimodal em- bodied planning, and scientific reasoning. AlfWorld [46] provides 134 unseen household tasks across six categories (pick-and-place, examine-in-light, clean, heat, cool, pick-two-and-place) in 120 rooms, evaluated by task success rate. EmbodiedBench [47] provides 1,128 tasks across two environments (EB-ALFRED, EB-Habitat), reporting success rate along six fine-grained capability dimensions: base ability"},{"citing_arxiv_id":"2605.06595","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cross-Modal Navigation with Multi-Agent Reinforcement Learning","primary_cat":"cs.RO","submitted_at":"2026-05-07T17:20:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CRONA is a MARL framework that uses modality-specialized agents with auxiliary beliefs and a centralized multi-modal critic to achieve better performance and efficiency than single-agent baselines on visual-acoustic navigation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06534","ref_index":63,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ROSE: Rollout On Serving GPUs via Cooperative Elasticity for Agentic RL","primary_cat":"cs.DC","submitted_at":"2026-05-07T16:33:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ROSE is a system for cooperative elasticity that co-locates serving and rollout models on shared GPUs, delivering 1.3-3.3x higher end-to-end throughput than fixed-resource baselines while preserving serving SLOs.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"fer engine that exploits both shard-awareness and sparsity-awareness for efficient RL weight propagation across datacenters. • We design an elastic rollout scheduler with turn-wise routing and cache-affinity that adapts to time-varying serving load without global synchronization. 2 ROSE: Rollouts on Serving GPUs Conference'17, July 2017, Washington, DC, USA • We implementROSEatop ROLL [ 64] and evaluate it with Qwen3-8B/32B [75] on agentic tasks [9, 58], using 16-48 training GPUs and 16-64 serving GPUs (H800). Compared with resource-fixed baselines ROLL [ 64] and AReaL [13],ROSEimproves average throughput by1 .20-3.31× and1 .44-2.69×, respectively. Compared with resource-elastic baselines RLBoost [70] and𝜆RL [42, 62],ROSEoutperforms by1 ."},{"citing_arxiv_id":"2605.05583","ref_index":13,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Belief Memory: Agent Memory Under Partial Observability","primary_cat":"cs.AI","submitted_at":"2026-05-07T02:03:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BeliefMem is a probabilistic memory architecture for LLM agents that retains multiple candidate conclusions with probabilities updated by Noisy-OR, achieving superior average performance over deterministic baselines on LoCoMo and ALFWorld.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05413","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From History to State: Constant-Context Skill Learning for LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-05-06T20:13:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Constant-context skill learning trains reusable task-family modules for LLM agents using a deterministic state block for progress tracking and subgoal rewards, achieving 89.6% unseen success on ALFWorld, 76.8% on WebShop, and 66.4% on SciWorld with Qwen3-8B while reducing prompt tokens 2-7x.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02178","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-04T03:15:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00380","ref_index":43,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-01T03:57:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"For each positive token h∈ P , compute x= LN(h)−µ + as in (41). Stack these centered positive vectors as rows to form X + ∈R |P|×d , X + m: =x ⊤ m,(42) wherex m ∈R d is them-th centered positive representation. Step 3: PCA objective and equivalence to truncated SVD.Define the (uncentered) empirical covariance of the centered positives C := 1 |P| (X +)⊤X + ∈R d×d.(43) A standard characterization of PCA is that the top-kprincipal subspace solves max V∈R d×k tr(V ⊤CV)s.t.V ⊤V=I k,(44) i.e., it maximizes the variance captured by projecting onto span(V) . The optimizer Vk is given by the top-k eigenvectors of C. To connect this to the truncated SVD used in Definition 1, take an SVD ofX +: X + =UΣV ⊤,(45) where U∈R |P|×r , V∈R d×r have orthonormal columns, Σ∈R r×r is diagonal with singular values, and r= rank(X +)."},{"citing_arxiv_id":"2605.00347","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-01T02:05:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Odysseus adapts PPO with a turn-level critic and leverages pretrained VLM action priors to train agents achieving at least 3x average game progress over frontier models in long-horizon Super Mario Land.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24005","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"TCOD: Exploring Temporal Curriculum in On-Policy Distillation for Multi-turn Autonomous Agents","primary_cat":"cs.LG","submitted_at":"2026-04-27T03:38:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TCOD stabilizes on-policy distillation for multi-turn agents via temporal curriculum on trajectory depth, improving performance up to 18 points over vanilla OPD and sometimes surpassing the teacher.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23194","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-04-25T07:54:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21232","ref_index":19,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ReCAPA: Hierarchical Predictive Correction to Mitigate Cascading Failures","primary_cat":"cs.AI","submitted_at":"2026-04-23T02:57:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ReCAPA adds predictive correction and multi-level semantic alignment to VLA models, plus two new metrics for tracking error spread and recovery, yielding competitive benchmark results over LLM baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19485","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training","primary_cat":"cs.LG","submitted_at":"2026-04-21T14:07:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EVPO adaptively switches between critic-based and batch-mean advantage estimation using batch-level explained variance to provably achieve no greater variance than the better of PPO or GRPO at every step.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Deepseekmath: Pushing the limits of mathematical reasoning in open language models, April 2024. URL http://arxiv.org/ abs/2402.03300. [26] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning, March 2021. URLhttp://arxiv.org/abs/2010.03768. [27] Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, A. J. Ostrow, Akhila Ananthram, Akshay Nathan, Alan Luo, Alec Helyar, Aleksander Madry, Aleksandr Efremov, Aleksandra Spyra, Alex Baker- Whitcomb, Alex Beutel, Alex Karpenko, Alex Makelov, Alex Neitz, Alex Wei, Alexandra Barr, Alexandre Kirchmeyer, Alexey Ivanov, Alexi Christakis, Alistair Gillespie, Allison Tam, Ally"},{"citing_arxiv_id":"2604.19839","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Environmental Understanding Vision-Language Model for Embodied Agent","primary_cat":"cs.CV","submitted_at":"2026-04-21T09:11:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"EUEA fine-tunes VLMs on object perception, task planning, action understanding and goal recognition, with recovery and GRPO, to raise ALFRED success rates by 11.89% over behavior cloning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18401","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2026-04-20T15:22:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"StepPO argues that LLM agents should optimize at the step level rather than token level to better handle delayed rewards and long contexts in agentic RL.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"General agents built on Large Language Models (LLMs) [1, 6] have already given rise to phenomenal applications [8] such as OpenClaw and Claude Code [5, 19]. As these agent systems strive for bolder goals, they place increasing demands on the underlying models to support agentic capabilities such as decision making, planning, tool use [23, 37], environment interaction [29, 36], and long-horizon task execution [8, 16, 31, 38]. 1 arXiv:2604.18401v1 [cs.CL] 20 Apr 2026 Table 1Representative methods differ in how they place the MDP formulation and credit-assignment units, revealing a granularity mismatch in Agentic RL. StepPO reduces this mismatch by aligning both with the interaction step. Method MDP formulation granularity Credit assignment granularity"},{"citing_arxiv_id":"2604.17244","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DORA Explorer: Improving the Exploration Ability of LLMs Without Training","primary_cat":"cs.CL","submitted_at":"2026-04-19T04:07:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DORA Explorer boosts LLM agent exploration without training by ranking diverse actions using log-probabilities and a tunable parameter, yielding UCB-competitive results on multi-armed bandits and gains on text adventure environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16682","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving","primary_cat":"cs.DC","submitted_at":"2026-04-17T20:39:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"3716025 [55] Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency. InProceedings of the 2025 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE Computer Society, 1348-1362. doi:10.1109/HPCA61900. 2025.00102 [56] Noppanat Wadlom, Junyi Shen, and Yao Lu. 2026. Efficient LLM Serving for Agentic Workflows: A Data Systems Perspective.arXiv preprint arXiv:2603.16104(2026). doi:10.48550/arXiv.2603.16104 [57] Zibo Wang, Yijia Zhang, Fuchun Wei, Bingqiang Wang, Yanlin Liu, Zhiheng Hu, Jingyi Zhang, Xiaoxin Xu, Jian He, Xiaoliang Wang, Wanchun Dou, Guihai Chen, and Chen Tian."},{"citing_arxiv_id":"2604.10517","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning","primary_cat":"cs.AI","submitted_at":"2026-04-12T08:14:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08509","ref_index":89,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Visually-grounded Humanoid Agents","primary_cat":"cs.CV","submitted_at":"2026-04-09T17:50:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A coupled world-agent framework uses 3D Gaussian reconstruction and first-person RGB-D perception with iterative planning to enable goal-directed, collision-avoiding humanoid behavior in novel reconstructed scenes.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"With large language models (LLMs) [ 31] and vision- language models (VLMs) [ 75], it has become straightfor- ward to utilize these methods to simulate aspects of digital agents such as high-level reasoning [109] and dialogue [63] in complex scenes. However, such systems largely remain disembodied: they are typically constrained to symbolic reasoning [89] or scripted scenarios [67] and often lack vi- sual grounding, real-world perception-action coupling, and context-aware adaptability. Some efforts integrate VLMs with visual inputs for the motion planning of agents [11], but relying solely on VLM makes it challenging to operate effec- tively in complex environments. Due to these limitations, no prior work has been able to span the spectrum from semantic"},{"citing_arxiv_id":"2604.08232","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HiRO-Nav: Hybrid ReasOning Enables Efficient Embodied Navigation","primary_cat":"cs.AI","submitted_at":"2026-04-09T13:22:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HiRO-Nav adaptively triggers reasoning only on high-entropy actions via a hybrid training pipeline and shows better success-token trade-offs than always-reason or never-reason baselines on the CHORES-S benchmark.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"and use a finetuned Mask R-CNN [16] model for instance segmentation2. As shown in Table 2, our agent experiences only a slight performance drop due to the reduced quality of the estimated ASMs, while still maintaining strong navigation capabilities. Moreover, our agent continues to outperform the no-thinking and dense-thinking base- 2We follow the Mask R-CNN fine-tuning scripts in ALFWord [28] of- ficial repo. 1 2 4 8 16 k 70 75 80 85 90 95Pass@k SR% HiRo-NavGT HiRo-NavDM Poliformer Figure 8.Pass@k curves of HiRO-Nav with hybrid reasoning. GT and DM refer to the ground truth ASMs and the ASMs esti- mated by deep models as in Tab. 2 .We evaluation for 16 times with temperature=0.2. The navigation ability upper bound of HiRO- Nav outperforms task-specific sota method Poliformer[43], even"},{"citing_arxiv_id":"2604.07752","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"MIMIC-Py: An Extensible Tool for Personality-Driven Automated Game Testing with Large Language Models","primary_cat":"cs.SE","submitted_at":"2026-04-09T03:16:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MIMIC-Py provides a modular Python framework that turns personality-driven LLM agents into an extensible system for automated game testing via configurable traits, decoupled components, and multiple interaction methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.05044","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents","primary_cat":"cs.AI","submitted_at":"2026-03-05T10:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.03784","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Specification-Driven Generation and Evaluation of Discrete-Event World Models via the DEVS Formalism","primary_cat":"cs.AI","submitted_at":"2026-03-04T06:50:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A staged LLM pipeline synthesizes verifiable discrete-event world models from natural language specifications using the DEVS formalism for long-horizon consistency in LLM agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00977","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HiMAC: Hierarchical Macro-Micro Learning for Long-Horizon LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-03-01T08:09:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HiMAC decomposes LLM agent tasks into macro planning and micro execution using critic-free hierarchical RL and iterative co-evolution, outperforming baselines on ALFWorld, WebShop, and Sokoban.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.18600","ref_index":62,"ref_count":4,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MapTab: Are MLLMs Ready for Multi-Criteria Route Planning in Heterogeneous Graphs?","primary_cat":"cs.LG","submitted_at":"2026-02-20T20:22:18+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MapTab is a new multimodal benchmark with 328 images and nearly 200k queries that shows current MLLMs have substantial difficulty with multi-criteria route planning when visual and tabular information must be combined.","context_count":2,"top_context_role":"background","top_context_polarity":"background","context_text":"foundation models, and large language models to intelligent transportation systems. In2023 International Conference on Computer and Applications (ICCA), pages 1-7. IEEE, 2023. [61] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020. [62] Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256-274. Springer, 2024. [63] Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang,"},{"citing_arxiv_id":"2602.09514","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies","primary_cat":"cs.CL","submitted_at":"2026-02-10T08:12:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EcoGym is a new open benchmark with three economic environments that reveals no leading LLM dominates at sustained plan-and-execute decision making across scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.08392","ref_index":97,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs","primary_cat":"cs.RO","submitted_at":"2026-02-09T08:47:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"manding rigorous spatio-temporal synchronization and the precise management of dual concurrent action streams. It re- quires the MLLM not only to recognize objects but also to un- derstand the intricate inter-dependencies between two arms, such as preventing self-collisions and navigating overlapping kinematic workspaces. However, while existing embodied benchmarks (e.g., ALFWorld [97]) have laid a solid foun- dation for assessing sequential reasoning within single-arm paradigms, a significant gap persists when addressing the evolving requirements of bimanual coordination. These dual- arm specific challenges-such as simultaneous multi-stream reasoning, dynamic role assignment, and mutual kinematic constraints-naturally fall outside the intended scope of tra-"},{"citing_arxiv_id":"2602.05353","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction","primary_cat":"cs.AI","submitted_at":"2026-02-05T06:24:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}