{"total":75,"items":[{"citing_arxiv_id":"2606.00241","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate","primary_cat":"cs.LG","submitted_at":"2026-05-29T18:16:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"InfoAtlas is a pretrained neural model for zero-shot mutual information estimation that matches state-of-the-art accuracy with 100x speedup and handles varying dimensions via a single model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30719","ref_index":11,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When are LLMs Sufficient Policy Optimizers for Sequential RL Tasks?","primary_cat":"cs.LG","submitted_at":"2026-05-29T01:24:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PromptPO shows LLMs can act as black-box policy optimizers for sequential RL when leveraging prior knowledge, matching baselines in exploration and robotics but underperforming in MuJoCo.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30569","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Any-ttach: Quick End-effector Swapping Enables Manipulation Dexterity with Simplicity","primary_cat":"cs.RO","submitted_at":"2026-05-28T21:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Any-ttach shows that rapid end-effector swapping combined with demonstration collection and task planning enables reliable multi-tool skills in long-horizon tasks such as sandwich making.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07568","ref_index":293,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Systematic Study of Behavioral Cloning for Scientific Data Annotation","primary_cat":"cs.HC","submitted_at":"2026-05-26T02:19:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces 9 synthetic annotation tasks and benchmarks for behavioral cloning, finding hierarchical skill learning, scaling benefits, effective multi-task pretraining, and shared internal representations of task phases and mistakes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19319","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SWEET: Sparse World Modeling with Image Editing for Embodied Task Execution","primary_cat":"cs.CV","submitted_at":"2026-05-19T03:54:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SWEET is a one-shot sparse visual planning framework that progressively generates manipulation keyframes via image editing conditioned on language and spatial guidance, then converts them to actions with a diffusion predictor, showing better fidelity and lower cost than video models on DROID and Rob","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18636","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SPIKE: An Adaptive Dual Controller Framework for Cost-Efficient Long-Horizon Game Agents","primary_cat":"cs.CV","submitted_at":"2026-05-18T16:43:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPIKE dual-controller framework raises success rates 5-9 points and cuts tokens 55% in StarDojo agents by reusing strategic plans across stable segments and escalating only at detected events.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18109","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-18T09:19:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17077","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning","primary_cat":"cs.RO","submitted_at":"2026-05-16T16:52:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeMiAn re-annotates robot and egocentric videos with VLM-generated dense labels across motion, scene, pose, and reasoning aspects, then uses a learned instructor to boost policy success by 5 points on RoboCasa over task-only baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16054","ref_index":188,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ada-Diffuser: Latent-Aware Adaptive Diffusion for Decision-Making","primary_cat":"cs.LG","submitted_at":"2026-05-15T15:21:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ada-Diffuser is a causal diffusion model that jointly learns observed interaction structure and underlying latent dynamics from minimal observations for adaptive planning and policy learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13119","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-13T07:40:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12755","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"State-Centric Decision Process","primary_cat":"cs.AI","submitted_at":"2026-05-12T21:09:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"action space, so the same predicate chain can be realized by different actions without replanning. World models and state abstraction.Several recent systems construct internal representations of the environment rather than raw observations, including temporal knowledge graphs [10], free- form state descriptions [47, 38], and feedback-triggered replanning [17]. These systems share the intuition that an agent benefits from an explicit model of what is true in the world. The gap is in what follows from that model. In each case the internal representation is consumed by the action- producing module, with no per-step certification. SDP's four-operator decomposition enforces this separation architecturally rather than relying on the language model to maintain it implicitly."},{"citing_arxiv_id":"2605.11951","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Reaction to Anticipation: Proactive Failure Recovery through Agentic Task Graph for Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-05-12T11:00:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AgentChord models manipulation tasks as directed graphs enriched with anticipatory recovery branches, using specialized agents to enable immediate, low-latency failure responses and improve success on long-horizon bimanual tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Each task consists of 20 trials, in which the manipulated objects are varied across different instances within the same category and their initial poses are randomly sampled. Fig. 3 shows the related objects in each task. Baseline Methods.We compare AgentChord against four strong baselines that are capable of detecting and recovering from execution failures. 1) Inner Monologue (IM) [25] em- ploys VLMs to detect failures at the completion of each sub- goal using a VQA formulation. 2) DoReMi (DRM) [23] de- Single/Dual arm pour water Setup coffee tray Fold towel Rearrange table Handover block Fig. 3: We collected a variety of object instances for each task to verify the generalizability of different methods. tects intermediate failures during execution by issuing frequent"},{"citing_arxiv_id":"2605.10834","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World","primary_cat":"cs.AI","submitted_at":"2026-05-11T16:50:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"10112, apr 2025. [11] Wenlong Huang, F. Xia, Ted Xiao, Harris Chan, Jacky Liang, Peter R. Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, P. Sermanet, Noah Brown, Tomas Jackson, Linda Luu, S. Levine, Karol Hausman, and Brian Ichter. Inner Monologue: Embodied Reasoning through Planning with Language Models.ArXiv, abs/2207.05608, jul 2022. [12] Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of LLM agents: A survey.ArXiv, abs/2402.02716, feb 2024. [13] Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. Ai Agents That Matter.Trans. Mach. Learn. Res., 2025, jul 2024."},{"citing_arxiv_id":"2605.09410","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RePO-VLA: Recovery-Driven Policy Optimization for Vision-Language-Action Models","primary_cat":"cs.RO","submitted_at":"2026-05-10T08:24:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RePO-VLA raises average adversarial success rates in VLA manipulation from 20% to 75% by using recovery-aware initialization, a progress-aware semantic value function, and value-conditioned refinement on success and corrective trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09009","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Models for Sequential Decision-Making: Improving In-Context Learning via Supervised Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-09T15:49:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Supervised fine-tuning of pretrained LLMs on offline trajectories yields better few-shot sequential decision-making than in-context-only baselines, with a theoretical suboptimality bound derived for linear MDPs by interpreting attention as Q-function estimation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"theoretical guarantees for learning linear regression in-context. LLMs have also been applied to time series analysis (see, e.g., [15] for a detailed survey). LLMs as decision-making agents.Some studies have explored using LLMs directly for decision- making through prompting. [ 16] generates both reasoning traces and task-specific actions in an interleaved manner, and [17] uses language feedback for embodied planning. [ 18] trains a single generalist transformer across vision, language, and control tasks. These approaches either use pretrained models directly or train models from scratch. In contrast, we fine-tune pretrained LLMs for few-shot sequential decision-making. 2 In-context decision-making.Decision Transformer was among the first to cast RL as sequence"},{"citing_arxiv_id":"2605.08904","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07066","ref_index":9,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"2.5-D Decomposition for LLM-Based Spatial Construction","primary_cat":"cs.AI","submitted_at":"2026-05-08T00:17:33+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06078","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Milestone-Guided Policy Learning for Long-Horizon Language Agents","primary_cat":"cs.CL","submitted_at":"2026-05-07T12:00:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BEACON uses milestone partitioning, temporal reward shaping, and dual-scale advantage estimation to nearly double success rates on long-horizon ALFWorld tasks while raising effective sample use from 23.7% to 82%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05846","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LoopTrap: Termination Poisoning Attacks on LLM Agents","primary_cat":"cs.CR","submitted_at":"2026-05-07T08:21:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"principal sub-mechanisms, yielding 10 strategies in total (Table 1). Category descriptions.Progress Manipulation(P1-P3) distorts the agent's assessment of task completion through moving targets (P1), never-ending sub-goals (P2), or fabricated near-completion signals (P3), exploiting the near-miss bias.Cognitive Bias Exploita- tion(P4-P6) leverages well-documented cognitive biases that LLMs exhibit [16], including authority deference (P4), sunk cost reasoning (P5), and social conformity (P6), to discourage termination.Task Structure Manipulation(P7-P8) alters the perceived task structure through infinite verification regression (P7) or circular dependency chains (P8), creating irresolvable execution paths.Reward Shap- ing(P9-P10) manipulates implicit reward signals through step-wise"},{"citing_arxiv_id":"2605.05340","ref_index":13,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study","primary_cat":"cs.CR","submitted_at":"2026-05-06T18:10:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Vision-language models exhibit perceptual fragility and fail to consistently respect privacy constraints when operating in simulated physical environments, with performance declining in cluttered scenes and under conflicting commands.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"efforts either operate in the textual domain or evaluate multimodal models on static, single-image inputs, and therefore cannot expose the perception-dependent, context-sensitive failures that arise in interactive physical deployment. LLMs as embodied agents.A separate research thread positions LLMs as the reasoning core of physically grounded agents. SayCan [7] and Inner Monologue [13] chain language-model planning with robotic affordances and multi-turn feedback; Code as Policies [18] compiles natural-language commands into executable robot control code; and ReAct [ 43] interleaves reasoning traces with environment actions. These approaches are evaluated in simulation platforms such as VirtualHome [32], Habitat [ 33], AI2-THOR [ 15], and task benchmarks like ALFRED [ 36] and TEACh [ 31]."},{"citing_arxiv_id":"2605.03855","ref_index":14,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Evaluating Generative Models as Interactive Emergent Representations of Human-Like Collaborative Behavior","primary_cat":"cs.RO","submitted_at":"2026-05-05T15:20:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Embodied LLM agents exhibit emergent collaborative behaviors indicating mental models of partners in a color-matching game, detected via LLM judges and supported by positive user feedback.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Index [11] - Eira tried to interact with an object: Target Martha is an entity, not an object. Try speaking to them instead. Ensure your target is a valid object. Index [12] - Eira spoke: Martha, I'm holding the iron axe but I don't know how to give it to you directly. Can you take it from me? I need to get back to finding more colored objects for the other boxes. Index [13] - Martha spoke: Can you come here to give me the axe? Index [14] - Eira spoke: Coming! I have the yellow flower now, and I'll bring you the axe. Let me switch to it and come over. Index [15] - Martha tried to pick up an object: Successfully picked up flower red 00 4 <Transcript excerpt truncated for brevity; indices 16--60 omitted> Index [61] - Eira spoke: Excellent! I got the first box to match... We now have 3 out of 4"},{"citing_arxiv_id":"2605.03034","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Stable Agentic Control: Tool-Mediated LLM Architecture for Autonomous Cyber Defense","primary_cat":"cs.AI","submitted_at":"2026-05-04T18:02:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"partial","one_line_summary":"Tool-mediated LLM agents with deterministic tools and a machine-checked Lyapunov certificate achieve stable control in cyber defense, reducing attacker game value by 59% on real attack graphs.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"5), each swept over four temperatures ( 80 runs total). The cross- LLM contrast tests within-family scaling of the corollaries. None of the controllers have trained or fine-tuned components. 5.1 Datasets and shared setup Experiment 1.We construct 282 adversarial graphs from real enterprise penetration tests conducted by Horizon3.ai's NodeZero, a commercial autonomous pentesting platform [39]. From 7,372 oper- ations in the data warehouse, 300 are selected via stratified sampling across three complexity bins (Appendix B); 18 are excluded during validation (14 where S <0.01 before any deployment and 4 where no block-mode policy covers any edge). The remaining 282 graphs span 161 customer organizations across 25 industries (healthcare, manufacturing, finance, government)."},{"citing_arxiv_id":"2605.02815","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FlexSQL: Flexible Exploration and Execution Make Better Text-to-SQL Agents","primary_cat":"cs.CL","submitted_at":"2026-05-04T16:51:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FlexSQL reaches 65.4% on Spider2-Snow by allowing agents to flexibly explore schemas, generate diverse plans, choose SQL or Python execution, and apply two-tiered repair.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01477","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion","primary_cat":"cs.RO","submitted_at":"2026-05-02T14:52:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Action Agent pairs LLM-driven video generation with a flow-constrained diffusion transformer to produce velocity commands, raising video success to 86% and delivering 64.7% real-world navigation on a Unitree G1 humanoid.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23336","ref_index":6,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA","primary_cat":"cs.IR","submitted_at":"2026-04-25T14:45:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Rabtriever distills a generative reranker into an efficient bi-encoder using on-policy JEPA to achieve near-reranker accuracy with linear complexity on rationale-based retrieval.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21924","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Long-Horizon Manipulation via Trace-Conditioned VLA Planning","primary_cat":"cs.RO","submitted_at":"2026-04-23T17:59:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LoHo-Manip enables robust long-horizon robot manipulation by using a receding-horizon VLM manager to output progress-aware subtask sequences and 2D visual traces that condition a VLA executor for automatic replanning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21138","ref_index":90,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems","primary_cat":"cs.RO","submitted_at":"2026-04-22T22:58:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18791","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"HELM: Harness-Enhanced Long-horizon Memory for Vision-Language-Action Manipulation","primary_cat":"cs.LG","submitted_at":"2026-04-20T19:57:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HELM raises long-horizon VLA success from 58.4% to 81.5% on LIBERO-LONG by combining episodic memory retrieval, learned failure prediction, and replanning, outperforming context extension or adaptation alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18463","ref_index":40,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Using large language models for embodied planning introduces systematic safety risks","primary_cat":"cs.AI","submitted_at":"2026-04-20T16:18:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ural language [3], while ProgPrompt employed programmatic prompt structures with precondition checking to constrain outputs to valid actions [37]. More recent work has explored multimodal embod- ied language models that incorporate sensor data directly [38], vision-language-action models that output robot actions as text tokens [39], and closed-loop reasoning systems that incorporate environ- ment feedback [40]. Additional approaches include hierarchical policies bridging high-level language to low-level motor execution [41], efficient action tokenization [42], foundation models for humanoid robots [43], large-scale multi-robot datasets [44], and on-device distillation of language models for robot planning with minimal human supervision [45]. Our benchmark evaluates raw LLM planning capabilities rather than hybrid systems integrating"},{"citing_arxiv_id":"2604.14902","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ADAPT: Benchmarking Commonsense Planning under Unspecified Affordance Constraints","primary_cat":"cs.AI","submitted_at":"2026-04-16T11:46:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ADAPT augments planners with affordance reasoning to raise task success in environments with unspecified and time-varying object affordances, and a LoRA-finetuned VLM backend beats GPT-4o on the new DynAfford benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10517","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"From Perception to Planning: Evolving Ego-Centric Task-Oriented Spatiotemporal Reasoning via Curriculum Learning","primary_cat":"cs.AI","submitted_at":"2026-04-12T08:14:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EgoTSR applies a three-stage curriculum on a 46-million-sample dataset to build egocentric spatiotemporal reasoning, reaching 92.4% accuracy on long-horizon tasks and reducing chronological biases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10096","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents","primary_cat":"cs.CV","submitted_at":"2026-04-11T08:33:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ABot-Claw is an embodied software layer that adds unified robot scheduling, cross-embodiment visual memory, and critic-driven replanning on top of OpenClaw to support persistent multi-robot execution from natural-language goals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08059","ref_index":13,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study","primary_cat":"cs.RO","submitted_at":"2026-04-09T10:18:51+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"We instantiate the capability framework using versioned ECMs, each corresponding to a reusable functional skill family: ECM-Grasp, ECM-Align, and ECM-Place. For each capability family𝑖, the system begins with an active baseline version𝑐(0) 𝑖 . Through simulated refinement rounds, we generate a sequence of candidate upgraded versions: 𝑐(0) 𝑖 →̂ 𝑐(1) 𝑖 →̂ 𝑐(2) 𝑖 →⋯(13) Candidate versions are created through parameter refinement, control-logic modification, or configuration-level adjustment. The exact update mechanism is not the primary variable; what matters is that each new version can X. Qin et al.:Preprint submitted to ElsevierPage 18 of 39 Governed Capability Evolution potentiallychangetaskperformanceaswellasinterfaceassumptions,runtimebehavior,permissionrequirements,and"},{"citing_arxiv_id":"2604.08044","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators","primary_cat":"cs.AR","submitted_at":"2026-04-09T09:48:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07833","ref_index":10,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution","primary_cat":"cs.RO","submitted_at":"2026-04-09T05:35:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05427","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Pre-Execution Safety Gate & Task Safety Contracts for LLM-Controlled Robot Systems","primary_cat":"cs.RO","submitted_at":"2026-04-07T04:46:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SafeGate adds a deterministic pre-execution gate and runtime contracts with Z3 SMT solving to block unsafe LLM commands for robots while passing safe ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16408","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Edge-Host-Cloud Architecture for Robot-Agnostic, Caregiver-in-the-Loop Personalized Cognitive Exercise: Multi-Site Deployment in Dementia Care","primary_cat":"cs.RO","submitted_at":"2026-04-01T00:41:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Speaking Memories is a robot-agnostic edge-host-cloud architecture for caregiver-in-the-loop personalized cognitive exercise in dementia care, achieving sub-6-second latency and positive stakeholder feedback in multi-site deployments.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the system level to prevent inappropriate or cognitively over- whelming responses. These constraints collectively motivate a distributed host-edge-cloud architecture. B. Domain Constraints for Safe Reminiscence Interaction The design of the Speaking Memories framework was guided by domain-specific constraints inherent to dementia- oriented human-robot interaction. Prior work [31], [32] in dementia care emphasizes the importance of emotionally sup- portive communication, reduced cognitive load, and structured conversational scaffolding to facilitate autobiographical re- call [8], [33]. Accordingly, the system prioritizes (i) short, 5 clear utterances; (ii) affirmation and emotional validation; and (iii) gradual narrative prompting grounded in personally"},{"citing_arxiv_id":"2604.19775","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Actions to Understanding: Conformal Interpretability of Temporal Concepts in LLM Agents","primary_cat":"cs.AI","submitted_at":"2026-03-27T22:29:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A conformal interpretability method labels LLM agent states step-by-step and extracts linearly separable temporal concept directions aligned with task success on ScienceWorld and AlfWorld.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.16947","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LightZeroNav: Zero-Shot Vision Language Navigation in Continuous Environments Based on Lightweight VLMs","primary_cat":"cs.CV","submitted_at":"2026-03-16T14:07:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LightZeroNav decomposes zero-shot VLN-CE into modules that reduce input redundancy, improve progress tracking from noisy memory, and separate action execution from stage transitions, allowing an 8B VLM to match GPT-4o performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.16331","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BrainMem: Brain-Inspired Evolving Memory for Embodied Agent Task Planning","primary_cat":"cs.RO","submitted_at":"2026-03-12T17:54:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BrainMem equips LLM-based embodied planners with working, episodic, and semantic memory that evolves interaction histories into retrievable knowledge graphs and guidelines, raising success rates on long-horizon 3D benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.08388","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Hierarchical Error-Corrective Graph Framework for Autonomous Agents with LLM-Based Action Generation","primary_cat":"cs.AI","submitted_at":"2026-03-09T13:46:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HECG combines multi-dimensional metrics for strategy choice, ten-type error classification with recoverability details, and causal-context graphs to improve LLM agent reliability in complex tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.00991","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tracking Capabilities for Safer Agents","primary_cat":"cs.AI","submitted_at":"2026-03-01T08:39:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AI agents can generate code in a capability-safe Scala dialect that statically prevents information leakage and malicious side effects while preserving task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.20867","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SoK: Agentic Skills -- Beyond Tool Use in LLM Agents","primary_cat":"cs.CR","submitted_at":"2026-02-24T13:11:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.08392","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ST-BiBench: Benchmarking Multi-Stream Multimodal Coordination in Bimanual Embodied Tasks for MLLMs","primary_cat":"cs.RO","submitted_at":"2026-02-09T08:47:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ST-BiBench reveals a coordination paradox in which MLLMs show strong high-level strategic reasoning yet fail at fine-grained 16-dimensional bimanual action synthesis and multi-stream fusion.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Photodoodle: Learning artistic image editing from few-shot pairwise data.arXiv preprint arXiv:2502.14397, 2025. S1 [51] Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot planners: Extract- ing actionable knowledge for embodied agents. InInterna- tional conference on machine learning, pages 9118-9147. PMLR, 2022. 1, 3 [52] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Em- bodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022. [53] Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. V oxposer: Composable 3d value"},{"citing_arxiv_id":"2601.21841","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Embodied Task Planning via Graph-Informed Action Generation with Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-01-29T15:18:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GiG uses a Graph-in-Graph architecture with GNN-encoded states, experience memory retrieval, and bounded symbolic lookahead to improve LLM planning on embodied benchmarks with gains up to 37%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.12538","ref_index":137,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic Reasoning for Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-01-18T18:58:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The survey structures agentic reasoning for LLMs into foundational, self-evolving, and collective multi-agent layers while distinguishing in-context orchestration from post-training optimization and reviewing applications across domains.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"plansrecursivelybreakproblemsintocompilable/editableunits, whilestructuredpipelinesembedhierarchical RL or MCTS within the tree to choose promising edits and verification paths [76, 22, 130, 131, 132]. In robotics, behavior trees and high-level goal decomposition translate language instructions into subgoal sequences executed by low-level controllers and skills [133, 134, 135, 136, 137]. Taken together, hierarchical tree-search couplesplan synthesis(node expansion, heuristic/evidence-based 12 Agentic Reasoning for Large Language Models selection) withplan realization(leaf grounding and feedback), yielding interpretable, long-horizon agents that can backtrack, refine, and verify before committing to irreversible actions, while remaining flexible"},{"citing_arxiv_id":"2601.07060","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PALM: Progress-Aware Policy Learning via Affordance Reasoning for Long-Horizon Robotic Manipulation","primary_cat":"cs.RO","submitted_at":"2026-01-11T21:00:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PALM improves long-horizon robotic manipulation success by distilling affordance representations for object interaction and predicting within-subtask progress in a VLA model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"spatial constraints of parts with foundation models. InIn- ternational Conference on Intelligent Robots and Systems (IROS), pages 9488-9495. IEEE, 2024. 2 [40] Siyuan Huang, Haonan Chang, Yuhan Liu, Yimeng Zhu, Hao Dong, Peng Gao, Abdeslam Boularias, and Hongsheng Li. A3vlm: Actionable articulation-aware vision language model.arXiv preprint arXiv:2406.07549, 2024. 3 [41] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Em- bodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022. 2 [42] Wenlong Huang, Chen Wang, Yunzhu Li, Ruohan Zhang, and Li Fei-Fei. Rekep: Spatio-temporal reasoning of rela-"},{"citing_arxiv_id":"2506.04565","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems","primary_cat":"cs.MA","submitted_at":"2025-06-05T02:34:43+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The Generator in RAG systems is essentially an LLM. It can be an original pre-trained language model, such as T5 [136], FLAN [185] and LLaMA [174], or a black-box pre-trained language model, such as GPT-3 [14], GPT-4 [2], Gemini [169], Claude [24]. Alternatively, the generator can also be a fine-tuned language model specifically tailored for a particular task. For instance, BART [79] and T5 [63] are fine-tuned alongside the retriever, a process commonly referred to as co-training or dual fine-tuning, to enhance the quality and consistency of retrieval [93]. In other scenarios, the generator is fine-tuned to effectively filter retrieved results, retaining only relevant documents and discarding irrelevant ones [106, 206, 217]. Furthermore, the Generator can be trained and fine-tuned within a reinforcement learning framework"},{"citing_arxiv_id":"2506.03610","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games","primary_cat":"cs.AI","submitted_at":"2025-06-04T06:40:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.19645","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success","primary_cat":"cs.RO","submitted_at":"2025-02-27T00:30:29+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OpenVLA-OFT fine-tuning boosts LIBERO success rate from 76.5% to 97.1%, speeds action generation 26x, and outperforms baselines on real bimanual dexterous tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Zhu, Baoxiong Jia, and Siyuan Huang. An embodied generalist agent in 3d world. In Proceedings of the International Conference on Machine Learning (ICML) , 2024. [16] Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mordatch. Language models as zero-shot plan- ners: Extracting actionable knowledge for embodied agents, 2022. URL https://arxiv.org/abs/ 2201.07207. [17] Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, Pierre Sermanet, Noah Brown, Tomas Jackson, Linda Luu, Sergey Levine, Karol Hausman, and Brian Ichter. Inner monologue: Embodied reasoning through planning with language models, 2022. URL https://arxiv.org/abs/2207."}],"limit":50,"offset":0}