{"total":14,"items":[{"citing_arxiv_id":"2606.29654","ref_index":82,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Budgeted Act-or-Defer Multi-Agent LLM Deliberation with Local Reliability Bounds","primary_cat":"cs.AI","submitted_at":"2026-06-28T23:46:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A kNN lower-confidence-bound approach for act-or-defer decisions in multi-agent LLM debates respects user-declared wrong-action budgets while achieving high automation rates on benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.21399","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Calibration Is Not Control: Why LLM-Agent Oversight Needs Intervention","primary_cat":"cs.AI","submitted_at":"2026-06-19T13:08:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Action-conditioned estimation of intervention advantage via prefix branching reduces control regret over calibrated scalar risk scores in LLM agent oversight across benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18189","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Failure Recovery: An Engagement-Aware Human-in-the-loop Framework for Robotic Systems","primary_cat":"cs.RO","submitted_at":"2026-06-16T17:21:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"E-MPC is a model predictive control framework that uses a user interaction dynamics model to balance autonomy and engagement under workload constraints in robotic caregiving, evaluated via simulation and a user study.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12587","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Strategic Decision Support for AI Agents","primary_cat":"cs.AI","submitted_at":"2026-06-10T18:34:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper introduces an optimization framework for AI agents to strategically seek support, proving a threshold policy on support value and providing an online algorithm to control missed-support error without distributional assumptions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20662","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Confidence Laundering in Agent Systems: Why Uncertainty Needs a Latent Carrier","primary_cat":"cs.AI","submitted_at":"2026-06-09T23:13:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Agent systems lose uncertainty at decision handoffs, causing downstream over-trust; the paper proposes latent uncertainty as a carrier to preserve pre-commitment fragility across interfaces.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18747","ref_index":111,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Code as Agent Harness","primary_cat":"cs.CL","submitted_at":"2026-05-18T17:59:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Method Mechanism Action Paradigm Key Innovation AutoHarness [109] Harness Gen. Action validation Synthesizes code harnesses that mediate model actions and filter invalid environment interactions SayCan [9] Skill Selec. Affordance-based Links LLM plans to physical feasibility KnowNo [110] Skill Selec. Conformal prediction Calibrates planner uncertainty for ambiguous instructions SkillVLA [111] Skill Selec. Bimanual grounding Extends grounding to combinatorial skill reuse BOSS [112] Skill Selec. Skill bootstrapping Synthesizes new executable skill chains via guided practice LLM-Guided Traj. [113] Skill Selec. Trajectory generation Generates diverse manipulation trajectories and executable success conditions LRLL [114] Skill Selec. Lifelong grounding Evolving skill interface via memory and self-exploration"},{"citing_arxiv_id":"2605.18109","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-18T09:19:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TaskGround introduces a Ground-Infer-Execute framework for full-scene household reasoning that improves success rates on the FullHome benchmark and enables compact models to match larger ones at up to 18x lower token cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14358","ref_index":61,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Uncovering the Representation Geometry of Minimal Cores in Overcomplete Reasoning Traces","primary_cat":"cs.AI","submitted_at":"2026-05-14T04:35:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Language models produce overcomplete reasoning traces where on average 46% of steps can be removed while preserving the answer in 86% of cases, with necessity concentrated in the top three steps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06116","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:26:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27914","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Geometry-Calibrated Conformal Abstention for Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-30T14:20:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Geometry-calibrated conformal abstention lets language models abstain from uncertain queries with finite-sample guarantees on both participation rate and conditional correctness of answers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.04129","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"KGLAMP: Knowledge Graph-guided Language model for Adaptive Multi-robot Planning and Replanning","primary_cat":"cs.RO","submitted_at":"2026-02-04T01:46:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KGLAMP uses a dynamically updated knowledge graph to guide LLMs in creating and replanning PDDL specifications for heterogeneous multi-robot teams, reporting at least 25.3% better performance than LLM-only or classical PDDL baselines on the MAT-THOR benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.05991","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"3D Instruction Ambiguity Detection","primary_cat":"cs.AI","submitted_at":"2026-01-09T18:17:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Defines 3D Instruction Ambiguity Detection as a new task, releases the Ambi3D benchmark, shows state-of-the-art 3D LLMs struggle with it, and proposes the AmbiVer framework that gathers multi-view visual evidence to guide VLMs in judging ambiguity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.11954","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VizCopilot: Fostering Appropriate Reliance on Enterprise Chatbots with Context Visualization","primary_cat":"cs.HC","submitted_at":"2025-10-13T21:28:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VizCopilot integrates topic modeling with document visualization to support user oversight of retrieved context in enterprise chatbots, enabling detection of misalignments and adaptation of prompting strategies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2304.11477","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM+P: Empowering Large Language Models with Optimal Planning Proficiency","primary_cat":"cs.AI","submitted_at":"2023-04-22T20:34:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM+P lets LLMs solve planning problems optimally by converting them to PDDL for classical planners and back to natural language.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"veloped in recent years, such as Bert [27], CodeX [28], Opt [29], GPT-3 [10], ChatGPT [30], GPT-4 [2], Llama [31], Llama2 [32], and PaLM [33]. As LLMs are pretrained with a tremendous amount of offline text data, they can emerge with surprising zero-shot generalization ability, which can be leveraged for robot planning tasks [34], [35], [36], [37], [38], [39], [40], [41], [42], [43], [44], [45]. Several recent methods had successes in extracting task knowledge from LLMs to decompose commands or instructions for robots in natural language. For instance, the work of Huang et al. showed that LLMs can be used for task planning in household domains by iteratively augmenting prompts [38]. SayCan is another approach that enabled robot planning with"}],"limit":50,"offset":0}