{"total":17,"items":[{"citing_arxiv_id":"2605.21408","ref_index":49,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"TCARD: Nearly Balanced Two-Level Designs with Treatment Cardinality Constraints with an Application to LLM Prompt Engineering","primary_cat":"stat.ME","submitted_at":"2026-05-20T17:06:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes nearly balanced TCARDs that minimize the first two generalized word-length pattern components, defines Φ_BCD criterion linked to classical optimality, and constructs designs via coordinate exchange with simulation-calibrated weights for LLM prompt engineering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19316","ref_index":4,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"A Multi-Agent Framework for Feature-Constrained Difficulty Control in Reading Comprehension Item Generation","primary_cat":"cs.CL","submitted_at":"2026-05-19T03:52:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAFIG is a multi-agent framework that uses LLM agents and evaluators to generate reading comprehension items with significantly higher adherence to specified feature constraints than single-agent baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17539","ref_index":20,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis","primary_cat":"cs.AI","submitted_at":"2026-05-17T16:47:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MEMOIR adds branch-local and global memory with a reflection step to tree search for LLM solver synthesis, reaching 96.7% solution validity and 7.3-point score gains over baselines on seven CO problems with lower run-to-run variance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14133","ref_index":97,"ref_count":2,"confidence":0.35,"is_internal_anchor":false,"paper_title":"ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents","primary_cat":"cs.AI","submitted_at":"2026-05-13T21:34:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClawForge is a generator framework that creates reproducible executable benchmarks for command-line agents under state conflict, with ClawForge-Bench showing frontier models reach at most 45.3% strict accuracy and that state inspection drives most performance gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10171","ref_index":32,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews","primary_cat":"cs.CL","submitted_at":"2026-05-11T08:20:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"function configured to retrieve pairs exhibiting po- tential semantic divergence. This high-recall fil- tering ensures that the system retains subtle con- tradictions while discarding clear agreements, pro- viding a unified candidate set for intensity assess- ment. Across review pairs, extracted evidence is accumulated into anAspect-Specific Evidence Pool E={E a1, . . . ,EaM }, where Eam = [ (i,j) E (i,j) am .(2) Deliberative Intensity Agent (DIA):A Delib- erative Intensity Agent (DIA) serves as the core reasoning unit for assigning graded contradiction intensity scores. Given an aspect-aligned evidence pair (e(j) 1 , e(j) 2 )∈ E aj, the agent functions as a probabilistic mapping that predicts a discrete in- tensity label αj ∈ {0,1,2,3} (following the rubric of contradiction intensity 5) and generates a sup- porting explanation/reason for the assigned label ρj: (αj, ρj) =g DIA (e(j) 1 , e(j) 2 , ri, rj),(3) where ri and rj denote the full review contexts. Conditioning on the full context enables the agent to interpret localized evidence spans within the broader evaluative discourse of each reviewer, dis- tinguishing genuine conflict from rhetorical differ- ences. IMPACT employs two DIAs (DIA-A and DIA-B) which share a functional specification but may be instantiated using diverse underlying LLMs to encourage reasoning variance. Intensity Agreement Checker:The Intensity Agreement Checker functions as a deterministic control gate. It compares the agents' initial in- dependent predictions, αA j and αB j , to determine whether they agree (i.e., αA j =α B j ). If agreement holds, the shared intensity label is accepted directly and propagated to downstream components with- out further interaction. Conversely, in the event of disagreement, the deliberation protocol is triggered and managed by the Disagreement Orchestrator. Disagreement Orchestrator:The Disagreement Orchestrator (DO) manages structured interaction 5Here, label 0 denotes \"no valid contradiction\" (i"},{"citing_arxiv_id":"2605.09519","ref_index":35,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Weighted Rules under the Stable Model Semantics","primary_cat":"cs.AI","submitted_at":"2026-05-10T13:05:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Weighted rules extend stable model semantics to support probabilistic reasoning, model ranking, and statistical inference in answer set programs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09492","ref_index":27,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"APCD: Adaptive Path-Contrastive Decoding for Reliable Large Language Model Generation","primary_cat":"cs.CL","submitted_at":"2026-05-10T11:57:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"APCD adaptively branches LLM decoding paths based on token entropy and contrasts divergent paths to improve factual accuracy while preserving efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03936","ref_index":12,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"The Counterexample Game: Iterated Conceptual Analysis and Repair in Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-05T16:26:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models engage in counterexample-repair loops for conceptual definitions but produce increasingly verbose outputs without accuracy gains and hit diminishing returns quickly.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21794","ref_index":41,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Learning to Communicate: Toward End-to-End Optimization of Multi-Agent Language Systems","primary_cat":"cs.AI","submitted_at":"2026-04-23T15:53:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DiffMAS jointly optimizes latent communication and reasoning in multi-agent LLM systems via parameter-efficient supervised training on trajectories, yielding consistent gains over baselines on math, science, and code benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21268","ref_index":21,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding","primary_cat":"cs.LG","submitted_at":"2026-04-23T04:23:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17433","ref_index":30,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Self-Consistency from Only Two Samples: CoT-PoT Ensembling for Efficient LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-19T13:26:04+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17351","ref_index":55,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"SOCIA-EVO: Automated Simulator Construction via Dual-Anchored Bi-Level Optimization","primary_cat":"cs.AI","submitted_at":"2026-04-19T09:57:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SOCIA-EVO generates statistically consistent simulators by separating structural refinement from parameter calibration via bi-level optimization and falsifying strategies through execution feedback in a Bayesian-weighted playbook.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04078","ref_index":42,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Validity-Calibrated Reasoning Distillation","primary_cat":"cs.LG","submitted_at":"2026-04-14T12:32:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Validity-calibrated reasoning distillation improves transfer of reasoning skills by modulating updates based on relative local validity of next steps instead of enforcing full trajectory imitation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04066","ref_index":172,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:34:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04065","ref_index":187,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:26:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2408.08435","ref_index":12,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Automated Design of Agentic Systems","primary_cat":"cs.AI","submitted_at":"2024-08-15T21:59:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.19118","ref_index":20,"ref_count":1,"confidence":0.35,"is_internal_anchor":false,"paper_title":"Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate","primary_cat":"cs.CL","submitted_at":"2023-05-30T15:25:45+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}