{"total":58,"items":[{"citing_arxiv_id":"2606.25396","ref_index":42,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions","primary_cat":"cs.AI","submitted_at":"2026-06-24T04:46:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.24828","ref_index":87,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Less is More: Quality-Aware Training Data Selection for Scientific Summarization","primary_cat":"cs.CL","submitted_at":"2026-06-23T17:12:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 1.88-million-article biomedical summarization dataset is released and quality-aware selection of training data based on abstract alignment outperforms random sampling on factuality metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.23164","ref_index":23,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Same question, different history: language, national identity, and credit in large language models","primary_cat":"cs.CL","submitted_at":"2026-06-22T11:05:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Analysis of 11 LLMs on 21 disputed inventions across 12 languages and 75,896 responses finds query language systematically shifts credit toward lower-status claimants in their associated language while Anglophone figures remain stable.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20487","ref_index":40,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-06-18T17:04:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"H-RePlan provides hierarchical recovery for cross-device agent systems by distinguishing device-local fixes from global replanning and demonstrates gains on the new fault-injected HeraBench benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19640","ref_index":69,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language","primary_cat":"cs.CL","submitted_at":"2026-06-17T22:36:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Modifying nationality and language parameters in English-centric personas for mental health dialogues introduces clinical inconsistencies across languages and causes LLM judges to perform inaccurately on non-English depression severity assessments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19544","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias","primary_cat":"cs.CL","submitted_at":"2026-06-17T19:37:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Large-scale study of 21 LLM-as-a-Judge models shows exact-match agreement overstates reliability, rankings shift across benchmarks, and high consistency can mask position bias.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18829","ref_index":16,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents","primary_cat":"cs.LG","submitted_at":"2026-06-17T09:06:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"GateMem benchmark shows no existing memory method for LLM agents achieves strong utility, access control, and reliable forgetting simultaneously in multi-principal shared settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18797","ref_index":2,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports","primary_cat":"cs.CL","submitted_at":"2026-06-17T08:10:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Lightweight metrics trained on Qwen3-8B and MedGemma-4B using synthetic pairs outperform larger medical LLMs at distinguishing clinical significance in radiology reports while balancing discrimination and robustness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.20676","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity","primary_cat":"cs.CV","submitted_at":"2026-06-12T15:53:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.13115","ref_index":21,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents","primary_cat":"cs.CL","submitted_at":"2026-06-11T09:42:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"G-Long uses graph-enhanced triplet memory and attention-aware scoring from a T5 summarizer to achieve up to 9.8% better response quality on MSC and 40.8% better retrieval recall on LME with lower overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12984","ref_index":60,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants","primary_cat":"cs.CL","submitted_at":"2026-06-11T07:21:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SkillChain automates skill lifecycle for e-commerce image AI assistants via creator, optimizer, and refiner stages, leading to improved response quality and user engagement in production A/B tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10307","ref_index":25,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate","primary_cat":"cs.CL","submitted_at":"2026-06-09T01:52:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Early-token log-probabilities from LLM decoding are stronger predictors of reasoning quality than full-sequence statistics in multi-agent debate on essay scoring tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10296","ref_index":25,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge","primary_cat":"cs.CL","submitted_at":"2026-06-09T01:33:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"In two-agent debate, log-probability confidence aligns with LLM-judged reasoning quality roughly twice as strongly for the Constructor (AUROC 0.804 for critical failure detection) as for the Auditor (0.634).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07951","ref_index":77,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"From `May' to `Is': Certainty Distortion in Language Model Rewriting","primary_cat":"cs.CL","submitted_at":"2026-06-06T02:53:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LMs systematically inflate expressed certainty during rewriting, affecting up to 75% of outputs with a 1.5-2x bias toward increasing rather than decreasing certainty, and the effect compounds over iterations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05384","ref_index":53,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges","primary_cat":"cs.AI","submitted_at":"2026-06-03T19:37:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04596","ref_index":29,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs","primary_cat":"cs.CL","submitted_at":"2026-06-03T08:34:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Constructs multi-video summarization benchmark and evaluates nine MLLMs showing positional bias is domain- and model-dependent with middle positions often weaker and budgets not uniformly fixing it.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03980","ref_index":52,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill","primary_cat":"cs.LG","submitted_at":"2026-06-02T17:56:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03410","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams","primary_cat":"cs.CV","submitted_at":"2026-06-02T09:54:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Enginuity is the first open benchmark dataset for VLMs on engineering diagrams, with evaluations showing models identify parts but produce low-fidelity descriptions and struggle with factual reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01252","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization","primary_cat":"cs.CL","submitted_at":"2026-05-31T14:12:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces the MEA benchmark for multi-target cross-lingual summarization across 24 languages and demonstrates that activation steering from English summarization representations improves performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00467","ref_index":24,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance","primary_cat":"cs.CL","submitted_at":"2026-05-30T01:21:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs correct only 34.8% of zero-shot annotation errors via prompting, and Definition-Specific Familiarity correlates positively with performance (partial r = +0.41) while memorization metrics do not.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31167","ref_index":22,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability","primary_cat":"cs.AI","submitted_at":"2026-05-29T11:20:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces LLM-FACETS, a privacy-preserving open-source framework for LLM evaluation using deterministic metrics locally, LLM-judge metrics with user-controlled APIs, and mechanisms for uncertainty visualization and hallucination detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31042","ref_index":14,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors","primary_cat":"cs.CR","submitted_at":"2026-05-29T09:19:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces ClawTrojan benchmark achieving 95.5% ASR for multi-step trojan attacks in agentic harnesses and DASGuard defense that sanitizes control content from untrusted sources.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21086","ref_index":18,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control","primary_cat":"cs.CL","submitted_at":"2026-05-20T12:21:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LoCar is a localization-aware evaluation framework for in-vehicle assistants that identifies unstable Korean honorific control and weaker performance on strategic metrics like clarification and proactivity in current LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20364","ref_index":13,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation","primary_cat":"cs.CL","submitted_at":"2026-05-19T18:16:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new 263k TTCW-annotated story dataset shows non-reasoning fine-tuning of Qwen3 models outperforms reasoning-supervised fine-tuning for fixed-format long-form literary review generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19766","ref_index":16,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Synthesis and Evaluation of Long-term History-aware Medical Dialogue","primary_cat":"cs.CL","submitted_at":"2026-05-19T12:38:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Creates MediLongChat synthetic longitudinal medical dialogues and benchmarks showing state-of-the-art LLMs struggle with in-dialogue, cross-dialogue, and synthesis reasoning tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18032","ref_index":1,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows","primary_cat":"cs.CL","submitted_at":"2026-05-18T08:22:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PROTEA supplies an offline interface for scoring intermediate outputs in multi-agent LLM workflows, performing backward evaluation from final answers, and iterating on targeted prompt revisions with visible score changes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15474","ref_index":34,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Jobs' AI Exposure Should Be Measured from Evidence, Not Model Priors","primary_cat":"cs.IR","submitted_at":"2026-05-14T23:29:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The authors propose a retrieval-augmented framework that grounds AI exposure labels for 18,796 O*NET occupation-task pairs in retrieved news and academic abstracts, outperforming zero-shot prompting in 72% of disagreements and aligning better with observed real-world usage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18859","ref_index":15,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing","primary_cat":"cs.LG","submitted_at":"2026-05-14T08:58:59+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TwinRouterBench supplies 970 execution-verified router prefixes across five datasets plus a live harness for 100 held-out SWE-bench cases, scoring routers on tier accuracy, trajectory success, and realized token cost without LLM judges.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14257","ref_index":14,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult?","primary_cat":"cs.CL","submitted_at":"2026-05-14T01:57:35+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fine-tuned LLM and explainable models predict vocabulary difficulty with correlations r > 0.91 and r > 0.77, showing spelling difficulty and test item construction as key influences in addition to word production difficulty.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13596","ref_index":71,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations","primary_cat":"cs.CL","submitted_at":"2026-05-13T14:30:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11753","ref_index":219,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention","primary_cat":"cs.AI","submitted_at":"2026-05-12T08:28:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SPeCTrA-Sum uses hierarchical cross-modal fusion via DVP and DPP-distilled image selection via VRP to generate more accurate and visually grounded multimodal summaries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11693","ref_index":220,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity","primary_cat":"cs.AI","submitted_at":"2026-05-12T07:50:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MM-Eval unifies evaluation of multimodal summaries by integrating factual text quality, cross-modal relevance via MLLM judge, and visual diversity via truncated CLIP entropy, then calibrates their combination on human preferences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10171","ref_index":40,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews","primary_cat":"cs.CL","submitted_at":"2026-05-11T08:20:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"function configured to retrieve pairs exhibiting po- tential semantic divergence. This high-recall fil- tering ensures that the system retains subtle con- tradictions while discarding clear agreements, pro- viding a unified candidate set for intensity assess- ment. Across review pairs, extracted evidence is accumulated into anAspect-Specific Evidence Pool E={E a1, . . . ,EaM }, where Eam = [ (i,j) E (i,j) am .(2) Deliberative Intensity Agent (DIA):A Delib- erative Intensity Agent (DIA) serves as the core reasoning unit for assigning graded contradiction intensity scores. Given an aspect-aligned evidence pair (e(j) 1 , e(j) 2 )∈ E aj, the agent functions as a probabilistic mapping that predicts a discrete in- tensity label αj ∈ {0,1,2,3} (following the rubric of contradiction intensity 5) and generates a sup- porting explanation/reason for the assigned label ρj: (αj, ρj) =g DIA (e(j) 1 , e(j) 2 , ri, rj),(3) where ri and rj denote the full review contexts. Conditioning on the full context enables the agent to interpret localized evidence spans within the broader evaluative discourse of each reviewer, dis- tinguishing genuine conflict from rhetorical differ- ences. IMPACT employs two DIAs (DIA-A and DIA-B) which share a functional specification but may be instantiated using diverse underlying LLMs to encourage reasoning variance. Intensity Agreement Checker:The Intensity Agreement Checker functions as a deterministic control gate. It compares the agents' initial in- dependent predictions, αA j and αB j , to determine whether they agree (i.e., αA j =α B j ). If agreement holds, the shared intensity label is accepted directly and propagated to downstream components with- out further interaction. Conversely, in the event of disagreement, the deliberation protocol is triggered and managed by the Disagreement Orchestrator. Disagreement Orchestrator:The Disagreement Orchestrator (DO) manages structured interaction 5Here, label 0 denotes \"no valid contradiction\" (i"},{"citing_arxiv_id":"2605.09542","ref_index":7,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs","primary_cat":"cs.AI","submitted_at":"2026-05-10T13:54:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09533","ref_index":43,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications","primary_cat":"cs.CL","submitted_at":"2026-05-10T13:35:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08590","ref_index":44,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations","primary_cat":"cs.HC","submitted_at":"2026-05-09T01:10:40+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Other benchmarks focus on temporal reasoning, testing whether models preserve event order, timing, and time-sensitive facts [7, 12]. A related line of work has developed more fine-grained methods for evaluating grounding, factuality, and model reliability. Faithfulness and factuality research in summarization evaluates whether generated text is supported by source documents [ 44], with QA-based and entailment-based approaches such as QAFactEval providing automated measures of factual consistency [20]. More recent LLM factuality benchmarks decompose generated outputs into smaller units for evaluation: FActScore checks atomic factual claims against evidence [48], FELM annotates factuality at the segment level with error types and supporting or contradicting references [76], and"},{"citing_arxiv_id":"2605.08503","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"NARRA-Gym for Evaluating Interactive Narrative Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T21:36:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06939","ref_index":10,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Bias and Uncertainty in LLM-as-a-Judge Estimation","primary_cat":"cs.LG","submitted_at":"2026-05-07T20:55:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimates are unreliable.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06652","ref_index":14,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:56:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06327","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity","primary_cat":"cs.CL","submitted_at":"2026-05-07T14:23:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05902","ref_index":128,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Evaluating Non-English Developer Support in Machine Learning for Software Engineering","primary_cat":"cs.SE","submitted_at":"2026-05-07T09:14:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00754","ref_index":11,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring","primary_cat":"cs.SE","submitted_at":"2026-05-01T16:07:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Themis introduces the largest open code preference dataset with over 350k pairs and trains multilingual reward models from 600M to 32B parameters that support flexible multi-criteria scoring, with experiments showing scaling trends and cross-lingual transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26269","ref_index":3,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Calibrated Surprise: An Information-Theoretic Account of Creative Quality","primary_cat":"cs.CL","submitted_at":"2026-04-29T03:53:31+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21304","ref_index":51,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs","primary_cat":"cs.IR","submitted_at":"2026-04-23T05:42:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19185","ref_index":18,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization","primary_cat":"cs.CL","submitted_at":"2026-04-21T07:51:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17658","ref_index":64,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Towards Self-Improving Error Diagnosis in Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-04-19T23:13:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04083","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals","primary_cat":"cs.LG","submitted_at":"2026-04-15T17:35:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12049","ref_index":41,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs","primary_cat":"cs.CL","submitted_at":"2026-04-13T20:41:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"wSSAS is a two-phase deterministic framework that uses hierarchical text organization and SNR-based feature prioritization to improve clustering integrity, categorization accuracy, and reproducibility when applying LLMs to large review datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"QAG Mechanics and Embedding Engine: QAG acts as a reference-free \"polygraph test\" for factual consistency [40]. The system generates up to five factual, close-ended questions from the source text and verifies the summary's ability to provide accurate answers. To calculate semantic similarity between true responses and extracted responses, we utilized the sentence-transformers/all-MiniLM-L6-v2 embedding model [41] (a) Triage and Encoding: QAG scores were encoded into a 0 (as good as), 1 (better than), or -1 (worse than) scale, comparing weighted vs. unweighted outputs. A critical triage process was applied to prioritize semantic similarity over verbatim alignment. This prevents the penalization of the LLM for utilizing sophisticated paraphrasing while ensuring that factual hallucinations-which an exact-match algorithm"},{"citing_arxiv_id":"2604.07883","ref_index":17,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks","primary_cat":"cs.AI","submitted_at":"2026-04-09T06:51:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2 Attribution Errors and Safety Inflation in LLM Evaluators Attribution errors, where models conflate cited evidence with authorial claims, are a documented limitation of LLM evaluators [9]. Existing mitigations via re- trieval augmentation [14] or post-hoc verification [32] target generation settings and do not transfer to evaluative tasks assessing existing material. Standard frameworks such as G-Eval [17] and MT-Bench [33] lack explicit mechanisms to distinguish endorsed narrative from quoted historical sources, resulting in sys- tematic over-penalization of factually accurate content [26]. We address both failure modes through a Source Attribution Protocol that enforces this distinc- tion as a constrained intermediate representation prior to evaluation."},{"citing_arxiv_id":"2604.05912","ref_index":20,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks","primary_cat":"cs.CL","submitted_at":"2026-04-07T14:15:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}