{"total":13,"items":[{"citing_arxiv_id":"2606.30556","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Poller: Are LLMs Suitable for Evaluating the Poetry Understanding Task?","primary_cat":"cs.CL","submitted_at":"2026-06-29T16:51:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Poller reduces LLM-human disagreement in evaluating Chinese poetry understanding by having LLMs role-play as authors, with reported error reductions of 94.55% and 89.53% on rhetorical techniques and defamiliarization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27921","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Show, Don't TELL: Explainable AI-Generated Text Detection","primary_cat":"cs.AI","submitted_at":"2026-05-27T03:47:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TELL is a new architecture for AI text detection that natively supplies explanatory annotations, reaching AUROC 0.927 and a 72.3% human win-rate on explanation quality metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27914","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm","primary_cat":"cs.CL","submitted_at":"2026-05-27T03:41:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17554","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps","primary_cat":"cs.AI","submitted_at":"2026-05-17T17:32:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A new benchmark with cognitive traps shows frontier deep research agents achieve only 13-16% acceptance on expert consulting tasks under combined verifier and rubric criteria.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02765","ref_index":55,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning","primary_cat":"cs.AI","submitted_at":"2026-05-04T16:05:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"These scenarios raise a broader design question: how should user-AI interactions with black-box AI models be designed so that users can reliably express and balance their different types of constraints? Traditional approaches to planning have largely emerged from theautomated planningdomain, where the focus is on constructing action sequences that achieve a goal state while satisfying a set of constraints [55]. Such methods are effective in expert settings such as robotics, logistics, or manufacturing [31, 80, 108], but they are often inaccessible to end users. They rely on specialized planning languages, whereas everyday users need simple ways to express their intentions; they require detailed domain specifications, while users often operate with only partial or high-level"},{"citing_arxiv_id":"2604.23178","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines","primary_cat":"cs.AI","submitted_at":"2026-04-25T07:18:30+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al. Judging LLM-as-a-judge with MT-Bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024. A Custom Controlled Dataset Details Our controlled dataset consists of 225 pairs. The first 200 use 50 diverse questions spanning five domains: mathematics (10), coding (10), creative writing (10), factual QA (10), and instruction following (10). For each question, we generate four pair types (expected verdict: tie): •LENGTH (expansion):A base response (avg. 64 words) and an expanded version (avg. 178 words, 2.8×ratio). Response A is expanded in 34/50 cases; in 16/50 cases the generation model's expansion was shorter than the base."},{"citing_arxiv_id":"2604.15302","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations","primary_cat":"cs.AI","submitted_at":"2026-04-16T17:58:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM judges display per-document transitivity violations in 33-67% of cases despite low aggregate rates, while conformal prediction set widths serve as reliable indicators of document-level difficulty with cross-judge agreement.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06996","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Preference Bias in Rubric-Based Evaluation of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-08T12:13:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Rubric-based LLM judges show self-preference bias, incorrectly marking their own failed outputs as satisfied up to 50% more often on verifiable benchmarks and skewing scores by 10 points on subjective ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.05579","ref_index":116,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods","primary_cat":"cs.CL","submitted_at":"2024-12-07T08:07:24+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"7) MovieLens [80], Zhang et al. [284], Yelp [4] Search (§6.1.8) TREC Deep Learning Track [118], MS MARCO v2 collection [11], LeCaRDv2 [129] Comprehensive Data(§6.1.9) HelpSteer [238], HelpSteer2 [237], UltraFeedback [44], UltraChat [49], ShareGPT [37], TruthfulQA [140], AlpacaEval [56],Chatbot Arena [292], MT-Bench [292], WildBench [138], FLASK [269], RewardBench [116], RM-Bench [148], JudgeBench [213],MLLM-as-a-Judge [24], MM-Eval [202] Metric (§6.2) Accuracy, Pearson [41], Spearman [190], Kendall's Tau [191], Cohen's Kappa [240], ICC [13] Fig. 1. Taxonomy of LLMs-as-judges in functionality, methodology, application, meta-evaluation. tasks in NLP and machine learning. We formalize the input-output structure of the LLMs-as-"},{"citing_arxiv_id":"2411.15594","ref_index":69,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on LLM-as-a-Judge","primary_cat":"cs.CL","submitted_at":"2024-11-23T16:03:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"verifiable judgements. 5.2.4Others.LLMs have also been employed as evaluators to enhance efficiency and consistency across various fields. In software engineering, a method was proposed for using LLMs to evaluate bug report summarizations, demonstrating high accuracy in assessing correctness and complete- ness, even surpassing human evaluators who experienced fatigue [ 69]. This approach offers a , Vol. 1, No. 1, Article . Publication date: October 2025. J. Gu, X. Jiang, Z. Shi, J. Guo, et al. scalable solution for evaluation. In education, automated essay scoring and revising have been explored using open-source LLMs, achieving performance comparable to traditional deep-learning models. Techniques such as few-shot learning and prompt tuning improved scoring accuracy, while"},{"citing_arxiv_id":"2410.20791","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap","primary_cat":"cs.SE","submitted_at":"2024-10-28T07:16:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.19737","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Better & Faster Large Language Models via Multi-token Prediction","primary_cat":"cs.CL","submitted_at":"2024-04-30T17:33:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-token prediction training yields higher sample efficiency, better benchmark scores on code generation, and up to 3x faster inference than standard next-token prediction for LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2404.13076","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLM Evaluators Recognize and Favor Their Own Generations","primary_cat":"cs.CL","submitted_at":"2024-04-15T16:49:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs show measurable self-recognition that linearly correlates with self-preference bias in evaluations, supported by fine-tuning experiments and controls for confounders.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}