{"total":16,"items":[{"citing_arxiv_id":"2605.18805","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RecoAtlas: From Semantic Plausibility to Set-Level Utility in LLM Recommendation Agents","primary_cat":"cs.IR","submitted_at":"2026-05-11T18:55:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RecoAtlas is a benchmark that evaluates LLM recommendation agents on behavior-grounded metrics for relevance, complementarity, and diversity in addition to semantic coherence.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.12422","ref_index":236,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering","primary_cat":"cs.CY","submitted_at":"2026-05-08T16:32:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"LLM graders achieve substantial human agreement on math and science MCAS items but vary on ELA, performing best as sources of formative narrative feedback rather than summative numerical scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27132","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TRUST: A Framework for Decentralized AI Service v.0.1","primary_cat":"cs.AI","submitted_at":"2026-04-29T19:32:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while proving a Safety-Profitability Theorem that rewards honest auditors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23178","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines","primary_cat":"cs.AI","submitted_at":"2026-04-25T07:18:30+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"puts, yet LLM judges exhibit systematic biases that compromise evaluation reliability. We present a comprehensive empirical study comparing nine debiasing strategies across five judge models from four provider families (Google, Anthropic, OpenAI, Meta), three bench- marks (MT-Benchn=400, LLMBarn=200, customn=225), and four bias types. Our key findings: (1) Style bias is the dominant bias (0.76-0.92 across all models), far exceeding position bias (≤0.04), yet has received minimal research attention. (2) All models show a conciseness preference on expansion pairs, but truncation controls confirm they correctly distinguish quality from length (0.92-1.00 accuracy), suggesting quality-sensitive evaluation"},{"citing_arxiv_id":"2604.21769","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Who Defines \"Best\"? Towards Interactive, User-Defined Evaluation of LLM Leaderboards","primary_cat":"cs.AI","submitted_at":"2026-04-23T15:28:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Analysis of the LMArena dataset reveals heavy topic skew and varying model rankings, leading to an interactive visualization tool for users to define custom evaluation priorities on LLM leaderboards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18835","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring","primary_cat":"cs.CL","submitted_at":"2026-04-20T20:59:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs exhibit positional bias and context-dependent scoring patterns when judging document similarity, with each model showing a stable scoring fingerprint but a shared hierarchy of sensitivity to different semantic perturbations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09791","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Pioneer Agent: Continual Improvement of Small Language Models in Production","primary_cat":"cs.AI","submitted_at":"2026-04-10T18:13:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"verify that identified weaknesses are systematic; and (4)parent model awareness, which inspects model lineage to determine whether corrective data should complement an existing training set or be built from scratch. We describe each stage in turn below. (1)Trace ingestion:The agent operates over the production inference database described above. Each record is a tuple ti = (xi,ˆyi, y ∗ i , v i, r i, m i)(12) where the first five fields are as defined in Eq. 10 and mi encodes judge metadata (the judge model, prompt template, and evaluation criteria). The judge itself may be a deterministic scorer-such as a token-level F1 function that computes overlap against a gold reference-or an LLM judge such as DeepSeek; in the deterministic case, ri contains the score breakdown and y∗"},{"citing_arxiv_id":"2604.02359","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis","primary_cat":"cs.CL","submitted_at":"2026-03-20T04:31:03+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.03332","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations","primary_cat":"cs.CL","submitted_at":"2026-02-11T03:11:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs show heterogeneous robustness to five types of chain-of-thought perturbations, with MathError causing 50-60% accuracy loss in small models but scaling benefits, UnitConversion remaining hard across sizes, and ExtraSteps causing minimal degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.20284","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Can LLMs Make (Personalized) Access Control Decisions?","primary_cat":"cs.CR","submitted_at":"2025-11-25T13:11:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs reflect users' privacy preferences in access control decisions with up to 86% agreement and can promote safer behavior, but personalization trades off higher individual match for potentially less secure results when users over-permission.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.22359","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models","primary_cat":"cs.AI","submitted_at":"2025-07-30T03:50:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"League of LLMs organizes LLMs into a self-governed mutual evaluation league using dynamic, transparent, objective, and professional criteria to distinguish model capabilities with 70.7% top-k ranking stability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.05579","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods","primary_cat":"cs.CL","submitted_at":"2024-12-07T08:07:24+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Publication date: December 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods 5 LLMs-as-judges LIMITATION (§7) Biases (§7.1) Presentation-Related(§7.1.1) Position bias [16, 25, 87, 107, 111, 127, 130, 132, 134, 151, 182, 184, 197, 208, 230, 234, 288, 291, 292, 296], Verbosity bias [163, 267, 270] Social-Related (§7.1.2)Authority bias [25, 267, 289], Bandwagon-effect bias [112, 267], Compassion-fade bias [112, 267], Diversity bias [25, 267] Content-Related(§7.1.3) Sentiment bias [267], Token Bias [98, 127, 178, 184], Contextual Bias [62, 78, 179, 290, 296, 300] Cognitive-Related(§7.1.4) Overconfidence bias [103, 107], Self-enhancement bias [8, 19, 132, 145, 145, 267, 292], Refinement-aware bias [259, 267]Distraction bias [112, 195, 267], Fallacy-oversight bias [25, 267]"},{"citing_arxiv_id":"2411.15594","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on LLM-as-a-Judge","primary_cat":"cs.CL","submitted_at":"2024-11-23T16:03:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To increase transparency in multi-modal evaluations, Xiong et al. [174] explored the use of LLM- as-a-judge to not only score model outputs but also generate natural language rationales explaining each assessment. This dual approach improves the interpretability of model judgments and helps developers identify failure modes. In a specialized application, Chen et al . [20] constructed the first benchmark for evaluating large vision-language models (LVLMs) on self-driving corner cases. Their results demonstrate that LLM-based judges correlate more closely with human evaluations than judgments provided by LVLMs themselves, underscoring the generalizability of text-based judges even in vision-dominated tasks. Surveying broader trends, Jiang et al."},{"citing_arxiv_id":"2410.20791","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap","primary_cat":"cs.SE","submitted_at":"2024-10-28T07:16:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A semi-structured thematic synthesis identifies core challenges in FM selection, alignment, prompting, orchestration, testing, deployment, and cross-cutting concerns like observability for production-ready FMware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.21772","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ShieldGemma: Generative AI Content Moderation Based on Gemma","primary_cat":"cs.CL","submitted_at":"2024-07-31T17:48:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"ShieldGemma delivers a family of Gemma2-based classifiers that outperform Llama Guard and WildCard on public safety benchmarks while introducing a synthetic-data curation pipeline for safety tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.14782","ref_index":255,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lessons from the Trenches on Reproducible Evaluation of Language Models","primary_cat":"cs.CL","submitted_at":"2024-05-23T16:50:49+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}