super hub Mixed citations

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Siyuan Zhuang, Wei-Lin Chiang, Ying Sheng, Yonghao Zhuang, Zhanghao Wu · 2023 · cs.CL · arXiv 2306.05685

Mixed citation behavior. Most common role is background (47%).

269 Pith papers citing it

Background 47% of classified citations

open full Pith review browse 269 citing papers more from Lianmin Zheng arXiv PDF

abstract

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 15 method 10 dataset 4 baseline 1

citation-polarity summary

background 14 use method 9 use dataset 4 unclear 2 baseline 1

claims ledger

abstract Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-be

authors

Lianmin Zheng Siyuan Zhuang Wei-Lin Chiang Ying Sheng Yonghao Zhuang Zhanghao Wu

co-cited works

representative citing papers

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

cs.CL · 2023-08-28 · unverdicted · novelty 8.0

LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).

Universal and Transferable Adversarial Attacks on Aligned Language Models

cs.CL · 2023-07-27 · accept · novelty 8.0

Gradient and greedy search over token suffixes produces universal, transferable adversarial prompts that elicit objectionable outputs from aligned models including black-box commercial systems.

FARS: A Fully Automated Research System Deployed at Scale

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

FARS deployed at scale produced 166 AI/ML papers across 67 topics that received 282 structured human reviews indicating some review-worthy outputs alongside recurring failure modes.

EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

The paper presents EMPATH, a new multilingual multi-turn benchmark for safety evaluation of emotional-support chatbots that uses separate auditor and judge models and releases its pipeline and rubrics.

CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents

cs.AI · 2026-06-29 · unverdicted · novelty 7.0

CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.

AI Fiction in the Wild

cs.CL · 2026-06-22 · unverdicted · novelty 7.0

Analysis of 500k ChatGPT logs shows over one-third of conversations generate fiction, dominated by power users with repetitive and niche patterns.

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

cs.AI · 2026-06-08 · unverdicted · novelty 7.0

RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.

Cherry-pick Override: Unsafe Directional Commitment in LLM Judges under Mixed Evidence

cs.SE · 2026-06-05 · unverdicted · novelty 7.0

The paper defines Cherry-pick Override (CCO) as unauthorized directional commitment by LLM judges under mixed evidence and quantifies its prevalence (>84% on AVeriTeC conflicting subset) while testing intervention ladders and a two-channel reference probe.

Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems

cs.AI · 2026-06-05 · unverdicted · novelty 7.0

MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).

CoEval: Ranking Language Models for Custom Tasks Without Labeled Data or Trustworthy Benchmarks

cs.CL · 2026-06-02 · conditional · novelty 7.0

CoEval generates task-specific benchmarks by rotating models through teacher, student, and judge roles, then weights questions by discriminative power and judges by panel consensus to recover accurate model rankings without labels.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

cs.CL · 2026-05-31 · conditional · novelty 7.0

Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.

RWGBench: Evaluating Scholarly Positioning in Related Work Generation

cs.DL · 2026-05-30 · unverdicted · novelty 7.0

RWGBench is a citation-centric benchmark for related work generation built from 40k CS papers and a 100-paper test set, with multi-dimensional metrics that better match human expert judgment than standard similarity scores.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

A Behavioral Specification interpretive layer improves representational accuracy for AI personalization by compressing user data into patterns, outperforming raw corpora and commercial memory systems on held-out behavioral predictions across 14 autobiographical corpora while reducing context cost.

OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

cs.SE · 2026-05-26 · conditional · novelty 7.0

LogDx-CI benchmark shows hybrid grep+tail reducers achieve top diagnosis quality at low cost, agent loops shrink quality variance across reducers, and cross-family LLM summarizers outperform same-family pairs.

MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning

cs.AI · 2026-05-25 · unverdicted · novelty 7.0

MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

SomaliBench finds large English-to-Somali refusal gaps (0.38 to 0.90) across Llama-3.1-8B, Gemma-2-9B, Qwen-2.5-7B, and Aya-23-8B, with many Somali responses being unclear rather than compliant.

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.

Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

LLMs show severe staleness after training cutoffs and recency bias on historical German statutes; RAG with version filtering mitigates both better than web search.

citing papers explorer

Showing 50 of 84 citing papers after filters.

Why Do Multi-Agent LLM Systems Fail? cs.AI · 2025-03-17 · unverdicted · none · ref 22 · internal anchor
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
FARS: A Fully Automated Research System Deployed at Scale cs.AI · 2026-06-30 · unverdicted · none · ref 25 · internal anchor
FARS deployed at scale produced 166 AI/ML papers across 67 topics that received 282 structured human reviews indicating some review-worthy outputs alongside recurring failure modes.
EMPATH: A Multilingual Auditor-Judge Benchmark for Safety Evaluation of Emotional-Support Chatbots cs.AI · 2026-06-29 · unverdicted · none · ref 13 · internal anchor
The paper presents EMPATH, a new multilingual multi-turn benchmark for safety evaluation of emotional-support chatbots that uses separate auditor and judge models and releases its pipeline and rubrics.
CLQT: A Closed-Loop, Cost-Aware, Strategy-Consistent Benchmark for Diagnostic Evaluation of LLM Portfolio-Management Agents cs.AI · 2026-06-29 · unverdicted · none · ref 13 · internal anchor
CLQT is a new closed-loop, cost-aware benchmark that diagnoses LLM trading agent capabilities through strategy-consistent metrics and hash-verifiable trails rather than outcome rankings.
RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning cs.AI · 2026-06-08 · unverdicted · none · ref 1 · internal anchor
RealMath-Eval benchmark shows LLM judges have an evaluation gap, performing worse on diverse real human math reasoning than on synthetic solutions due to greater error diversity and higher surprisal.
Beyond Goodhart's Law: A Dynamic Benchmark for Evaluating Compliance in Multi-Agent Systems cs.AI · 2026-06-05 · unverdicted · none · ref 78 · internal anchor
MAC-Bench is a new adversarial benchmark that converts legal texts into executable scenarios via the SERV pipeline to measure procedural compliance in multi-agent LLM systems using CSR and MG metrics.
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges cs.AI · 2026-06-03 · unverdicted · none · ref 79 · internal anchor
LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
OR-Space: A Full-Lifecycle Workspace Benchmark for Industrial Optimization Agents cs.AI · 2026-05-27 · unverdicted · none · ref 46 · internal anchor
OR-Space is a benchmark for LLM agents performing full-lifecycle optimization tasks across Build, Revise, and Explain modes in executable multi-artifact workspaces.
MuCRASP: Multimodal Chain-of-thought Reasoning aware Structured Pruning cs.AI · 2026-05-25 · unverdicted · none · ref 55 · internal anchor
MuCRASP prunes VLMs in a CoT-aware manner, outperforming baselines by preserving reasoning quality at 30-50% compression rates on models like Qwen2.5-VL-7B.
TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization cs.AI · 2026-05-20 · unverdicted · none · ref 49 · internal anchor
A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 35 · internal anchor
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 57 · internal anchor
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science cs.AI · 2026-05-18 · unverdicted · none · ref 83 · internal anchor
SCICONVBENCH is a new benchmark evaluating LLMs on multi-turn disambiguation and inconsistency resolution for task formulation in computational science, with frontier models reaching only 52.7% success on fluid mechanics disambiguation cases.
Recall Isn't Enough: Bounding Commitments in Personalized Language Systems cs.AI · 2026-05-15 · unverdicted · none · ref 13 · internal anchor
CBEA with LCV bounds evidence sets and validates commitments before response generation, achieving zero failures in scoped tests at 0.49-0.60 availability versus near-zero for baselines.
Belief Engine: Configurable and Inspectable Stance Dynamics in Multi-Agent LLM Deliberation cs.AI · 2026-05-14 · unverdicted · none · ref 34 · internal anchor
Belief Engine is a configurable belief-update mechanism for multi-agent LLM systems that uses structured argument extraction and log-odds stance updates to make evidence-grounded deliberation inspectable and controllable.
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs cs.AI · 2026-04-22 · unverdicted · none · ref 19 · internal anchor
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models cs.AI · 2026-04-21 · unverdicted · none · ref 10 · internal anchor
Position bias scales positively with reasoning trajectory length in CoT models, shown by partial correlations and truncation interventions across multiple benchmarks and model scales.
Don't Start What You Can't Finish: A Counterfactual Audit of Support-State Triage in LLM Agents cs.AI · 2026-04-17 · unverdicted · none · ref 18 · internal anchor
LLM agents overcommit on non-complete tasks at 41.7% unless given explicit support-state categories, which raise typed deferral accuracy to 91.7%.
Evaluating Tool-Using Language Agents: Judge Reliability, Propagation Cascades, and Runtime Mitigation in AgentProp-Bench cs.AI · 2026-04-17 · conditional · none · ref 32 · internal anchor
AgentProp-Bench shows substring judging agrees with humans at kappa=0.049, LLM ensemble at 0.432, bad-parameter injection propagates with ~0.62 probability, rejection and recovery are independent, and a runtime fix cuts hallucinations 23pp on GPT-4o-mini but not Gemini-2.0-Flash.
The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents cs.AI · 2026-04-15 · unverdicted · none · ref 2 · internal anchor
A parallel Cognitive Companion architecture reduces repetition in LLM agents by 52-62% on loop-prone tasks using LLM monitoring with 11% overhead or zero-overhead probes on hidden states, with benefits depending on task type.
IoT-Brain: Grounding LLMs for Semantic-Spatial Sensor Scheduling cs.AI · 2026-04-09 · unverdicted · none · ref 89 · internal anchor
IoT-Brain uses a neuro-symbolic Spatial Trajectory Graph to ground LLMs for verifiable semantic-spatial sensor scheduling, achieving 37.6% higher task success with lower resource use on a campus-scale benchmark.
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks cs.AI · 2026-04-09 · unverdicted · none · ref 33 · internal anchor
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.
LatentAudit: Real-Time White-Box Faithfulness Monitoring for Retrieval-Augmented Generation with Verifiable Deployment cs.AI · 2026-04-07 · unverdicted · none · ref 9 · internal anchor
LatentAudit monitors RAG faithfulness in real time via Mahalanobis distance on residual-stream activations, reaching 0.942 AUROC on PubMedQA with 0.77 ms overhead and supporting Groth16 verification.
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning cs.AI · 2026-01-14 · unverdicted · none · ref 23 · internal anchor
Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
Theoria: Rewrite-Acceptability Verification over Informal Reasoning States cs.AI · 2026-07-01 · unverdicted · none · ref 8 · internal anchor
Theoria rewrites solutions into auditable typed state transitions with justifications, certifying 105 of 185 HLE problems at 91.4% precision and outperforming holistic judges on adversarial poisoned proofs by catching hidden premises.
Graph-Native Reinforcement Learning Enables Traceable Scientific Hypothesis Generation through Conceptual Recombination cs.AI · 2026-07-01 · unverdicted · none · ref 9 · internal anchor
Graph-PRefLexOR fine-tunes graph-native models with GRPO to organize reasoning into phases, yielding 40-65% gains in traceable hypothesis generation and 2-3x semantic diversity on 100 materials science questions.
Open Problems in Constitutional Preference Reconstruction cs.AI · 2026-06-29 · unverdicted · none · ref 10 · internal anchor
Empirical analysis across three datasets identifies three open problems in constitutional preference reconstruction and shows that principle refinement raises inter-executor agreement from 73% to 78%.
IPO Finance Agent: Benchmark of LLM Financial Analysts Beyond Finance Agent v2, with Automated Rubric Generation, on the SpaceX (SPCX) IPO cs.AI · 2026-06-22 · unverdicted · none · ref 40 · internal anchor
IPO Finance Agent benchmarks LLMs on SpaceX S-1 questions with contextual retrieval and auto-generated rubrics, reporting up to 79.8% accuracy and better cost-efficiency than prior Finance Agent v2 entries.
Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement cs.AI · 2026-06-11 · unverdicted · none · ref 8 · internal anchor
AgentBuild builds scientific agents from version-controlled contracts for Rietveld refinement, progressing through a signal-to-noise ladder to handle 4-hour scans while exposing workflow limits.
Skill Coverage: A Test Adequacy Metric for Agent Skills cs.AI · 2026-06-09 · unverdicted · none · ref 18 · internal anchor
Skill coverage is a binary test adequacy metric that extracts observable behavior constraints from skill documents and judges whether trajectories provide sufficient evidence to cover each constraint, revealing 39.90-43.98% coverage on SkillsBench.
LLM Self-Recognition: Steering and Retrieving Activation Signatures cs.AI · 2026-06-04 · unverdicted · none · ref 57 · internal anchor
Steering LLM residual streams with random sparse vectors creates detectable self-recognition fingerprints that enable over 98% accurate attribution of generated text to specific models without degrading output quality.
DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey Generation cs.AI · 2026-05-28 · unverdicted · none · ref 24 · internal anchor
DeepSurvey introduces an agentic system for automated survey generation that improves depth through full-text keynotes, cross-paper clustering, and code analysis, while boosting citation reliability via graph expansion, hybrid filtering, and evidence-constrained assignment, with reported gains over
Verifiable Benchmarking of Long-Horizon Spatial Biology cs.AI · 2026-05-27 · unverdicted · none · ref 21 · internal anchor
Introduces SpatialBench-Long benchmark with 24 evaluations on spatial biology datasets from PDAC, glioblastoma, lung adenocarcinoma and optic nerve systems, reporting top model performance at 8/72 runs (11.1%).
Know When To Fold 'Em: Token-Efficient LLM Synthetic Data Generation via Multi-Stage In-Flight Rejection cs.AI · 2026-05-13 · unverdicted · none · ref 19 · internal anchor
MSIFR stops faulty LLM generations early via staged rule-based checks, reducing token consumption 11-78% with no accuracy loss.
Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation cs.AI · 2026-05-11 · unverdicted · none · ref 30 · internal anchor
Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces cs.AI · 2026-05-09 · unverdicted · none · ref 131 · internal anchor
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Evaluating Agentic AI in the Wild: Failure Modes, Drift Patterns, and a Production Evaluation Framework cs.AI · 2026-05-02 · unverdicted · none · ref 14 · internal anchor
The paper presents a taxonomy of seven production-specific failure modes for agentic AI, demonstrates that existing metrics fail to detect four of them entirely, and proposes the PAEF five-dimension framework for continuous production evaluation with an open-source implementation.
Iterative Finetuning is Mostly Idempotent cs.AI · 2026-05-01 · unverdicted · none · ref 17 · internal anchor
Iterative self-finetuning of LLMs mostly fails to amplify seeded behavioral traits, with amplification limited to specific DPO setups and often harming coherence.
Where and What: Reasoning Dynamic and Implicit Preferences in Situated Conversational Recommendation cs.AI · 2026-04-22 · unverdicted · none · ref 98 · internal anchor
SiPeR improves recommendation accuracy and response quality in situated conversations by estimating scene transitions and performing Bayesian inverse inference with multimodal LLMs.
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks cs.AI · 2026-04-22 · unverdicted · none · ref 27 · internal anchor
ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing rankings between MCQ and LLM-judge scoring.
CreativeGame:Toward Mechanic-Aware Creative Game Generation cs.AI · 2026-04-21 · unverdicted · none · ref 10 · internal anchor
CreativeGame enables iterative HTML5 game generation via mechanic-guided planning, lineage memory, runtime validation, and programmatic rewards to produce inspectable version-to-version mechanic evolution.
Beyond RAG for Cyber Threat Intelligence: A Systematic Evaluation of Graph-Based and Agentic Retrieval cs.AI · 2026-04-13 · unverdicted · none · ref 44 · internal anchor
A hybrid graph-text retrieval system for cyber threat intelligence improves multi-hop question answering by up to 35% over vector-based RAG on a 3,300-question benchmark.
TrajOnco: a multi-agent framework for temporal reasoning over longitudinal EHR for multi-cancer early detection cs.AI · 2026-04-12 · unverdicted · none · ref 20 · internal anchor
TrajOnco uses a chain-of-agents LLM architecture with memory to perform temporal reasoning on longitudinal EHR, achieving 0.64-0.80 AUROC for 1-year multi-cancer risk prediction in zero-shot mode on matched cohorts while matching supervised ML on lung cancer and outperforming single-agent baselines.
In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach cs.AI · 2026-04-10 · unverdicted · none · ref 113 · internal anchor
A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.
Can Large Language Models Reinvent Foundational Algorithms? cs.AI · 2026-04-07 · unverdicted · none · ref 3 · internal anchor
LLMs reinvention success reaches 50% with no hints and 90% with moderate hints on 10 algorithms, but complex cases like Strassen's require test-time RL and still fail without sufficient guidance.
Label Effects: Shared Heuristic Reliance in Trust Assessment by Humans and LLM-as-a-Judge cs.AI · 2026-04-07 · unverdicted · none · ref 53 · internal anchor
Both humans and LLMs trust content more when labeled human-authored than AI-generated, with LLMs showing denser attention to labels and higher uncertainty under AI labels, mirroring human heuristic patterns.
Weakly Supervised Distillation of Hallucination Signals into Transformer Representations cs.AI · 2026-04-07 · unverdicted · none · ref 18 · internal anchor
Weak supervision signals can be distilled into LLM hidden states so that simple probes on internal activations detect hallucinations at inference without external tools.
Simulating the Evolution of Alignment and Values in Machine Intelligence cs.AI · 2026-04-07 · unverdicted · none · ref 19 · internal anchor
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
Evaluating Artificial Intelligence Through a Christian Understanding of Human Flourishing cs.AI · 2026-04-03 · unverdicted · none · ref 20 · internal anchor
Frontier AI models default to procedural secularism and score 17 points lower on Christian human-flourishing criteria than on pluralistic ones, with a 31-point gap in faith and spirituality.
Agentic Business Process Management: A Research Manifesto cs.AI · 2026-03-19 · unverdicted · none · ref 68 · internal anchor
Agentic Business Process Management reframes BPM around autonomous agents that must exhibit framed autonomy, explainability, conversational actionability, and self-modification to keep their actions aligned with organizational objectives.

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer