hub Canonical reference

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu · 2023 · Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing · DOI 10.18653/v1/2023.emnlp-main.153

Canonical reference. 100% of citing Pith papers cite this work as background.

58 Pith papers citing it

502 external citations · Crossref

Background 100% of classified citations

open at publisher browse 58 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

GateMem benchmark shows no existing memory method for LLM agents achieves strong utility, access control, and reliable forgetting simultaneously in multi-principal shared settings.

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity

cs.CV · 2026-06-12 · unverdicted · novelty 7.0

VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).

Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Enginuity is the first open benchmark dataset for VLMs on engineering diagrams, with evaluations showing models identify parts but produce low-fidelity descriptions and struggle with factual reasoning.

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

cs.CL · 2026-05-19 · conditional · novelty 7.0

A new 263k TTCW-annotated story dataset shows non-reasoning fine-tuning of Qwen3 models outperforms reasoning-supervised fine-tuning for fixed-format long-form literary review generation.

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

cs.CL · 2026-05-18 · conditional · novelty 7.0

PROTEA supplies an offline interface for scoring intermediate outputs in multi-agent LLM workflows, performing backward evaluation from final answers, and iterating on targeted prompt revisions with visible score changes.

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

cs.LG · 2026-05-14 · accept · novelty 7.0 · 2 refs

TwinRouterBench supplies 970 execution-verified router prefixes across five datasets plus a live harness for 100 held-out SWE-bench cases, scoring routers on tier accuracy, trajectory success, and realized token cost without LLM judges.

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.

Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

MM-Eval unifies evaluation of multimodal summaries by integrating factual text quality, cross-modal relevance via MLLM judge, and visual diversity via truncated CLIP entropy, then calibrates their combination on human preferences.

LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.

Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

cs.HC · 2026-05-09 · accept · novelty 7.0

LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.

NARRA-Gym for Evaluating Interactive Narrative Agents

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

Evaluating Non-English Developer Support in Machine Learning for Software Engineering

cs.SE · 2026-05-07 · unverdicted · novelty 7.0

Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

cs.SE · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

Themis introduces the largest open code preference dataset with over 350k pairs and trains multilingual reward models from 600M to 32B parameters that support flexible multi-criteria scoring, with experiments showing scaling trends and cross-lingual transfer.

PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs

cs.IR · 2026-04-23 · unverdicted · novelty 7.0

PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.

AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.

An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.

FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

Unlocking Prompt Infilling Capability for Diffusion Language Models

cs.CL · 2026-04-04 · unverdicted · novelty 7.0

Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

cs.CL · 2025-12-16 · conditional · novelty 7.0

VLegal-Bench supplies 10,450 expert-validated samples for evaluating LLMs on Vietnamese legal questions, retrieval, multi-step reasoning, and scenario solving.

Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

cs.CL · 2025-11-13 · conditional · novelty 7.0

ATR4CH is a replicable five-step methodology for LLM-based knowledge extraction from cultural heritage documents that combines annotation models and ontological frameworks, achieving F1 scores of 0.96-0.99 for metadata, 0.7-0.8 for entities, and 0.62 G-EVAL for discourse on Wikipedia articles about

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

cs.CL · 2024-10-14 · unverdicted · novelty 7.0

LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.

Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions

cs.AI · 2026-06-24 · unverdicted · novelty 6.0

TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.

citing papers explorer

Showing 50 of 58 citing papers.

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents cs.LG · 2026-06-17 · unverdicted · none · ref 16
GateMem benchmark shows no existing memory method for LLM agents achieves strong utility, access control, and reliable forgetting simultaneously in multi-principal shared settings.
Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity cs.CV · 2026-06-12 · unverdicted · none · ref 10
VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.
Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges cs.AI · 2026-06-03 · unverdicted · none · ref 53
LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).
Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams cs.CV · 2026-06-02 · unverdicted · none · ref 12
Enginuity is the first open benchmark dataset for VLMs on engineering diagrams, with evaluations showing models identify parts but produce low-fidelity descriptions and struggle with factual reasoning.
When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation cs.CL · 2026-05-19 · conditional · none · ref 13
A new 263k TTCW-annotated story dataset shows non-reasoning fine-tuning of Qwen3 models outperforms reasoning-supervised fine-tuning for fixed-format long-form literary review generation.
PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows cs.CL · 2026-05-18 · conditional · none · ref 1
PROTEA supplies an offline interface for scoring intermediate outputs in multi-agent LLM workflows, performing backward evaluation from final answers, and iterating on targeted prompt revisions with visible score changes.
TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing cs.LG · 2026-05-14 · accept · none · ref 15 · 2 links
TwinRouterBench supplies 970 execution-verified router prefixes across five datasets plus a live harness for 100 held-out SWE-bench cases, scoring routers on tier accuracy, trajectory success, and realized token cost without LLM judges.
Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations cs.CL · 2026-05-13 · unverdicted · none · ref 71
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity cs.AI · 2026-05-12 · unverdicted · none · ref 220
MM-Eval unifies evaluation of multimodal summaries by integrating factual text quality, cross-modal relevance via MLLM judge, and visual diversity via truncated CLIP entropy, then calibrates their combination on human preferences.
LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs cs.AI · 2026-05-10 · unverdicted · none · ref 7
TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.
Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations cs.HC · 2026-05-09 · accept · none · ref 44
LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.
NARRA-Gym for Evaluating Interactive Narrative Agents cs.CL · 2026-05-08 · unverdicted · none · ref 11
NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 3
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
Evaluating Non-English Developer Support in Machine Learning for Software Engineering cs.SE · 2026-05-07 · unverdicted · none · ref 128
Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring cs.SE · 2026-05-01 · unverdicted · none · ref 11 · 2 links
Themis introduces the largest open code preference dataset with over 350k pairs and trains multilingual reward models from 600M to 32B parameters that support flexible multi-criteria scoring, with experiments showing scaling trends and cross-lingual transfer.
PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs cs.IR · 2026-04-23 · unverdicted · none · ref 51
PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.
AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals cs.LG · 2026-04-15 · unverdicted · none · ref 11
AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks cs.AI · 2026-04-09 · unverdicted · none · ref 17
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks cs.CL · 2026-04-07 · unverdicted · none · ref 20
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
Unlocking Prompt Infilling Capability for Diffusion Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 15
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models cs.CL · 2025-12-16 · conditional · none · ref 13
VLegal-Bench supplies 10,450 expert-validated samples for evaluating LLMs on Vietnamese legal questions, retrieval, multi-step reasoning, and scenario solving.
Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates cs.CL · 2025-11-13 · conditional · none · ref 3
ATR4CH is a replicable five-step methodology for LLM-based knowledge extraction from cultural heritage documents that combines annotation models and ontological frameworks, achieving F1 scores of 0.96-0.99 for metadata, 0.7-0.8 for entities, and 0.62 G-EVAL for discourse on Wikipedia articles about
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory cs.CL · 2024-10-14 · unverdicted · none · ref 79
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions cs.AI · 2026-06-24 · unverdicted · none · ref 42
TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.
Less is More: Quality-Aware Training Data Selection for Scientific Summarization cs.CL · 2026-06-23 · unverdicted · none · ref 87
A 1.88-million-article biomedical summarization dataset is released and quality-aware selection of training data based on abstract alignment outperforms random sampling on factuality metrics.
Same question, different history: language, national identity, and credit in large language models cs.CL · 2026-06-22 · unverdicted · none · ref 23
Analysis of 11 LLMs on 21 disputed inventions across 12 languages and 75,896 responses finds query language systematically shifts credit toward lower-status claimants in their associated language while Anglophone figures remain stable.
Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems cs.CL · 2026-06-18 · unverdicted · none · ref 40
H-RePlan provides hierarchical recovery for cross-device agent systems by distinguishing device-local fixes from global replanning and demonstrates gains on the new fault-injected HeraBench benchmark.
Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias cs.CL · 2026-06-17 · unverdicted · none · ref 3
Large-scale study of 21 LLM-as-a-Judge models shows exact-match agreement overstates reliability, rankings shift across benchmarks, and high consistency can mask position bias.
From `May' to `Is': Certainty Distortion in Language Model Rewriting cs.CL · 2026-06-06 · unverdicted · none · ref 77
LMs systematically inflate expressed certainty during rewriting, affecting up to 75% of outputs with a 1.5-2x bias toward increasing rather than decreasing certainty, and the effect compounds over iterations.
Skill-RM: Unifying Heterogeneous Evaluation Criteria via Agent Skill cs.LG · 2026-06-02 · unverdicted · none · ref 52
Skill-RM unifies heterogeneous reward criteria by modeling reward computation as dynamic execution of a reusable Reward-Evaluation Skill within an agent framework.
Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization cs.CL · 2026-05-31 · unverdicted · none · ref 5
Introduces the MEA benchmark for multi-target cross-lingual summarization across 24 languages and demonstrates that activation steering from English summarization representations improves performance.
On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance cs.CL · 2026-05-30 · unverdicted · none · ref 24
LLMs correct only 34.8% of zero-shot annotation errors via prompting, and Definition-Specific Familiarity correlates positively with performance (partial r = +0.41) while memorization metrics do not.
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors cs.CR · 2026-05-29 · unverdicted · none · ref 14
Introduces ClawTrojan benchmark achieving 95.5% ASR for multi-step trojan attacks in agentic harnesses and DASGuard defense that sanitizes control content from untrusted sources.
Synthesis and Evaluation of Long-term History-aware Medical Dialogue cs.CL · 2026-05-19 · unverdicted · none · ref 16
Creates MediLongChat synthetic longitudinal medical dialogues and benchmarks showing state-of-the-art LLMs struggle with in-dialogue, cross-dialogue, and synthesis reasoning tasks.
Jobs' AI Exposure Should Be Measured from Evidence, Not Model Priors cs.IR · 2026-05-14 · conditional · none · ref 34
The authors propose a retrieval-augmented framework that grounds AI exposure labels for 18,796 O*NET occupation-task pairs in retrieved news and academic abstracts, outperforming zero-shot prompting in 72% of disagreements and aligning better with observed real-world usage.
Towards Visually Grounded Multimodal Summarization via Cross-Modal Transformer and Gated Attention cs.AI · 2026-05-12 · unverdicted · none · ref 219
SPeCTrA-Sum uses hierarchical cross-modal fusion via DVP and DPP-distilled image selection via VRP to generate more accurate and visually grounded multimodal summaries.
When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews cs.CL · 2026-05-11 · unverdicted · none · ref 40
Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.
Bias and Uncertainty in LLM-as-a-Judge Estimation cs.LG · 2026-05-07 · unverdicted · none · ref 10
Bias-corrected LLM-as-a-Judge estimators can reverse true model orderings under shared calibration, and the paper supplies judge quality J and cross-model instability ΔJ as practical diagnostics for when such estimates are unreliable.
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization cs.CL · 2026-04-21 · unverdicted · none · ref 18
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
Stress Testing Factual Consistency Metrics for Long-Document Summarization cs.CL · 2025-11-10 · unverdicted · none · ref 25
Short-form factual consistency metrics produce inconsistent scores on semantically equivalent long-document summaries and lose reliability on information-dense claims.
Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports cs.CL · 2026-06-17 · unverdicted · none · ref 2
Lightweight metrics trained on Qwen3-8B and MedGemma-4B using synthetic pairs outperform larger medical LLMs at distinguishing clinical significance in radiology reports while balancing discrimination and robustness.
G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents cs.CL · 2026-06-11 · unverdicted · none · ref 21
G-Long uses graph-enhanced triplet memory and attention-aware scoring from a T5 summarizer to achieve up to 9.8% better response quality on MSC and 40.8% better retrieval recall on LME with lower overhead.
Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate cs.CL · 2026-06-09 · unverdicted · none · ref 25
Early-token log-probabilities from LLM decoding are stronger predictors of reasoning quality than full-sequence statistics in multi-agent debate on essay scoring tasks.
The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge cs.CL · 2026-06-09 · unverdicted · none · ref 25
In two-agent debate, log-probability confidence aligns with LLM-judged reasoning quality roughly twice as strongly for the Constructor (AUROC 0.804 for critical failure detection) as for the Auditor (0.634).
A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs cs.CL · 2026-06-03 · unverdicted · none · ref 29
Constructs multi-video summarization benchmark and evaluates nine MLLMs showing positional bias is domain- and model-dependent with middle positions often weaker and budgets not uniformly fixing it.
LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability cs.AI · 2026-05-29 · unverdicted · none · ref 22
Introduces LLM-FACETS, a privacy-preserving open-source framework for LLM evaluation using deterministic metrics locally, LLM-judge metrics with user-controlled APIs, and mechanisms for uncertainty visualization and hallucination detection.
LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control cs.CL · 2026-05-20 · unverdicted · none · ref 18
LoCar is a localization-aware evaluation framework for in-vehicle assistants that identifies unstable Korean honorific control and weaker performance on strategic metrics like clarification and proactivity in current LLMs.
Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult? cs.CL · 2026-05-14 · accept · none · ref 14 · 2 links
Fine-tuned LLM and explainable models predict vocabulary difficulty with correlations r > 0.91 and r > 0.77, showing spelling difficulty and test item construction as key influences in addition to word production difficulty.
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels cs.LG · 2026-05-07 · unverdicted · none · ref 14
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
Towards Self-Improving Error Diagnosis in Multi-Agent Systems cs.MA · 2026-04-19 · unverdicted · none · ref 64
ErrorProbe introduces a self-improving pipeline for attributing semantic failures in LLM multi-agent systems to specific agents and steps via anomaly detection, backward tracing, and tool-grounded validation with verified episodic memory.

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer