hub Canonical reference

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, Chenguang Zhu · 2023 · Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing · DOI 10.18653/v1/2023.emnlp-main.153

Canonical reference. 100% of citing Pith papers cite this work as background.

58 Pith papers citing it

502 external citations · Crossref

Background 100% of classified citations

open at publisher browse 58 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 5

citation-polarity summary

background 5

representative citing papers

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

GateMem benchmark shows no existing memory method for LLM agents achieves strong utility, access control, and reliable forgetting simultaneously in multi-principal shared settings.

Jury Duty: Calibration and Orientation Failures in MLLM-as-a-Judge Under Cultural Ambiguity

cs.CV · 2026-06-12 · unverdicted · novelty 7.0

VOIR DIRE benchmark shows MLLM-as-a-Judge systems decompose into positivity-floor calibration failure and orientation failure on culturally contested items, with persona prompting recovering only the former.

Stability vs. Manipulability: Evaluating Robustness Under Post-Decision Interaction in LLM Judges

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

LLM judges exhibit high stability under neutral re-evaluation but substantial reversibility under targeted post-decision challenges, quantified via a new Evaluation Robustness Score (ERS).

Enginuity: A Dataset and Benchmark for Vision-Language Understanding of Engineering Diagrams

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

Enginuity is the first open benchmark dataset for VLMs on engineering diagrams, with evaluations showing models identify parts but produce low-fidelity descriptions and struggle with factual reasoning.

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation

cs.CL · 2026-05-19 · conditional · novelty 7.0

A new 263k TTCW-annotated story dataset shows non-reasoning fine-tuning of Qwen3 models outperforms reasoning-supervised fine-tuning for fixed-format long-form literary review generation.

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

cs.CL · 2026-05-18 · conditional · novelty 7.0

PROTEA supplies an offline interface for scoring intermediate outputs in multi-agent LLM workflows, performing backward evaluation from final answers, and iterating on targeted prompt revisions with visible score changes.

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

cs.LG · 2026-05-14 · accept · novelty 7.0 · 2 refs

TwinRouterBench supplies 970 execution-verified router prefixes across five datasets plus a live harness for 100 held-out SWE-bench cases, scoring routers on tier accuracy, trajectory success, and realized token cost without LLM judges.

Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.

Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

MM-Eval unifies evaluation of multimodal summaries by integrating factual text quality, cross-modal relevance via MLLM judge, and visual diversity via truncated CLIP entropy, then calibrates their combination on human preferences.

LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs

cs.AI · 2026-05-10 · unverdicted · novelty 7.0

TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.

Causal Stories from Sensor Traces: Auditing Epistemic Overreach in LLM-Generated Personal Sensing Explanations

cs.HC · 2026-05-09 · accept · novelty 7.0

LLMs routinely produce unsupported causal stories for personal sensing anomalies, and richer evidence or constrained prompts do not reliably eliminate this epistemic overreach.

NARRA-Gym for Evaluating Interactive Narrative Agents

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

Evaluating Non-English Developer Support in Machine Learning for Software Engineering

cs.SE · 2026-05-07 · unverdicted · novelty 7.0

Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

cs.SE · 2026-05-01 · unverdicted · novelty 7.0 · 2 refs

Themis introduces the largest open code preference dataset with over 350k pairs and trains multilingual reward models from 600M to 32B parameters that support flexible multi-criteria scoring, with experiments showing scaling trends and cross-lingual transfer.

PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs

cs.IR · 2026-04-23 · unverdicted · novelty 7.0

PaperMind is a new benchmark that evaluates integrated multimodal reasoning and critique over scientific papers through four complementary task families across seven domains.

AsymmetryZero: A Framework for Operationalizing Human Expert Preferences as Semantic Evals

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

AsymmetryZero operationalizes expert preferences as stable evaluation contracts for semantic evals, with a study showing 75.9-89.6% criterion agreement between frontier and compact model juries at 4-5% of the cost.

An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks

cs.AI · 2026-04-09 · unverdicted · novelty 7.0

An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83.3% acceptable excerpts and human preference in 64.8% of blind comparisons.

FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

Unlocking Prompt Infilling Capability for Diffusion Language Models

cs.CL · 2026-04-04 · unverdicted · novelty 7.0

Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models

cs.CL · 2025-12-16 · conditional · novelty 7.0

VLegal-Bench supplies 10,450 expert-validated samples for evaluating LLMs on Vietnamese legal questions, retrieval, multi-step reasoning, and scenario solving.

Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates

cs.CL · 2025-11-13 · conditional · novelty 7.0

ATR4CH is a replicable five-step methodology for LLM-based knowledge extraction from cultural heritage documents that combines annotation models and ontological frameworks, achieving F1 scores of 0.96-0.99 for metadata, 0.7-0.8 for entities, and 0.62 G-EVAL for discourse on Wikipedia articles about

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

cs.CL · 2024-10-14 · unverdicted · novelty 7.0

LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.

Long-Term Simulation Exposes Cognitive-Developmental Risks in AI Companions

cs.AI · 2026-06-24 · unverdicted · novelty 6.0

TSJ longitudinal simulation framework finds that short-term AI safety tests underestimate developmental risks, with early childhood and emerging adulthood as most vulnerable stages across cognitive trust and emotional dependency.

citing papers explorer

Showing 35 of 35 citing papers after filters.

When Reasoning Supervision Hurts: TTCW-Based Long-Form Literary Review Generation cs.CL · 2026-05-19 · conditional · none · ref 13
A new 263k TTCW-annotated story dataset shows non-reasoning fine-tuning of Qwen3 models outperforms reasoning-supervised fine-tuning for fixed-format long-form literary review generation.
PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows cs.CL · 2026-05-18 · conditional · none · ref 1
PROTEA supplies an offline interface for scoring intermediate outputs in multi-agent LLM workflows, performing backward evaluation from final answers, and iterating on targeted prompt revisions with visible score changes.
Creativity Bias: How Machine Evaluation Struggles with Creativity in Literary Translations cs.CL · 2026-05-13 · unverdicted · none · ref 71
Automatic evaluation tools for literary translations correlate poorly with expert human judgments on creativity and exhibit bias favoring machine-translated texts.
NARRA-Gym for Evaluating Interactive Narrative Agents cs.CL · 2026-05-08 · unverdicted · none · ref 11
NARRA-Gym is an executable benchmark that generates complete interactive narrative episodes from emotional seeds and logs full model trajectories to expose gaps in coherence, adaptation, and personalization that static story tests miss.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 3
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks cs.CL · 2026-04-07 · unverdicted · none · ref 20
FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.
Unlocking Prompt Infilling Capability for Diffusion Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 15
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models cs.CL · 2025-12-16 · conditional · none · ref 13
VLegal-Bench supplies 10,450 expert-validated samples for evaluating LLMs on Vietnamese legal questions, retrieval, multi-step reasoning, and scenario solving.
Knowledge Graphs Generation from Cultural Heritage Texts: Combining LLMs and Ontological Engineering for Scholarly Debates cs.CL · 2025-11-13 · conditional · none · ref 3
ATR4CH is a replicable five-step methodology for LLM-based knowledge extraction from cultural heritage documents that combines annotation models and ontological frameworks, achieving F1 scores of 0.96-0.99 for metadata, 0.7-0.8 for entities, and 0.62 G-EVAL for discourse on Wikipedia articles about
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory cs.CL · 2024-10-14 · unverdicted · none · ref 79
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
Less is More: Quality-Aware Training Data Selection for Scientific Summarization cs.CL · 2026-06-23 · unverdicted · none · ref 87
A 1.88-million-article biomedical summarization dataset is released and quality-aware selection of training data based on abstract alignment outperforms random sampling on factuality metrics.
Same question, different history: language, national identity, and credit in large language models cs.CL · 2026-06-22 · unverdicted · none · ref 23
Analysis of 11 LLMs on 21 disputed inventions across 12 languages and 75,896 responses finds query language systematically shifts credit toward lower-status claimants in their associated language while Anglophone figures remain stable.
Beyond Global Replanning: Hierarchical Recovery for Cross-Device Agent Systems cs.CL · 2026-06-18 · unverdicted · none · ref 40
H-RePlan provides hierarchical recovery for cross-device agent systems by distinguishing device-local fixes from global replanning and demonstrates gains on the new fault-injected HeraBench benchmark.
Reliability without Validity: A Systematic, Large-Scale Evaluation of LLM-as-a-Judge Models Across Agreement, Consistency, and Bias cs.CL · 2026-06-17 · unverdicted · none · ref 3
Large-scale study of 21 LLM-as-a-Judge models shows exact-match agreement overstates reliability, rankings shift across benchmarks, and high consistency can mask position bias.
From `May' to `Is': Certainty Distortion in Language Model Rewriting cs.CL · 2026-06-06 · unverdicted · none · ref 77
LMs systematically inflate expressed certainty during rewriting, affecting up to 75% of outputs with a 1.5-2x bias toward increasing rather than decreasing certainty, and the effect compounds over iterations.
Understanding LLM Behavior in Multi-Target Cross-Lingual Summarization cs.CL · 2026-05-31 · unverdicted · none · ref 5
Introduces the MEA benchmark for multi-target cross-lingual summarization across 24 languages and demonstrates that activation steering from English summarization representations improves performance.
On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance cs.CL · 2026-05-30 · unverdicted · none · ref 24
LLMs correct only 34.8% of zero-shot annotation errors via prompting, and Definition-Specific Familiarity correlates positively with performance (partial r = +0.41) while memorization metrics do not.
Synthesis and Evaluation of Long-term History-aware Medical Dialogue cs.CL · 2026-05-19 · unverdicted · none · ref 16
Creates MediLongChat synthetic longitudinal medical dialogues and benchmarks showing state-of-the-art LLMs struggle with in-dialogue, cross-dialogue, and synthesis reasoning tasks.
When Reviews Disagree: Fine-Grained Contradiction Analysis in Scientific Peer Reviews cs.CL · 2026-05-11 · unverdicted · none · ref 40
Introduces RevCI benchmark and IMPACT multi-agent framework for evidence-level contradiction detection and graded intensity scoring in peer reviews, distilled into efficient TIDE model.
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization cs.CL · 2026-04-21 · unverdicted · none · ref 18
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
Stress Testing Factual Consistency Metrics for Long-Document Summarization cs.CL · 2025-11-10 · unverdicted · none · ref 25
Short-form factual consistency metrics produce inconsistent scores on semantically equivalent long-document summaries and lose reliability on information-dense claims.
Beyond Scalar Scores: Exploring LLM-based Metrics for Clinical Significance Evaluation in Radiology Reports cs.CL · 2026-06-17 · unverdicted · none · ref 2
Lightweight metrics trained on Qwen3-8B and MedGemma-4B using synthetic pairs outperform larger medical LLMs at distinguishing clinical significance in radiology reports while balancing discrimination and robustness.
G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents cs.CL · 2026-06-11 · unverdicted · none · ref 21
G-Long uses graph-enhanced triplet memory and attention-aware scoring from a T5 summarizer to achieve up to 9.8% better response quality on MSC and 40.8% better retrieval recall on LME with lower overhead.
Early-Token Confidence Predicts Reasoning Quality in Multi-Agent LLM Debate cs.CL · 2026-06-09 · unverdicted · none · ref 25
Early-token log-probabilities from LLM decoding are stronger predictors of reasoning quality than full-sequence statistics in multi-agent debate on essay scoring tasks.
The Confident Liar: Diagnosing Multi-Agent Debate with Log-Probabilities and LLM-as-Judge cs.CL · 2026-06-09 · unverdicted · none · ref 25
In two-agent debate, log-probability confidence aligns with LLM-judged reasoning quality roughly twice as strongly for the Constructor (AUROC 0.804 for critical failure detection) as for the Auditor (0.634).
A Systematic Evaluation of Positional Bias in Multi-Video Summarization with MLLMs cs.CL · 2026-06-03 · unverdicted · none · ref 29
Constructs multi-video summarization benchmark and evaluates nine MLLMs showing positional bias is domain- and model-dependent with middle positions often weaker and budgets not uniformly fixing it.
LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control cs.CL · 2026-05-20 · unverdicted · none · ref 18
LoCar is a localization-aware evaluation framework for in-vehicle assistants that identifies unstable Korean honorific control and weaker performance on strategic metrics like clarification and proactivity in current LLMs.
Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult? cs.CL · 2026-05-14 · accept · none · ref 14 · 2 links
Fine-tuned LLM and explainable models predict vocabulary difficulty with correlations r > 0.91 and r > 0.77, showing spelling difficulty and test item construction as key influences in addition to word production difficulty.
Creating Multilingual Mental Health Dialogue Datasets: Limits of Persona-Based Localization via Nationality and Language cs.CL · 2026-06-17 · unverdicted · none · ref 69
Modifying nationality and language parameters in English-centric personas for mental health dialogues introduces clinical inconsistencies across languages and causes LLM judges to perform inaccurately on non-English depression severity assessments.
SkillChain: Closing the Loop on Skill Evolution for Image-Based E-Commerce AI Assistants cs.CL · 2026-06-11 · unverdicted · none · ref 60
SkillChain automates skill lifecycle for e-commerce image AI assistants via creator, optimizer, and refiner stages, leading to improved response quality and user engagement in production A/B tests.
Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications cs.CL · 2026-05-10 · unverdicted · none · ref 43
RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.
Leveraging Weighted Syntactic and Semantic Context Assessment Summary (wSSAS) Towards Text Categorization Using LLMs cs.CL · 2026-04-13 · unverdicted · none · ref 41
wSSAS is a two-phase deterministic framework that uses hierarchical text organization and SNR-based feature prioritization to improve clustering integrity, categorization accuracy, and reproducibility when applying LLMs to large review datasets.
A Survey on LLM-as-a-Judge cs.CL · 2024-11-23 · unverdicted · none · ref 96
A survey on LLM-as-a-Judge that reviews reliability strategies, proposes evaluation methods, and introduces a novel benchmark for assessing such systems.
Calibrated Surprise: An Information-Theoretic Account of Creative Quality cs.CL · 2026-04-29 · unreviewed · ref 3
Lessons from the Trenches on Reproducible Evaluation of Language Models cs.CL · 2024-05-23 · unreviewed · ref 147

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer