Title resolution pending

Ribeiro, M · 2020 · DOI 10.18653/v1/2020.acl-main.442

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

open at publisher browse 19 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Measuring Massive Multitask Language Understanding

cs.CY · 2020-09-07 · accept · novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

An Empirical Analysis of Factual Errors in Human-Written Text and its Application

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

An empirical study distills a taxonomy of human factual errors from newspaper corrections and shows LLMs achieve only 52% F1 on detection.

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

cs.CL · 2026-06-10 · accept · novelty 7.0

Layer-isolated evaluation decomposes LLM agents into per-layer deterministic no-LLM test slices whose locked baselines localize regressions that aggregate pass rates mask.

Are LLMs Ready for Conflict Monitoring? Empirical Evidence from West Africa

cs.CL · 2026-05-05 · conditional · novelty 7.0

LLMs show significant biases in conflict event classification, with open-weight models exhibiting false illegitimation and adapted models showing actor bias and lexical sensitivity, making them unsuitable for unsupervised deployment.

TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation

cs.CV · 2026-04-29 · accept · novelty 7.0

TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.

Sell More, Play Less: Benchmarking LLM Realistic Selling Skill

cs.CL · 2026-04-08 · conditional · novelty 7.0

SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

Life After Benchmark Saturation: A Case Study of CORE-Bench

cs.AI · 2026-06-23 · unverdicted · novelty 6.0

Using CORE-Bench as a case study, the paper shows that saturated benchmarks can still deliver insights on efficiency, reliability, model-scaffold differences, and human collaboration even after accuracy plateaus, and introduces improved benchmark versions plus a small randomized experiment demonstra

Measuring Behavior Portability in Large Language Models

cs.AI · 2026-06-22 · unverdicted · novelty 6.0

A new framework measures behavioral portability of LLMs across payoff-equivalent environments and reports substantial systematic transfer losses in seven economic decision problems.

Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

cs.CL · 2026-06-03 · conditional · novelty 6.0

Lexical anonymization via Caliper causes consistent accuracy drops of 7-30 percentage points across LLMs on causal benchmarks, indicating reliance on lexical anchors rather than structural causal reasoning.

Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks

cs.CL · 2026-05-22 · unverdicted · novelty 6.0 · 2 refs

Metadata predictability alone does not prove evidence dependence; a combined audit using MPDS, evidence-intervention sensitivity ΔEvi, and reader calibration is needed for weak-label benchmarks.

LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling

cs.LG · 2026-05-14 · conditional · novelty 6.0

LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.

Dual Alignment Between Language Model Layers and Human Sentence Processing

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

Later LLM layers align better with human cognitive effort in syntactic ambiguity than early layers do, indicating dual processing modes and complementary benefits from multi-layer probability updates.

Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

Social identity markers in medical questions degrade LLM accuracy and uncertainty calibration, producing a calibration crisis that is non-additive for intersectional cases.

Rigorous Interpretation Is a Form of Evaluation

cs.CY · 2026-05-06 · unverdicted · novelty 5.0

Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.

Artificial Intelligence in Number Theory: LLMs for Algorithm Generation and Ensemble Methods for Conjecture Verification

math.NT · 2025-04-28 · conditional · novelty 5.0

LLM reaches >=0.95 accuracy on 60 number theory problems with optimal hints; LightGBM classifier empirically supports Dirichlet conductor conjecture via zero features at 93.9% test accuracy for small q.

Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification

cs.CL · 2026-06-05 · unverdicted · novelty 4.0

On a controlled Turkish dataset of 147 examples, few-shot prompting lets some LLMs match or beat a supervised BERT baseline for LVC detection, though results are highly sensitive to prompt design.

Acceptance-Test-Driven Evaluation Protocols for Business-Centric LLM Systems

cs.SE · 2026-06-01 · unverdicted · novelty 4.0

The paper introduces a red-train-green lifecycle and governance metric stack that adapts acceptance testing to LLM systems for business use.

NLG Evaluation: Past, Present, Future

cs.CL · 2026-05-22 · unverdicted · novelty 1.0

A historical review of NLG evaluation practices from 1990 to 2026, noting the rise of experimental methods and predicting increased focus on impact, qualitative, and safety evaluation.

citing papers explorer

Showing 19 of 19 citing papers.

Measuring Massive Multitask Language Understanding cs.CY · 2020-09-07 · accept · none · ref 233
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
An Empirical Analysis of Factual Errors in Human-Written Text and its Application cs.CL · 2026-06-26 · unverdicted · none · ref 22
An empirical study distills a taxonomy of human factual errors from newspaper corrections and shows LLMs achieve only 52% F1 on detection.
Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness cs.CL · 2026-06-10 · accept · none · ref 8
Layer-isolated evaluation decomposes LLM agents into per-layer deterministic no-LLM test slices whose locked baselines localize regressions that aggregate pass rates mask.
Are LLMs Ready for Conflict Monitoring? Empirical Evidence from West Africa cs.CL · 2026-05-05 · conditional · none · ref 66
LLMs show significant biases in conflict event classification, with open-weight models exhibiting false illegitimation and adapted models showing actor bias and lexical sensitivity, making them unsuitable for unsupervised deployment.
TRIP-Evaluate: An Open Multimodal Benchmark for Evaluating Large Models in Transportation cs.CV · 2026-04-29 · accept · none · ref 4
TRIP-Evaluate is a new open multimodal benchmark with 837 text, image, and point-cloud items organized by a role-task-knowledge taxonomy to evaluate large models on transportation workflows.
Sell More, Play Less: Benchmarking LLM Realistic Selling Skill cs.CL · 2026-04-08 · conditional · none · ref 36
SalesLLM provides an automatic evaluation framework for LLM sales dialogues that correlates 0.98 with human experts and shows top models approaching human performance while weaker ones lag.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 265
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Life After Benchmark Saturation: A Case Study of CORE-Bench cs.AI · 2026-06-23 · unverdicted · none · ref 50
Using CORE-Bench as a case study, the paper shows that saturated benchmarks can still deliver insights on efficiency, reliability, model-scaffold differences, and human collaboration even after accuracy plateaus, and introduces improved benchmark versions plus a small randomized experiment demonstra
Measuring Behavior Portability in Large Language Models cs.AI · 2026-06-22 · unverdicted · none · ref 21
A new framework measures behavioral portability of LLMs across payoff-equivalent environments and reports substantial systematic transfer losses in seven economic decision problems.
Caliper: Probing Lexical Anchors versus Causal Structure in LLMs cs.CL · 2026-06-03 · conditional · none · ref 26
Lexical anonymization via Caliper causes consistent accuracy drops of 7-30 percentage points across LLMs on causal benchmarks, indicating reliance on lexical anchors rather than structural causal reasoning.
Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks cs.CL · 2026-05-22 · unverdicted · none · ref 16 · 2 links
Metadata predictability alone does not prove evidence dependence; a combined audit using MPDS, evidence-intervention sensitivity ΔEvi, and reader calibration is needed for weak-label benchmarks.
LPDS: Evaluating LLM Robustness Through Logic-Preserving Difficulty Scaling cs.LG · 2026-05-14 · conditional · none · ref 42
LPDS quantifies difficulty of logic-preserving problem variations and searches for the hardest ones, producing up to 5x larger performance drops than random sampling and better robustness gains from fine-tuning on difficult examples.
Dual Alignment Between Language Model Layers and Human Sentence Processing cs.CL · 2026-04-20 · unverdicted · none · ref 38
Later LLM layers align better with human cognitive effort in syntactic ambiguity than early layers do, indicating dual processing modes and complementary benefits from multi-layer probability updates.
Calibrated? Not for Everyone: How Sexual Orientation and Religious Markers Distort LLM Accuracy and Confidence in Medical QA cs.CL · 2026-04-19 · unverdicted · none · ref 4
Social identity markers in medical questions degrade LLM accuracy and uncertainty calibration, producing a calibration crisis that is non-additive for intersectional cases.
Rigorous Interpretation Is a Form of Evaluation cs.CY · 2026-05-06 · unverdicted · none · ref 111
Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.
Artificial Intelligence in Number Theory: LLMs for Algorithm Generation and Ensemble Methods for Conjecture Verification math.NT · 2025-04-28 · conditional · none · ref 45
LLM reaches >=0.95 accuracy on 60 number theory problems with optimal hints; LightGBM classifier empirically supports Dirichlet conductor conjecture via zero features at 93.9% test accuracy for small q.
Supervision versus Demonstration-Based In-Context Learning for Multiword Expression Classification cs.CL · 2026-06-05 · unverdicted · none · ref 204
On a controlled Turkish dataset of 147 examples, few-shot prompting lets some LLMs match or beat a supervised BERT baseline for LVC detection, though results are highly sensitive to prompt design.
Acceptance-Test-Driven Evaluation Protocols for Business-Centric LLM Systems cs.SE · 2026-06-01 · unverdicted · none · ref 4
The paper introduces a red-train-green lifecycle and governance metric stack that adapts acceptance testing to LLM systems for business use.
NLG Evaluation: Past, Present, Future cs.CL · 2026-05-22 · unverdicted · none · ref 42
A historical review of NLG evaluation practices from 1990 to 2026, noting the rise of experimental methods and predicting increased focus on impact, qualitative, and safety evaluation.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer