Smith, Yejin Choi, and Hannaneh Hajishirzi

Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, Yulia Tsvetkov, Noah A · 2024 · DOI 10.52202/079017-1573

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

PhantomBench: Benchmarking the Non-existential Threat of Language Models

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

PhantomBench is a new benchmark of 60K+ non-existent terms showing language models hallucinate at rates up to 86.7 percent even when inputs assume the concepts exist.

Beyond Single-Policy: Evaluating Composed Organization-Specific Policy Alignment in LLM Chatbots

cs.SE · 2026-06-03 · unverdicted · novelty 5.0

COPAL reveals a 33.1% average error rate on composed-policy queries across nine LLM chatbots, showing that existing single-policy benchmarks miss common failures.

citing papers explorer

Showing 2 of 2 citing papers.

PhantomBench: Benchmarking the Non-existential Threat of Language Models cs.CL · 2026-06-09 · unverdicted · none · ref 38
PhantomBench is a new benchmark of 60K+ non-existent terms showing language models hallucinate at rates up to 86.7 percent even when inputs assume the concepts exist.
Beyond Single-Policy: Evaluating Composed Organization-Specific Policy Alignment in LLM Chatbots cs.SE · 2026-06-03 · unverdicted · none · ref 32
COPAL reveals a 33.1% average error rate on composed-policy queries across nine LLM chatbots, showing that existing single-policy benchmarks miss common failures.

Smith, Yejin Choi, and Hannaneh Hajishirzi

fields

years

verdicts

representative citing papers

citing papers explorer