pith. sign in

Title resolution pending

19 Pith papers cite this work. Polarity classification is still indexing.

19 Pith papers citing it

citation-role summary

background 1

citation-polarity summary

roles

background 1

polarities

background 1

clear filters

representative citing papers

Measuring Massive Multitask Language Understanding

cs.CY · 2020-09-07 · accept · novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

Life After Benchmark Saturation: A Case Study of CORE-Bench

cs.AI · 2026-06-23 · unverdicted · novelty 6.0

Using CORE-Bench as a case study, the paper shows that saturated benchmarks can still deliver insights on efficiency, reliability, model-scaffold differences, and human collaboration even after accuracy plateaus, and introduces improved benchmark versions plus a small randomized experiment demonstra

Measuring Behavior Portability in Large Language Models

cs.AI · 2026-06-22 · unverdicted · novelty 6.0

A new framework measures behavioral portability of LLMs across payoff-equivalent environments and reports substantial systematic transfer losses in seven economic decision problems.

Caliper: Probing Lexical Anchors versus Causal Structure in LLMs

cs.CL · 2026-06-03 · conditional · novelty 6.0

Lexical anonymization via Caliper causes consistent accuracy drops of 7-30 percentage points across LLMs on causal benchmarks, indicating reliance on lexical anchors rather than structural causal reasoning.

Rigorous Interpretation Is a Form of Evaluation

cs.CY · 2026-05-06 · unverdicted · novelty 5.0

Rigorous interpretability can function as a principled form of model evaluation if its claims are falsifiable, reproducible, and predictive.

NLG Evaluation: Past, Present, Future

cs.CL · 2026-05-22 · unverdicted · novelty 1.0

A historical review of NLG evaluation practices from 1990 to 2026, noting the rise of experimental methods and predicting increased focus on impact, qualitative, and safety evaluation.

citing papers explorer

Showing 19 of 19 citing papers.