The B i GG en bench: A principled benchmark for fine-grained evaluation of language models with language models

Seungone Kim, Juyoung Suk, Ji Yong Cho, Shayne Longpre, Chaeeun Kim, Dongkeun Yoon, Guijin Son, Yejin Cho, Sheikh Shafayat, Jinheon Baek, Sue Hyun Park, Hyeonbin Hwang, Jinkyung Jo, Hyowon Cho, Haebin Shin, Seongyun Lee, Hanseok Oh, Noah Le · 2025 · DOI 10.18653/v1/2025.naacl-long.303

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open at publisher browse 2 citing papers

representative citing papers

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

cs.CL · 2026-05-02 · conditional · novelty 7.0

Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.

Unlocking Prompt Infilling Capability for Diffusion Language Models

cs.CL · 2026-04-04 · unverdicted · novelty 7.0

Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

citing papers explorer

Showing 2 of 2 citing papers.

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese cs.CL · 2026-05-02 · conditional · none · ref 7
Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.
Unlocking Prompt Infilling Capability for Diffusion Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 12
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

The B i GG en bench: A principled benchmark for fine-grained evaluation of language models with language models

fields

years

verdicts

representative citing papers

citing papers explorer