Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.
The B i GG en bench: A principled benchmark for fine-grained evaluation of language models with language models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2years
2026 2representative citing papers
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
citing papers explorer
-
Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese
Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.
-
Unlocking Prompt Infilling Capability for Diffusion Language Models
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.