Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.
I n F o B ench: Evaluating Instruction Following Ability in Large Language Models
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
background 1polarities
background 1representative citing papers
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
RECAP benchmark finds that six prompt optimization methods show no significant performance gains under proactive continual adaptation to evolving constraints across four LLMs.
CRGC models instructions as constraint graphs, identifies bridge constraints, and cuts violations by 39% on three datasets while preserving reasoning performance.
KD outperforms SFT for LLM post-training in low-data regimes but the advantage fades with abundant data unless the teacher is stronger; a two-stage strategy aids domain-specific low-resource cases.
COPAL reveals a 33.1% average error rate on composed-policy queries across nine LLM chatbots, showing that existing single-policy benchmarks miss common failures.
citing papers explorer
-
Beyond Single-Policy: Evaluating Composed Organization-Specific Policy Alignment in LLM Chatbots
COPAL reveals a 33.1% average error rate on composed-policy queries across nine LLM chatbots, showing that existing single-policy benchmarks miss common failures.