I n F o B ench: Evaluating Instruction Following Ability in Large Language Models

Qin, Yiwei, Song, Kaiqiang, Hu, Yebowen, Yao, Wenlin, Cho, Sangwoo, Wang, Xiaoyang · 2024 · DOI 10.18653/v1/2024.findings-acl.772

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open at publisher browse 7 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese

cs.CL · 2026-05-02 · conditional · novelty 7.0

Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models

cs.CL · 2026-04-28 · accept · novelty 7.0

SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.

RECAP: Regression Evaluation for Continual Adaptation of Prompts

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

RECAP benchmark finds that six prompt optimization methods show no significant performance gains under proactive continual adaptation to evolving constraints across four LLMs.

Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models

cs.AI · 2026-06-02 · unverdicted · novelty 6.0

CRGC models instructions as constraint graphs, identifies bridge constraints, and cuts violations by 39% on three datasets while preserving reasoning performance.

Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails

cs.CL · 2026-06-22 · unverdicted · novelty 5.0

KD outperforms SFT for LLM post-training in low-data regimes but the advantage fades with abundant data unless the teacher is stronger; a two-stage strategy aids domain-specific low-resource cases.

Beyond Single-Policy: Evaluating Composed Organization-Specific Policy Alignment in LLM Chatbots

cs.SE · 2026-06-03 · unverdicted · novelty 5.0

COPAL reveals a 33.1% average error rate on composed-policy queries across nine LLM chatbots, showing that existing single-policy benchmarks miss common failures.

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

cs.LG · 2026-05-18

citing papers explorer

Showing 7 of 7 citing papers.

Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese cs.CL · 2026-05-02 · conditional · none · ref 15
Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models cs.CL · 2026-04-28 · accept · none · ref 15
SOB benchmark shows LLMs achieve near-perfect schema compliance but value accuracy of only 83% on text, 67% on images, and 24% on audio.
RECAP: Regression Evaluation for Continual Adaptation of Prompts cs.LG · 2026-06-04 · unverdicted · none · ref 30
RECAP benchmark finds that six prompt optimization methods show no significant performance gains under proactive continual adaptation to evolving constraints across four LLMs.
Bridging Auxiliary Constraints to Resolve Instruction Following in Large Reasoning Models cs.AI · 2026-06-02 · unverdicted · none · ref 2
CRGC models instructions as constraint graphs, identifies bridge constraints, and cuts violations by 39% on three datasets while preserving reasoning performance.
Understanding Knowledge Distillation in Post-Training: When It Helps and When It Fails cs.CL · 2026-06-22 · unverdicted · none · ref 13
KD outperforms SFT for LLM post-training in low-data regimes but the advantage fades with abundant data unless the teacher is stronger; a two-stage strategy aids domain-specific low-resource cases.
Beyond Single-Policy: Evaluating Composed Organization-Specific Policy Alignment in LLM Chatbots cs.SE · 2026-06-03 · unverdicted · none · ref 47
COPAL reveals a 33.1% average error rate on composed-policy queries across nine LLM chatbots, showing that existing single-policy benchmarks miss common failures.
AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning cs.LG · 2026-05-18 · unreviewed · ref 16

I n F o B ench: Evaluating Instruction Following Ability in Large Language Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer