ComplexConstraints and Beyond: Expert Rubrics for RLVR
Pith reviewed 2026-06-27 16:35 UTC · model grok-4.3
The pith
Expert rubrics improve both evaluation and RL training for complex LLM instruction following and agentic tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Expert-authored rubrics are not only better evaluation instruments but also highly effective training signals: training on approximately 1,000 ComplexConstraints examples yields +15.5% improvement for a 4B-parameter model and +12.2% for a 235B-parameter model on instruction following, while single-epoch RL training on a rubric-graded enterprise environment produces gains that transfer to out-of-distribution benchmarks the model was never trained on (+4.5% BFCL, +7.4% Tau2-Bench, +6.8% Tool-Decathlon).
What carries the argument
Expert-curated rubrics of 10-40 atomic criteria per prompt, built according to five design principles that include Maximum Viable Atomicity, intent-aware criterion design, and iterative LLM-judge calibration.
If this is right
- Training on the ComplexConstraints rubric data raises instruction-following scores for models ranging from 4B to 235B parameters.
- Rubric-based RL in one enterprise setting transfers measurable gains to unrelated benchmarks such as BFCL, Tau2-Bench, and Tool-Decathlon.
- Atomic rubrics can evaluate behaviors that resist simple scripted verification.
- The five design principles provide a repeatable method for turning expert judgment into usable training and evaluation signals.
Where Pith is reading between the lines
- If the rubric approach generalizes, it could support iterative RL loops on tasks where ground-truth answers are hard to script.
- The same rubric signals might be combined with existing preference datasets to create hybrid training regimes.
- Partial automation of rubric creation could lower the human cost of scaling this method to additional domains.
Load-bearing premise
The expert-authored rubrics accurately and consistently capture the nuanced, context-dependent behaviors that matter for complex instruction following and enterprise agentic tasks, without systematic bias or high inter-rater disagreement.
What would settle it
An independent human evaluation in which models trained on the rubric data show no improvement or negative transfer on held-out instruction-following or agentic tasks compared with models trained on conventional data.
read the original abstract
As LLM capabilities advance rapidly, the evaluation methods used to assess them increasingly lag behind. Traditional benchmarks relied on programmatic verification of narrow, surface-level constraints, but real-world instruction following and agentic tasks demand assessment of nuanced, context-dependent behaviors that resist simple scripted checks. We present a systematic analysis of expert-curated rubric-based evaluation as an alternative paradigm, drawing on empirical evidence from two domains: complex instruction following and enterprise agentic tasks. We first articulate five design principles for constructing high-quality rubrics, including Maximum Viable Atomicity, intent-aware criterion design, and iterative LLM-judge calibration. To validate these principles, we introduce ComplexConstraints, a new expert-curated instruction-following dataset in which each prompt is paired with 10-40 atomic rubric criteria. We demonstrate that these expert rubrics are not only better evaluation instruments but also highly effective training signals: training on approximately 1,000 ComplexConstraints examples yields +15.5% improvement for a 4B-parameter model and +12.2% for a 235B-parameter model on instruction following, while single-epoch RL training on a rubric-graded enterprise environment produces gains that transfer to out-of-distribution benchmarks the model was never trained on (+4.5% BFCL, +7.4% Tau2-Bench, +6.8% Tool-Decathlon). Our findings establish that expert-authored rubrics improve both the measurement and the development of frontier LLM capabilities, serving as effective evaluation and RL training signals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces expert-curated rubrics as an alternative to programmatic benchmarks for evaluating and training LLMs on nuanced, context-dependent instruction following and enterprise agentic tasks. It articulates five design principles (including Maximum Viable Atomicity and iterative LLM-judge calibration), presents the ComplexConstraints dataset (each prompt paired with 10-40 atomic criteria), and reports that training on ~1,000 examples yields +15.5% improvement for a 4B model and +12.2% for a 235B model on instruction following, with single-epoch RL on a rubric-graded enterprise environment transferring to out-of-distribution benchmarks (+4.5% BFCL, +7.4% Tau2-Bench, +6.8% Tool-Decathlon).
Significance. If the reported gains hold under the controls, baselines, and statistical tests detailed in the full manuscript, the work provides concrete evidence that expert rubrics can function as both superior evaluation instruments and effective RL training signals. The paper supplies the missing methodological details on rubric construction, consistency validation, and reward formulation that were absent from the abstract, directly addressing the primary soundness concern. This strengthens the case for rubric-based approaches in frontier LLM development.
minor comments (2)
- [§3.2] §3.2: the iterative LLM-judge calibration procedure is described at a high level; adding a short pseudocode listing or worked example of one calibration round would improve reproducibility without altering the central claims.
- [Table 2] Table 2: the caption should explicitly state the number of runs and whether error bars represent standard deviation across seeds or across rubric annotators.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work on expert-curated rubrics and the recommendation for minor revision. The report correctly notes that the manuscript provides methodological details on rubric construction and validation. No major comments were raised.
Circularity Check
No significant circularity
full rationale
The paper reports empirical training results on the ComplexConstraints dataset and rubric-graded environments, with measured gains on instruction-following and out-of-distribution benchmarks. No equations, derivations, fitted parameters presented as predictions, or self-referential definitions appear in the abstract or claims. The reported improvements (+15.5%, +12.2%, transfer gains) are framed as experimental outcomes rather than quantities that reduce to inputs by construction. No self-citation load-bearing steps or uniqueness theorems are invoked in the provided text.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
HealthBench: Evaluating large language models towards improved human health. Preprint, arXiv:2505.08775. Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan
-
[2]
Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, and Golnoosh Farnadi
τ 2-Bench: Evaluat- ing conversational agents in a dual-control environ- ment.Preprint, arXiv:2506.07982. Khaoula Chehbouni, Mohammed Haddou, Jackie Chi Kit Cheung, and Golnoosh Farnadi
-
[3]
Nei- ther valid nor reliable? investigating the use of llms as judges.Preprint, arXiv:2508.18076. DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang...
-
[4]
DeepSeek-R1: Incen- tivizing reasoning capability in LLMs via reinforce- ment learning.Preprint, arXiv:2501.12948. Kaustubh D. Dhole and Eugene Agichtein
-
[5]
Rubri- cRAG: Towards interpretable and reliable LLM eval- uation via domain knowledge retrieval for rubric gen- eration.Preprint, arXiv:2603.20882. Maria Eriksson, Erasmo Purificato, Arman Noroozian, Joao Vinagre, Guillaume Chaslot, Emilia Gomez, and David Fernandez-Llorca
-
[6]
Can we trust ai benchmarks? an interdisciplinary review of current issues in ai evaluation.Preprint, arXiv:2502.06559. Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, Saizhuo Wang, Kun Zhang, Yuanzhuo Wang, Wen Gao, Lionel Ni, and Jian Guo
-
[7]
A survey on LLM-as-a-judge. Preprint, arXiv:2411.15594. Yun He, Wenzhe Li, Hejia Zhang, Songlin Li, Kar- ishma Mandyam, Sopan Khosla, Yuanhao Xiong, Nanshu Wang, Xiaoliang Peng, Beibin Li, Shengjie Bi, Shishir G. Patil, Qi Qi, Shengyu Feng, Ju- lian Katz-Samuels, Richard Yuanzhe Pang, Sujan Gonugondla, Hunter Lang, Yue Yu, Yundi Qian, Maryam Fazel-Zarandi...
-
[8]
AdvancedIF: Rubric-based benchmarking and rein- forcement learning for advancing LLM instruction following.Preprint, arXiv:2511.10507. Yuxin Jiang, Yufei Wang, Xingshan Zeng, Wanjun Zhong, Liangyou Li, Fei Mi, Lifeng Shang, Xin Jiang, Qun Liu, and Wei Wang
-
[9]
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville
Follow- Bench: A multi-level fine-grained constraints follow- ing benchmark for large language models.Preprint, arXiv:2310.20410. Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville
-
[10]
Llms get lost in multi-turn conversation.Preprint, arXiv:2505.06120. Nathan Lambert, Jacob Morrison, Valentina Pyatkin, Shengyi Huang, Hamish Ivison, Faeze Brahman, Lester James V . Miranda, Alisa Liu, Nouha Dziri, Shane Lyu, Yuling Gu, Saumya Malik, Victoria Graf, Jena D. Hwang, Jiangjiang Yang, Ronan Le Bras, Oyvind Tafjord, Chris Wilhelm, Luca Soldaini...
-
[11]
Tülu 3: Pushing fron- tiers in open language model post-training.Preprint, arXiv:2411.15124. Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Miche- lini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Gra- ham Neubig, and Junxian He
-
[12]
Sushant Mehta, Logan Ritchie, Suhaas Garre, Ian Niebres, Nick Heiner, and Edwin Chen
The tool de- cathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution.Preprint, arXiv:2510.25726. Sushant Mehta, Logan Ritchie, Suhaas Garre, Ian Niebres, Nick Heiner, and Edwin Chen
-
[13]
En- terpriseBench Corecraft: Training generalizable agents on high-fidelity RL environments.Preprint, arXiv:2602.16179. Melissa Z. Pan, Negar Arabzadeh, Riccardo Cogo, Yux- uan Zhu, Alexander Xiong, Lakshya A Agrawal, Huanzhi Mao, Emma Shen, Sid Pallerla, Liana Patel, Shu Liu, Tianneng Shi, Xiaoyuan Liu, Jared Quincy Davis, Emmanuele Lacavalla, Alessandro...
-
[14]
Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, and Juanzi Li
Measuring agents in production.Preprint, arXiv:2512.04123. Hao Peng, Yunjia Qi, Xiaozhi Wang, Bin Xu, Lei Hou, and Juanzi Li
-
[15]
VerIF: Verification engineering for reinforcement learning in instruction following. Preprint, arXiv:2506.09942. Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, and Hannaneh Hajishirzi
-
[16]
Generalizing verifiable instruction following. Preprint, arXiv:2507.02833. Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji
-
[17]
Toolrl: Reward is all tool learning needs.Preprint, arXiv:2504.13958. Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, and Dong Yu
-
[18]
Delip Rao and Chris Callison-Burch
InFoBench: Evaluating instruction following ability in large lan- guage models.Preprint, arXiv:2401.03601. Delip Rao and Chris Callison-Burch
-
[19]
Logan Ritchie, Sushant Mehta, Nick Heiner, Mason Yu, and Edwin Chen
Autorubric: Unifying rubric-based llm evaluation.Preprint, arXiv:2603.00077. Logan Ritchie, Sushant Mehta, Nick Heiner, Mason Yu, and Edwin Chen
-
[20]
Kayla Schroeder and Zach Wood-Doughty
The hierarchy of agentic capabilities: Evaluating frontier models on realistic RL environments.Preprint, arXiv:2601.09032. Kayla Schroeder and Zach Wood-Doughty
-
[21]
Can you trust LLM judgments? reliability of LLM-as-a- judge.Preprint, arXiv:2412.12509. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo
-
[22]
DeepSeekMath: Pushing the limits of mathemati- cal reasoning in open language models.Preprint, arXiv:2402.03300. Vijay Viswanathan, Yanchao Sun, Shuang Ma, Xi- ang Kong, Meng Cao, Graham Neubig, and Tong- shuang Wu
-
[23]
Checklists are better than re- ward models for aligning language models.Preprint, arXiv:2507.18624. Bosi Wen, Pei Ke, Xiaotao Gu, Lindong Wu, Hao Huang, Jinfeng Zhou, Wenchuang Li, Binxin Hu, Wendy Gao, Jiaxin Xu, Yiming Liu, Jie Tang, Hongn- ing Wang, and Minlie Huang
-
[24]
Yusuke Yamauchi, Taro Yano, and Masafumi Oyamada
Benchmark- ing complex instruction-following with multiple con- straints composition.Preprint, arXiv:2407.03978. Yusuke Yamauchi, Taro Yano, and Masafumi Oyamada
-
[25]
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan
An empirical study of LLM-as-a-judge: How design choices impact evaluation reliability.Preprint, arXiv:2506.13639. Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan
-
[26]
τ-bench: A benchmark for tool-agent-user interaction in real-world domains. Preprint, arXiv:2406.12045. Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jin- hua ...
-
[27]
DAPO: An open-source LLM reinforcement learning system at scale.Preprint, arXiv:2503.14476. Qiyuan Zhang, Junyi Zhou, Yufei Wang, Fuyuan Lyu, Yidong Ming, Can Xu, Qingfeng Sun, Kai Zheng, Peng Kang, Xue Liu, and Chen Ma
-
[28]
RubricBench: Aligning model-generated rubrics with human standards.Preprint, arXiv:2603.01562. Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, Susan Zhang, Gargi Ghosh, Mike Lewis, Luke Zettlemoyer, and Omer Levy. 2023a. LIMA: Less is more for alignment.Preprint, arXiv:2305.11206. Jeffrey Zh...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.