{"paper":{"title":"Revisiting the Reliability of Language Models in Instruction-Following","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Language models lose up to 61.8 percent instruction-following accuracy on prompts with subtle phrasing changes that preserve intent.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.SE","authors_text":"Chao Zhang, Han Qiu, Jianshuo Dong, Tao Wei, Yan Liu, Yutong Zhang, Zhenyu Zhong","submitted_at":"2025-12-15T02:57:55Z","abstract_excerpt":"Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The automated data-augmentation pipeline produces cousin prompts that preserve the original user intent without changing task difficulty or introducing new ambiguities.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"LLMs exhibit up to 61.8% performance drops on nuanced rephrasings of instruction-following tasks, revealing insufficient nuance-oriented reliability across 46 tested models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Language models lose up to 61.8 percent instruction-following accuracy on prompts with subtle phrasing changes that preserve intent.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4abc4ab846f600c229f7e45277e4e0cbd75a50b3ecb1246b0f4b18984ab2efa9"},"source":{"id":"2512.14754","kind":"arxiv","version":3},"verdict":{"id":"45fde6dd-8b81-4a6e-a272-618791a01207","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T22:49:18.604668Z","strongest_claim":"Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications.","one_line_summary":"LLMs exhibit up to 61.8% performance drops on nuanced rephrasings of instruction-following tasks, revealing insufficient nuance-oriented reliability across 46 tested models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The automated data-augmentation pipeline produces cousin prompts that preserve the original user intent without changing task difficulty or introducing new ambiguities.","pith_extraction_headline":"Language models lose up to 61.8 percent instruction-following accuracy on prompts with subtle phrasing changes that preserve intent."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2512.14754/integrity.json","findings":[],"available":true,"detectors_run":[],"snapshot_sha256":"c28c3603d3b5d939e8dc4c7e95fa8dfce3d595e45f758748cecf8e644a296938"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}