{"paper":{"title":"Efficiently Aligning Language Models with Online Natural Language Feedback","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Natural language feedback builds proxy rewards that align language models with up to 50 times fewer expert samples.","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"Christine Ye, Joe Benton","submitted_at":"2026-05-05T23:25:00Z","abstract_excerpt":"Reinforcement learning with verifiable rewards has been used to elicit impressive performance from language models in many domains. But, broadly beneficial deployments of AI may require us to train models with strong capabilities in \"fuzzy\", hard-to-supervise domains. In this paper, we develop methods to align language models in fuzzy domains where human experts are still able to provide high-quality supervision signal, but only for a small number of model outputs, using online natural language feedback. Specifically, we train models by iteratively optimizing against proxy reward signals, stop"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"For Qwen3-8B, ICL methods recover up to 35% of performance with 50x fewer expert samples, while fine-tuning methods recover 80% with up to 20x fewer samples and 100% with 3x fewer samples. For Haiku 4.5, ICL methods recover up to 35% of performance with 30x fewer samples, and fine-tuning methods recover 100% with 10x fewer samples.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That proxy reward models constructed via ICL or fine-tuning on limited natural language feedback will continue to provide useful training signals without introducing systematic biases or being gamed in ways that degrade actual alignment quality.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Online natural language feedback enables recovery of 35-100% of alignment performance in fuzzy domains using 3-50x fewer expert samples via iterative proxy reward updates with ICL and fine-tuning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Natural language feedback builds proxy rewards that align language models with up to 50 times fewer expert samples.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a0becb54b0fa720d098de5e6d711110db540d74b36e7247f0ba5f5aca8bdbb34"},"source":{"id":"2605.04356","kind":"arxiv","version":2},"verdict":{"id":"9fa2033b-4364-497b-80c3-269af13e0c77","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-08T16:51:38.070310Z","strongest_claim":"For Qwen3-8B, ICL methods recover up to 35% of performance with 50x fewer expert samples, while fine-tuning methods recover 80% with up to 20x fewer samples and 100% with 3x fewer samples. For Haiku 4.5, ICL methods recover up to 35% of performance with 30x fewer samples, and fine-tuning methods recover 100% with 10x fewer samples.","one_line_summary":"Online natural language feedback enables recovery of 35-100% of alignment performance in fuzzy domains using 3-50x fewer expert samples via iterative proxy reward updates with ICL and fine-tuning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That proxy reward models constructed via ICL or fine-tuning on limited natural language feedback will continue to provide useful training signals without introducing systematic biases or being gamed in ways that degrade actual alignment quality.","pith_extraction_headline":"Natural language feedback builds proxy rewards that align language models with up to 50 times fewer expert samples."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.04356/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"ai_meta_artifact","ran_at":"2026-05-20T12:33:52.722957Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-19T23:31:20.423845Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_compliance","ran_at":"2026-05-19T14:31:02.504531Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"3ae340085101b9d80a79edc2301b4c3a859ab6e878c8de8a81f424d334c34085"},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}