{"work":{"id":"52aff42f-4fa9-4fcf-bdb3-1459b9bebf65","openalex_id":null,"doi":null,"arxiv_id":"2203.02155","raw_key":null,"title":"Training language models to follow instructions with human feedback","authors":null,"authors_text":"Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin","year":2022,"venue":"cs.CL","abstract":"Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.","external_url":"https://arxiv.org/abs/2203.02155","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T08:53:15.407807+00:00","pith_arxiv_id":"2203.02155","created_at":"2026-05-08T18:44:01.783269+00:00","updated_at":"2026-06-29T08:53:15.407807+00:00","title_quality_ok":true,"display_title":"Training language models to follow instructions with human feedback","render_title":"Training language models to follow instructions with human feedback"},"hub":{"state":{"work_id":"52aff42f-4fa9-4fcf-bdb3-1459b9bebf65","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":218,"external_cited_by_count":null,"distinct_field_count":20,"first_pith_cited_at":"2022-01-28T02:33:07+00:00","last_pith_cited_at":"2026-06-24T21:26:43+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T13:08:47.266283+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":54},{"context_role":"method","n":1},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":52},{"context_polarity":"unclear","n":3},{"context_polarity":"use_method","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Training language models to follow instructions with human feedback","claims":[{"claim_text":"Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we u","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"this challenge and prioritize flexible safeguarding to rapidly incorporate community feedback. These semantic safeguards may need to be augmented by traditional software safety measures, including trusted testers, gradual feature rollouts, access controls, request logging, and flagging uncertain outputs for manual review. Ensuring the safety of these systems, in line with existing AI safety guidelines [137, 138], necessitates a multi-pronged approach. This includes: • Comprehensive threat modeli","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"it for efficient deployment. InInternational Conference on Learning Representations, 2020. URL https://arxiv.org/pdf/1908.09791.pdf. [90] J.-B. Alayrac, J. Donahue, P . Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. [91] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language model","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"tions (calledinstruction tuning), LLMs are shown to perform well on unseen tasks that are also described in the form of instructions [28, 66, 67]. With instruction tuning, LLMs are enabled to follow the task instructions for new tasks without using explicit examples, thus having an improved generalization ability. According to the experiments in [67], instruction-tuned LaMDA-PT [68] started to significantly outperform the untuned one on unseen tasks when the model size reached 68B, but not for 8","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Analogously, figure 3 (e) and (f) detail the estimated relationship between the model's grounding loss during training and its performance on the RefCOCO evaluation benchmark. Performance prediction remains an active research area, and prior works have used a sigmoid function to model the relationship between LLM performance and loss [37, 151] or compute [101]. 4 Post-training The post-training stage equips Seed1.5-VL with robust instruction-following and reasoning abilities through a combinatio","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"28 30 100 [3,35,1.5],[6,35,1.3],[9,5,0.4], [12,50,1],[15,15,0.7],[18,5,0.3], [21,45,0.6],[24,120,0.9],[27,170,1.1], [30,160,1] − 29 16 12 [16,34.25,26] − 30 16 21.63 [8,18.50,18],[16,34.25,26] − 31 16 38.99 [4,13.60,14],[8,18.50,18], [12,25.18,22],[16,34.25,26] − 32 16 70.27 [2,11,66,12],[4,13.60,14], [6,15.87,16],[8,18.50,18], [10,21.59,20],[12,25.18,22], [14,29.37,24],[16,34.25,26] − 33 16 126.67 [1,10.8,11],[2,11,66,12], [3,12.60,13],[4,13.60,14], [5,14.69,15],[6,15.87,16], [7,17.14,17],[8,18","claim_type":"background","confidence":0.3,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Training language models to follow instructions with human feedback because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (4 contexts).","role_counts":[{"n":4,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-14T20:56:22.563153+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"97a6b9d9-c85b-47c2-bc08-32e023139007","orcid":null,"display_name":"Long Ouyang"},{"id":"d6cd59b3-1e65-4692-bb28-8dbb6541d3a9","orcid":null,"display_name":"Jeff Wu"},{"id":"65f4da03-2e2f-4bdd-b039-8810c25df5eb","orcid":null,"display_name":"Xu Jiang"},{"id":"91782e17-f1bb-4137-82d8-e8f02c489978","orcid":null,"display_name":"Diogo Almeida"},{"id":"1c80bd3a-8621-44a5-ba46-5df42b8996e2","orcid":null,"display_name":"Carroll L. Wainwright"},{"id":"a5bcc5c0-97f6-43ff-893e-48d475dbdcf8","orcid":null,"display_name":"Pamela Mishkin"}]},"error":null,"updated_at":"2026-05-14T20:56:23.656913+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T05:56:48.913902+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":20},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":19},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":19},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":19},{"title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models","work_id":"d1cf6693-a082-403c-ada9-dac7b96341f9","shared_citers":18},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":16},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":15},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":15},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":15},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":15},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":14},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":13},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":13},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":13},{"title":"Direct Preference Optimization: Your Language Model is Secretly a Reward Model","work_id":"62105b61-411f-46e5-970a-bd322c6adbbc","shared_citers":12},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":12},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":12},{"title":"WebGPT: Browser-assisted question-answering with human feedback","work_id":"e25ef3e1-4848-4cb9-bf28-67a420591165","shared_citers":12},{"title":"LaMDA: Language Models for Dialog Applications","work_id":"1b66d0a5-f6ae-4332-8025-c662dc64b238","shared_citers":11},{"title":"Reflexion: Language Agents with Verbal Reinforcement Learning","work_id":"778f739e-5f55-4961-8a2a-e4736a2757f4","shared_citers":11},{"title":"Scaling Language Models: Methods, Analysis & Insights from Training Gopher","work_id":"47ce8be9-e500-407d-af41-ac2d132215eb","shared_citers":11},{"title":"Large Language Models are Zero-Shot Reasoners","work_id":"d9b7eb1a-7165-46ff-9f06-d2f0b9d6f95d","shared_citers":10},{"title":"ReAct: Synergizing Reasoning and Acting in Language Models","work_id":"407a2351-25f1-497d-b611-f77d0292a8e6","shared_citers":10},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":10}],"time_series":[{"n":10,"year":2022},{"n":12,"year":2023},{"n":3,"year":2024},{"n":6,"year":2025},{"n":66,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T06:06:48.556817+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T05:56:51.439033+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Training language models to follow instructions with human feedback","claims":[{"claim_text":"Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we u","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"this challenge and prioritize flexible safeguarding to rapidly incorporate community feedback. These semantic safeguards may need to be augmented by traditional software safety measures, including trusted testers, gradual feature rollouts, access controls, request logging, and flagging uncertain outputs for manual review. Ensuring the safety of these systems, in line with existing AI safety guidelines [137, 138], necessitates a multi-pronged approach. This includes: • Comprehensive threat modeli","claim_type":"background","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"it for efficient deployment. InInternational Conference on Learning Representations, 2020. URL https://arxiv.org/pdf/1908.09791.pdf. [90] J.-B. Alayrac, J. Donahue, P . Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022. [91] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language model","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"tions (calledinstruction tuning), LLMs are shown to perform well on unseen tasks that are also described in the form of instructions [28, 66, 67]. With instruction tuning, LLMs are enabled to follow the task instructions for new tasks without using explicit examples, thus having an improved generalization ability. According to the experiments in [67], instruction-tuned LaMDA-PT [68] started to significantly outperform the untuned one on unseen tasks when the model size reached 68B, but not for 8","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Analogously, figure 3 (e) and (f) detail the estimated relationship between the model's grounding loss during training and its performance on the RefCOCO evaluation benchmark. Performance prediction remains an active research area, and prior works have used a sigmoid function to model the relationship between LLM performance and loss [37, 151] or compute [101]. 4 Post-training The post-training stage equips Seed1.5-VL with robust instruction-following and reasoning abilities through a combinatio","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"28 30 100 [3,35,1.5],[6,35,1.3],[9,5,0.4], [12,50,1],[15,15,0.7],[18,5,0.3], [21,45,0.6],[24,120,0.9],[27,170,1.1], [30,160,1] − 29 16 12 [16,34.25,26] − 30 16 21.63 [8,18.50,18],[16,34.25,26] − 31 16 38.99 [4,13.60,14],[8,18.50,18], [12,25.18,22],[16,34.25,26] − 32 16 70.27 [2,11,66,12],[4,13.60,14], [6,15.87,16],[8,18.50,18], [10,21.59,20],[12,25.18,22], [14,29.37,24],[16,34.25,26] − 33 16 126.67 [1,10.8,11],[2,11,66,12], [3,12.60,13],[4,13.60,14], [5,14.69,15],[6,15.87,16], [7,17.14,17],[8,18","claim_type":"background","confidence":0.3,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Training language models to follow instructions with human feedback because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (4 contexts).","role_counts":[{"n":4,"context_role":"background"},{"n":1,"context_role":"method"}]},"error":null,"updated_at":"2026-05-14T20:56:23.661503+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Training language models to follow instructions with human feedback","claims":[{"claim_text":"Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we u","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Training language models to follow instructions with human feedback because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T06:06:52.347059+00:00"}},"summary":{"title":"Training language models to follow instructions with human feedback","claims":[{"claim_text":"Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we u","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Training language models to follow instructions with human feedback because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":20},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":19},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":19},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":19},{"title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models","work_id":"d1cf6693-a082-403c-ada9-dac7b96341f9","shared_citers":18},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":16},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":15},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":15},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":15},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":15},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":14},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":13},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":13},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":13},{"title":"Direct Preference Optimization: Your Language Model is Secretly a Reward Model","work_id":"62105b61-411f-46e5-970a-bd322c6adbbc","shared_citers":12},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":12},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":12},{"title":"WebGPT: Browser-assisted question-answering with human feedback","work_id":"e25ef3e1-4848-4cb9-bf28-67a420591165","shared_citers":12},{"title":"LaMDA: Language Models for Dialog Applications","work_id":"1b66d0a5-f6ae-4332-8025-c662dc64b238","shared_citers":11},{"title":"Reflexion: Language Agents with Verbal Reinforcement Learning","work_id":"778f739e-5f55-4961-8a2a-e4736a2757f4","shared_citers":11},{"title":"Scaling Language Models: Methods, Analysis & Insights from Training Gopher","work_id":"47ce8be9-e500-407d-af41-ac2d132215eb","shared_citers":11},{"title":"Large Language Models are Zero-Shot Reasoners","work_id":"d9b7eb1a-7165-46ff-9f06-d2f0b9d6f95d","shared_citers":10},{"title":"ReAct: Synergizing Reasoning and Acting in Language Models","work_id":"407a2351-25f1-497d-b611-f77d0292a8e6","shared_citers":10},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":10}],"time_series":[{"n":10,"year":2022},{"n":12,"year":2023},{"n":3,"year":2024},{"n":6,"year":2025},{"n":66,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"1c80bd3a-8621-44a5-ba46-5df42b8996e2","orcid":null,"display_name":"Carroll L. Wainwright","source":"manual","import_confidence":0.72},{"id":"91782e17-f1bb-4137-82d8-e8f02c489978","orcid":null,"display_name":"Diogo Almeida","source":"manual","import_confidence":0.72},{"id":"d6cd59b3-1e65-4692-bb28-8dbb6541d3a9","orcid":null,"display_name":"Jeff Wu","source":"manual","import_confidence":0.72},{"id":"97a6b9d9-c85b-47c2-bc08-32e023139007","orcid":null,"display_name":"Long Ouyang","source":"manual","import_confidence":0.72},{"id":"a5bcc5c0-97f6-43ff-893e-48d475dbdcf8","orcid":null,"display_name":"Pamela Mishkin","source":"manual","import_confidence":0.72},{"id":"65f4da03-2e2f-4bdd-b039-8810c25df5eb","orcid":null,"display_name":"Xu Jiang","source":"manual","import_confidence":0.72}]}}