{"work":{"id":"8ca58a10-da41-4f70-baae-7e449512e345","openalex_id":null,"doi":null,"arxiv_id":"2207.05221","raw_key":null,"title":"Language Models (Mostly) Know What They Know","authors":null,"authors_text":"Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez","year":2022,"venue":"cs.CL","abstract":"We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability \"P(True)\" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at self-evaluation further improves when we allow models to consider many of their own samples before predicting the validity of one specific possibility. Next, we investigate whether models can be trained to predict \"P(IK)\", the probability that \"I know\" the answer to a question, without reference to any particular proposed answer. Models perform well at predicting P(IK) and partially generalize across tasks, though they struggle with calibration of P(IK) on new tasks. The predicted P(IK) probabilities also increase appropriately in the presence of relevant source materials in the context, and in the presence of hints towards the solution of mathematical word problems. We hope these observations lay the groundwork for training more honest models, and for investigating how honesty generalizes to cases where models are trained on objectives other than the imitation of human writing.","external_url":"https://arxiv.org/abs/2207.05221","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T07:35:27.898428+00:00","pith_arxiv_id":"2207.05221","created_at":"2026-05-08T20:09:07.252280+00:00","updated_at":"2026-05-25T07:35:27.898428+00:00","title_quality_ok":true,"display_title":"Language Models (Mostly) Know What They Know","render_title":"Language Models (Mostly) Know What They Know"},"hub":{"state":{"work_id":"8ca58a10-da41-4f70-baae-7e449512e345","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":197,"external_cited_by_count":null,"distinct_field_count":16,"first_pith_cited_at":"2022-06-15T17:32:01+00:00","last_pith_cited_at":"2026-05-22T06:34:17+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-29T20:00:17.937673+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":33},{"context_role":"method","n":3},{"context_role":"baseline","n":2}],"polarity_counts":[{"context_polarity":"background","n":28},{"context_polarity":"support","n":3},{"context_polarity":"use_method","n":3},{"context_polarity":"baseline","n":2},{"context_polarity":"unclear","n":2}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Language Models (Mostly) Know What They Know","claims":[{"claim_text":"We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability \"P(True)\" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at sel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Language Models (Mostly) Know What They Know because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T01:54:02.471593+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"155c969c-3372-4b52-b881-15f051250399","orcid":null,"display_name":"Saurav Kadavath"},{"id":"03d96420-f585-49c3-91b7-03bd250284f5","orcid":null,"display_name":"Tom Conerly"},{"id":"fa166f46-a486-47b9-b4c7-b3c0d3548ac0","orcid":null,"display_name":"Amanda Askell"},{"id":"5a8540f5-b157-4b1c-a575-3af16bd1b586","orcid":null,"display_name":"Tom Henighan"},{"id":"357b28c4-8c4b-4321-a7a5-e7b3ee5d01cd","orcid":null,"display_name":"Dawn Drain"},{"id":"19b22360-f7f4-4cb9-b4fd-ab815d17f54a","orcid":null,"display_name":"Ethan Perez"}]},"error":null,"updated_at":"2026-05-14T01:54:02.466699+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T01:44:13.768307+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":27},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":18},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":17},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":14},{"title":"Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs","work_id":"7c5c5f6d-fd68-4f65-ac05-5d90308e8bc2","shared_citers":13},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":13},{"title":"Measuring Faithfulness in Chain-of-Thought Reasoning","work_id":"86ca07b8-4628-4f51-8938-a82683386ae4","shared_citers":13},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":12},{"title":"Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation","work_id":"d66d411b-c2c1-4cd6-8a6b-9fac872fa257","shared_citers":12},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":11},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":9},{"title":"Teaching models to express their uncertainty in words.arXiv preprint arXiv:2205.14334","work_id":"5dfa654a-4d1b-41c5-a6e0-769ffe4b7741","shared_citers":9},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":8},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":8},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":8},{"title":"Towards Understanding Sycophancy in Language Models","work_id":"aeefec9a-6ad5-4743-92b9-de6983895e21","shared_citers":8},{"title":"Adam Liska, Tomas Kocisky, Elena Gribovskaya, Tayfun Terzi, Eren Sezener, Devang Agrawal, D’Autume Cyprien De Masson, Tim Scholtes, Manzil Zaheer, Susannah Young, et al","work_id":"332e0154-d69a-4bcc-9946-371ded6047ce","shared_citers":7},{"title":"Discovering latent knowledge in language models without supervision","work_id":"a12b68bd-76a4-4837-ac7c-3ed5a60010d3","shared_citers":7},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":7},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":7},{"title":"Representation Engineering: A Top-Down Approach to AI Transparency","work_id":"45b326e2-e962-41a5-a542-2559e103a19b","shared_citers":7},{"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","work_id":"a8d50b24-bdf5-46ed-bc4f-2927dfd81f1d","shared_citers":7},{"title":"Show Your Work: Scratchpads for Intermediate Computation with Language Models","work_id":"a05b1e60-8e76-4f26-9bea-28927a5f8620","shared_citers":7},{"title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models","work_id":"bb63abb3-0d50-4362-b97c-b5e725b03b39","shared_citers":6}],"time_series":[{"n":2,"year":2022},{"n":3,"year":2023},{"n":2,"year":2025},{"n":106,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T01:44:11.079140+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T01:44:06.110430+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Language Models (Mostly) Know What They Know","claims":[{"claim_text":"We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability \"P(True)\" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at sel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Language Models (Mostly) Know What They Know because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T01:44:16.391933+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Language Models (Mostly) Know What They Know","claims":[{"claim_text":"We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability \"P(True)\" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at sel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Language Models (Mostly) Know What They Know because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T01:44:16.394875+00:00"}},"summary":{"title":"Language Models (Mostly) Know What They Know","claims":[{"claim_text":"We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions when they are provided in the right format. Thus we can approach self-evaluation on open-ended sampling tasks by asking models to first propose answers, and then to evaluate the probability \"P(True)\" that their answers are correct. We find encouraging performance, calibration, and scaling for P(True) on a diverse array of tasks. Performance at sel","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Language Models (Mostly) Know What They Know because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":27},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":18},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":17},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":14},{"title":"Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs","work_id":"7c5c5f6d-fd68-4f65-ac05-5d90308e8bc2","shared_citers":13},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":13},{"title":"Measuring Faithfulness in Chain-of-Thought Reasoning","work_id":"86ca07b8-4628-4f51-8938-a82683386ae4","shared_citers":13},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":12},{"title":"Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation","work_id":"d66d411b-c2c1-4cd6-8a6b-9fac872fa257","shared_citers":12},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":11},{"title":"Gemma 3 Technical Report","work_id":"f93e08bf-9e96-409b-8ac6-b8385fd17fd7","shared_citers":9},{"title":"Teaching models to express their uncertainty in words.arXiv preprint arXiv:2205.14334","work_id":"5dfa654a-4d1b-41c5-a6e0-769ffe4b7741","shared_citers":9},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":8},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":8},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":8},{"title":"Towards Understanding Sycophancy in Language Models","work_id":"aeefec9a-6ad5-4743-92b9-de6983895e21","shared_citers":8},{"title":"Adam Liska, Tomas Kocisky, Elena Gribovskaya, Tayfun Terzi, Eren Sezener, Devang Agrawal, D’Autume Cyprien De Masson, Tim Scholtes, Manzil Zaheer, Susannah Young, et al","work_id":"332e0154-d69a-4bcc-9946-371ded6047ce","shared_citers":7},{"title":"Discovering latent knowledge in language models without supervision","work_id":"a12b68bd-76a4-4837-ac7c-3ed5a60010d3","shared_citers":7},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":7},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":7},{"title":"Representation Engineering: A Top-Down Approach to AI Transparency","work_id":"45b326e2-e962-41a5-a542-2559e103a19b","shared_citers":7},{"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","work_id":"a8d50b24-bdf5-46ed-bc4f-2927dfd81f1d","shared_citers":7},{"title":"Show Your Work: Scratchpads for Intermediate Computation with Language Models","work_id":"a05b1e60-8e76-4f26-9bea-28927a5f8620","shared_citers":7},{"title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models","work_id":"bb63abb3-0d50-4362-b97c-b5e725b03b39","shared_citers":6}],"time_series":[{"n":2,"year":2022},{"n":3,"year":2023},{"n":2,"year":2025},{"n":106,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"fa166f46-a486-47b9-b4c7-b3c0d3548ac0","orcid":null,"display_name":"Amanda Askell","source":"manual","import_confidence":0.72},{"id":"357b28c4-8c4b-4321-a7a5-e7b3ee5d01cd","orcid":null,"display_name":"Dawn Drain","source":"manual","import_confidence":0.72},{"id":"19b22360-f7f4-4cb9-b4fd-ab815d17f54a","orcid":null,"display_name":"Ethan Perez","source":"manual","import_confidence":0.72},{"id":"155c969c-3372-4b52-b881-15f051250399","orcid":null,"display_name":"Saurav Kadavath","source":"manual","import_confidence":0.72},{"id":"03d96420-f585-49c3-91b7-03bd250284f5","orcid":null,"display_name":"Tom Conerly","source":"manual","import_confidence":0.72},{"id":"5a8540f5-b157-4b1c-a575-3af16bd1b586","orcid":null,"display_name":"Tom Henighan","source":"manual","import_confidence":0.72}]}}