{"work":{"id":"513eb205-04ca-4722-9a43-a74e8cbe7e85","openalex_id":null,"doi":null,"arxiv_id":"2210.09261","raw_key":null,"title":"Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them","authors":null,"authors_text":"Mirac Suzgun, Nathan Scales, Nathanael Sch\\\"arli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung","year":2022,"venue":"cs.CL","abstract":"BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models?\n  In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we call BIG-Bench Hard (BBH). These are the task for which prior language model evaluations did not outperform the average human-rater. We find that applying chain-of-thought (CoT) prompting to BBH tasks enables PaLM to surpass the average human-rater performance on 10 of the 23 tasks, and Codex (code-davinci-002) to surpass the average human-rater performance on 17 of the 23 tasks. Since many tasks in BBH require multi-step reasoning, few-shot prompting without CoT, as done in the BIG-Bench evaluations (Srivastava et al., 2022), substantially underestimates the best performance and capabilities of language models, which is better captured via CoT prompting. As further analysis, we explore the interaction between CoT and model scale on BBH, finding that CoT enables emergent task performance on several BBH tasks with otherwise flat scaling curves.","external_url":"https://arxiv.org/abs/2210.09261","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T00:52:55.772841+00:00","pith_arxiv_id":"2210.09261","created_at":"2026-05-09T05:55:30.073565+00:00","updated_at":"2026-06-29T00:52:55.772841+00:00","title_quality_ok":true,"display_title":"Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them","render_title":"Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them"},"hub":{"state":{"work_id":"513eb205-04ca-4722-9a43-a74e8cbe7e85","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":98,"external_cited_by_count":null,"distinct_field_count":7,"first_pith_cited_at":"2022-06-15T17:32:01+00:00","last_pith_cited_at":"2026-06-26T01:12:02+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T18:39:07.555074+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"dataset","n":12},{"context_role":"background","n":9},{"context_role":"method","n":1}],"polarity_counts":[{"context_polarity":"use_dataset","n":12},{"context_polarity":"background","n":10}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T13:31:04.175249+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":36},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":25},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":18},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":17},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":17},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":17},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":16},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":15},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":14},{"title":"Instruction-Following Evaluation for Large Language Models","work_id":"3aa06177-125a-4f5a-8f4a-8070c5986c26","shared_citers":12},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":12},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":12},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":12},{"title":"CMMLU: Measuring Massive Multitask Language Understanding in Chinese","work_id":"30c9ec62-1af0-4f30-94b4-d1ef163eff71","shared_citers":11},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":11},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":9},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":9},{"title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models","work_id":"bb63abb3-0d50-4362-b97c-b5e725b03b39","shared_citers":8},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":8},{"title":"C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models","work_id":"61dc95c3-d071-4ec6-9d31-ea0610192fde","shared_citers":7},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":7},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":7},{"title":"MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark","work_id":"3c028052-035a-4c22-b80e-3046edb44adc","shared_citers":7},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":7}],"time_series":[{"n":3,"year":2022},{"n":3,"year":2023},{"n":9,"year":2024},{"n":8,"year":2025},{"n":28,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T13:31:04.211165+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T13:31:08.420049+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them","claims":[{"claim_text":"BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models?\n  In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T13:31:08.423472+00:00"}},"summary":{"title":"Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them","claims":[{"claim_text":"BIG-Bench (Srivastava et al., 2022) is a diverse evaluation suite that focuses on tasks believed to be beyond the capabilities of current language models. Language models have already made good progress on this benchmark, with the best model in the BIG-Bench paper outperforming average reported human-rater results on 65% of the BIG-Bench tasks via few-shot prompting. But on what tasks do language models fall short of average human-rater performance, and are those tasks actually unsolvable by current language models?\n  In this work, we focus on a suite of 23 challenging BIG-Bench tasks which we","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":36},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":25},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":18},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":17},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":17},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":17},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":16},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":15},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":14},{"title":"Instruction-Following Evaluation for Large Language Models","work_id":"3aa06177-125a-4f5a-8f4a-8070c5986c26","shared_citers":12},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":12},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":12},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":12},{"title":"CMMLU: Measuring Massive Multitask Language Understanding in Chinese","work_id":"30c9ec62-1af0-4f30-94b4-d1ef163eff71","shared_citers":11},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":11},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":9},{"title":"PaLM: Scaling Language Modeling with Pathways","work_id":"a94f3ef7-2c49-4445-93fe-6ec16aafd966","shared_citers":9},{"title":"Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models","work_id":"bb63abb3-0d50-4362-b97c-b5e725b03b39","shared_citers":8},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":8},{"title":"C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models","work_id":"61dc95c3-d071-4ec6-9d31-ea0610192fde","shared_citers":7},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":7},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":7},{"title":"MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark","work_id":"3c028052-035a-4c22-b80e-3046edb44adc","shared_citers":7},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":7}],"time_series":[{"n":3,"year":2022},{"n":3,"year":2023},{"n":9,"year":2024},{"n":8,"year":2025},{"n":28,"year":2026}],"dependency_candidates":[]},"authors":[]}}