{"paper":{"title":"From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A progressive fine-tuning method lets language models internalize chain-of-thought steps so they can solve harder reasoning tasks without producing explicit intermediate outputs.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Stuart Shieber, Yejin Choi, Yuntian Deng","submitted_at":"2024-05-23T17:54:14Z","abstract_excerpt":"When leveraging language models for reasoning tasks, generating explicit chain-of-thought (CoT) steps often proves essential for achieving high accuracy in final outputs. In this paper, we investigate if models can be taught to internalize these CoT steps. To this end, we propose a simple yet effective method for internalizing CoT steps: starting with a model trained for explicit CoT reasoning, we gradually remove the intermediate steps and finetune the model. This process allows the model to internalize the intermediate reasoning steps, thus simplifying the reasoning process while maintaining"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our approach enables a GPT-2 Small model to solve 9-by-9 multiplication with up to 99% accuracy, whereas standard training cannot solve beyond 4-by-4 multiplication.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That performance gains arise specifically from internalizing the removed reasoning steps rather than from increased task exposure, regularization, or other side effects of the progressive fine-tuning schedule.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Gradual fine-tuning that removes explicit CoT steps lets GPT-2 Small reach 99% accuracy on 9x9 multiplication and Mistral 7B exceed 50% on GSM8K with no intermediate outputs.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A progressive fine-tuning method lets language models internalize chain-of-thought steps so they can solve harder reasoning tasks without producing explicit intermediate outputs.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"b7d5f4a5ada675bf73c55cb6f4ce0cd7d11f397250614a63507247c366ca2028"},"source":{"id":"2405.14838","kind":"arxiv","version":1},"verdict":{"id":"0d67b4be-0518-440e-80ad-15843605694f","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T11:40:00.849486Z","strongest_claim":"Our approach enables a GPT-2 Small model to solve 9-by-9 multiplication with up to 99% accuracy, whereas standard training cannot solve beyond 4-by-4 multiplication.","one_line_summary":"Gradual fine-tuning that removes explicit CoT steps lets GPT-2 Small reach 99% accuracy on 9x9 multiplication and Mistral 7B exceed 50% on GSM8K with no intermediate outputs.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That performance gains arise specifically from internalizing the removed reasoning steps rather than from increased task exposure, regularization, or other side effects of the progressive fine-tuning schedule.","pith_extraction_headline":"A progressive fine-tuning method lets language models internalize chain-of-thought steps so they can solve harder reasoning tasks without producing explicit intermediate outputs."},"references":{"count":20,"sample":[{"doi":"","year":2024,"title":"Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R","work_id":"50a8b109-5fb3-401e-8579-5d3b737fe859","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2018,"title":"On internal language representations in deep learning: An analysis of machine translation and speech recognition","work_id":"82043d1a-fbbe-4ba0-9d3f-7a4d2dc45c5f","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Beyond the imitation game: Quantifying and extrapolating the capabilities of language models","work_id":"7393fee7-41f4-49fd-8957-ecd752f4252b","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretch","work_id":"5aea4c3d-a3a0-44c3-9914-69a2e9f05e7e","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Training verifiers to solve math word problems","work_id":"1f7ff91e-3d16-4cba-ad18-3c6ae7ec674b","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":20,"snapshot_sha256":"58414645b799b721be73ccbdf2f5441e24fffd1c58cd4c4a11bbec2061668e1b","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"af91505ee37bdbe7c73c8a09730edfd984e0b121465aede3129f24673493eebc"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}