{"work":{"id":"a8d50b24-bdf5-46ed-bc4f-2927dfd81f1d","openalex_id":null,"doi":null,"arxiv_id":"2408.03314","raw_key":null,"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","authors":null,"authors_text":"Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar","year":2024,"venue":"cs.LG","abstract":"Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a \"compute-optimal\" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.","external_url":"https://arxiv.org/abs/2408.03314","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-19T05:02:58.787522+00:00","pith_arxiv_id":"2408.03314","created_at":"2026-05-08T21:54:18.004682+00:00","updated_at":"2026-05-19T05:02:58.787522+00:00","title_quality_ok":true,"display_title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","render_title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters"},"hub":{"state":{"work_id":"a8d50b24-bdf5-46ed-bc4f-2927dfd81f1d","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":181,"external_cited_by_count":null,"distinct_field_count":17,"first_pith_cited_at":"2024-09-19T17:16:21+00:00","last_pith_cited_at":"2026-05-14T17:57:40+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-19T07:11:15.499747+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":36},{"context_role":"method","n":3},{"context_role":"dataset","n":2}],"polarity_counts":[{"context_polarity":"background","n":35},{"context_polarity":"use_method","n":3},{"context_polarity":"use_dataset","n":2},{"context_polarity":"support","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","claims":[{"claim_text":"Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T01:24:13.810516+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"cdf8282b-f2ce-4f8a-b5ca-51538f97bcd9","orcid":null,"display_name":"Charlie Snell"},{"id":"9cc966c7-3abe-4768-acea-781e806be1c2","orcid":null,"display_name":"Jaehoon Lee"},{"id":"f8ec4f29-391a-4e31-9974-811f9e278acb","orcid":null,"display_name":"Kelvin Xu"},{"id":"fa809172-8fb7-4879-a7e8-58002b30b409","orcid":null,"display_name":"Aviral Kumar"}]},"error":null,"updated_at":"2026-05-14T01:24:14.426555+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T01:14:11.626130+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":43},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":36},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":35},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":34},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":30},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":23},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":20},{"title":"Large Language Monkeys: Scaling Inference Compute with Repeated Sampling","work_id":"b124064d-5a56-42ad-86f5-3cc349b86a3a","shared_citers":19},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":19},{"title":"s1: Simple test-time scaling","work_id":"806265b1-8f22-48dd-b8ad-a99823b18fa4","shared_citers":19},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":19},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":18},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":18},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":18},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":18},{"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","shared_citers":16},{"title":"arXiv preprint arXiv:2408.00724 , year=","work_id":"cef2c407-5d51-46a9-8eb6-f382b419502e","shared_citers":11},{"title":"GPQA: A Graduate-Level Google-Proof Q&A Benchmark","work_id":"9e2a976b-f5ad-4aee-af5c-243fe0fe75d2","shared_citers":11},{"title":"Solving math word problems with process- and outcome-based feedback","work_id":"94492239-b1d5-435e-bea5-7f51992d0614","shared_citers":11},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":10},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":10},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":9},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":9},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":9}],"time_series":[{"n":3,"year":2024},{"n":9,"year":2025},{"n":100,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T01:22:27.979452+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T01:14:17.985005+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","claims":[{"claim_text":"Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T01:24:13.814894+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","claims":[{"claim_text":"Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T01:14:02.526360+00:00"}},"summary":{"title":"Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters","claims":[{"claim_text":"Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":43},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":36},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":35},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":34},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":30},{"title":"Proximal Policy Optimization Algorithms","work_id":"240c67fe-d14d-4520-91c1-38a4e272ca19","shared_citers":23},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":20},{"title":"Large Language Monkeys: Scaling Inference Compute with Repeated Sampling","work_id":"b124064d-5a56-42ad-86f5-3cc349b86a3a","shared_citers":19},{"title":"OpenAI o1 System Card","work_id":"68d3c334-0fc9-49e3-b7b0-a69afae933e2","shared_citers":19},{"title":"s1: Simple test-time scaling","work_id":"806265b1-8f22-48dd-b8ad-a99823b18fa4","shared_citers":19},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":19},{"title":"DAPO: An Open-Source LLM Reinforcement Learning System at Scale","work_id":"64019d00-0b11-4bbd-b173-b46c8fad0157","shared_citers":18},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":18},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":18},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":18},{"title":"Kimi k1.5: Scaling Reinforcement Learning with LLMs","work_id":"bff96ab1-bd6a-4585-be23-74fdb51969c7","shared_citers":16},{"title":"arXiv preprint arXiv:2408.00724 , year=","work_id":"cef2c407-5d51-46a9-8eb6-f382b419502e","shared_citers":11},{"title":"GPQA: A Graduate-Level Google-Proof Q&A Benchmark","work_id":"9e2a976b-f5ad-4aee-af5c-243fe0fe75d2","shared_citers":11},{"title":"Solving math word problems with process- and outcome-based feedback","work_id":"94492239-b1d5-435e-bea5-7f51992d0614","shared_citers":11},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":10},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":10},{"title":"Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities","work_id":"008df105-2fdd-45d8-857a-8e35868aecb6","shared_citers":9},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":9},{"title":"Understanding R1-Zero-Like Training: A Critical Perspective","work_id":"ec354f3b-9484-4a0c-94c8-92d4d0260835","shared_citers":9}],"time_series":[{"n":3,"year":2024},{"n":9,"year":2025},{"n":100,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"fa809172-8fb7-4879-a7e8-58002b30b409","orcid":null,"display_name":"Aviral Kumar","source":"manual","import_confidence":0.72},{"id":"cdf8282b-f2ce-4f8a-b5ca-51538f97bcd9","orcid":null,"display_name":"Charlie Snell","source":"manual","import_confidence":0.72},{"id":"9cc966c7-3abe-4768-acea-781e806be1c2","orcid":null,"display_name":"Jaehoon Lee","source":"manual","import_confidence":0.72},{"id":"f8ec4f29-391a-4e31-9974-811f9e278acb","orcid":null,"display_name":"Kelvin Xu","source":"manual","import_confidence":0.72}]}}