{"work":{"id":"d3fdf68e-3a5e-48b5-8a18-7a9137479c55","openalex_id":null,"doi":null,"arxiv_id":"2305.14314","raw_key":null,"title":"QLoRA: Efficient Finetuning of Quantized LLMs","authors":null,"authors_text":"Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer","year":2023,"venue":"cs.LG","abstract":"We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.","external_url":"https://arxiv.org/abs/2305.14314","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T13:13:27.137507+00:00","pith_arxiv_id":"2305.14314","created_at":"2026-05-09T06:20:42.368717+00:00","updated_at":"2026-06-29T13:13:27.137507+00:00","title_quality_ok":true,"display_title":"QLoRA: Efficient Finetuning of Quantized LLMs","render_title":"QLoRA: Efficient Finetuning of Quantized LLMs"},"hub":{"state":{"work_id":"d3fdf68e-3a5e-48b5-8a18-7a9137479c55","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":87,"external_cited_by_count":null,"distinct_field_count":12,"first_pith_cited_at":"2023-06-09T05:55:52+00:00","last_pith_cited_at":"2026-06-16T18:00:00+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T16:29:02.956241+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":19},{"context_role":"method","n":6},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":18},{"context_polarity":"use_method","n":6},{"context_polarity":"unclear","n":2}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T17:49:17.437706+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":22},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":11},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":10},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":10},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":10},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":8},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":8},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":8},{"title":"AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning","work_id":"6fa49657-348b-42dd-b870-8758c71af878","shared_citers":6},{"title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models","work_id":"d1cf6693-a082-403c-ada9-dac7b96341f9","shared_citers":6},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":6},{"title":"F., Cheng, K.-T., and Chen, M.-H","work_id":"6726c65c-0da8-4d37-9dae-84dbdd936ae4","shared_citers":6},{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":5},{"title":"Finetuned Language Models Are Zero-Shot Learners","work_id":"7ed6cdaa-ed67-4db4-aceb-b7e1b0e6e7c4","shared_citers":5},{"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","shared_citers":5},{"title":"OPT: Open Pre-trained Transformer Language Models","work_id":"d7ff3b21-1fff-4cf4-952a-4714e3ef2307","shared_citers":5},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":5},{"title":"TruthfulQA: Measuring How Models Mimic Human Falsehoods","work_id":"22e3b047-a6e8-4c4c-b62e-173b545a1a45","shared_citers":5},{"title":"Alpacafarm: A simulation framework for methods that learn from human feedback","work_id":"a875adf2-8826-466d-bf52-896ee15632ea","shared_citers":4},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":4},{"title":"Fine-Tuning Language Models from Human Preferences","work_id":"4f54aad1-f3b6-404f-b9c7-e21ba0a33b99","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale","work_id":"98201f98-f4e5-4d1c-9ed7-b795e3c8f76c","shared_citers":4},{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":4}],"time_series":[{"n":4,"year":2023},{"n":4,"year":2024},{"n":35,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T17:48:32.377749+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T17:49:04.922048+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"QLoRA: Efficient Finetuning of Quantized LLMs","claims":[{"claim_text":"We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save m","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks QLoRA: Efficient Finetuning of Quantized LLMs because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T17:49:11.714793+00:00"}},"summary":{"title":"QLoRA: Efficient Finetuning of Quantized LLMs","claims":[{"claim_text":"We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save m","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks QLoRA: Efficient Finetuning of Quantized LLMs because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"LoRA: Low-Rank Adaptation of Large Language Models","work_id":"0426219a-789e-4964-adc8-a04538510818","shared_citers":22},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":11},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":10},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":10},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":10},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":8},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":8},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":8},{"title":"AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning","work_id":"6fa49657-348b-42dd-b870-8758c71af878","shared_citers":6},{"title":"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models","work_id":"d1cf6693-a082-403c-ada9-dac7b96341f9","shared_citers":6},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":6},{"title":"F., Cheng, K.-T., and Chen, M.-H","work_id":"6726c65c-0da8-4d37-9dae-84dbdd936ae4","shared_citers":6},{"title":"Attention Is All You Need","work_id":"baafb5a2-5272-43bc-932f-09fa9ffe5316","shared_citers":5},{"title":"Finetuned Language Models Are Zero-Shot Learners","work_id":"7ed6cdaa-ed67-4db4-aceb-b7e1b0e6e7c4","shared_citers":5},{"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","shared_citers":5},{"title":"OPT: Open Pre-trained Transformer Language Models","work_id":"d7ff3b21-1fff-4cf4-952a-4714e3ef2307","shared_citers":5},{"title":"Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback","work_id":"a1f2574b-a899-4713-be60-c87ba332656c","shared_citers":5},{"title":"TruthfulQA: Measuring How Models Mimic Human Falsehoods","work_id":"22e3b047-a6e8-4c4c-b62e-173b545a1a45","shared_citers":5},{"title":"Alpacafarm: A simulation framework for methods that learn from human feedback","work_id":"a875adf2-8826-466d-bf52-896ee15632ea","shared_citers":4},{"title":"Constitutional AI: Harmlessness from AI Feedback","work_id":"faaaa4e0-2676-4fac-a0b4-99aef10d2095","shared_citers":4},{"title":"Fine-Tuning Language Models from Human Preferences","work_id":"4f54aad1-f3b6-404f-b9c7-e21ba0a33b99","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale","work_id":"98201f98-f4e5-4d1c-9ed7-b795e3c8f76c","shared_citers":4},{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":4}],"time_series":[{"n":4,"year":2023},{"n":4,"year":2024},{"n":35,"year":2026}],"dependency_candidates":[]},"authors":[]}}