{"work":{"id":"b73ad5b2-e553-4c71-b0c9-67e67ba7b158","openalex_id":null,"doi":null,"arxiv_id":"2305.13245","raw_key":null,"title":"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints","authors":null,"authors_text":"Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebr\\'on, Sumit Sanghai","year":2023,"venue":"cs.CL","abstract":"Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show that uptrained GQA achieves quality close to multi-head attention with comparable speed to MQA.","external_url":"https://arxiv.org/abs/2305.13245","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T07:26:42.158615+00:00","pith_arxiv_id":"2305.13245","created_at":"2026-05-09T02:54:46.373069+00:00","updated_at":"2026-05-25T07:26:42.158615+00:00","title_quality_ok":true,"display_title":"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints","render_title":"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints"},"hub":{"state":{"work_id":"b73ad5b2-e553-4c71-b0c9-67e67ba7b158","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":102,"external_cited_by_count":null,"distinct_field_count":13,"first_pith_cited_at":"2023-07-17T17:50:36+00:00","last_pith_cited_at":"2026-05-21T07:40:57+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-03T10:15:55.346338+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":20},{"context_role":"method","n":6},{"context_role":"other","n":1}],"polarity_counts":[{"context_polarity":"background","n":19},{"context_polarity":"use_method","n":6},{"context_polarity":"unclear","n":2}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints","claims":[{"claim_text":"Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show t","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"such as Qwen3 [97], while incorporating several design modifications to balance scalability and multimodal adapt- ability. The model consists of 64 transformer layers, each with a hidden size of 5,120 and an intermediate size of 25,600. The attention mechanism employs 64 heads with 8 dedicated key-value heads, adopting Grouped Query Attention (GQA) [2] to improve efficiency. RMSNorm [123] with pre-normalization is used to stabilize training. We introduce QK-Norm [23] to the query and key project","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"pretraining requires data scaling, one would like to make sure the data used are of high quality, rather than training the model on large raw data, i.e., we prefer 3T tokens over sophasticated engineering over 10T tokens without extensive filtering. Regarding the model architecture, we use standard implementation of the Transformer architecture with Grouped-Query Attention (GQA) [1], SwiGLU [68] activation, and RoPE with an adjusted base frequency (RoPE ABF) [82]. This design choice is the stand","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Moreover, AVA-BENCH represents a critical step towards developing next- generation VFMs by providing a systematic, diagnostic, and comprehensive evaluation framework. This benchmark enables VFM developers to accurately pinpoint specific deficiencies and implement targeted improvements, fostering the creation of more robust, versatile, and well-rounded VFMs in the future. 10 References [1] Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa:","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"tokens through a separate stream. Specifically, image queries attend to concatenated image and text key-value pairs, where text keys and values are projected from the Qwen3-VL [16] encoder output. This design reduces computational overhead compared to bidirectional attention schemes while maintaining strong text-image alignment. We adopt Grouped Query Attention (GQA) [17] with a 4:1 ratio, using 16 query heads and 4 key-value heads. This reduces the key-value cache by 4× during inference with ne","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"board-level watt ceiling, and the driver throttles the GPU to stay within it- trading some throughput for guaranteed power savings. This model is sound for compute-bound workloads that push the GPU near its thermal design power (TDP). We show it fails for the phase that dominates production LLM serving: autoregressive decode. The failure is structural, not incidental. Across four attention paradigms- GQA [2], MLA [6], Gated DeltaNet [ 21], and Mamba2 [ 5]-decode power draw arXiv:2605.11999v1 [cs","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"•For supporting structures: Describe legs, frames, bases, and their stability characteristics • There should be anothershort_caption, which is a condensed version of the caption in approximately 5 words, capturing the essential object type and key feature (e.g., 'wooden dining chair', 'round glass table', 'metal desk lamp') Format your response as a JSON object: {\"captions\": [{\"part_idx\": 0, \"caption\": \"concise assembly-aware description\", \"short_caption\": \"5-word summary\"}, {\"part_idx\": 1, \"cap","claim_type":"other","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (20 contexts).","role_counts":[{"n":20,"context_role":"background"},{"n":6,"context_role":"method"},{"n":1,"context_role":"other"}]},"error":null,"updated_at":"2026-05-23T04:04:16.041715+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"def23d27-3f7c-47c9-b553-45af77bcbb6a","orcid":null,"display_name":"Joshua Ainslie"},{"id":"f3feb0e8-3e1c-4108-93fd-cd8dc20f0611","orcid":null,"display_name":"James Lee-Thorp"},{"id":"b6547429-84fa-49d4-b49a-46a74584b8f6","orcid":null,"display_name":"Michiel de Jong"},{"id":"930a9a8c-4af5-4761-ad70-196afca094d6","orcid":null,"display_name":"Yury Zemlyanskiy"},{"id":"8c7a28db-5d82-4107-9753-bf8fee61096f","orcid":null,"display_name":"Federico Lebr\\'on"},{"id":"ed57c749-f179-4eb2-87d4-0c011f1c09c9","orcid":null,"display_name":"Sumit Sanghai"}]},"error":null,"updated_at":"2026-05-23T04:04:17.257745+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T13:41:16.855499+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Fast Transformer Decoding: One Write-Head is All You Need","work_id":"160ea164-b1d4-4adb-8ccb-a4655d8a0bb4","shared_citers":19},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":12},{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":11},{"title":"Efficient Streaming Language Models with Attention Sinks","work_id":"a8d25452-c237-48c9-88a4-682717c3979a","shared_citers":10},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":10},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":10},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":10},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":10},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":9},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":9},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":8},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":8},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":8},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":8},{"title":"FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning","work_id":"fff3953b-5efb-4753-bee4-002f59995810","shared_citers":7},{"title":"GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers","work_id":"19ed8c44-202a-48f6-8169-637d5a5f2408","shared_citers":7},{"title":"KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache","work_id":"735737c3-24e5-41c3-ab4f-04edcb36731c","shared_citers":7},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":7},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":6},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":6},{"title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","work_id":"4e5eee26-cd04-4c7a-988f-3e6d1a1f0eb9","shared_citers":6},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":5},{"title":"Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them","work_id":"513eb205-04ca-4722-9a43-a74e8cbe7e85","shared_citers":5}],"time_series":[{"n":2,"year":2023},{"n":9,"year":2024},{"n":5,"year":2025},{"n":35,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T13:51:24.022401+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T13:41:21.508341+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints","claims":[{"claim_text":"Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show t","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"such as Qwen3 [97], while incorporating several design modifications to balance scalability and multimodal adapt- ability. The model consists of 64 transformer layers, each with a hidden size of 5,120 and an intermediate size of 25,600. The attention mechanism employs 64 heads with 8 dedicated key-value heads, adopting Grouped Query Attention (GQA) [2] to improve efficiency. RMSNorm [123] with pre-normalization is used to stabilize training. We introduce QK-Norm [23] to the query and key project","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"pretraining requires data scaling, one would like to make sure the data used are of high quality, rather than training the model on large raw data, i.e., we prefer 3T tokens over sophasticated engineering over 10T tokens without extensive filtering. Regarding the model architecture, we use standard implementation of the Transformer architecture with Grouped-Query Attention (GQA) [1], SwiGLU [68] activation, and RoPE with an adjusted base frequency (RoPE ABF) [82]. This design choice is the stand","claim_type":"method","confidence":0.95,"evidence_strength":"citation_context"},{"claim_text":"Moreover, AVA-BENCH represents a critical step towards developing next- generation VFMs by providing a systematic, diagnostic, and comprehensive evaluation framework. This benchmark enables VFM developers to accurately pinpoint specific deficiencies and implement targeted improvements, fostering the creation of more robust, versatile, and well-rounded VFMs in the future. 10 References [1] Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa:","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"tokens through a separate stream. Specifically, image queries attend to concatenated image and text key-value pairs, where text keys and values are projected from the Qwen3-VL [16] encoder output. This design reduces computational overhead compared to bidirectional attention schemes while maintaining strong text-image alignment. We adopt Grouped Query Attention (GQA) [17] with a 4:1 ratio, using 16 query heads and 4 key-value heads. This reduces the key-value cache by 4× during inference with ne","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"board-level watt ceiling, and the driver throttles the GPU to stay within it- trading some throughput for guaranteed power savings. This model is sound for compute-bound workloads that push the GPU near its thermal design power (TDP). We show it fails for the phase that dominates production LLM serving: autoregressive decode. The failure is structural, not incidental. Across four attention paradigms- GQA [2], MLA [6], Gated DeltaNet [ 21], and Mamba2 [ 5]-decode power draw arXiv:2605.11999v1 [cs","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"•For supporting structures: Describe legs, frames, bases, and their stability characteristics • There should be anothershort_caption, which is a condensed version of the caption in approximately 5 words, capturing the essential object type and key feature (e.g., 'wooden dining chair', 'round glass table', 'metal desk lamp') Format your response as a JSON object: {\"captions\": [{\"part_idx\": 0, \"caption\": \"concise assembly-aware description\", \"short_caption\": \"5-word summary\"}, {\"part_idx\": 1, \"cap","claim_type":"other","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (20 contexts).","role_counts":[{"n":20,"context_role":"background"},{"n":6,"context_role":"method"},{"n":1,"context_role":"other"}]},"error":null,"updated_at":"2026-05-23T04:04:17.263113+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints","claims":[{"claim_text":"Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show t","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T13:51:21.510928+00:00"}},"summary":{"title":"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints","claims":[{"claim_text":"Multi-query attention (MQA), which only uses a single key-value head, drastically speeds up decoder inference. However, MQA can lead to quality degradation, and moreover it may not be desirable to train a separate model just for faster inference. We (1) propose a recipe for uptraining existing multi-head language model checkpoints into models with MQA using 5% of original pre-training compute, and (2) introduce grouped-query attention (GQA), a generalization of multi-query attention which uses an intermediate (more than one, less than number of query heads) number of key-value heads. We show t","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Fast Transformer Decoding: One Write-Head is All You Need","work_id":"160ea164-b1d4-4adb-8ccb-a4655d8a0bb4","shared_citers":19},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":12},{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":11},{"title":"Efficient Streaming Language Models with Attention Sinks","work_id":"a8d25452-c237-48c9-88a4-682717c3979a","shared_citers":10},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":10},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":10},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":10},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":10},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":9},{"title":"Training Compute-Optimal Large Language Models","work_id":"b2faf28d-86b7-429c-bc42-469458efc246","shared_citers":9},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":8},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":8},{"title":"Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism","work_id":"c888e6d1-0b1d-43d6-9ef5-f0912a0efa1b","shared_citers":8},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":8},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":8},{"title":"FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning","work_id":"fff3953b-5efb-4753-bee4-002f59995810","shared_citers":7},{"title":"GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers","work_id":"19ed8c44-202a-48f6-8169-637d5a5f2408","shared_citers":7},{"title":"KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache","work_id":"735737c3-24e5-41c3-ab4f-04edcb36731c","shared_citers":7},{"title":"Program Synthesis with Large Language Models","work_id":"fd241a05-03b9-4de2-9588-9d77ce176125","shared_citers":7},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":6},{"title":"Qwen Technical Report","work_id":"bb1fd52f-6b2f-437c-9516-37bdf6eb9be8","shared_citers":6},{"title":"RoFormer: Enhanced Transformer with Rotary Position Embedding","work_id":"4e5eee26-cd04-4c7a-988f-3e6d1a1f0eb9","shared_citers":6},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":5},{"title":"Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them","work_id":"513eb205-04ca-4722-9a43-a74e8cbe7e85","shared_citers":5}],"time_series":[{"n":2,"year":2023},{"n":9,"year":2024},{"n":5,"year":2025},{"n":35,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"8c7a28db-5d82-4107-9753-bf8fee61096f","orcid":null,"display_name":"Federico Lebr\\'on","source":"manual","import_confidence":0.72},{"id":"f3feb0e8-3e1c-4108-93fd-cd8dc20f0611","orcid":null,"display_name":"James Lee-Thorp","source":"manual","import_confidence":0.72},{"id":"def23d27-3f7c-47c9-b553-45af77bcbb6a","orcid":null,"display_name":"Joshua Ainslie","source":"manual","import_confidence":0.72},{"id":"b6547429-84fa-49d4-b49a-46a74584b8f6","orcid":null,"display_name":"Michiel de Jong","source":"manual","import_confidence":0.72},{"id":"ed57c749-f179-4eb2-87d4-0c011f1c09c9","orcid":null,"display_name":"Sumit Sanghai","source":"manual","import_confidence":0.72},{"id":"930a9a8c-4af5-4761-ad70-196afca094d6","orcid":null,"display_name":"Yury Zemlyanskiy","source":"manual","import_confidence":0.72}]}}