{"work":{"id":"1e1df141-cac8-47fd-b068-c4c96e51e331","openalex_id":null,"doi":null,"arxiv_id":"2405.04434","raw_key":null,"title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","authors":null,"authors_text":"DeepSeek-AI, Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu","year":2024,"venue":"cs.CL","abstract":"We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.","external_url":"https://arxiv.org/abs/2405.04434","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T07:45:29.215810+00:00","pith_arxiv_id":"2405.04434","created_at":"2026-05-09T06:15:37.602726+00:00","updated_at":"2026-05-25T07:45:29.215810+00:00","title_quality_ok":true,"display_title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","render_title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model"},"hub":{"state":{"work_id":"1e1df141-cac8-47fd-b068-c4c96e51e331","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":118,"external_cited_by_count":null,"distinct_field_count":18,"first_pith_cited_at":"2023-07-12T20:01:52+00:00","last_pith_cited_at":"2026-05-22T17:31:16+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-30T12:21:06.086972+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":24},{"context_role":"method","n":5},{"context_role":"dataset","n":3},{"context_role":"baseline","n":1}],"polarity_counts":[{"context_polarity":"background","n":23},{"context_polarity":"use_method","n":5},{"context_polarity":"use_dataset","n":3},{"context_polarity":"baseline","n":1},{"context_polarity":"support","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","claims":[{"claim_text":"We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSe","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Iterative algorithm for solving triple-hierarchical constrained optimization problem. J. Optim. Theory Appl., 148:580-592, 2011. [99] A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024. [100] B. Liu, X. Li, J. Zhang, J. Wang, T. He, S. Hong, H. Liu, S. Zhang, K. Song, K. Zhu, Y . Cheng, S. Wang, X. Wang, Y . Luo, H. Jin, P. Zhang, O. ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"* Chen Zhang is the corresponding author. †Qijun Zhang participated in this project during his internship at Shanghai Jiao Tong University. Fig. 1: Comparison of computation-communication overlap strategies in MoE systems. fraction of the computational cost [12], [13], [22], [55]. This approach has been validated by state-of-the-art models such as Mixtral-8x7B [22], DeepSeek [9]-[11], GPT [49], [50], Llama 4 [36], DBRX [54] and others [1], [4], [45], [60], firmly establishing MoE as a crucial co","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"board-level watt ceiling, and the driver throttles the GPU to stay within it- trading some throughput for guaranteed power savings. This model is sound for compute-bound workloads that push the GPU near its thermal design power (TDP). We show it fails for the phase that dominates production LLM serving: autoregressive decode. The failure is structural, not incidental. Across four attention paradigms- GQA [2], MLA [6], Gated DeltaNet [ 21], and Mamba2 [ 5]-decode power draw arXiv:2605.11999v1 [cs","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"for production LLM training. Date:April 22, 2026 1 Introduction The landscape of deep learning has been fundamentally altered by the emergence of mixture-of-experts (MoE) architectures, which increases model size at scalable training and inference cost. State-of-the-art large language models (LLMs), including GPT-5 [35], Gemini3 Pro [9], DeepSeek-V3 [14], and Qwen3 [42] have universally adopted MoE designs to scale parameter counts into hundreds of billions while maintaining manageable activatio","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"HyLo-Llama-14MLA14M2 8B 4.7% 44.1 71.2 67.3 39.6 75.4 40.0 64.3 57.4 71.7 65.4 57.8 46.6 40.9 HyLo-Llama-14MLA14GDN 8B 4.7% 45.1 72.0 68.2 39.4 76.1 40.9 63.8 57.9 73.2 69.7 62.9 52.0 58.9 Table 3:Comparison of different techniques across backbone models Llama-3.2-3B. which includes ARC-Challenge (ARC) [9], ARC-Easy (ARE) [9], HellaSwag (HS) [53], OpenBookQA (OB) [ 31], PIQA [4], RACE (RA) [23], and WinoGrande (WG) [ 41]. For long context evaluations we use all 13 tasks from RULER [20] benchmark","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"The signed power transform Φβ(x) = sign(x)⊙ |x|β with β= 0.1 provides four layers of protection. 16 Layer 1: Hölder continuity ofΦ β.Lemma 3 establishes that for anyx,y∈R d, ∥Φβ(x)−Φ β(y)∥1+β ≤C β∥x−y∥ β 1+β,(32) with Cβ = 2 1−βd(1−β)/(1+β) . Setting x= ˜mt and y=m t and using the norm equivalence ∥δt∥1+β ≤d 1 1+β ∥δt∥∞, we obtain ∥Φβ( ˜mt)−Φ β(mt)∥1+β ≤C βd β 1+β ∥δt∥β ∞.(33) Since β= 0.1 , the exponent β on the quantization error ∥δt∥∞ significantly attenuates its impact. For a block with dyna","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (24 contexts).","role_counts":[{"n":24,"context_role":"background"},{"n":5,"context_role":"method"},{"n":3,"context_role":"dataset"},{"n":1,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-20T13:11:57.745808+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"b3d3bc38-c7e6-4554-ab1a-a4b32a8299c8","orcid":null,"display_name":"DeepSeek-AI"},{"id":"5ca1d3e9-abe6-434e-91cd-b96f50c305c4","orcid":null,"display_name":"Aixin Liu"},{"id":"16057617-fcdc-45e5-a009-48651c24b426","orcid":null,"display_name":"Bei Feng"},{"id":"21461dbd-f942-4232-8842-f8348bf4678a","orcid":null,"display_name":"Bin Wang"},{"id":"1319a98c-3288-417b-86b2-19490b8cdcb2","orcid":null,"display_name":"Bingxuan Wang"},{"id":"d23ba457-7f27-40ff-b81f-5e71cbdfd3cc","orcid":null,"display_name":"Bo Liu"}]},"error":null,"updated_at":"2026-05-20T13:11:57.739658+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T11:29:59.439100+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":24},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":20},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":10},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":10},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":9},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":9},{"title":"gpt-oss-120b & gpt-oss-20b Model Card","work_id":"178c1f7e-4f19-4392-a45d-45a6dfa88ead","shared_citers":9},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":9},{"title":"Mixtral of Experts","work_id":"0de8c352-9daa-4e1e-8c7b-3d0dec69f369","shared_citers":9},{"title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer","work_id":"2c6b3f6d-54e4-4df7-baa7-475a490799af","shared_citers":9},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":8},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":7},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":7},{"title":"DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models","work_id":"07c85cc5-4086-4abc-823b-6d0f4ff784d0","shared_citers":6},{"title":"Fast Transformer Decoding: One Write-Head is All You Need","work_id":"160ea164-b1d4-4adb-8ccb-a4655d8a0bb4","shared_citers":6},{"title":"GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding","work_id":"52b3c9a6-2a27-45a7-ba2b-ebe4b5bb5a5f","shared_citers":6},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":6},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":6},{"title":"arXiv preprint arXiv:2408.15664 , year=","work_id":"267500ca-1512-478f-8a1b-6ecbdb09771d","shared_citers":5},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":5},{"title":"DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models","work_id":"a9888d6d-bf47-4324-9834-7cc12ac3a78c","shared_citers":5},{"title":"Efficient Memory Management for Large Language Model Serving with PagedAttention","work_id":"0eb5eca2-2c11-4a77-a25a-18c331a50ed2","shared_citers":5},{"title":"Efficient Streaming Language Models with Attention Sinks","work_id":"a8d25452-c237-48c9-88a4-682717c3979a","shared_citers":5}],"time_series":[{"n":3,"year":2024},{"n":3,"year":2025},{"n":54,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T11:29:46.527651+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T11:29:50.778645+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","claims":[{"claim_text":"We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSe","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"Iterative algorithm for solving triple-hierarchical constrained optimization problem. J. Optim. Theory Appl., 148:580-592, 2011. [99] A. Liu, B. Feng, B. Wang, B. Wang, B. Liu, C. Zhao, C. Dengr, C. Ruan, D. Dai, D. Guo, et al. DeepSeek-V2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024. [100] B. Liu, X. Li, J. Zhang, J. Wang, T. He, S. Hong, H. Liu, S. Zhang, K. Song, K. Zhu, Y . Cheng, S. Wang, X. Wang, Y . Luo, H. Jin, P. Zhang, O. ","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"* Chen Zhang is the corresponding author. †Qijun Zhang participated in this project during his internship at Shanghai Jiao Tong University. Fig. 1: Comparison of computation-communication overlap strategies in MoE systems. fraction of the computational cost [12], [13], [22], [55]. This approach has been validated by state-of-the-art models such as Mixtral-8x7B [22], DeepSeek [9]-[11], GPT [49], [50], Llama 4 [36], DBRX [54] and others [1], [4], [45], [60], firmly establishing MoE as a crucial co","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"board-level watt ceiling, and the driver throttles the GPU to stay within it- trading some throughput for guaranteed power savings. This model is sound for compute-bound workloads that push the GPU near its thermal design power (TDP). We show it fails for the phase that dominates production LLM serving: autoregressive decode. The failure is structural, not incidental. Across four attention paradigms- GQA [2], MLA [6], Gated DeltaNet [ 21], and Mamba2 [ 5]-decode power draw arXiv:2605.11999v1 [cs","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"for production LLM training. Date:April 22, 2026 1 Introduction The landscape of deep learning has been fundamentally altered by the emergence of mixture-of-experts (MoE) architectures, which increases model size at scalable training and inference cost. State-of-the-art large language models (LLMs), including GPT-5 [35], Gemini3 Pro [9], DeepSeek-V3 [14], and Qwen3 [42] have universally adopted MoE designs to scale parameter counts into hundreds of billions while maintaining manageable activatio","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"HyLo-Llama-14MLA14M2 8B 4.7% 44.1 71.2 67.3 39.6 75.4 40.0 64.3 57.4 71.7 65.4 57.8 46.6 40.9 HyLo-Llama-14MLA14GDN 8B 4.7% 45.1 72.0 68.2 39.4 76.1 40.9 63.8 57.9 73.2 69.7 62.9 52.0 58.9 Table 3:Comparison of different techniques across backbone models Llama-3.2-3B. which includes ARC-Challenge (ARC) [9], ARC-Easy (ARE) [9], HellaSwag (HS) [53], OpenBookQA (OB) [ 31], PIQA [4], RACE (RA) [23], and WinoGrande (WG) [ 41]. For long context evaluations we use all 13 tasks from RULER [20] benchmark","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"The signed power transform Φβ(x) = sign(x)⊙ |x|β with β= 0.1 provides four layers of protection. 16 Layer 1: Hölder continuity ofΦ β.Lemma 3 establishes that for anyx,y∈R d, ∥Φβ(x)−Φ β(y)∥1+β ≤C β∥x−y∥ β 1+β,(32) with Cβ = 2 1−βd(1−β)/(1+β) . Setting x= ˜mt and y=m t and using the norm equivalence ∥δt∥1+β ≤d 1 1+β ∥δt∥∞, we obtain ∥Φβ( ˜mt)−Φ β(mt)∥1+β ≤C βd β 1+β ∥δt∥β ∞.(33) Since β= 0.1 , the exponent β on the quantization error ∥δt∥∞ significantly attenuates its impact. For a block with dyna","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (24 contexts).","role_counts":[{"n":24,"context_role":"background"},{"n":5,"context_role":"method"},{"n":3,"context_role":"dataset"},{"n":1,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-20T13:11:57.743327+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","claims":[{"claim_text":"We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSe","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T11:29:59.441178+00:00"}},"summary":{"title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","claims":[{"claim_text":"We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSe","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":24},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":20},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":12},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":10},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":10},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":9},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":9},{"title":"gpt-oss-120b & gpt-oss-20b Model Card","work_id":"178c1f7e-4f19-4392-a45d-45a6dfa88ead","shared_citers":9},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":9},{"title":"Mixtral of Experts","work_id":"0de8c352-9daa-4e1e-8c7b-3d0dec69f369","shared_citers":9},{"title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer","work_id":"2c6b3f6d-54e4-4df7-baa7-475a490799af","shared_citers":9},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":8},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":7},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":7},{"title":"DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models","work_id":"07c85cc5-4086-4abc-823b-6d0f4ff784d0","shared_citers":6},{"title":"Fast Transformer Decoding: One Write-Head is All You Need","work_id":"160ea164-b1d4-4adb-8ccb-a4655d8a0bb4","shared_citers":6},{"title":"GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding","work_id":"52b3c9a6-2a27-45a7-ba2b-ebe4b5bb5a5f","shared_citers":6},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":6},{"title":"Qwen2.5 Technical Report","work_id":"d8432992-4980-4a81-85c7-9fa2c2b87f85","shared_citers":6},{"title":"arXiv preprint arXiv:2408.15664 , year=","work_id":"267500ca-1512-478f-8a1b-6ecbdb09771d","shared_citers":5},{"title":"DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models","work_id":"c5006563-f3ec-438a-9e35-b7b484f34828","shared_citers":5},{"title":"DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models","work_id":"a9888d6d-bf47-4324-9834-7cc12ac3a78c","shared_citers":5},{"title":"Efficient Memory Management for Large Language Model Serving with PagedAttention","work_id":"0eb5eca2-2c11-4a77-a25a-18c331a50ed2","shared_citers":5},{"title":"Efficient Streaming Language Models with Attention Sinks","work_id":"a8d25452-c237-48c9-88a4-682717c3979a","shared_citers":5}],"time_series":[{"n":3,"year":2024},{"n":3,"year":2025},{"n":54,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"5ca1d3e9-abe6-434e-91cd-b96f50c305c4","orcid":null,"display_name":"Aixin Liu","source":"manual","import_confidence":0.72},{"id":"16057617-fcdc-45e5-a009-48651c24b426","orcid":null,"display_name":"Bei Feng","source":"manual","import_confidence":0.72},{"id":"1319a98c-3288-417b-86b2-19490b8cdcb2","orcid":null,"display_name":"Bingxuan Wang","source":"manual","import_confidence":0.72},{"id":"21461dbd-f942-4232-8842-f8348bf4678a","orcid":null,"display_name":"Bin Wang","source":"manual","import_confidence":0.72},{"id":"d23ba457-7f27-40ff-b81f-5e71cbdfd3cc","orcid":null,"display_name":"Bo Liu","source":"manual","import_confidence":0.72},{"id":"b3d3bc38-c7e6-4554-ab1a-a4b32a8299c8","orcid":null,"display_name":"DeepSeek-AI","source":"manual","import_confidence":0.72}]}}