arxiv: 2512.13751 · v2 · submitted 2025-12-15 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

MIDUS: Memory-Infused Depth Up-Scaling

Taero Kim , Hoyoon Byun , Youngjun Choi , Sungrae Park , Kyungwoo Song

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords depth up-scalingmemory layersproduct-key memorytransformer residualhead-wise expansionmodel scalingretrieval-based capacity

0 comments

The pith

MIDUS replaces duplicated FFN branches with head-wise memory layers to turn added depth into lightweight retrieval residuals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Memory-Infused Depth Up-Scaling (MIDUS) to expand pre-trained language models by inserting duplicated Transformer blocks more efficiently than prior depth up-scaling methods. Standard DUS duplicates entire blocks including heavy FFN branches, raising parameter and compute costs while adding block-level dense residuals. MIDUS instead swaps those duplicated FFN branches for head-wise memory layers that deliver retrieval-based residual capacity at lower cost. It introduces the Head-wise Memory Layer (HML) that pairs multi-head product-key memory with Head-wise Implicit Value Expansion (HIVE) so each head receives its own key space and compactly projected values from a shared bank. Empirical results and structural analyses show this yields performance gains alongside a distinct head-conditioned form of residual expansion.

Core claim

Depth up-scaling works by duplicating blocks but need not duplicate dense FFN branches; MIDUS shows that head-wise memory layers using multi-head product-key memory and HIVE can supply the added capacity as lightweight, retrieval-based, head-specific residuals that preserve or improve performance and remain structurally distinct from FFN-based expansion.

What carries the argument

Head-wise Memory Layer (HML), which assigns each head a distinct key space via multi-head product-key memory and realizes head-specific values from a shared latent bank through compact projections enabled by Head-wise Implicit Value Expansion (HIVE).

If this is right

Added depth contributes retrieval-based rather than dense FFN-based residual capacity.
Head-level conditioning accommodates heterogeneous head roles more directly than block-level duplication.
Parameter and compute overhead of depth up-scaling drops while performance holds or rises.
Structural analyses confirm HML with HIVE forms a distinct alternative to FFN residual branches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Memory-infused scaling may generalize to other expansion strategies that currently duplicate full blocks.
Head-conditioned retrieval could support more modular growth paths than uniform block duplication.
Fixed-retrieval analyses suggest future work could tune key spaces per head without retraining the entire model.

Load-bearing premise

Head-wise memory layers with multi-head product-key memory and HIVE can substitute for duplicated FFN branches while preserving model performance and supplying equivalent residual capacity.

What would settle it

A direct comparison on standard language-model benchmarks where MIDUS versions show lower accuracy or higher effective compute cost than ordinary depth-up-scaled baselines with the same number of added blocks.

Figures

Figures reproduced from arXiv: 2512.13751 by Hoyoon Byun, Kyungwoo Song, Sungrae Park, Taero Kim, Youngjun Choi.

**Figure 1.** Figure 1: DUS upscales by depth-wise insertion of Transformer blocks, whereas MIDUS upscales by inserting Memory blocks. Large language models (LLMs) improve predictably as parameters and data scale, yet training ever larger models from scratch is increasingly impractical in time and memory. A pragmatic alternative is model expansion, where a strong pre-trained backbone is enlarged and then further pre-trained. Dep… view at source ↗

**Figure 2.** Figure 2: MIDUS with HML is divided into six steps. In this figure, we assume the case where the number of attention heads is H = 3 and k = 2 for Top-k selection. Step 1) The input x entering the Memory block passes through Attn′ , an attention layer without output projection, producing a ′ . Step 2) a ′ is split into head-wise representations ah, which directly serve as the queries qh for each head. Step 3) Each he… view at source ↗

**Figure 3.** Figure 3: Change in prefill time versus prompt length for each memory layer relative to DUS, where negative ∆ Prefill Time indicates faster prefill time than DUS on Llama-3.1-8B. memory banks is crucial. The effect of normalization and the attention output projection within the HML architecture is further analyzed in Appendix F.4. Efficiency of MIDUS-HML [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Head importance on PIQA benchmark for Llama-3.2-1B. (a) Base-pretrained head importance. (b) MIDUS-HML head importance after CPT on FineWeb-Edu. (c) Per-layer variance of head importance, showing stronger and more concentrated heads under HML [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Change in prefill time versus prompt length for each memory layer relative to DUS, where negative ∆ Prefill Time indicates faster prefill time than DUS. Batch size 16 32 64 128 20 8 6 4 2 Peak Memory in Training (GB) 60 50 40 30 20 0 10 Training Time (s/iter ) 18 16 12 10 10 90 70 50 40 30 20 Llama Pro (Dist.) OpT-DeUS (Top.) MIDUS-HML (Dist.) MIDUS-HML (Top.) Peak Memory Training Time (a) Llama-3.2-1B (b)… view at source ↗

**Figure 6.** Figure 6: Efficiency of DUS baselines and MIDUS-HML as a function of batch size. Here, Dist. denotes the Distributed and Top. the Top-heavy DUS placement policy. 256 512 1024 2048 4096 14 8 6 4 2 Peak Memory in Training (GB) 60 50 40 30 20 10 Training Time (s/iter ) 12 10 8 6 4 10 80 Llama Pro (Dist.) OpT-DeUS (Top.) MIDUS-HML (Dist.) MIDUS-HML (Top.) Peak Memory Training Time (a) Llama-3.2-1B (b) Llama-3.1-8B 70 Se… view at source ↗

**Figure 7.** Figure 7: Efficiency of DUS baselines and MIDUS-HML as a function of sequence length. Here, Dist. denotes the Distributed and Top. the Top-heavy DUS placement policy. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Expanding pre-trained language models offers a practical way to increase capacity without training larger models from scratch. Depth Up-Scaling (DUS) does so by duplicating Transformer blocks and inserting them into a pre-trained backbone. This process also duplicates FFN-heavy blocks, increasing parameter and compute cost while adding capacity through a block-level dense residual branch. Yet prior work suggests that added capacity need not remain tied to dense FFN branches, while attention heads often play heterogeneous roles, motivating more efficient head-level residual corrections. We propose Memory-Infused Depth Up-Scaling (MIDUS), which replaces the duplicated FFN branches with memory layers and turns added depth into lightweight retrieval-based residual capacity. We introduce a Head-wise Memory Layer (HML), which combines multi-head product-key memory with Head-wise Implicit Value Expansion (HIVE). HML assigns each head a distinct key space, while HIVE realizes head-specific values from a shared latent bank through compact projections. Alongside empirical improvements in performance and efficiency, our head-importance and fixed-retrieval structural analyses characterize HML with HIVE as a structurally distinct, head-conditioned alternative to FFN-based residual expansion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIDUS swaps duplicated FFNs for head-wise product-key memory plus HIVE in depth up-scaling, which looks like a clean efficiency move if the numbers hold.

read the letter

The core idea here is replacing the duplicated FFN branches in standard depth up-scaling with a head-wise memory layer that uses per-head product keys and compact projections for implicit value expansion. This turns added depth into retrieval-based residuals instead of another dense block, which directly targets the parameter bloat that comes with block duplication. The paper spells out the HML architecture, the insertion points in the backbone, and the HIVE mechanism for sharing a latent bank across heads while keeping head-specific outputs. That part is new in the DUS setting and avoids simply rehashing prior memory-augmented transformers or plain DUS work. The head-importance and fixed-retrieval analyses are the strongest part; they actually characterize how the residual capacity differs structurally from FFN expansion, which gives the substitution argument some teeth beyond just claiming efficiency. Parameter counts and the motivation from attention heterogeneity line up without obvious circularity. The full manuscript supplies the definitions and protocol details, so the argument is internally consistent on its own terms. The main soft spot is the evidence. The abstract flags performance and efficiency gains but supplies no concrete numbers, baselines, ablations, or error bars, so it is hard to judge how much the substitution actually preserves or improves results versus adding hidden retrieval costs. The free parameters around key space size and projection dimensions are acknowledged but not stress-tested in depth in the summary. This is for researchers focused on practical transformer scaling under compute limits. Anyone already working on memory layers or depth expansion would get value from the structural comparisons. It is coherent enough to deserve a serious referee who can check the experiments and see whether the claimed gains are robust.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes Memory-Infused Depth Up-Scaling (MIDUS) as a modification to Depth Up-Scaling (DUS) for expanding pre-trained language models. Instead of duplicating FFN-heavy blocks, MIDUS replaces the duplicated FFN branches with Head-wise Memory Layers (HML) that combine multi-head product-key memory and Head-wise Implicit Value Expansion (HIVE). Each head receives a distinct key space while HIVE derives head-specific values from a shared latent bank via compact projections. The paper supplies the architectural definitions, insertion points into the DUS backbone, and accompanying head-importance and fixed-retrieval analyses, claiming that the resulting residual capacity is lightweight, retrieval-based, and structurally distinct from dense FFN branches while delivering empirical gains in performance and efficiency.

Significance. If the reported empirical gains and structural distinctions hold under rigorous baselines and ablations, MIDUS would demonstrate a practical route to parameter-efficient depth scaling that exploits attention-head heterogeneity. The approach converts added depth into retrieval-based capacity rather than additional dense computation, which could reduce both parameter count and inference cost relative to standard DUS while preserving or improving model quality. The head-wise memory formulation and HIVE mechanism also supply a concrete architectural alternative that future work on memory-augmented transformers could build upon.

minor comments (3)

[Abstract] Abstract: the statement that MIDUS yields 'empirical improvements in performance and efficiency' is presented without any quantitative anchors (e.g., perplexity deltas, throughput numbers, or baseline comparisons). Even in an abstract, a single concrete result or reference to the main table would strengthen the summary.
[§3 (Architectural Definition)] The definitions of the memory key space size per head and the HIVE projection dimensions are introduced as free parameters; a short paragraph or table explicitly listing their chosen values and sensitivity would improve reproducibility.
[§5 (Structural Analyses)] Figure captions for the head-importance and fixed-retrieval analyses should state the exact evaluation metric and the number of heads or layers examined so that readers can immediately interpret the structural claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of MIDUS, the recognition of its potential as a parameter-efficient alternative to standard depth up-scaling, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents MIDUS as an architectural proposal: it motivates replacing duplicated FFN branches in DUS with head-wise memory layers (HML + HIVE) based on external observations about attention heterogeneity, then supplies explicit definitions for the key spaces, projections, insertion points, and empirical protocol. No equation or claim reduces a 'prediction' to a fitted parameter by construction, no uniqueness theorem is imported from self-citation, and no ansatz is smuggled via prior work by the same authors. The residual-capacity characterization follows directly from the structural definitions and reported analyses rather than from re-labeling inputs. The chain is therefore self-contained.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that attention heads have heterogeneous roles that benefit from per-head memory spaces, plus two newly introduced architectural components whose effectiveness is asserted empirically but not derived from first principles.

free parameters (2)

memory key space size per head
Distinct key space assigned to each head; size is a design choice that must be set to enable retrieval.
HIVE projection dimensions
Compact projections that realize head-specific values from shared latent bank; dimensions are chosen hyperparameters.

axioms (1)

domain assumption Attention heads play heterogeneous roles
Invoked to motivate head-wise rather than block-level residual corrections.

invented entities (2)

Head-wise Memory Layer (HML) no independent evidence
purpose: Provide head-specific residual corrections via multi-head product-key memory
New component introduced to replace duplicated FFN branches.
Head-wise Implicit Value Expansion (HIVE) no independent evidence
purpose: Realize head-specific values from a shared latent bank through compact projections
New mechanism to enable per-head value expansion without separate banks.

pith-pipeline@v0.9.0 · 5510 in / 1283 out tokens · 46586 ms · 2026-05-16T22:00:09.428623+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a Head-wise Memory Layer (HML), which combines multi-head product-key memory with Head-wise Implicit Value Expansion (HIVE).
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MIDUS replaces the duplicated FFN branches with memory layers and turns added depth into lightweight retrieval-based residual capacity.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 16 internal anchors

[1]

Mathqa: Towards interpretable math word problem solving with operation-based for- malisms

Aida Amini, Saadia Gabriel, Shanchuan Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based for- malisms. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 (long an...

work page 2019
[2]

Progressive depth up-scaling via optimal transport

Mingzi Cao, Xi Wang, and Nikolaos Aletras. Progressive depth up-scaling via optimal transport. arXiv preprint arXiv:2508.08011,

work page arXiv
[3]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions.arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[4]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

URLhttps://www.databricks.com/blog/2023/04/ 12/dolly-first-open-commercially-viable-instruction-tuned-llm. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Gradient localization improves lifelong pre- training of language models.arXiv preprint arXiv:2411.04448,

Jared Fernandez, Yonatan Bisk, and Emma Strubell. Gradient localization improves lifelong pre- training of language models.arXiv preprint arXiv:2411.04448,

work page arXiv
[8]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[10]

Measuring Mathematical Problem Solving With the MATH Dataset

11 Preprint. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Ultra- sparse memory network.arXiv preprint arXiv:2411.12364,

Zihao Huang, Qiyang Min, Hongzhi Huang, Defa Zhu, Yutao Zeng, Ran Guo, and Xun Zhou. Ultra- sparse memory network.arXiv preprint arXiv:2411.12364,

work page arXiv
[12]

Ultramemv2: Memory networks scaling to 120b parameters with superior long-context learning.arXiv preprint arXiv:2508.18756,

Zihao Huang, Yu Bao, Qiyang Min, Siyan Chen, Ran Guo, Hongzhi Huang, Defa Zhu, Yutao Zeng, Banggu Wu, Xun Zhou, et al. Ultramemv2: Memory networks scaling to 120b parameters with superior long-context learning.arXiv preprint arXiv:2508.18756,

work page arXiv
[13]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bam- ford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling.arXiv preprint arXiv:2312.15166,

Dahyun Kim, Chanjun Park, Sanghoon Kim, Wonsung Lee, Wonho Song, Yunsu Kim, Hyeonwoo Kim, Yungi Kim, Hyeonju Lee, Jihoo Kim, et al. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling.arXiv preprint arXiv:2312.15166,

work page arXiv
[15]

Large product key memory for pretrained language models

Gyuwan Kim and Tae-Hwan Jung. Large product key memory for pretrained language models. arXiv preprint arXiv:2010.03881,

work page arXiv 2010
[16]

Why Deep Neural Networks for Function Approximation?

Shiyu Liang and Rayadurgam Srikant. Why deep neural networks for function approximation? arXiv preprint arXiv:1610.04161,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

2007.08124 , archivePrefix=

Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: A challenge dataset for machine reading comprehension with logical reasoning.arXiv preprint arXiv:2007.08124,

work page arXiv 2007
[18]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Instruction Tuning with GPT-4

Baolin Peng, Chunydubeyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4.arXiv preprint arXiv:2304.03277,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam Santoro. Mixture-of-depths: Dynamically allocating compute in transformer-based lan- guage models.arXiv preprint arXiv:2404.02258,

work page internal anchor Pith review arXiv
[22]

The power of deeper networks for expressing natural functions

David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. arXiv preprint arXiv:1705.05502,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

12 Preprint. Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. Commonsenseqa: A question answering challenge targeting commonsense knowledge.arXiv preprint arXiv:1811.00937,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Dlo: Dynamic layer operation for efficient vertical scaling of llms.arXiv preprint arXiv:2407.11030,

Zhen Tan, Daize Dong, Xinyu Zhao, Jie Peng, Yu Cheng, and Tianlong Chen. Dlo: Dynamic layer operation for efficient vertical scaling of llms.arXiv preprint arXiv:2407.11030,

work page arXiv
[26]

Llama pro: Progressive llama with block expansion.arXiv preprint arXiv:2401.02415,

Chengyue Wu, Yukang Gan, Yixiao Ge, Zeyu Lu, Jiahao Wang, Ye Feng, Ying Shan, and Ping Luo. Llama pro: Progressive llama with block expansion.arXiv preprint arXiv:2401.02415,

work page arXiv
[27]

Progressively stacking 2.0: A multi-stage layerwise training method for bert training speedup.arXiv preprint arXiv:2011.13635,

Cheng Yang, Shengnan Wang, Chao Yang, Yuechuan Li, Ru He, and Jingqiao Zhang. Progressively stacking 2.0: A multi-stage layerwise training method for bert training speedup.arXiv preprint arXiv:2011.13635,

work page arXiv 2011
[28]

Lesa: Learnable llm layer scaling-up.arXiv preprint arXiv:2502.13794,

Yifei Yang, Zouying Cao, Xinbei Ma, Yao Yao, Libo Qin, Zhi Chen, and Hai Zhao. Lesa: Learnable llm layer scaling-up.arXiv preprint arXiv:2502.13794,

work page arXiv
[29]

Efficient con- struction of model family through progressive training using model expansion.arXiv preprint arXiv:2504.00623,

Kazuki Yano, Sho Takase, Sosuke Kobayashi, Shun Kiyono, and Jun Suzuki. Efficient con- struction of model family through progressive training using model expansion.arXiv preprint arXiv:2504.00623,

work page arXiv
[30]

Which attention heads matter for in-context learning?arXiv preprint arXiv:2502.14010,

Kayo Yin and Jacob Steinhardt. Which attention heads matter for in-context learning?arXiv preprint arXiv:2502.14010,

work page arXiv
[31]

Focus direc- tions make your language models pay more attention to relevant contexts.arXiv preprint arXiv:2503.23306,

Youxiang Zhu, Ruochen Li, Danqing Wang, Daniel Haehn, and Xiaohui Liang. Focus direc- tions make your language models pay more attention to relevant contexts.arXiv preprint arXiv:2503.23306,

work page arXiv
[32]

For Llama-3.2-1B on theFineWeb-Edusubset, we follow the hyperparameter configuration of prior work and apply the same search protocol to all baselines, including ours

while limiting additional hyperparameter tuning, we fix the learning rates of key and value parameters to the maximum learning rate without scheduling and set their weight decay to zero. For Llama-3.2-1B on theFineWeb-Edusubset, we follow the hyperparameter configuration of prior work and apply the same search protocol to all baselines, including ours. Du...

work page 2025
[33]

To construct theMathPilesubset, we extract math-related text from the six components of the orig- inal MATHPILE(Wang et al.,

andDatabricks-Dolly-15k(15k examples) (Conover et al., 2023). To construct theMathPilesubset, we extract math-related text from the six components of the orig- inal MATHPILE(Wang et al.,

work page 2023
[34]

with Llama- 3.2-1B across DUS baselines and MIDUS. Perplexity↓Zero-shot Accuracy↑ Methods Wiki-PPL ARC LogiQA Wino CSQA BoolQ PIQA MMLU Average CPT-1B Base 13.07 68.64 21.35 60.30 27.52 63.2775.2430.49 49.55 SOLAR 13.1969.4023.04 59.04 27.52 60.03 75.14 30.77 49.28 Llama Pro 12.47 68.1023.8160.54 41.93 63.7075.2434.03 52.48 LESA 11.77 66.46 23.66 60.77 49...

work page arXiv 1969
[35]

11.64 65.9523.3560.85 47.17 65.4474.76 36.41 53.42 1M (n=

work page arXiv
[36]

Average zero-shot accuracy rises steadily with capacity, while perplexity stays essentially flat

11.64 66.1623.20 61.5646.27 65.29 75.08 36.91 53.50 In Table 7, we vary the total number of product-key memories from tens of thousands to one million, i.e.,n=16,32,64in the PKM factorization. Average zero-shot accuracy rises steadily with capacity, while perplexity stays essentially flat. We hypothesize that larger tables mainly help tasks that benefit f...

work page arXiv 2060
[37]

F.5 ABLATION STUDY ONDUSPLACEMENT POLICY

18 Preprint. F.5 ABLATION STUDY ONDUSPLACEMENT POLICY. Table 11: MIDUS-HML under different DUS policies. GPU Memory and Time are training require- ments.Top-heavyplaces expanded blocks toward the top,Distributedinterleaves, andBottom-heavy places them near the input. GB↓s/iter↓Perplexity↓Zero-shot Accuracy↑ DUS Policy GPU Memory Time Wiki-PPL ARC LogiQA W...

work page 2025