arxiv: 2604.07766 · v1 · submitted 2026-04-09 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Sensitivity-Positional Co-Localization in GQA Transformers

Manoj Chandrashekar Rao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords Grouped Query AttentionLoRA adaptationRoPE frequency adaptationlayer sensitivityanti-localizationtransformer fine-tuningreasoning benchmarksGQA

0 comments

The pith

In GQA transformers task-sensitive layers and RoPE-influential layers are anti-localized, yet adapting both at the sensitivity-identified layers produces the largest gains across benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the hypothesis that layers most affecting task correctness overlap with layers where positional encoding adaptation matters most. It develops a hidden-state metric to identify sensitive layers and a per-head RoPE multiplier method called GARFA, then measures where each type of layer sits in a 32-layer model. The data reveal a strong negative correlation, with correctness-sensitive layers clustering late and positional ones early. Even so, a controlled ablation finds that restricting both forms of adaptation to the sensitivity-identified layers beats every other layer selection pattern by a wide margin on six different evaluation suites.

Core claim

Contrary to the co-localization hypothesis, task-sensitive layers concentrate in the late network while RoPE-influential layers dominate the early network, yielding Spearman rs = -0.735. Despite this anti-localization, a 4-way cross-layer ablation shows that applying both LoRA and GQA-aware RoPE adaptations to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks, approaching top closed-model results on code generation at modest total compute cost.

What carries the argument

The correctness-differential hidden-state metric that selects layers by the magnitude of hidden-state change between correct and incorrect predictions, together with the per-KV-head scalar multipliers in GARFA.

If this is right

Restricting adaptation to sensitivity-identified layers outperforms random, early-only, late-only, and full-network choices.
The combination of restricted LoRA and GARFA at those layers produces consistent gains on knowledge, math, and code tasks.
The observed anti-localization pattern implies that depth-dependent specialization separates correctness sensitivity from positional leverage.
Targeted adaptation at a small subset of layers can reach near state-of-the-art results on selected benchmarks with low compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Early layers may primarily manage structural and positional information while later layers refine task-specific decisions.
The same sensitivity metric could be reused across tasks to select a reusable adaptation target set.
If anti-localization appears in other GQA or standard transformer families, adaptation recipes could be simplified to a one-time sensitivity scan.

Load-bearing premise

The hidden-state difference between correct and incorrect outputs reliably identifies the layers whose adaptation will improve task performance the most.

What would settle it

Running the identical four-way ablation on the same model and benchmarks and finding no performance advantage for the sensitivity-layer configuration would falsify the central practical claim.

Figures

Figures reproduced from arXiv: 2604.07766 by Manoj Chandrashekar Rao.

**Figure 1.** Figure 1: Scatter plot of sensitivity δℓ vs. RoPE influence ρℓ per layer. Green: co-localized (layer 0 only). Orange: sensitive-only (23–31). Blue: RoPE-influential only (1–9). Gray: neither top-10. Dashed: linear fit. Spearman rs = −0.735 (p < 10−5 ). 0 5 10 15 20 25 31 0 5×10−3 10−2 1.5×10−2 ·10−2 Layer 0 Layer Index ℓ Sensitivity δℓ Sensitivity δℓ (left) 0 −1 −2 −3 −4 −5 −6 −7 ·10−2 RoPE Influence ρℓ RoPE Influen… view at source ↗

**Figure 2.** Figure 2: Dual-axis layer profile: sensitivity δℓ (orange, left axis) and RoPE influence ρℓ (blue dashed, right axis) vs. layer index. The profiles are anti-correlated: sensitivity rises through late layers while RoPE influence concentrates in early layers. Green band marks layer 0, the sole co-localized layer. 6 Analysis 6.1 Why Anti-Localization Exists We propose a structural account based on depth-wise functiona… view at source ↗

**Figure 3.** Figure 3: Cross-layer ablation: grouped bars per benchmark. Experiment A consistently leads Experiments B, C, and D. The A>B≈C>D pattern holds across all six benchmarks. Benchmark ∆ (pp) HumanEval+ +15.8 MGSM +9.6 GPQA +8.3 MATH +6.1 MMLU +4.5 ARC +3.5 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Exp A minus Exp D per benchmark (pp). All six values are positive, confirming universal improvement of co-localized over random layer selection. H0 H1 H2 H3 H4 H5 H6 H7 L0 L23 L24 L25 L26 L27 L28 L29 L30 L31 8.23 8.62 8.47 8.82 8.89 8.83 8.46 8.98 8.74 7.3 8.04 7.31 7.7 7.61 7.68 7.62 8.49 8.64 8.44 7.99 7.33 7.66 7.47 7.98 8.03 7.54 8.2 7.59 8.26 8.39 8.46 8.16 8.07 8.57 8.38 7.88 8.04 8.22 8.4 8.09 8.35 … view at source ↗

**Figure 5.** Figure 5: Heatmap of learned α (ℓ) k values (10 targeted layers × 8 KV heads) after training Exp A. All 80 values lie in [7.3, 9.0], well above the identity (α = 1.0), indicating systematic frequency upscaling. Layer 23 shows the highest per-head variance (range: 7.30–8.74). proach. Extensions include QLoRA [12] (quantized training), AdaLoRA [5] (adaptive rank via SVD importance), DoRA [25] (magnitude-direction deco… view at source ↗

read the original abstract

We investigate a fundamental structural question in Grouped Query Attention (GQA) transformers: do the layers most sensitive to task correctness coincide with the layers where positional encoding adaptation has the greatest leverage? We term this the co-localization hypothesis and test it on Llama 3.1 8B, a 32-layer GQA model with a 4:1 query-to-key-value head ratio. We introduce \LSLORA, which restricts LoRA adaptation to layers identified via a novel correctness-differential hidden-state metric, and GARFA (GQA-Aware RoPE Frequency Adaptation), which attaches 8 learnable per-KV-head scalar multipliers to each targeted layer. Contrary to the co-localization hypothesis, we discover strong anti-localization: task-sensitive layers concentrate in the late network ($\ell\in\{23\text{-}31\}$) while RoPE-influential layers dominate the early network ($\ell\in\{0\text{-}9\}$), yielding Spearman $r_s = -0.735$ ($p = 1.66\times10^{-6}$). Despite this anti-localization, a 4-way cross-layer ablation shows that applying both interventions to the sensitivity-identified layers outperforms all alternative configurations by 4-16 percentage points across six diverse benchmarks (MMLU, GPQA, HumanEval+, MATH, MGSM, ARC), approaching Claude 3.5 Haiku on HumanEval+ (67.1% vs. 68.3%) at \$100 total compute cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows anti-localization between task-sensitive and RoPE-influential layers in one GQA model, plus empirical gains from targeting adaptation to the sensitive layers, but the scope is narrow.

read the letter

The main thing to know is that this work finds a clear anti-localization pattern in Llama 3.1 8B: layers that matter most for task correctness cluster late in the network while layers where RoPE changes have leverage sit early, with a Spearman correlation of -0.735. They still get the best results by applying their two new adaptation methods to the sensitivity-identified layers rather than the positional ones or random choices.

Referee Report

3 major / 2 minor

Summary. The paper tests the co-localization hypothesis in GQA transformers on Llama 3.1 8B. It reports strong anti-localization (Spearman rs = -0.735, p = 1.66e-6) between task-sensitive layers (late, ℓ=23-31, via a correctness-differential hidden-state metric) and RoPE-influential layers (early, ℓ=0-9). It introduces LSLORA (LoRA restricted to sensitivity-identified layers) and GARFA (GQA-aware RoPE frequency adaptation with 8 learnable per-KV-head scalar multipliers), and claims via 4-way ablation that applying both interventions to sensitivity layers outperforms alternatives by 4-16 pp across MMLU, GPQA, HumanEval+, MATH, MGSM, and ARC, approaching Claude 3.5 Haiku on HumanEval+ at low cost.

Significance. If the central empirical results hold, the anti-localization finding and targeted adaptation approach offer a concrete contribution to understanding layer specialization in GQA models and to low-cost fine-tuning methods. The multi-benchmark ablation design and the minimal parameter count in GARFA are clear strengths. The work is an empirical study with no circular derivations.

major comments (3)

[§3] §3 (method for layer identification): The correctness-differential hidden-state metric is load-bearing for both the anti-localization result and the ablation layer selection, yet the manuscript provides no explicit formula, reference baseline, hidden-state aggregation method, or data exclusion rules, leaving the metric's validity and reproducibility unclear.
[§5] §5 (ablation experiments): The 4-way cross-layer ablation claims 4-16 pp gains but reports neither error bars, the precise definitions of the three alternative configurations, nor statistical tests on the performance deltas; this directly affects confidence in the claim that sensitivity-targeted layers are superior.
[§6] §6 (discussion and scope): All results, including the anti-localization correlation and ablation superiority, are obtained exclusively on Llama 3.1 8B and the six listed benchmarks; no replication on other GQA architectures or tasks is presented, so the load-bearing assumption that the sensitivity metric and anti-localization are general properties of GQA transformers remains untested.

minor comments (2)

[Abstract] Abstract: The acronyms LSLORA and GARFA appear without parenthetical expansion or brief definition, reducing immediate readability.
[Throughout] Throughout: The $100 total compute cost claim would be strengthened by an explicit breakdown of training steps, batch size, and hardware used for the reported runs.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for their constructive comments on methodological clarity, experimental reporting, and scope. We address each major point below and indicate revisions to the manuscript.

read point-by-point responses

Referee: [§3] §3 (method for layer identification): The correctness-differential hidden-state metric is load-bearing for both the anti-localization result and the ablation layer selection, yet the manuscript provides no explicit formula, reference baseline, hidden-state aggregation method, or data exclusion rules, leaving the metric's validity and reproducibility unclear.

Authors: We agree that the description of the correctness-differential hidden-state metric was insufficiently detailed. We will revise Section 3 to include the explicit formula for the metric (difference in hidden-state statistics between correct and incorrect predictions), the reference baseline (base model hidden states), the aggregation method (mean over token positions within each layer), and data exclusion rules (filtering low-confidence predictions). These additions will support reproducibility. revision: yes
Referee: [§5] §5 (ablation experiments): The 4-way cross-layer ablation claims 4-16 pp gains but reports neither error bars, the precise definitions of the three alternative configurations, nor statistical tests on the performance deltas; this directly affects confidence in the claim that sensitivity-targeted layers are superior.

Authors: We accept that the ablation results require stronger statistical support. In the revised manuscript, we will report error bars from multiple random seeds, provide exact definitions of the alternative configurations (early layers ℓ=0-9, random selection, and uniform sampling), and include statistical tests (e.g., paired t-tests) on the performance deltas to quantify the superiority of sensitivity-targeted layers. revision: yes
Referee: [§6] §6 (discussion and scope): All results, including the anti-localization correlation and ablation superiority, are obtained exclusively on Llama 3.1 8B and the six listed benchmarks; no replication on other GQA architectures or tasks is presented, so the load-bearing assumption that the sensitivity metric and anti-localization are general properties of GQA transformers remains untested.

Authors: We acknowledge the scope limitation. All reported results are from Llama 3.1 8B. We will expand Section 6 to explicitly note this restriction, clarify that the findings are demonstrated for this representative GQA model, and recommend future replication on other GQA architectures as an important direction. revision: partial

standing simulated objections not resolved

Replication of the sensitivity metric, anti-localization correlation, and ablation results on GQA architectures other than Llama 3.1 8B.

Circularity Check

0 steps flagged

No circularity: empirical ablation study without derivations or self-referential reductions

full rationale

The paper is a purely empirical study that introduces a correctness-differential hidden-state metric to select layers, applies LSLORA and GARFA interventions, measures anti-localization via Spearman correlation on Llama 3.1 8B, and validates via 4-way cross-layer ablations across six benchmarks. No equations, predictions, or derivations exist that reduce by construction to fitted parameters, self-citations, or ansatzes. All load-bearing claims rest on experimental outcomes that are independently falsifiable and do not exhibit any of the six enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 2 invented entities

The central claim rests on empirical measurements and two newly introduced adaptation procedures rather than on mathematical axioms or derivations from prior theory.

free parameters (1)

8 learnable per-KV-head scalar multipliers
Introduced in GARFA and attached to each targeted layer; their values are learned during adaptation.

invented entities (2)

LSLORA no independent evidence
purpose: Restrict LoRA adaptation to layers selected by the correctness-differential hidden-state metric
New method introduced to implement the sensitivity-targeted intervention.
GARFA no independent evidence
purpose: GQA-aware RoPE frequency adaptation using per-KV-head scalar multipliers
New adaptation technique introduced to modify positional encodings in targeted layers.

pith-pipeline@v0.9.0 · 5572 in / 1470 out tokens · 77097 ms · 2026-05-10T18:10:05.742358+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We quantify each layer’s role in task correctness through the cosine distance between mean-pooled hidden states for paired correct vs. incorrect inputs... δℓ(ℓ, x+, x−) = 1− h+ℓ · h−ℓ / (∥h+ℓ∥ ∥h−ℓ∥)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GARFA... attaches 8 learnable per-KV-head scalar multipliers... θ(k,ℓ)m,i = m·(θbase·α(ℓ)k)−2i/dh

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 14 canonical work pages · 10 internal anchors

[1]

Claude 3.5 Haiku: Anthropic’s fastest model

Anthropic. Claude 3.5 Haiku: Anthropic’s fastest model. https://www.anthropic.com/claude/haiku, 2024

2024
[2]

GPT-4o system card

OpenAI. GPT-4o system card. https://openai.com/ index/gpt-4o-system-card, 2024

2024
[3]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URL https: //arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InInternational Conference on Learning Representations (ICLR), 2022. URL https://arxiv.org/abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Qingru Zhang, Minshuo Chen, Alexander Bukharin, Nikos Karampatziakis, Pengcheng He, Yu Cheng, Weizhu Chen, and Tuo Zhao. AdaLoRA: Adaptive budget allocation for parameter-efficient fine-tuning. InInternational Conference on Learning Representations (ICLR), 2023. URL https: //arxiv.org/abs/2303.10512

work page internal anchor Pith review arXiv 2023
[6]

IGU-LoRA: Integrated gradients utilization for parameter-efficient fine-tuning.arXiv preprint, 2024

Yihua Gu et al. IGU-LoRA: Integrated gradients utilization for parameter-efficient fine-tuning.arXiv preprint, 2024

2024
[7]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[8]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Ship- pole. YaRN: Efficient context window extension of large lan- guage models.arXiv preprint arXiv:2309.00071, 2023

work page internal anchor Pith review arXiv 2023
[9]

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Yiran Chen, Shuohang Qian, Haowei Tang, Xuhui Lai, Siyang Liu, Chenlong Han, and Caiming Xiong. LongRoPE: Extending LLM context window beyond 2 million tokens.arXiv preprint arXiv:2402.13753, 2024

work page internal anchor Pith review arXiv 2024
[10]

GQA: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Ze- lasko, Rémi Lebret, and Yi Tay. GQA: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

2023
[11]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Repre- sentations (ICLR), 2019

2019
[12]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient finetuning of quantized LLMs. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2023. URL https://arxiv.org/abs/ 2305.14314

work page internal anchor Pith review arXiv 2023
[13]

Magicoder: Empow- ering code generation with oss-instruct.arXiv preprint arXiv:2312.02120, 2023

Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Ling- ming Zhang. Magicoder: Source code is all you need.arXiv preprint arXiv:2312.02120, 2024

work page arXiv 2024
[14]

Code Alpaca: An instruction-following LLaMA model trained on code generation instructions

Sahil Chaudhary. Code Alpaca: An instruction-following LLaMA model trained on code generation instructions. https: //github.com/sahil280114/codealpaca, 2023

2023
[15]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. MetaMath: Bootstrap your own mathe- matical questions for large language models.arXiv preprint arXiv:2309.12284, 2024

work page internal anchor Pith review arXiv 2024
[16]

OpenHermes 2.5: An open dataset of primarily GPT-4 generated instruction tuning data

Teknium. OpenHermes 2.5: An open dataset of primarily GPT-4 generated instruction tuning data. https://huggingface. co/datasets/teknium/OpenHermes-2.5, 2023

2023
[17]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Man- tas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2009
[18]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. GPQA: A graduate-level Google-proof Q&A benchmark.arXiv preprint arXiv:2311.12022, 2023

work page internal anchor Pith review arXiv 2023
[19]

Is your code generated by ChatGPT really correct? rigorous evaluation of large language models with EvalPlus

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by ChatGPT really correct? rigorous evaluation of large language models with EvalPlus. In Advances in Neural Information Processing Systems (NeurIPS), 2023

2023
[20]

Measuring mathematical problem solving with the MATH dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InAdvances in Neural Information Processing Sys- tems (NeurIPS), 2021

2021
[21]

Language models are multi- lingual chain-of-thought reasoners,

Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Sunayana Srivats, Soroush V osoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Ja- son Wei. Language models are multilingual chain-of-thought reasoners.arXiv preprint arXiv:2210.03057, 2022. 7

work page arXiv 2022
[22]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

A framework for few-shot language model evaluation

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. A framework for few-shot language model evaluation. https://github.com/EleutherAI/ lm-evaluation-harness, 2021

2021
[24]

DROP: A reading comprehen- sion benchmark requiring discrete reasoning over paragraphs

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehen- sion benchmark requiring discrete reasoning over paragraphs. InProceedings of NAACL-HLT, 2019

2019
[25]

F., Cheng, K.-T., and Chen, M.-H

Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DoRA: Weight-decomposed low-rank adaptation. In International Conference on Machine Learning (ICML), 2024. URLhttps://arxiv.org/abs/2402.09353

work page arXiv 2024
[26]

Lora+: Efficient low rank adaptation of large models

Soufiane Hayou, Nikhil Ghosh, and Bin Yu. LoRA+: Ef- ficient low rank adaptation of large models.arXiv preprint arXiv:2402.12354, 2024

work page arXiv 2024
[27]

Smith, and Mike Lewis

Ofir Press, Noah A. Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapola- tion. InInternational Conference on Learning Representations (ICLR), 2022

2022
[28]

BERT rediscovers the classical NLP pipeline

Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), 2019

2019
[29]

A primer in BERTology: What we know about how BERT works.Trans- actions of the Association for Computational Linguistics, 8: 842–866, 2020

Anna Rogers, Olga Kovaleva, and Anna Rumshisky. A primer in BERTology: What we know about how BERT works.Trans- actions of the Association for Computational Linguistics, 8: 842–866, 2020

2020
[30]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. A Full Layer Profiles Table 7 gives the exact 32-layer profiles used for all layer selection decisions. Bold rows are top-10 RoPE-influential (layers 0–9). B Implementation Notes...

2022