Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning

Abhay Yadav

arxiv: 2605.23171 · v1 · pith:2HY3YOD7new · submitted 2026-05-22 · 💻 cs.LG · cs.AI· stat.ML

Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning

Abhay Yadav This is my paper

Pith reviewed 2026-05-25 04:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords noisy embeddingsinstruction finetuningsymmetric noiselocal curvaturelanguage modelsNEFTuneSymNoiseAlpacaEval

0 comments

The pith

Symmetric noise in embeddings improves instruction finetuning by more stringently regulating local curvature than uniform noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why adding noise to embeddings during instruction fine-tuning helps language models, focusing on the uniform noise used in NEFTune. It finds that uniform and Gaussian noise give comparable results, then proposes symmetric noise as a new approach called SymNoise. This method is shown to produce better instruction-following models, lifting AlpacaEval scores from 64.69 percent with NEFTune to 69.04 percent on LLaMA-2-7B fine-tuned with Alpaca. The improvement is attributed to tighter control over the local curvature of the learned function. The gains hold across other models and datasets such as Evol-Instruct and ShareGPT.

Core claim

When fine-tuning the LLaMA-2-7B model using Alpaca, standard techniques yield a 29.79 percent score on AlpacaEval. However, our approach, SymNoise, increases this score significantly to 69.04 percent, using symmetric noisy embeddings. This is a 6.7 percent improvement over the state-of-the-art method, NEFTune (64.69 percent). The paper argues that symmetric noise regulates the model's local curvature more stringently than uniform noise, and that this curvature control drives the performance gains. Theoretical and empirical analysis indicates comparable performance among uniform and Gaussian noise types.

What carries the argument

Symmetric noise added to embeddings, which imposes stricter regulation on the model's local curvature during fine-tuning.

If this is right

SymNoise raises AlpacaEval from 64.69 percent to 69.04 percent on LLaMA-2-7B with the Alpaca dataset.
The method outperforms NEFTune on multiple models and on stronger instruction datasets including Evol-Instruct, ShareGPT, and OpenPlatypus.
Uniform and Gaussian noise produce comparable results, reducing the prior emphasis on uniform noise as uniquely effective.
Symmetric noise provides a concrete way to strengthen curvature control in embedding perturbations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Curvature control through noise symmetry could be tested as a design principle for other regularization methods beyond embeddings.
If the curvature mechanism holds, SymNoise might combine with other fine-tuning tricks such as data augmentation or longer training schedules.
The approach invites experiments on whether symmetric perturbations improve performance in domains outside language modeling, such as vision or multimodal models.

Load-bearing premise

The performance gains result from symmetric noise regulating local curvature more stringently than uniform noise does.

What would settle it

A direct measurement of local curvature on models trained with SymNoise versus NEFTune that finds no difference in curvature despite the performance gap.

Figures

Figures reproduced from arXiv: 2605.23171 by Abhay Yadav.

**Figure 1.** Figure 1: Comparison of average L2 norm ratios for Gaussian and Bernoulli noise relative to Uniform noise as a function of dimensionality. Drawing from Lemma 1 and Lemma 2, it is apparent that the expected noise from the Gaussian distribution is √ 3 times that of the Uniform distribution. Consequently, to equate the noise scales for comparison, the noise scaling factor for the Gaussian distribution should be adjuste… view at source ↗

**Figure 2.** Figure 2: Gaussian/Uniform Average L2 Norm Ratio as a Function of Dimensionality. The plot illustrates the ratio of the average L2 norm of points drawn from a Gaussian distribution to that of a Uniform distribution, with the number of points fixed at 256 and the dimensionality varying from 1 to 4096. 0 500 1000 1500 2000 2500 3000 3500 4000 Dimension d 1.75 1.80 1.85 1.90 1.95 A v era g e L2 N orm R atio Bernoulli/U… view at source ↗

**Figure 3.** Figure 3: Bernoulli/Uniform Average L2 Norm Ratio as a Function of Dimensionality. The plot depicts the ratio of the average L2 norm of points drawn from a Bernoulli distribution to that of a Uniform distribution, with the number of points fixed at 256 and the dimensionality varying from 1 to 4096. A.2.2 Average L2 Norm Ratio with Varying Number of Points 13 [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Gaussian/Uniform Average L2 Norm Ratio for Varying Number of Points. The plot illustrates the ratio of the average L2 norm of points drawn from a Gaussian distribution to that of a Uniform distribution, with the dimensionality fixed at 4096 and the number of points varying from 64 to 256. 75 100 125 150 175 200 225 250 Number of Points 1.730 1.731 1.732 1.733 1.734 1.735 A v era g e L2 N orm R atio Bernoul… view at source ↗

**Figure 5.** Figure 5: Bernoulli/Uniform Average L2 Norm Ratio for Varying Number of Points. The plot depicts the ratio of the average L2 norm of points drawn from a Bernoulli distribution to that of a Uniform distribution, with the dimensionality fixed at 4096 and the number of points varying from 64 to 256. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

read the original abstract

Recent advancements in instructional fine-tuning have injected noise into embeddings, with NEFTune (Jain et al., 2024) setting benchmarks using uniform noise. Despite NEFTune's empirical findings that uniform noise outperforms Gaussian noise, the reasons for this remain unclear. This paper aims to clarify this by offering a thorough analysis, both theoretical and empirical, indicating comparable performance among these noise types. Additionally, we introduce a new fine-tuning method for language models, utilizing symmetric noise in embeddings. This method aims to enhance the model's function by more stringently regulating its local curvature, demonstrating superior performance over the current method, NEFTune. When fine-tuning the LLaMA-2-7B model using Alpaca, standard techniques yield a 29.79% score on AlpacaEval. However, our approach, SymNoise, increases this score significantly to 69.04%, using symmetric noisy embeddings. This is a 6.7% improvement over the state-of-the-art method, NEFTune (64.69%). Furthermore, when tested on various models and stronger baseline instruction datasets, such as Evol-Instruct, ShareGPT, OpenPlatypus, SymNoise consistently outperforms NEFTune. The current literature, including NEFTune, has underscored the importance of more in-depth research into the application of noise-based strategies in the fine-tuning of language models. Our approach, SymNoise, is another significant step towards this direction, showing notable improvement over the existing state-of-the-art method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Symmetric noise gives reported gains over NEFTune but the curvature control claim needs the math and measurements to back it up.

read the letter

The one or two things to know are that this work proposes symmetric noise for embeddings in instruction fine-tuning as an improvement over NEFTune's uniform noise, and it reports a 6.7 point increase on AlpacaEval for LLaMA-2-7B along with gains on other datasets. The paper does a decent job extending the noise injection approach by comparing noise types and introducing symmetry as a way to more tightly control local curvature. It tests the method on multiple models and stronger baselines like Evol-Instruct and ShareGPT, which is good practice for showing robustness. What stands out as new is the specific use of symmetric noise and the performance numbers attached to it. The attempt to provide both theoretical and empirical analysis to explain the original NEFTune findings is also a positive step, even if the details are not in the abstract. The soft spots are in the support for the central mechanism. The claim that symmetric noise regulates curvature more stringently than uniform noise is stated but the abstract gives no equations or derivation to show this. There are also no mentions of direct measurements of curvature or ablations that would confirm this is what drives the gains rather than other differences in how the noise is applied. The empirical results are given as final scores without enough detail on whether the NEFTune baseline was reproduced under the exact same conditions. If the full paper includes the promised theoretical analysis with clear math and the experiments have proper controls and measurements, then the work is more convincing. From what's here, the performance improvement is the main thing worth checking. This paper is for people doing practical work on fine-tuning language models and looking for small changes that might improve results. A reader focused on regularization techniques during training might also find the curvature angle useful if it is developed. It deserves a serious referee because the idea is accessible and the reported improvements are large enough that confirming or refuting them would be valuable to the community. I recommend sending it for peer review so that the analysis and experiments can be properly evaluated.

Referee Report

3 major / 0 minor

Summary. The paper analyzes noise injection into embeddings during instruction fine-tuning of LLMs. It reports that uniform and Gaussian noise yield comparable performance (contrary to prior claims favoring uniform), provides theoretical analysis supporting this equivalence, and introduces SymNoise using symmetric noise, which is claimed to more stringently regulate local curvature and thereby improve model function. On LLaMA-2-7B fine-tuned with Alpaca, SymNoise reaches 69.04% on AlpacaEval (vs. 29.79% for standard fine-tuning and 64.69% for NEFTune), with consistent gains reported across other models and datasets such as Evol-Instruct and ShareGPT.

Significance. If the curvature-regulation mechanism is rigorously derived and the performance gains are shown to be reproducible with controls that isolate symmetry from other implementation factors, the work would advance understanding of noise distributions as regularizers in LLM fine-tuning and supply a practical improvement over NEFTune.

major comments (3)

[Abstract] Abstract: the central claim that symmetric noise 'more stringently regulating its local curvature' produces the observed gains is unsupported; the text states that the theoretical analysis shows only that uniform and Gaussian noise are comparable, with no derivation, equation, or bound provided that demonstrates why symmetry tightens any curvature control relative to uniform noise.
[Abstract] Abstract: the reported AlpacaEval scores (69.04% for SymNoise, 64.69% for NEFTune) are presented without any accompanying measurements of local curvature, Lipschitz constants, or Hessian-based quantities that would verify the proposed mechanism.
[Abstract] Abstract: no ablation is described that isolates the symmetry of the noise distribution from other potential differences in implementation, hyper-parameters, or random seeds when comparing SymNoise to the NEFTune baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the three major comments point by point below, proposing revisions to improve clarity and rigor where the concerns are valid.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that symmetric noise 'more stringently regulating its local curvature' produces the observed gains is unsupported; the text states that the theoretical analysis shows only that uniform and Gaussian noise are comparable, with no derivation, equation, or bound provided that demonstrates why symmetry tightens any curvature control relative to uniform noise.

Authors: The manuscript's Section 3 provides a theoretical derivation showing equivalence of uniform and Gaussian noise, followed by an extension demonstrating that symmetry enables stricter curvature regularization via zero-mean cancellation effects on the perturbation. The abstract condenses this result. We will revise the abstract to include a concise reference to the key bound derived in the theory section. revision: yes
Referee: [Abstract] Abstract: the reported AlpacaEval scores (69.04% for SymNoise, 64.69% for NEFTune) are presented without any accompanying measurements of local curvature, Lipschitz constants, or Hessian-based quantities that would verify the proposed mechanism.

Authors: We agree that empirical verification of the curvature mechanism would strengthen the claims. The revised version will add a new subsection with measurements of local Lipschitz constants (or equivalent curvature proxies) across the compared methods to directly link performance gains to the proposed regularization effect. revision: yes
Referee: [Abstract] Abstract: no ablation is described that isolates the symmetry of the noise distribution from other potential differences in implementation, hyper-parameters, or random seeds when comparing SymNoise to the NEFTune baseline.

Authors: The primary experiments control for model architecture, dataset, optimizer, and all hyperparameters, differing only in the noise distribution. To further isolate symmetry, we will add a dedicated ablation experiment that applies symmetric versus non-symmetric variants under identical random seeds and implementation details. revision: yes

Circularity Check

0 steps flagged

No significant circularity; central claims are empirical performance results without load-bearing derivations

full rationale

The paper reports empirical AlpacaEval gains for SymNoise (69.04%) over NEFTune (64.69%) on LLaMA-2-7B with Alpaca and other datasets, attributing superiority to stricter local-curvature regulation by symmetric noise. The abstract references a 'theoretical analysis' only for comparability of uniform vs. Gaussian noise; no equations, curvature bounds, or derivations specific to symmetry are provided in the text. No self-citations, fitted parameters renamed as predictions, or self-definitional steps appear. Results are presented as direct experimental outcomes against external baselines, with no reduction of the claimed mechanism to inputs by construction. This is a standard empirical contribution without circular derivation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5799 in / 1047 out tokens · 18683 ms · 2026-05-25T04:49:07.354909+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we introduce a new fine-tuning method... utilizing symmetric noise in embeddings. This method aims to enhance the model's function by more stringently regulating its local curvature... Each noise component is generated with an equal probability of 1/2 for the values −1 and 1.
Foundation.AlphaCoordinateFixation J_uniquely_calibrated_via_higher_derivative echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

our goal is to have the gradient approach zero in the immediate vicinity of an input altered by a minimal amount... f(x+ε)=f(x−ε)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 11 internal anchors

[1]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

URLhttps://openreview.net/forum?id=SkeKtyHYPS. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901
[2]

Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701,

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701,

work page arXiv
[3]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P . Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90 https://lmsys.org/ blog/2023-03-30-vicuna/, Mar

work page 2023
[4]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models.arXiv preprint arXiv:2210.11416,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics,

work page 2019
[7]

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto

URLhttps://aclanthology.org/N19-1423/. Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.arXiv preprint arXiv:2305.14387,

work page arXiv
[8]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009
[9]

Platypus: Quick, cheap, and powerful refinement of llms.arXiv preprint arXiv:2308.07317,

Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms.arXiv preprint arXiv:2308.07317,

work page arXiv
[10]

On explicit curvature regularization in deep generative models

10 Published as a conference paper at COLM 2025 Yonghyeon Lee and Frank C Park. On explicit curvature regularization in deep generative models. InT opological, Algebraic and Geometric Learning Workshops 2023, pp. 505–518. PMLR,

work page 2025
[11]

Text-only training for image captioning using noise-injected clip.arXiv preprint arXiv:2211.00575,

David Nukrai, Ron Mokady, and Amir Globerson. Text-only training for image captioning using noise-injected clip.arXiv preprint arXiv:2211.00575,

work page arXiv
[12]

Multitask Prompted Training Enables Zero-Shot Task Generalization

URLhttp://jmlr.org/papers/v21/20-074.html. Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization.arXiv preprint arXiv:2110.08207,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee ...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

URL https://proceedings.neurips.cc/paper files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions.arXiv preprint arXiv:2212.10560,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Finetuned Language Models Are Zero-Shot Learners

11 Published as a conference paper at COLM 2025 Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Zeroprompt: scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization.arXiv preprint arXiv:2201.06910,

Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yanggang Wang, Haiyu Li, and Zhilin Yang. Zeroprompt: scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization.arXiv preprint arXiv:2201.06910,

work page arXiv
[18]

Noise injection-based regularization for point cloud processing.arXiv preprint arXiv:2103.15027,

Xiao Zang, Yi Xie, Siyu Liao, Jie Chen, and Bo Yuan. Noise injection-based regularization for point cloud processing.arXiv preprint arXiv:2103.15027,

work page arXiv
[19]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[20]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

A Survey of Large Language Models

Wayne Xin Zhao, Yujian Shao, Jingyuan Li, Yuan Wang, Xinyu Li, Zihan Yu, Yujia Ji, Jing Chen, Fei Wang, and Ji-Rong Li. A survey of large language models.arXiv preprint arXiv:2303.18223,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Freelb: Enhanced adversarial training for natural language understanding.arXiv preprint arXiv:1909.11764,

Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarial training for natural language understanding.arXiv preprint arXiv:1909.11764,

work page arXiv 1909
[23]

(2023): Developed using the Self-Instruct method by Wang et al

A Appendix A.1 Datasets • Alpaca Taori et al. (2023): Developed using the Self-Instruct method by Wang et al. (2022) and the Text-Davinci-003 Ouyang et al. (2022) model (Ouyang et al., 2022), Alpaca leverages a small set of seed tasks to generate new instruction tuning tasks and filter out ineffective ones. This dataset has been instrumental in advancing ...

work page 2023
[24]

0 500 1000 1500 2000 2500 3000 3500 4000 Dimension d 1.75 1.80 1.85 1.90 1.95Average L2 Norm Ratio Bernoulli/Uniform Average L2 Norm Ratio (n=256) Bernoulli/Uniform Figure 3: Bernoulli/Uniform Average L2 Norm Ratio as a Function of Dimensionality. The plot depicts the ratio of the average L2 norm of points drawn from a Bernoulli distribution to that of a ...

work page 2000
[25]

A.2.2 AverageL 2 Norm Ratio with Varying Number of Points 13 Published as a conference paper at COLM 2025 75 100 125 150 175 200 225 250 Number of Points 1.726 1.728 1.730 1.732 1.734 1.736 1.738Average L2 Norm Ratio Gaussian/Uniform Average L2 Norm Ratio (d=4096) Gaussian/Uniform Figure 4: Gaussian/Uniform Average L2 Norm Ratio for Varying Number of Poin...

work page 2025
[26]

3 and Sec

14 Published as a conference paper at COLM 2025 B Deferred proofs In this section, we show the proofs omitted from Sec. 3 and Sec. 4.1. B.0.1 Proof of Lemma 1 We state again Lemma 1 from Sec. 3 and present the proof. Lemma

work page 2025

[1] [1]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al

URLhttps://openreview.net/forum?id=SkeKtyHYPS. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page 1901

[2] [2]

Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701,

Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701,

work page arXiv

[3] [3]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P . Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90 https://lmsys.org/ blog/2023-03-30-vicuna/, Mar

work page 2023

[4] [4]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction- finetuned language models.arXiv preprint arXiv:2210.11416,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics,

work page 2019

[7] [7]

Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto

URLhttps://aclanthology.org/N19-1423/. Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback.arXiv preprint arXiv:2305.14387,

work page arXiv

[8] [8]

Measuring Massive Multitask Language Understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,

work page internal anchor Pith review Pith/arXiv arXiv 2009

[9] [9]

Platypus: Quick, cheap, and powerful refinement of llms.arXiv preprint arXiv:2308.07317,

Ariel N Lee, Cole J Hunter, and Nataniel Ruiz. Platypus: Quick, cheap, and powerful refinement of llms.arXiv preprint arXiv:2308.07317,

work page arXiv

[10] [10]

On explicit curvature regularization in deep generative models

10 Published as a conference paper at COLM 2025 Yonghyeon Lee and Frank C Park. On explicit curvature regularization in deep generative models. InT opological, Algebraic and Geometric Learning Workshops 2023, pp. 505–518. PMLR,

work page 2025

[11] [11]

Text-only training for image captioning using noise-injected clip.arXiv preprint arXiv:2211.00575,

David Nukrai, Ron Mokady, and Amir Globerson. Text-only training for image captioning using noise-injected clip.arXiv preprint arXiv:2211.00575,

work page arXiv

[12] [12]

Multitask Prompted Training Enables Zero-Shot Task Generalization

URLhttp://jmlr.org/papers/v21/20-074.html. Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization.arXiv preprint arXiv:2110.08207,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee ...

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Self-Instruct: Aligning Language Models with Self-Generated Instructions

URL https://proceedings.neurips.cc/paper files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language model with self generated instructions.arXiv preprint arXiv:2212.10560,

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Finetuned Language Models Are Zero-Shot Learners

11 Published as a conference paper at COLM 2025 Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652,

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions.arXiv preprint arXiv:2304.12244,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Zeroprompt: scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization.arXiv preprint arXiv:2201.06910,

Hanwei Xu, Yujun Chen, Yulun Du, Nan Shao, Yanggang Wang, Haiyu Li, and Zhilin Yang. Zeroprompt: scaling prompt-based pretraining to 1,000 tasks improves zero-shot generalization.arXiv preprint arXiv:2201.06910,

work page arXiv

[18] [18]

Noise injection-based regularization for point cloud processing.arXiv preprint arXiv:2103.15027,

Xiao Zang, Yi Xie, Siyu Liao, Jie Chen, and Bo Yuan. Noise injection-based regularization for point cloud processing.arXiv preprint arXiv:2103.15027,

work page arXiv

[19] [19]

HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence?arXiv preprint arXiv:1905.07830,

work page internal anchor Pith review Pith/arXiv arXiv 1905

[20] [20]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

A Survey of Large Language Models

Wayne Xin Zhao, Yujian Shao, Jingyuan Li, Yuan Wang, Xinyu Li, Zihan Yu, Yujia Ji, Jing Chen, Fei Wang, and Ji-Rong Li. A survey of large language models.arXiv preprint arXiv:2303.18223,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Freelb: Enhanced adversarial training for natural language understanding.arXiv preprint arXiv:1909.11764,

Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarial training for natural language understanding.arXiv preprint arXiv:1909.11764,

work page arXiv 1909

[23] [23]

(2023): Developed using the Self-Instruct method by Wang et al

A Appendix A.1 Datasets • Alpaca Taori et al. (2023): Developed using the Self-Instruct method by Wang et al. (2022) and the Text-Davinci-003 Ouyang et al. (2022) model (Ouyang et al., 2022), Alpaca leverages a small set of seed tasks to generate new instruction tuning tasks and filter out ineffective ones. This dataset has been instrumental in advancing ...

work page 2023

[24] [24]

0 500 1000 1500 2000 2500 3000 3500 4000 Dimension d 1.75 1.80 1.85 1.90 1.95Average L2 Norm Ratio Bernoulli/Uniform Average L2 Norm Ratio (n=256) Bernoulli/Uniform Figure 3: Bernoulli/Uniform Average L2 Norm Ratio as a Function of Dimensionality. The plot depicts the ratio of the average L2 norm of points drawn from a Bernoulli distribution to that of a ...

work page 2000

[25] [25]

A.2.2 AverageL 2 Norm Ratio with Varying Number of Points 13 Published as a conference paper at COLM 2025 75 100 125 150 175 200 225 250 Number of Points 1.726 1.728 1.730 1.732 1.734 1.736 1.738Average L2 Norm Ratio Gaussian/Uniform Average L2 Norm Ratio (d=4096) Gaussian/Uniform Figure 4: Gaussian/Uniform Average L2 Norm Ratio for Varying Number of Poin...

work page 2025

[26] [26]

3 and Sec

14 Published as a conference paper at COLM 2025 B Deferred proofs In this section, we show the proofs omitted from Sec. 3 and Sec. 4.1. B.0.1 Proof of Lemma 1 We state again Lemma 1 from Sec. 3 and present the proof. Lemma

work page 2025