TeRA: Vector-based Random Tensor Network for High-Rank Adaptation of Large Language Models

Danilo Mandic; Giorgos Iacovides; Wuyang Zhou; Yuxuan Gu

arxiv: 2509.03234 · v2 · submitted 2025-09-03 · 💻 cs.LG

TeRA: Vector-based Random Tensor Network for High-Rank Adaptation of Large Language Models

Yuxuan Gu , Wuyang Zhou , Giorgos Iacovides , Danilo Mandic This is my paper

Pith reviewed 2026-05-18 19:14 UTC · model grok-4.3

classification 💻 cs.LG

keywords Parameter-Efficient Fine-TuningTensor NetworksHigh-Rank AdaptationLarge Language ModelsLoRAPEFTRandom InitializationTucker Decomposition

0 comments

The pith

TeRA enables high-rank weight updates in LLMs while training only as many parameters as vector-based adapters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TeRA to resolve the usual trade-off in fine-tuning large language models, where high-rank updates require many more trainable parameters than simpler vector methods. It represents each weight update through a Tucker-like tensor network whose large factors are randomly initialized once, frozen, and shared across all layers. Only a small set of layer-specific scaling vectors is trained, keeping the total trainable count as low as basic vector adapters. Experiments show this construction matches or beats existing high-rank methods on standard adaptation tasks while theoretical checks and ablations confirm the random factors supply the needed expressivity.

Core claim

TeRA parametrizes the tensorized weight update matrix as a Tucker-like tensor network, whereby large randomly initialized factors are frozen and shared across layers, while only small layer-specific scaling vectors, corresponding to diagonal entries of factor matrices, are trained. This achieves high-rank weight updates while retaining the parameter efficiency of vector-based PEFT adapters, matching or even outperforming existing high-rank adapters.

What carries the argument

Tucker-like tensor network that decomposes the weight update, keeping large random factors frozen and shared while training only per-layer scaling vectors.

If this is right

High-rank updates become feasible without increasing the trainable parameter budget beyond vector-based methods.
Adapter performance equals or exceeds prior high-rank techniques on language-model fine-tuning benchmarks.
The separation of shared random structure from per-layer scalings reduces redundancy across model layers.
Theoretical guarantees and ablation results support that the random tensor factors encode sufficient high-rank directions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The shared random factors may implicitly align adaptation directions across layers without explicit coordination.
The same random-tensor pattern could be tested on other parameter-efficient methods such as prompt tuning.
Scaling the approach to models with thousands of layers would test whether the fixed factors remain effective without retraining.

Load-bearing premise

Randomly initialized and frozen large factors in the tensor network, when paired with only layer-specific scaling vectors, suffice to capture the high-rank information required for effective adaptation.

What would settle it

An ablation or benchmark run in which replacing the frozen random factors with learned ones yields no gain, or where TeRA falls measurably behind a comparable high-rank adapter on a task known to need high-rank capacity.

Figures

Figures reproduced from arXiv: 2509.03234 by Danilo Mandic, Giorgos Iacovides, Wuyang Zhou, Yuxuan Gu.

**Figure 2.** Figure 2: A comparison between LoRA (Hu et al. 2022) and our proposed TeRA method. LoRA represents the weight update [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Rank analysis of ∆Wq (max allowed rank of 4096) and ∆Wv (max allowed rank of 1024) across Llama3-8B layers. TeRA consistently maintains a high (near-full) rank. In contrast, methods like LoRA and VeRA have lowerrank weight updates, limiting their expressivity. a superior trade-off between model performance, high rank, and parameter efficiency. As shown in [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 5.** Figure 5: Rank of ∆Wq and ∆Wv (Max possible rank = 4096) across different layers in Llama-2-7B under different tensorization schemes in the commonsense reasoning task. Initialization of Frozen Factor Matrices. We explore different initialization choices for the frozen factor matrices. Specifically, we compare TeRA with a variant, TeRAiden, where its frozen factor matrices are all identity matrices. Note that TeRAide… view at source ↗

**Figure 6.** Figure 6: Comparison between TeRA and TeRAiden on the commonsense reasoning dataset with Llama-2-7B. Conclusion We have introduced TeRA, a high-rank PEFT adapter which utilizes a tensor network to parameterize the tensorized weight updates. In this way, TeRA offers a more effective alternative to existing vector-based adapters, achieving much better performances and high-rank updates but with a similar amount of tr… view at source ↗

**Figure 4.** Figure 4: Average accuracy across eight commonsense rea [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Parameter-Efficient Fine-Tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), have significantly reduced the number of trainable parameters needed in fine-tuning large language models (LLMs). The developments of LoRA-style adapters have considered two main directions: (1) enhancing model expressivity with high-rank adapters, and (2) aiming for further parameter reduction, as exemplified by vector-based methods. However, these approaches come with a trade-off, as achieving the expressivity of high-rank weight updates typically comes at the cost of sacrificing the extreme parameter efficiency offered by vector-based techniques. To address this issue, we propose a vector-based random Tensor network for high-Rank Adaptation (TeRA), a novel PEFT method that achieves high-rank weight updates while retaining the parameter efficiency of vector-based PEFT adapters. This is achieved by parametrizing the tensorized weight update matrix as a Tucker-like tensor network (TN), whereby large randomly initialized factors are frozen and shared across layers, while only small layer-specific scaling vectors, corresponding to diagonal entries of factor matrices, are trained. Comprehensive experiments demonstrate that TeRA matches or even outperforms existing high-rank adapters, while requiring as few trainable parameters as vector-based methods. Theoretical analysis and ablation studies validate the effectiveness of the proposed TeRA method. The code is available at https://github.com/guyuxuan9/TeRA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TeRA freezes large random tensor factors across layers and trains only small per-layer scaling vectors to claim high-rank updates at vector-level parameter cost.

read the letter

TeRA freezes large random tensor factors shared across layers and trains only small per-layer scaling vectors to claim high-rank updates at vector-level parameter cost. The construction uses a Tucker-like network where the big factors stay fixed and random, so the trainable part stays as cheap as pure vector adapters while the effective update rank is supposed to be higher. That specific combination of shared random factors plus diagonal scaling is the main new piece relative to prior LoRA variants and tensor adapters mentioned in the abstract. The paper reports experiments where TeRA matches or beats other high-rank methods on standard benchmarks, plus some theoretical analysis and ablations, and the code is public. Those are the concrete positives worth noting. The central assumption is that a single draw of random frozen factors already contains the directions needed for effective adaptation once you scale them per layer. If the random subspace is misaligned with the task gradients, scaling alone cannot add the missing components. The stress-test note flags exactly this point, and the paper would need clear ablations showing stable performance across different random seeds and alternative bases to make the claim convincing. Without those checks the efficiency gain is real but the expressivity guarantee looks fragile. This work is aimed at people doing practical PEFT for LLMs who care about the expressivity-efficiency trade-off. Readers already following tensor or random-projection adapters will see the most direct value. The idea is clean enough and the claims are testable enough that it deserves a serious referee to examine the experiments and the robustness of the random factors.

Referee Report

2 major / 2 minor

Summary. The paper proposes TeRA, a PEFT method for LLMs that parametrizes weight updates via a Tucker-like tensor network. Large randomly initialized factors are frozen and shared across layers, while only small per-layer scaling vectors (corresponding to diagonal entries) are trained. This is claimed to deliver high-rank adaptation updates at the parameter cost of vector-based methods. Comprehensive experiments, theoretical analysis, and ablations are reported to show that TeRA matches or outperforms existing high-rank adapters while using as few trainable parameters as vector-based PEFT.

Significance. If the central claims are substantiated, TeRA would usefully bridge the expressivity-efficiency trade-off in PEFT by showing that a shared random tensor basis plus per-layer scalings can suffice for effective high-rank updates. The availability of code and the inclusion of both theoretical analysis and ablations are positive features that aid reproducibility and verification.

major comments (2)

[§3] §3 (Method), Tucker-like TN parametrization: The central claim that frozen, shared random factors plus per-layer scaling vectors produce effective high-rank updates rests on the untested assumption that a single random subspace already contains the principal adaptation directions across layers. If the random basis is misaligned with layer-wise gradient structure, scaling alone cannot recover the missing expressivity; the paper must demonstrate stability under different random seeds for the frozen factors.
[§4.3] §4.3 (Ablations) and experimental tables: The reported performance gains over high-rank baselines are load-bearing for the claim, yet the manuscript provides insufficient detail on whether ablations include replacement of the shared random factors by an independent draw or by a learned basis; without such controls the results cannot rule out that success depends on a fortunate random initialization rather than the architecture itself.

minor comments (2)

[§3] Notation for the scaling vectors and the precise definition of the Tucker contraction should be clarified with an explicit equation showing which modes are contracted and which remain diagonal.
[§4] Figure captions and table headers should explicitly state the number of trainable parameters for each compared method to make the efficiency claim immediately verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the opportunity to address these points. We agree that additional empirical verification of stability and clearer ablation controls will strengthen the manuscript. We outline our responses below and will incorporate the suggested revisions.

read point-by-point responses

Referee: [§3] §3 (Method), Tucker-like TN parametrization: The central claim that frozen, shared random factors plus per-layer scaling vectors produce effective high-rank updates rests on the untested assumption that a single random subspace already contains the principal adaptation directions across layers. If the random basis is misaligned with layer-wise gradient structure, scaling alone cannot recover the missing expressivity; the paper must demonstrate stability under different random seeds for the frozen factors.

Authors: We acknowledge that demonstrating robustness to the choice of random seed for the shared frozen factors is valuable for substantiating the central claim. Section 3.2 provides a theoretical argument that a random Tucker-like basis can span the necessary high-rank space with high probability, but we agree this should be complemented by empirical checks. In the revised version we will add a new table (or subsection in §4) reporting mean and standard deviation of performance across at least five independent random seeds for the frozen factors on the main benchmarks, thereby directly addressing the concern about potential misalignment. revision: yes
Referee: [§4.3] §4.3 (Ablations) and experimental tables: The reported performance gains over high-rank baselines are load-bearing for the claim, yet the manuscript provides insufficient detail on whether ablations include replacement of the shared random factors by an independent draw or by a learned basis; without such controls the results cannot rule out that success depends on a fortunate random initialization rather than the architecture itself.

Authors: We appreciate the referee highlighting the need for explicit controls that isolate the benefit of sharing a single random tensor basis. The existing ablations in §4.3 vary the scaling-vector dimension and core rank but do not yet include the requested variants. We will expand §4.3 with two new controls: (i) per-layer independent random draws of the factor matrices (instead of a shared draw), and (ii) a learned (non-frozen) basis version. These additions will be presented alongside the original results so readers can assess whether performance depends on a fortunate initialization or on the shared-random architecture itself. revision: yes

Circularity Check

0 steps flagged

Explicit architectural parametrization with no load-bearing self-definition or fitted-input prediction

full rationale

The paper defines TeRA directly as a Tucker-like tensor network in which large random factors are frozen and shared while only per-layer scaling vectors are trained. This is presented as an engineering choice that trades off expressivity and parameter count, not as a quantity derived from or equivalent to the fitted scaling vectors themselves. No equations reduce the claimed high-rank adaptation performance to the trainable parameters by construction, and no self-citation chain is invoked to justify uniqueness or to rename an existing result. The central claim therefore remains an independent architectural proposal whose validity is tested empirically rather than presupposed by the method's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the modeling assumption that a random Tucker-like decomposition with frozen shared factors can stand in for full high-rank updates when only scaling vectors are learned.

free parameters (1)

layer-specific scaling vectors
Trainable parameters fitted during fine-tuning; their values are determined by the adaptation objective.

axioms (1)

domain assumption Randomly initialized frozen factors in the tensor network suffice to represent the necessary high-rank structure when scaled per layer.
Invoked to justify freezing the large components while training only the vectors.

invented entities (1)

TeRA tensor network parametrization no independent evidence
purpose: To achieve high-rank updates with vector-level trainable parameter count
New architectural construction introduced by the paper.

pith-pipeline@v0.9.0 · 5793 in / 1288 out tokens · 58341 ms · 2026-05-18T19:14:54.821605+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

large randomly initialized factors are frozen and shared across layers, while only small layer-specific scaling vectors... are trained
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1... rank(ΔW[N;k]) ≤ min(∏_{i=1}^k Ri, ∏_{i=k+1}^N Ri)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 5 internal anchors

[1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[3]

GPT-4 Technical Report

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. GPT-4 Technical Report . arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Banerjee, S.; and Lavie, A. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments . In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65--72

work page 2005
[5]

Bershatsky, D.; Cherniuk, D.; Daulbaev, T.; Mikhalev, A.; and Oseledets, I. 2024. LoTR : Low Tensor Rank Weight Adaptation. arXiv:2402.01376

work page arXiv 2024
[6]

Bisk, Y.; Zellers, R.; Gao, J.; Choi, Y.; et al. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language . In Proceedings of the AAAI conference on artificial intelligence, volume 34, 7432--7439

work page 2020
[7]

D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877--1901

work page 2020
[8]

Cichocki, A.; Mandic, D.; De Lathauwer, L.; Zhou, G.; Zhao, Q.; Caiafa, C.; and PHAN, H. A. 2015. Tensor Decompositions for Signal Processing Applications: From two-way to multiway component analysis. IEEE Signal Processing Magazine, 32(2): 145--163

work page 2015
[9]

Clark, C.; Lee, K.; Chang, M.-W.; Kwiatkowski, T.; Collins, M.; and Toutanova, K. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2924--2936

work page 2019
[10]

Clark, P.; Cowhey, I.; Etzioni, O.; Khot, T.; Sabharwal, A.; Schoenick, C.; and Tafjord, O. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge . arXiv preprint arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. 2021. Training Verifiers to Solve Math Word Problems . arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[12]

De Lathauwer, L.; De Moor, B.; and Vandewalle, J. 2000 a . A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4): 1253--1278

work page 2000
[13]

De Lathauwer, L.; De Moor, B.; and Vandewalle, J. 2000 b . On the best rank-1 and rank-(r1, r2,..., rn) approximation of higher-order tensors. SIAM journal on Matrix Analysis and Applications, 21(4): 1324--1342

work page 2000
[14]

Dinan, E.; Logacheva, V.; Malykh, V.; Miller, A.; Shuster, K.; Urbanek, J.; Kiela, D.; Szlam, A.; Serban, I.; Lowe, R.; et al. 2019. The Second Conversational Intelligence Challenge (ConvAI2) . In The NeurIPS'18 Competition: From Machine Learning to Intelligent Conversations, 187--208. Springer

work page 2019
[15]

Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. 2024. The Llama 3 Herd of Models . arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Gu, Y.; Zhou, W.; Iacovides, G.; and Mandic, D. 2025. TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs. arXiv preprint arXiv:2501.15674

work page arXiv 2025
[17]

J.; yelong shen; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W

Hu, E. J.; yelong shen; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2022. Lo RA : Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations

work page 2022
[18]

Hu, Z.; Wang, L.; Lan, Y.; Xu, W.; Lim, E.-P.; Bing, L.; Xu, X.; Poria, S.; and Lee, R. 2023. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 5254--5276

work page 2023
[19]

Huang, Q.; Ko, T.; Zhuang, Z.; Tang, L.; and Zhang, Y. 2025. Hi RA : Parameter-Efficient Hadamard High-Rank Adaptation for Large Language Models. In The Thirteenth International Conference on Learning Representations

work page 2025
[20]

Iacovides, G.; Zhou, W.; Li, C.; Zhao, Q.; and Mandic, D. 2025. Domain-Aware Tensor Network Structure Search. arXiv preprint arXiv:2505.23537

work page arXiv 2025
[21]

Iacovides, G.; Zhou, W.; and Mandic, D. 2024. Towards LLM -guided Efficient and Interpretable Multi-linear Tensor Network Rank Selection . arXiv preprint arXiv:2410.10728

work page arXiv 2024
[22]

Jiang, T.; Huang, S.; Luo, S.; Zhang, Z.; Huang, H.; Wei, F.; Deng, W.; Sun, F.; Zhang, Q.; Wang, D.; and Zhuang, F. 2024. MoRA : High-Rank Updating for Parameter-Efficient Fine-Tuning. arXiv:2405.12130

work page arXiv 2024
[23]

Koncel-Kedziorski, R.; Roy, S.; Amini, A.; Kushman, N.; and Hajishirzi, H. 2016. MAWPS : A Math Word Problem Repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , 1152--1157. San Diego, California: Association for Computational Linguistics

work page 2016
[24]

J.; Blankevoort, T.; and Asano, Y

Kopiczko, D. J.; Blankevoort, T.; and Asano, Y. M. 2024. Ve RA : Vector-based Random Matrix Adaptation. In The Twelfth International Conference on Learning Representations

work page 2024
[25]

Lester, B.; Al-Rfou, R.; and Constant, N. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 3045--3059. Association for Computational Linguistics

work page 2021
[26]

Li, C.; Zeng, J.; Li, C.; Caiafa, C.; and Zhao, Q. 2023. Alternating local enumeration (TnALE): solving tensor network structure search with fewer evaluations . In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org

work page 2023
[27]

Lin, C.-Y. 2004. ROUGE : A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74--81. Barcelona, Spain: Association for Computational Linguistics

work page 2004
[28]

Ling, W.; Yogatama, D.; Dyer, C.; and Blunsom, P. 2017. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 158--167

work page 2017
[29]

Liu, X.; Ji, K.; Fu, Y.; Tam, W.; Du, Z.; Yang, Z.; and Tang, J. 2022. P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. In Muresan, S.; Nakov, P.; and Villavicencio, A., eds., Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 61--68. Dublin, Ireland: Associat...

work page 2022
[30]

Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight Decay Regularization . In International Conference on Learning Representations

work page 2019
[31]

Mihaylov, T.; Clark, P.; Khot, T.; and Sabharwal, A. 2018. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2381--2391

work page 2018
[32]

Oseledets, I. V. 2011. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5): 2295--2317

work page 2011
[33]

Patel, A.; Bhattamishra, S.; and Goyal, N. 2021. Are NLP Models really able to Solve Simple Math Word Problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2080--2094. Association for Computational Linguistics

work page 2021
[34]

Sakaguchi, K.; Le Bras, R.; Bhagavatula, C.; and Choi, Y. 2020. WinoGrande: An adversarial winograd schema challenge at scale . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 8732--8740

work page 2020
[35]

Sap, M.; Rashkin, H.; Chen, D.; LeBras, R.; and Choi, Y. 2019. SocialIQA: Commonsense Reasoning about Social Interactions . In Conference on Empirical Methods in Natural Language Processing

work page 2019
[36]

Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

R.; et al

Tucker, L. R.; et al. 1964. The extension of factor analysis to three-dimensional matrices. Contributions to mathematical psychology, 110119: 110--182

work page 1964
[38]

D.; Fischer, J.; and Song, Y

Wang, M.; Duc, K. D.; Fischer, J.; and Song, Y. S. 2017. Operator norm inequalities between tensor unfoldings on the partition lattice. Linear algebra and its applications, 520: 44--66

work page 2017
[39]

Yang, Y.; Zhou, J.; Wong, N.; and Zhang, Z. 2024. LoRETTA: Low-Rank Economic Tensor-Train Adaptation for Ultra-Low-Parameter Fine-Tuning of Large Language Models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3161--3176

work page 2024
[40]

Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019. HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4791--4800

work page 2019
[41]

Q.; and Artzi, Y

Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2020. BERTScore: Evaluating Text Generation with BERT . In International Conference on Learning Representations

work page 2020

[1] [1]

, " * write output.state after.block = add.period write newline

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

work page

[2] [2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[3] [3]

GPT-4 Technical Report

Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. 2023. GPT-4 Technical Report . arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Banerjee, S.; and Lavie, A. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments . In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 65--72

work page 2005

[5] [5]

Bershatsky, D.; Cherniuk, D.; Daulbaev, T.; Mikhalev, A.; and Oseledets, I. 2024. LoTR : Low Tensor Rank Weight Adaptation. arXiv:2402.01376

work page arXiv 2024

[6] [6]

Bisk, Y.; Zellers, R.; Gao, J.; Choi, Y.; et al. 2020. PIQA: Reasoning about Physical Commonsense in Natural Language . In Proceedings of the AAAI conference on artificial intelligence, volume 34, 7432--7439

work page 2020

[7] [7]

D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al

Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877--1901

work page 2020

[8] [8]

Cichocki, A.; Mandic, D.; De Lathauwer, L.; Zhou, G.; Zhao, Q.; Caiafa, C.; and PHAN, H. A. 2015. Tensor Decompositions for Signal Processing Applications: From two-way to multiway component analysis. IEEE Signal Processing Magazine, 32(2): 145--163

work page 2015

[9] [9]

Clark, C.; Lee, K.; Chang, M.-W.; Kwiatkowski, T.; Collins, M.; and Toutanova, K. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2924--2936

work page 2019

[10] [10]

Clark, P.; Cowhey, I.; Etzioni, O.; Khot, T.; Sabharwal, A.; Schoenick, C.; and Tafjord, O. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge . arXiv preprint arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Cobbe, K.; Kosaraju, V.; Bavarian, M.; Chen, M.; Jun, H.; Kaiser, L.; Plappert, M.; Tworek, J.; Hilton, J.; Nakano, R.; et al. 2021. Training Verifiers to Solve Math Word Problems . arXiv preprint arXiv:2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021

[12] [12]

De Lathauwer, L.; De Moor, B.; and Vandewalle, J. 2000 a . A multilinear singular value decomposition. SIAM journal on Matrix Analysis and Applications, 21(4): 1253--1278

work page 2000

[13] [13]

De Lathauwer, L.; De Moor, B.; and Vandewalle, J. 2000 b . On the best rank-1 and rank-(r1, r2,..., rn) approximation of higher-order tensors. SIAM journal on Matrix Analysis and Applications, 21(4): 1324--1342

work page 2000

[14] [14]

Dinan, E.; Logacheva, V.; Malykh, V.; Miller, A.; Shuster, K.; Urbanek, J.; Kiela, D.; Szlam, A.; Serban, I.; Lowe, R.; et al. 2019. The Second Conversational Intelligence Challenge (ConvAI2) . In The NeurIPS'18 Competition: From Machine Learning to Intelligent Conversations, 187--208. Springer

work page 2019

[15] [15]

Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. 2024. The Llama 3 Herd of Models . arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Gu, Y.; Zhou, W.; Iacovides, G.; and Mandic, D. 2025. TensorLLM: Tensorising Multi-Head Attention for Enhanced Reasoning and Compression in LLMs. arXiv preprint arXiv:2501.15674

work page arXiv 2025

[17] [17]

J.; yelong shen; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W

Hu, E. J.; yelong shen; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; and Chen, W. 2022. Lo RA : Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations

work page 2022

[18] [18]

Hu, Z.; Wang, L.; Lan, Y.; Xu, W.; Lim, E.-P.; Bing, L.; Xu, X.; Poria, S.; and Lee, R. 2023. LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 5254--5276

work page 2023

[19] [19]

Huang, Q.; Ko, T.; Zhuang, Z.; Tang, L.; and Zhang, Y. 2025. Hi RA : Parameter-Efficient Hadamard High-Rank Adaptation for Large Language Models. In The Thirteenth International Conference on Learning Representations

work page 2025

[20] [20]

Iacovides, G.; Zhou, W.; Li, C.; Zhao, Q.; and Mandic, D. 2025. Domain-Aware Tensor Network Structure Search. arXiv preprint arXiv:2505.23537

work page arXiv 2025

[21] [21]

Iacovides, G.; Zhou, W.; and Mandic, D. 2024. Towards LLM -guided Efficient and Interpretable Multi-linear Tensor Network Rank Selection . arXiv preprint arXiv:2410.10728

work page arXiv 2024

[22] [22]

Jiang, T.; Huang, S.; Luo, S.; Zhang, Z.; Huang, H.; Wei, F.; Deng, W.; Sun, F.; Zhang, Q.; Wang, D.; and Zhuang, F. 2024. MoRA : High-Rank Updating for Parameter-Efficient Fine-Tuning. arXiv:2405.12130

work page arXiv 2024

[23] [23]

Koncel-Kedziorski, R.; Roy, S.; Amini, A.; Kushman, N.; and Hajishirzi, H. 2016. MAWPS : A Math Word Problem Repository. In Proceedings of the 2016 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies , 1152--1157. San Diego, California: Association for Computational Linguistics

work page 2016

[24] [24]

J.; Blankevoort, T.; and Asano, Y

Kopiczko, D. J.; Blankevoort, T.; and Asano, Y. M. 2024. Ve RA : Vector-based Random Matrix Adaptation. In The Twelfth International Conference on Learning Representations

work page 2024

[25] [25]

Lester, B.; Al-Rfou, R.; and Constant, N. 2021. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 3045--3059. Association for Computational Linguistics

work page 2021

[26] [26]

Li, C.; Zeng, J.; Li, C.; Caiafa, C.; and Zhao, Q. 2023. Alternating local enumeration (TnALE): solving tensor network structure search with fewer evaluations . In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org

work page 2023

[27] [27]

Lin, C.-Y. 2004. ROUGE : A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74--81. Barcelona, Spain: Association for Computational Linguistics

work page 2004

[28] [28]

Ling, W.; Yogatama, D.; Dyer, C.; and Blunsom, P. 2017. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 158--167

work page 2017

[29] [29]

Liu, X.; Ji, K.; Fu, Y.; Tam, W.; Du, Z.; Yang, Z.; and Tang, J. 2022. P -Tuning: Prompt Tuning Can Be Comparable to Fine-tuning Across Scales and Tasks. In Muresan, S.; Nakov, P.; and Villavicencio, A., eds., Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 61--68. Dublin, Ireland: Associat...

work page 2022

[30] [30]

Loshchilov, I.; and Hutter, F. 2019. Decoupled Weight Decay Regularization . In International Conference on Learning Representations

work page 2019

[31] [31]

Mihaylov, T.; Clark, P.; Khot, T.; and Sabharwal, A. 2018. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2381--2391

work page 2018

[32] [32]

Oseledets, I. V. 2011. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5): 2295--2317

work page 2011

[33] [33]

Patel, A.; Bhattamishra, S.; and Goyal, N. 2021. Are NLP Models really able to Solve Simple Math Word Problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2080--2094. Association for Computational Linguistics

work page 2021

[34] [34]

Sakaguchi, K.; Le Bras, R.; Bhagavatula, C.; and Choi, Y. 2020. WinoGrande: An adversarial winograd schema challenge at scale . In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 8732--8740

work page 2020

[35] [35]

Sap, M.; Rashkin, H.; Chen, D.; LeBras, R.; and Choi, Y. 2019. SocialIQA: Commonsense Reasoning about Social Interactions . In Conference on Empirical Methods in Natural Language Processing

work page 2019

[36] [36]

Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.; Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale, S.; et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

R.; et al

Tucker, L. R.; et al. 1964. The extension of factor analysis to three-dimensional matrices. Contributions to mathematical psychology, 110119: 110--182

work page 1964

[38] [38]

D.; Fischer, J.; and Song, Y

Wang, M.; Duc, K. D.; Fischer, J.; and Song, Y. S. 2017. Operator norm inequalities between tensor unfoldings on the partition lattice. Linear algebra and its applications, 520: 44--66

work page 2017

[39] [39]

Yang, Y.; Zhou, J.; Wong, N.; and Zhang, Z. 2024. LoRETTA: Low-Rank Economic Tensor-Train Adaptation for Ultra-Low-Parameter Fine-Tuning of Large Language Models . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3161--3176

work page 2024

[40] [40]

Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; and Choi, Y. 2019. HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 4791--4800

work page 2019

[41] [41]

Q.; and Artzi, Y

Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2020. BERTScore: Evaluating Text Generation with BERT . In International Conference on Learning Representations

work page 2020