pith. sign in

arxiv: 2606.08565 · v1 · pith:3Z4DCTGCnew · submitted 2026-06-07 · 💻 cs.LG · cs.AI

EinSort: Sorting is All We Need for Tensorizing LLM

Pith reviewed 2026-06-27 18:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords tensorizationLLM compressionlow-rank structureindex orderingsortingKV-cache compressionweight compressiontensor networks
0
0 comments X

The pith

Sorting the indices of a tensor uncovers its hidden low-rank structure for better LLM compression.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes that reordering the indices of a target tensor through sorting can expose implicit low-rank structure that standard fixed-order tensorization misses. This adaptive method is tested on compressing the weights and KV-caches of large language models, where it yields lower reconstruction error than baseline approaches. A sympathetic reader would care because it offers a lightweight way to reduce the memory footprint of foundation models by leveraging structure already present in their parameters. The approach avoids the need for hand-crafted tensor shapes or topologies by letting the data ordering itself guide the decomposition.

Core claim

The central claim is that an adaptive tensorization procedure based on index ordering by sorting reliably discovers inherent low-rank structure inside the otherwise unstructured weight tensors of large language models, leading to higher-fidelity reconstructions when the resulting tensor networks are used for weight and KV-cache compression.

What carries the argument

EinSort, the sorting-based index reordering step that adaptively permutes tensor modes to expose low-rank structure before tensor-network decomposition.

If this is right

  • Tensor networks can represent LLM weights at lower error for a given compression ratio once indices are sorted.
  • KV-cache tensors in transformer inference also admit better compression under the same reordering.
  • The method works without requiring changes to model architecture or additional training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The success of sorting suggests that index order in weight tensors is not arbitrary and may reflect training dynamics or architectural symmetries.
  • Similar permutation-based preprocessing could be tested on other tensor formats such as attention matrices or embedding tables.
  • If the reordering effect holds across scales, it may reduce the search space needed when designing tensor topologies for new models.

Load-bearing premise

Reordering a tensor's indices by sorting will consistently make its low-rank structure visible even when the original weight values appear unstructured.

What would settle it

Running the same tensor decomposition on LLM weight tensors with and without the sorting step and finding that reconstruction error does not decrease, or increases, for multiple models and decomposition ranks.

Figures

Figures reproduced from arXiv: 2606.08565 by Jing Liu, Toshiaki Koike-Akino, Ye Wang.

Figure 1
Figure 1. Figure 1: Various tensor networks in tensor diagrams. operations can expose latent low-rank structure that is oth￾erwise hidden in pretrained LLM weights and KV caches. Motivated by this observation, we propose Einstein sorted sum (EinSort), a tensorization framework that augments Einstein summation with reversible permutation operators. Our contributions are summarized as follows: (1) We in￾troduce EinSort, a tenso… view at source ↗
Figure 2
Figure 2. Figure 2: Tensorization error vs. rank for v proj weights at the first layer of Qwen3-0.6B model. Left: linear; right: log-log plots. 2020) for tensor factorization can be improved. More importantly, such gauge fixing can notably reduce the number of parameters when using a specific junction (Koike￾Akino et al., 2026a). For instance, we can choose a junction to convert adjacent tensor cores into block identity form.… view at source ↗
Figure 3
Figure 3. Figure 3: Sliced sorting with nonlinear mapping and reduction. Non-negative tensor decomposition keeps one-bit sign information [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PPL for Qwen3-0.6B and Gemma3-4B models [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: PPL vs. exponent for Qwen3-1.7B at 50% compression. Gauge fixing and non-negative tensorization improve the accuracy. 3. Experiments We evaluate some LLMs, including Qwen3 (Yang et al., 2025) and Gemma3 (Gemma Team et al., 2025) on the witkitext-2 (WT2) (Merity et al., 2016) benchmark. See Ap￾pendix S for details of LLM models, and Appendices T/U for datasets and libraries we use. We focus on KV cache deco… view at source ↗
Figure 6
Figure 6. Figure 6: Tensorization error vs. rank for v proj weights at the first layer of Qwen3-0.6B model. Sorting can significantly reduce the required rank. Left: linear; right: log-log plots. perm, _ = torch.sort(perm.view(-1, groupsize), dim=-1) # revert permutation within group Wg = torch.gather(W, dim=-1, index=perm.view_as(W)) # Row-wise sequential axis sorting primes = sympy.factorint(W.shape[1]) # prime factorizatio… view at source ↗
Figure 7
Figure 7. Figure 7: Different nonlinear mapping for Qwen3-1.7B at 50% compression: additive logarithm; multiplicative logarithm; and exponentia￾tion [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Different reduction and nonlinear mapping for Phi3-mini at 50% and 80% compression. matrices by parameterizing as the Stiefel manifold, either based on exponential map, Cayley transform, Householder reflections, Givens rotation, Taylor series, or Neumann series. The Stiefel manifold parameterization can reduce the required number of parameters from dr to dr − r(r + 1)/2 for a tensor of shape d × r with d ≥… view at source ↗
Figure 9
Figure 9. Figure 9: Gauge freedom: Tensor cores can be arbitrary up to any full-rank identity rotation injection at each cuts. P, L, U = torch.linalg.lu(A) # permutation index from matrix perm = P.argmax(-1) assert torch.allclose(A, (L @ U)[perm]) # A = (L @ U)[perm] # partition L0 = L[:r] # upper part, which is unitriangular L1 = L[r:] # lower part # block identity’s lower part: A = L1 @ L0.inverse() A = torch.linalg.solve_t… view at source ↗
Figure 10
Figure 10. Figure 10: Gauge fixing with SVD and LU canonicalization. SVD: Tensor cores can be left- or right-unitary around the orthogonality center. Orthogonal tensors can be represented by lower number of parameters with Stiefel manifold (Koike-Akino et al., 2025b). LU: Tensor cores can be block identity. Block identity tensors can be represented with lower number of parameters (Koike-Akino et al., 2026a). 26 [PITH_FULL_IMA… view at source ↗
Figure 11
Figure 11. Figure 11: Test-time tensor train adaptation based on online activation preconditioner C. key insight that the weight decomposition should be adaptive to the input tokens at inference time to be more efficient and effective. We may use a simple scoring network to determine the bond ranks at test time. Specifically, a small neural network feeding online activation input X, and decide the ranks for tensor networks. Th… view at source ↗
Figure 12
Figure 12. Figure 12: CNOT chain to realize the row-wise sorting π for W0 in (89). The sorted matrix Wπ can then be decomposed in low-rank tensors. The input X is contracted with the decomposed Wπ through reverse permutation π −1 with conjugate CNOT chain. L. Connection with quantum gates Tensor index permutations in EinSort are closely related to entangling operations in quantum tensor networks. In particular, reversible Bool… view at source ↗
Figure 13
Figure 13. Figure 13: EinSort pipeline with generalized folding/unfolding, ordering, slicing, sorting, and mapping. • 4-mode TT for X as is; • 4-mode TT for axis-swapped version of X, like X.permute(3,1,0,2); • Use the second axis as a batch dimension, and apply 3-mode TT for the rest 3 axis; • Unfold into [d1 × d2, d3, d4], and apply 3-mode TT: X.reshape(-1, d3, d4); • Fold X into 6-mode tensor like X.reshape(a,b,c,d,e,f), an… view at source ↗
Figure 14
Figure 14. Figure 14: PPL over compression rate for Qwen3-1.7B and Phi3-mini (3.8B) models. • Computing large tensor factorization is computationally challenging. • Permutation memory is not easy to reduce. • Compatible high-performance CUDA kernels which can optimize contraction and index reordering are not available. Q. Experiments setup We conduct experiments for LLM benchmarks to evaluate the effectiveness of our method. O… view at source ↗
Figure 15
Figure 15. Figure 15: GSM8K accuracy for Phi-4-mini as a function of KV cache compression ratio. Shaded zone is Wilson’s 99% confidence interval. (a) Qwen3-VL-4B-Instruct (b) Qwen3-VL-8B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Accuracy for compressed Qwen3-VL models on TextVQA benchmark. R.2. LLM math reasoning Besides perplexity evaluations, we add more practical LLM benchmark, specifically GSM8K (Cobbe et al., 2021) for mathematical reasoning [PITH_FULL_IMAGE:figures/full_fig_p033_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Robot manipulation video snapshots for the compressed VLA-JEPA model on LIBERO benchmarks [PITH_FULL_IMAGE:figures/full_fig_p036_17.png] view at source ↗
read the original abstract

Tensor networks provide efficient representations for compressing large neural networks. By carefully designing shapes and topologies, they can significantly reduce memory and computational costs. However, identifying implicit low-rank structures in large foundation models remains challenging due to their enormous scale and un-structured weight distributions. We propose an adaptive tensorization method that discovers inherent low-rank structure in a target tensor by index ordering. Experiments on weight and KV-cache compression demonstrate improved reconstruction quality compared to baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper proposes EinSort, an adaptive tensorization method for LLMs that discovers inherent low-rank structure in target tensors via index ordering (sorting). Experiments on weight and KV-cache compression are claimed to show improved reconstruction quality relative to baselines.

Significance. If the central claim holds under detailed scrutiny, the approach would supply a notably simple mechanism (index sorting) for exposing usable low-rank structure in the unstructured weights typical of foundation models, thereby easing the design burden for tensor-network compression and potentially lowering memory and compute costs.

minor comments (1)
  1. The provided text consists solely of the abstract; without access to the full manuscript, algorithms, experimental protocols, quantitative results, or error analysis, no technical evaluation of the method or its empirical support is possible.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their review and for acknowledging the potential significance of a simple index-sorting mechanism for exposing low-rank structure in foundation-model weights. The recommendation is listed as uncertain with no specific major comments provided in the report. We therefore stand by the manuscript as submitted while remaining available to address any additional questions or clarifications the referee may have.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The abstract and available context present a methodological proposal for adaptive tensorization via index ordering to discover low-rank structure, without any derivations, equations, fitted parameters, or self-citations that reduce claims to inputs by construction. No load-bearing steps are described that would qualify under the enumerated circularity patterns, making the derivation self-contained as an empirical suggestion rather than a tautological re-expression of prior results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; ledger therefore records the minimal domain assumptions required for any tensor-network compression claim. No free parameters or invented entities are visible in the abstract.

axioms (1)
  • domain assumption Weight tensors of large foundation models contain exploitable low-rank structure once indices are suitably reordered.
    This premise is required for the adaptive tensorization claim to be meaningful and is stated directly in the abstract.

pith-pipeline@v0.9.1-grok · 5592 in / 1189 out tokens · 22345 ms · 2026-06-27T18:29:13.173666+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

103 extracted references · 50 canonical work pages · 25 internal anchors

  1. [1]

    Phi-4 Technical Report

    Abdin, M., Aneja, J., Behl, H., Bubeck, S., Eldan, R., Gunasekar, S., Harrison, M., Hewett, R. J., Javaheripi, M., Kauffmann, P., et al. Phi-4 technical report. arXiv preprint arXiv:2412.08905, 2024

  2. [2]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Abouelenin, A., Ashfaq, A., Atkinson, A., Awadalla, H., Bach, N., Bao, J., Benhaim, A., Cai, M., Chaudhary, V., Chen, C., et al. Phi-4-mini technical report: Compact yet powerful multimodal language models via mixture-of- LoRAs . arXiv preprint arXiv:2503.01743, 2025

  3. [3]

    TQCompressor : improving tensor decomposition methods in neural networks via permutations

    Abronin, V., Naumov, A., Mazur, D., Bystrov, D., Tsarova, K., Melnikov, A., Dolgov, S., Brasher, R., and Perelshein, M. TQCompressor : improving tensor decomposition methods in neural networks via permutations. In 2024 IEEE 7th International Conference on Multimedia Information Processing and Retrieval (MIPR), pp.\ 503--506. IEEE, 2024

  4. [4]

    Phi-4-reasoning-vision- 15B technical report

    Aneja, J., Harrison, M., Joshi, N., LaBonte, T., Langford, J., and Salinas, E. Phi-4-reasoning-vision- 15B technical report. arXiv preprint arXiv:2603.03975, 2026

  5. [5]

    and Ichikawa, Y

    Arai, Y. and Ichikawa, Y. Quantization error propagation: Revisiting layer-wise post-training quantization. arXiv preprint arXiv:2504.09629, 2025

  6. [6]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al. V-JEPA 2 : Self-supervised video models enable understanding, prediction and planning. arXiv preprint arXiv:2506.09985, 2025

  7. [7]

    Available: https://arxiv.org/abs/2401.00625

    Bai, G., Chai, Z., Ling, C., Wang, S., Lu, J., Zhang, N., Shi, T., Yu, Z., Zhu, M., Zhang, Y., et al. Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv preprint arXiv:2401.00625, 2024 a

  8. [8]

    SparseLLM : Towards global pruning for pre-trained language models

    Bai, G., Li, Y., Ling, C., Kim, K., and Zhao, L. SparseLLM : Towards global pruning for pre-trained language models. arXiv preprint arXiv:2402.17946, 2024 b

  9. [9]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al. Qwen3-VL technical report. arXiv preprint arXiv:2511.21631, 2025

  10. [10]

    LoTR : Low tensor rank weight adaptation

    Bershatsky, D., Cherniuk, D., Daulbaev, T., Mikhalev, A., and Oseledets, I. LoTR : Low tensor rank weight adaptation. arXiv preprint arXiv:2402.01376, 2024

  11. [11]

    PaliGemma: A versatile 3B VLM for transfer

    Beyer, L., Steiner, A., Pinto, A. S., Kolesnikov, A., Wang, X., Salz, D., Neumann, M., Alabdulmohsin, I., Tschannen, M., Bugliarello, E., et al. PaliGemma : A versatile 3B VLM for transfer. arXiv preprint arXiv:2407.07726, 2024

  12. [12]

    K., Hachtel, G

    Brayton, R. K., Hachtel, G. D., McMullen, C., and Sangiovanni-Vincentelli, A. Logic minimization algorithms for VLSI synthesis , volume 2. Springer Science & Business Media, 1984

  13. [13]

    Pyramid KV : Dynamic KV cache compression based on pyramidal information funneling

    Cai, Z., Zhang, Y., Gao, B., Liu, Y., Li, Y., Liu, T., Lu, K., Xiong, W., Dong, Y., Hu, J., and Xiao, W. Pyramid KV : Dynamic KV cache compression based on pyramidal information funneling. In Second Conference on Language Modeling, 2025. URL https://openreview.net/forum?id=ayi7qezU87

  14. [14]

    arXiv preprint arXiv:2407.21118 , year=

    Chang, C.-C., Lin, W.-C., Lin, C.-Y., Chen, C.-Y., Hu, Y.-F., Wang, P.-S., Huang, N.-C., Ceze, L., Abdelfattah, M. S., and Wu, K.-C. Palu: Compressing KV -cache with low-rank projection. arXiv preprint arXiv:2407.21118, 2024

  15. [15]

    One-for-all: Generalized LoRA for parameter-efficient fine-tuning

    Chavan, A., Liu, Z., Gupta, D., Xing, E., and Shen, Z. One-for-all: Generalized LoRA for parameter-efficient fine-tuning. arXiv preprint arXiv:2306.07967, 2023

  16. [16]

    SuperLoRA : Parameter-efficient unified adaptation for large vision models

    Chen, X., Liu, J., Wang, Y., Wang, P., Brand, M., Wang, G., and Koike-Akino, T. SuperLoRA : Parameter-efficient unified adaptation for large vision models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 8050--8055, 2024 a

  17. [17]

    QuanTA : Efficient high-rank fine-tuning of LLMs with quantum-informed tensor adaptation

    Chen, Z., Dangovski, R., Loh, C., Dugan, O., Luo, D., and Solja c i \'c , M. QuanTA : Efficient high-rank fine-tuning of LLMs with quantum-informed tensor adaptation. Advances in Neural Information Processing Systems, 37: 0 92210--92245, 2024 b

  18. [18]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. URL https://arxiv. org/abs/2110.14168, 9, 2021

  19. [19]

    David, H. A. and Nagaraja, H. N. Order statistics. John Wiley & Sons, 2004

  20. [20]

    Predicting parameters in deep learning

    Denil, M., Shakibi, B., Dinh, L., Ranzato, M., and De Freitas, N. Predicting parameters in deep learning. Advances in neural information processing systems, 26, 2013

  21. [21]

    L., Zaremba, W., Bruna, J., LeCun, Y., and Fergus, R

    Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y., and Fergus, R. Exploiting linear structure within convolutional networks for efficient evaluation. Advances in neural information processing systems, 27, 2014

  22. [22]

    DaViT : Dual attention vision transformers

    Ding, M., Xiao, B., Codella, N., Luo, P., Wang, J., and Yuan, L. DaViT : Dual attention vision transformers. In European conference on computer vision, pp.\ 74--92. Springer, 2022

  23. [23]

    Learning to prune deep neural networks via layer-wise optimal brain surgeon

    Dong, X., Chen, S., and Pan, S. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in neural information processing systems, 30, 2017

  24. [24]

    W., and Keutzer, K

    Dong, Z., Yao, Z., Gholami, A., Mahoney, M. W., and Keutzer, K. HAWQ : Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 293--302, 2019

  25. [25]

    P., Clark, J

    Edalati, A., Tahaei, M., Kobyzev, I., Nia, V. P., Clark, J. J., and Rezagholizadeh, M. KronA : Parameter efficient tuning with kronecker adapter. arXiv preprint arXiv:2212.10650, 2022

  26. [26]

    Gauge fixing, canonical forms, and optimal truncations in tensor networks with closed loops

    Evenbly, G. Gauge fixing, canonical forms, and optimal truncations in tensor networks with closed loops. Physical Review B, 98 0 (8): 0 085155, 2018

  27. [27]

    and Alistarh, D

    Frantar, E. and Alistarh, D. SparseGPT : Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pp.\ 10323--10337. PMLR, 2023

  28. [28]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ : Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022

  29. [29]

    S., Thomas, A., Spector, B., Poli, M., Rudra, A., and R \'e , C

    Fu, D., Arora, S., Grogan, J., Johnson, I., Eyuboglu, E. S., Thomas, A., Spector, B., Poli, M., Rudra, A., and R \'e , C. Monarch mixer: A simple sub-quadratic GEMM -based architecture. Advances in Neural Information Processing Systems, 36: 0 77546--77603, 2023

  30. [30]

    Gemma 3 Technical Report

    Gemma Team , Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ram \'e , A., Rivi \`e re, M., et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

  31. [31]

    Stochastic Optimization of Sorting Networks via Continuous Relaxations

    Grover, A., Wang, E., Zweig, A., and Ermon, S. Stochastic optimization of sorting networks via continuous relaxations. arXiv preprint arXiv:1903.08850, 2019

  32. [32]

    Phi-3 safety post-training: Aligning language models with a ``break-fix'' cycle

    Haider, E., Perez-Becker, D., Portet, T., Madan, P., Garg, A., Ashfaq, A., Majercak, D., Wen, W., Kim, D., Yang, Z., et al. Phi-3 safety post-training: Aligning language models with a ``break-fix'' cycle. arXiv preprint arXiv:2407.13833, 2024

  33. [33]

    Optimal brain surgeon: Extensions and performance comparisons

    Hassibi, B., Stork, D., and Wolff, G. Optimal brain surgeon: Extensions and performance comparisons. Advances in neural information processing systems, 6, 1993

  34. [34]

    LoRA +: Efficient low rank adaptation of large models

    Hayou, S., Ghosh, N., and Yu, B. LoRA +: Efficient low rank adaptation of large models. In International Conference on Machine Learning, 2024

  35. [35]

    Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

    Hsieh, C.-Y., Li, C.-L., Yeh, C.-K., Nakhost, H., Fujii, Y., Ratner, A., Krishna, R., Lee, C.-Y., and Pfister, T. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023

  36. [36]

    J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al

    Hu, E. J., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. LoRA : Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021

  37. [37]

    B., and Stoudenmire, E

    Huggins, W., Patil, P., Mitchell, B., Whaley, K. B., and Stoudenmire, E. M. Towards quantum machine learning with tensor networks. Quantum Science and technology, 4 0 (2): 0 024001, 2019

  38. [38]

    PC-LoRA : Low-rank adaptation for progressive model compression with knowledge distillation

    Hwang, I., Park, H., Lee, Y., Yang, J., and Maeng, S. PC-LoRA : Low-rank adaptation for progressive model compression with knowledge distillation. arXiv preprint arXiv:2406.09117, 2024

  39. [39]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Intelligence, P., Black, K., Brown, N., Darpinian, J., Dhabalia, K., Driess, D., Esmail, A., Equi, M., Finn, C., Fusai, N., et al. _ 0.5 : a vision-language-action model with open-world generalization. arXiv preprint arXiv:2504.16054, 2025

  40. [40]

    Speeding up Convolutional Neural Networks with Low Rank Expansions

    Jaderberg, M., Vedaldi, A., and Zisserman, A. Speeding up convolutional neural networks with low rank expansions. arXiv preprint arXiv:1405.3866, 2014

  41. [41]

    M., Bommarito, M

    Katz, D. M., Bommarito, M. J., Gao, S., and Arredondo, P. GPT -4 passes the bar exam. Philosophical Transactions of the Royal Society A, 382 0 (2270): 0 20230254, 2024

  42. [42]

    -moe: Test-time pruning as micro-grained mixture-of-experts

    Koike-Akino, T., Liu, J., and Wang, Y. -moe: Test-time pruning as micro-grained mixture-of-experts. arXiv preprint arXiv:2505.18451, 2025 a

  43. [43]

    Z., Candogan, L

    Koike-Akino, T., Tonin, F., Wu, Y., Wu, F. Z., Candogan, L. N., and Cevher, V. Quantum-PEFT : Ultra parameter-efficient fine-tuning. arXiv preprint arXiv:2503.05431, 2025 b

  44. [44]

    P., and Brand, M

    Koike-Akino, T., Chen, X., Liu, J., Wang, Y., Wang, P. P., and Brand, M. LatentLLM : Activation-aware transform to multi-head latent attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pp.\ 22644--22652, 2026 a

  45. [45]

    TTQ : Activation-aware test-time quantization to accelerate LLM inference on the fly

    Koike-Akino, T., Liu, J., and Wang, Y. TTQ : Activation-aware test-time quantization to accelerate LLM inference on the fly. arXiv preprint arXiv:2603.19296, 2026 b

  46. [46]

    Speeding-up Convolutional Neural Networks Using Fine-tuned CP-Decomposition

    Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., and Lempitsky, V. Speeding-up convolutional neural networks using fine-tuned CP -decomposition. arXiv preprint arXiv:1412.6553, 2014

  47. [47]

    Optimal brain damage

    LeCun, Y., Denker, J., and Solla, S. Optimal brain damage. Advances in neural information processing systems, 2, 1989

  48. [48]

    Lehmer, D. H. Teaching combinatorial tricks to a computer. In Proceedings of Symposia in Applied Mathematics, pp.\ 179--193. American Mathematical Society, 1960

  49. [49]

    Reversible simulation of irreversible computation

    Li, M., Tromp, J., and Vit \'a nyi, P. Reversible simulation of irreversible computation. Physica D: Nonlinear Phenomena, 120 0 (1-2): 0 168--176, 1998

  50. [50]

    Optimal brain decomposition for accurate LLM low-rank approximation

    Li, Y., Lee, D., Yin, R., and Panda, P. Optimal brain decomposition for accurate LLM low-rank approximation. arXiv preprint arXiv:2604.00821, 2026

  51. [51]

    MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

    Lin, B., Tang, Z., Ye, Y., Cui, J., Zhu, B., Jin, P., Zhang, J., Ning, M., and Yuan, L. MoE-LlaVa : Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024 a

  52. [52]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    Lin, J., Tang, J., Tang, H., Yang, S., Dang, X., and Han, S. AWQ : Activation-aware weight quantization for LLM compression and acceleration. arXiv preprint arXiv:2306.00978, 2023

  53. [53]

    AWQ : Activation-aware weight quantization for on-device LLM compression and acceleration

    Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. AWQ : Activation-aware weight quantization for on-device LLM compression and acceleration. Proceedings of Machine Learning and Systems, 6: 0 87--100, 2024 b

  54. [54]

    DeepSeek-V3 Technical Report

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. DeepSeek -v3 technical report. arXiv preprint arXiv:2412.19437, 2024 a

  55. [55]

    LIBERO : Benchmarking knowledge transfer for lifelong robot learning

    Liu, B., Zhu, Y., Gao, C., Feng, Y., Liu, Q., Zhu, Y., and Stone, P. LIBERO : Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36: 0 44776--44791, 2023 a

  56. [56]

    LoDA : Low-dimensional adaptation of large language models

    Liu, J., Koike-Akino, T., Wang, P., Brand, M., Wang, Y., and Parsons, K. LoDA : Low-dimensional adaptation of large language models. In NeurIPS’23 Workshop on on Efficient Natural Language and Speech Processing, 2023 b

  57. [57]

    AWP : Activation-aware weight pruning and quantization with projected gradient descent

    Liu, J., Koike-Akino, T., Wang, Y., Mansour, H., and Brand, M. AWP : Activation-aware weight pruning and quantization with projected gradient descent. arXiv preprint arXiv:2506.10205, 2025

  58. [58]

    KIVI : A tuning-free asymmetric 2bit quantization for KV cache

    Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. KIVI : A tuning-free asymmetric 2bit quantization for KV cache. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of ...

  59. [59]

    An adaptive tensor-train decomposition approach for efficient deep neural network compression

    Luo, S., Liu, M., Yu, Y., Ren, S., and Bai, Y. An adaptive tensor-train decomposition approach for efficient deep neural network compression. arXiv preprint arXiv:2408.01534, 2024

  60. [60]

    McCluskey, E. J. Minimization of Boolean functions. The Bell System Technical Journal, 35 0 (6): 0 1417--1444, 1956

  61. [61]

    Learning Latent Permutations with Gumbel-Sinkhorn Networks

    Mena, G., Belanger, D., Linderman, S., and Snoek, J. Learning latent permutations with Gumbel - Sinkhorn networks. arXiv preprint arXiv:1802.08665, 2018

  62. [62]

    Pointer Sentinel Mixture Models

    Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016

  63. [63]

    Novikov, A., Podoprikhin, D., Osokin, A., and Vetrov, D. P. Tensorizing neural networks. Advances in neural information processing systems, 28, 2015

  64. [64]

    A practical introduction to tensor networks: Matrix product states and projected entangled pair states

    Or \'u s, R. A practical introduction to tensor networks: Matrix product states and projected entangled pair states. Annals of physics, 349: 0 117--158, 2014

  65. [65]

    PyTorch : An imperative style, high-performance deep learning library

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. PyTorch : An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019

  66. [66]

    and Xie, S

    Peebles, W. and Xie, S. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.\ 4195--4205, 2023

  67. [67]

    Stable low-rank tensor decomposition for compression of convolutional neural network

    Phan, A.-H., Sobolev, K., Sozykin, K., Ermilov, D., Gusak, J., Tichavsk \`y , P., Glukhov, V., Oseledets, I., and Cichocki, A. Stable low-rank tensor decomposition for compression of convolutional neural network. In European Conference on Computer Vision, pp.\ 522--539. Springer, 2020

  68. [68]

    and Eisenschlos, J

    Prillo, S. and Eisenschlos, J. SoftSort : A continuous relaxation for the argsort operator. In International Conference on Machine Learning, pp.\ 7793--7802. PMLR, 2020

  69. [69]

    TensorNetwork: A Library for Physics and Machine Learning

    Roberts, C., Milsted, A., Ganahl, M., Zalcman, A., Fontaine, B., Zou, Y., Hidary, J., Vidal, G., and Leichenauer, S. TensorNetwork : A library for physics and machine learning. arXiv preprint arXiv:1905.01330, 2019

  70. [70]

    Compressing large language models using low rank and low precision decomposition

    Saha, R., Sagan, N., Srivastava, V., Goldsmith, A., and Pilanci, M. Compressing large language models using low rank and low precision decomposition. Advances in Neural Information Processing Systems, 37: 0 88981--89018, 2024

  71. [71]

    N., Kingsbury, B., Sindhwani, V., Arisoy, E., and Ramabhadran, B

    Sainath, T. N., Kingsbury, B., Sindhwani, V., Arisoy, E., and Ramabhadran, B. Low-rank matrix factorization for deep neural network training with high-dimensional output targets. In 2013 IEEE international conference on acoustics, speech and signal processing, pp.\ 6655--6659. IEEE, 2013

  72. [72]

    Eigen attention: Attention in low-rank space for KV cache compression

    Saxena, U., Saha, G., Choudhary, S., and Roy, K. Eigen attention: Attention in low-rank space for KV cache compression. arXiv preprint arXiv:2408.05646, 2024

  73. [73]

    The density-matrix renormalization group in the age of matrix product states

    Schollw \"o ck, U. The density-matrix renormalization group in the age of matrix product states. Annals of physics, 326 0 (1): 0 96--192, 2011

  74. [74]

    A., and Etzioni, O

    Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. Green AI . Communications of the ACM, 63 0 (12): 0 54--63, 2020

  75. [75]

    D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E

    Sidiropoulos, N. D., De Lathauwer, L., Fu, X., Huang, K., Papalexakis, E. E., and Faloutsos, C. Tensor decomposition for signal processing and machine learning. IEEE Transactions on signal processing, 65 0 (13): 0 3551--3582, 2017

  76. [76]

    Towards VQA models that can read

    Singh, A., Natarjan, V., Shah, M., Jiang, Y., Chen, X., Parikh, D., and Rohrbach, M. Towards VQA models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 8317--8326, 2019

  77. [77]

    Sinha, A. K. and Fleuret, F. AA-SVD : Anchored and adaptive SVD for large language model compression. arXiv preprint arXiv:2604.02119, 2026

  78. [78]

    Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098,

    Sun, J., Zhang, W., Qi, Z., Ren, S., Liu, Z., Zhu, H., Sun, G., Jin, X., and Chen, Z. VLA-JEPA : Enhancing vision-language-action model with latent world model. arXiv preprint arXiv:2602.10098, 2026

  79. [79]

    Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023

  80. [80]

    Takeshita, O. Y. Permutation polynomial interleavers: An algebraic-geometric perspective. IEEE Transactions on Information Theory, 53 0 (6): 0 2116--2132, 2007

Showing first 80 references.