pith. machine review for the scientific record. sign in

arxiv: 2604.18124 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Recognition: unknown

TLoRA: Task-aware Low Rank Adaptation of Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:02 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords Low-Rank AdaptationParameter-Efficient Fine-TuningLarge Language ModelsSingular Value DecompositionRank AllocationTask-Aware Adaptation
0
0 comments X

The pith

TLoRA cuts trainable parameters while matching performance on language tasks

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TLoRA as a unified framework that jointly sets the initial values and distributes limited training resources across layers before fine-tuning starts. It initializes the LoRA A matrix by performing singular value decomposition on the product of the pre-trained weights and the input activation covariance, which aligns the starting point with directions relevant to the task at hand. After this step the A matrix is held fixed and only the B matrix is updated during training. A sensitivity measure then decides how many ranks and what scaling factors each layer receives, all while respecting a fixed total budget of trainable parameters. Experiments across natural language understanding, commonsense and math reasoning, code generation, and chat tasks show that this produces strong results with far fewer parameters to update than standard approaches. A reader would care because fine-tuning large models is costly in compute and data, and a method that automates good starting points and allocations could make such customization cheaper and more reliable.

Core claim

TLoRA is a framework that performs singular value decomposition on the product of pre-trained weights and input activation covariance to initialize the LoRA A matrix with task-relevant subspaces, freezes A while training only B, and employs a sensitivity-based importance metric to adaptively allocate ranks and scaling factors across layers under a fixed parameter budget, leading to excellent performance across natural language understanding, commonsense reasoning, math reasoning, code generation, and chat generation tasks with reduced trainable parameters.

What carries the argument

The SVD-based task-aware initialization of the LoRA A matrix combined with the sensitivity-based metric for adaptive rank and scaling allocation across layers.

If this is right

  • TLoRA reduces the number of trainable parameters compared with standard LoRA while delivering strong results on multiple task types.
  • The method works across natural language understanding, commonsense reasoning, math reasoning, code generation, and chat generation.
  • Freezing the initialized A matrix and training only the B matrix keeps training simpler than some other variants.
  • The joint optimization of initialization and allocation avoids the extra complexity introduced by some existing LoRA extensions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The sensitivity metric could be tested as a general tool for deciding update priorities in other parameter-efficient methods.
  • If the activation covariance step proves robust, small task-specific datasets might suffice to guide efficient adaptation in new domains.
  • Similar decomposition-based initialization might apply to deciding which modules to adapt in full-model fine-tuning.

Load-bearing premise

The assumption that decomposing the product of the model's weights and input patterns will select directions useful for the task, and that the sensitivity measure will rank layer importance accurately without bias or extra searches.

What would settle it

Experiments that replace the SVD initialization with standard random initialization or switch to uniform rank allocation instead of the sensitivity-based choice, then check whether the reported performance advantage over baseline LoRA disappears on the tested tasks.

Figures

Figures reproduced from arXiv: 2604.18124 by Jiawei Dang, Liang-Jie Zhang, Weicheng Lin, Yi Zhang.

Figure 1
Figure 1. Figure 1: (Left) Standard LoRA employs random Gaussian initialization for matrix A and initializes B to zero, using a fixed rank r across all modules. (Right) TLoRA utilizes a task-aware initialization strategy where matrix A is initialized using the top-r ∗ singular vectors of the product of pre-trained weights and input covariance (W0C) and is subsequently frozen (indicated by the snowflake). Furthermore, TLoRA ad… view at source ↗
Figure 2
Figure 2. Figure 2: (Left) Training loss curves of TLoRA and baseline with 128 ranks on the MetaMathQA dataset. (Right) Training loss curves from the ablation study with different settings on the MetaMathQA dataset. are shown in Tables 2 and 3. Notably, despite utilizing significantly fewer trainable parameters, TLoRA consistently achieves superior or highly competitive performance across various inference and generative benc… view at source ↗
Figure 3
Figure 3. Figure 3: Subspace Similarity (ϕ). Finally, we map the optimization result back to the original parameter space via an inverse trans￾formation to recover the LoRA projection ma￾trix A. Recalling the transformation definition A˜ = AC1/2 , by right-multiplying both sides of the equation by C −1/2 , we derive the theoretical optimal solution for A as: A = V T r C −1/2 ∈ R r×n (16) Where Vr is the matrix composed of the… view at source ↗
Figure 5
Figure 5. Figure 5: presents the gradient norm and training loss curves during the initial phase. The results pro￾vide compelling evidence: firstly, the theoretical so￾lution exhibits a propensity for gradient explosion, with gradient norms exceeding those of TLoRA by several orders of magnitude. Secondly, this instabil￾ity hinders effective model learning, resulting in a failure to converge. In contrast, TLoRA maintains stab… view at source ↗
Figure 6
Figure 6. Figure 6: Subspace similarity analysis between the the [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The data on the graph shows the difference [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

Low-Rank Adaptation (LoRA) has become a widely adopted parameter-efficient fine-tuning method for large language models, with its effectiveness largely influenced by the allocation of ranks and scaling factors, as well as initialization. Existing LoRA variants typically address only one of these factors, often at the cost of increased training complexity or reduced practical efficiency. In this work, we present Task-aware Low-Rank Adaptation (TLoRA), a unified framework that jointly optimizes initialization and resource allocation at the outset of training. TLoRA introduces a data-driven initialization strategy that aligns the LoRA $A$ matrix with task-relevant subspaces by performing singular value decomposition on the product of pre-trained weights and input activation covariance. After this, the $A$ matrix is frozen, and only the $B$ matrix is trained. Furthermore, TLoRA employs a sensitivity-based importance metric to adaptively allocate ranks and scaling factors across layers under a fixed parameter budget. We conduct extensive experiments that demonstrate TLoRA consistently performs excellently across various tasks, including natural language understanding, commonsense reasoning, math reasoning, code generation, and chat generation, while significantly reducing the number of trainable parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces TLoRA, a unified LoRA variant that (1) initializes the low-rank A matrix via SVD on the product of pretrained weights and input activation covariance, then freezes A while training only B, and (2) uses a sensitivity-based metric to adaptively allocate per-layer ranks and scaling factors under a fixed parameter budget. It claims this yields consistently strong results across NLU, commonsense reasoning, math reasoning, code generation, and chat tasks while using fewer trainable parameters than standard LoRA.

Significance. If the subspace-alignment and bias-free selection assumptions hold and the reported gains are reproducible, TLoRA would provide a practical, low-overhead way to improve both initialization and resource allocation in parameter-efficient fine-tuning. The joint treatment of init and allocation under a single framework is a conceptual strength, but the absence of quantitative results, ablations, or statistical details in the provided description makes it impossible to gauge whether the gains exceed those of existing adaptive LoRA methods.

major comments (3)
  1. [§3] §3 (method): The claim that SVD on W·(activation covariance) produces an A matrix whose frozen directions are 'task-relevant' is presented as a direct algorithmic step with no accompanying analysis, visualization, or ablation showing that these directions capture task-specific features better than random or standard LoRA initialization; without such evidence the freezing of A after SVD remains an unverified modeling assumption.
  2. [§4] §4 (experiments): The abstract and experimental claims assert 'consistent excellence' and 'significant' parameter reduction across five task categories, yet supply no numerical results, baseline comparisons (e.g., LoRA, DoRA, AdaLoRA), number of runs, or statistical tests; this absence is load-bearing because the central contribution is empirical superiority under a fixed budget.
  3. [§3.3] §3.3 (sensitivity metric): The sensitivity-based importance score used for adaptive rank/scaling allocation is described only at a high level; if the metric is gradient- or activation-derived, its correlation with true task importance versus layer depth/width is not analyzed, leaving open the possibility that the reported gains are an artifact of the allocation heuristic rather than the unified framework.
minor comments (2)
  1. [§3.1] Notation for the activation covariance matrix and the exact SVD procedure should be formalized with equations rather than prose to allow reproduction.
  2. [§3] The paper should clarify the computational overhead of the one-time SVD and sensitivity computation relative to standard LoRA training.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify important areas where additional evidence and clarity will strengthen the manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§3] §3 (method): The claim that SVD on W·(activation covariance) produces an A matrix whose frozen directions are 'task-relevant' is presented as a direct algorithmic step with no accompanying analysis, visualization, or ablation showing that these directions capture task-specific features better than random or standard LoRA initialization; without such evidence the freezing of A after SVD remains an unverified modeling assumption.

    Authors: We agree that the manuscript would be strengthened by explicit evidence supporting the task-relevance of the SVD-derived directions. In the revised version we will add to §3 an analysis subsection containing (i) visualizations of the leading singular vectors from the task-data SVD versus random initialization, (ii) quantitative ablations comparing downstream performance when A is frozen after SVD versus when it is randomly initialized or jointly trained, and (iii) a brief discussion of how the captured subspaces align with task-specific activation patterns. These additions will provide direct empirical grounding for the modeling choice. revision: yes

  2. Referee: [§4] §4 (experiments): The abstract and experimental claims assert 'consistent excellence' and 'significant' parameter reduction across five task categories, yet supply no numerical results, baseline comparisons (e.g., LoRA, DoRA, AdaLoRA), number of runs, or statistical tests; this absence is load-bearing because the central contribution is empirical superiority under a fixed budget.

    Authors: We acknowledge that the current presentation of results does not sufficiently detail the numerical evidence or statistical support. We will revise §4 to include full performance tables with exact metrics and trainable-parameter counts for TLoRA and the requested baselines (LoRA, DoRA, AdaLoRA) across all five task categories, report the number of runs (with standard deviations), and add statistical significance tests (paired t-tests) comparing TLoRA against each baseline under the fixed parameter budget. This will make the empirical claims fully verifiable. revision: yes

  3. Referee: [§3.3] §3.3 (sensitivity metric): The sensitivity-based importance score used for adaptive rank/scaling allocation is described only at a high level; if the metric is gradient- or activation-derived, its correlation with true task importance versus layer depth/width is not analyzed, leaving open the possibility that the reported gains are an artifact of the allocation heuristic rather than the unified framework.

    Authors: The sensitivity score is computed from the average gradient norm of the task loss with respect to each layer’s weights on a small calibration set of task data. We will expand §3.3 with the exact mathematical definition, a correlation analysis between the scores and both task performance and layer depth/width, and ablation experiments that compare sensitivity-based allocation against uniform and alternative heuristics while keeping the SVD initialization fixed. These additions will clarify that the observed gains arise from the joint framework rather than the allocation rule in isolation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: TLoRA is a direct algorithmic recipe

full rationale

The paper presents TLoRA as an explicit two-part procedure—SVD initialization of the frozen A matrix from the product of pretrained weights and input activation covariance, followed by sensitivity-based adaptive allocation of ranks and scaling factors under a fixed budget. No equations, predictions, or uniqueness claims are shown to reduce to fitted constants, self-referential definitions, or load-bearing self-citations. The method is self-contained as a heuristic construction whose validity is assessed via downstream experiments rather than internal derivation loops.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on the unstated premise that the top singular vectors of W * Cov(x) capture task-relevant directions and that loss sensitivity to rank is a reliable proxy for importance; both are domain assumptions without independent verification in the abstract.

axioms (2)
  • domain assumption SVD on pre-trained weights times input activation covariance yields task-aligned subspaces for A initialization
    Invoked in the initialization strategy paragraph of the abstract
  • domain assumption Sensitivity-based metric correctly ranks layer importance for rank allocation
    Invoked in the adaptive allocation paragraph of the abstract

pith-pipeline@v0.9.0 · 5509 in / 1413 out tokens · 53715 ms · 2026-05-10T05:02:20.029487+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 52 canonical work pages · 24 internal anchors

  1. [1]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  2. [2]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  3. [3]

    LoRA-FA: Efficient and Effective Low Rank Representation Fine-tuning

    Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning , author=. arXiv preprint arXiv:2308.03303 , year=

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  5. [5]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  6. [6]

    International conference on machine learning , pages=

    Parameter-efficient transfer learning for NLP , author=. International conference on machine learning , pages=. 2019 , organization=

  7. [7]

    2024 , school=

    Identifying and mitigating vulnerabilities in llm-integrated applications , author=. 2024 , school=

  8. [8]

    arXiv:2405.17357

    Dora: Enhancing parameter-efficient fine-tuning with dynamic rank distribution , author=. arXiv preprint arXiv:2405.17357 , year=

  9. [9]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. ICLR , volume=

  10. [10]

    Lora vs full fine-tuning: An illusion of equivalence

    Lora vs full fine-tuning: An illusion of equivalence , author=. arXiv preprint arXiv:2410.21228 , year=

  11. [11]

    Yangyi Chen, Binxuan Huang, Yifan Gao, Zhengyang Wang, Jingfeng Yang, and Heng Ji

    Lora learns less and forgets less , author=. arXiv preprint arXiv:2405.09673 , year=

  12. [12]

    arXiv preprint arXiv:2311.09578 , year=

    Tied-LoRA: Enhancing parameter efficiency of LoRA with weight tying , author=. arXiv preprint arXiv:2311.09578 , year=

  13. [13]

    arXiv preprint arXiv:2405.15179 , year=

    VB-LoRA: extreme parameter efficient fine-tuning with vector banks , author=. arXiv preprint arXiv:2405.15179 , year=

  14. [14]

    Gradient Descent Happens in a Tiny Subspace

    Gradient descent happens in a tiny subspace , author=. arXiv preprint arXiv:1812.04754 , year=

  15. [15]

    arXiv preprint arXiv:2410.07170 , year=

    Parameter Efficient Fine-tuning via Explained Variance Adaptation , author=. arXiv preprint arXiv:2410.07170 , year=

  16. [16]

    Advances in Neural Information Processing Systems , volume=

    Lora-ga: Low-rank adaptation with gradient approximation , author=. Advances in Neural Information Processing Systems , volume=

  17. [17]

    Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

    DyLoRA: Parameter-efficient tuning of pre-trained models using dynamic search-free low-rank adaptation , author=. Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages=

  18. [18]

    Advances in Neural Information Processing Systems , volume=

    Compacter: Efficient low-rank hypercomplex adapter layers , author=. Advances in Neural Information Processing Systems , volume=

  19. [19]

    Lora+: Efficient low rank adaptation of large models

    Lora+: Efficient low rank adaptation of large models , author=. arXiv preprint arXiv:2402.12354 , year=

  20. [20]

    SeedLoRA: A Fusion Approach to Efficient LLM Fine-Tuning , author=

  21. [21]

    Olora: Orthonormal low-rank adaptation of large language models

    Olora: Orthonormal low-rank adaptation of large language models , author=. arXiv preprint arXiv:2406.01775 , year=

  22. [22]

    Advances in Neural Information Processing Systems , volume=

    Corda: Context-oriented decomposition adaptation of large language models for task-aware parameter-efficient fine-tuning , author=. Advances in Neural Information Processing Systems , volume=

  23. [23]

    Mistral 7B

    Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

  24. [24]

    Advances in Neural Information Processing Systems , volume=

    Pissa: Principal singular values and singular vectors adaptation of large language models , author=. Advances in Neural Information Processing Systems , volume=

  25. [25]

    Forty-first International Conference on Machine Learning , year=

    Dora: Weight-decomposed low-rank adaptation , author=. Forty-first International Conference on Machine Learning , year=

  26. [26]

    CoRR , volume =

    P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks , author=. arXiv preprint arXiv:2110.07602 , year=

  27. [27]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Prefix-tuning: Optimizing continuous prompts for generation , author=. arXiv preprint arXiv:2101.00190 , year=

  28. [28]

    AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

    Adalora: Adaptive budget allocation for parameter-efficient fine-tuning , author=. arXiv preprint arXiv:2303.10512 , year=

  29. [29]

    Advances in neural information processing systems , volume=

    Qlora: Efficient finetuning of quantized llms , author=. Advances in neural information processing systems , volume=

  30. [30]

    Towards a Unified View of Parameter-Efficient Transfer Learning , journal =

    Towards a unified view of parameter-efficient transfer learning , author=. arXiv preprint arXiv:2110.04366 , year=

  31. [31]

    AdapterFusion: Non-Destructive Task Composition for Transfer Learning , journal =

    Adapterfusion: Non-destructive task composition for transfer learning , author=. arXiv preprint arXiv:2005.00247 , year=

  32. [32]

    Adapterdrop: On the efficiency of adapters in transformers,

    Adapterdrop: On the efficiency of adapters in transformers , author=. arXiv preprint arXiv:2010.11918 , year=

  33. [33]

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models

    Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models , author=. arXiv preprint arXiv:2106.10199 , year=

  34. [34]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    The power of scale for parameter-efficient prompt tuning , author=. arXiv preprint arXiv:2104.08691 , year=

  35. [35]

    Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    Parameter-efficient fine-tuning for large models: A comprehensive survey , author=. arXiv preprint arXiv:2403.14608 , year=

  36. [36]

    Vera: Vector-based random matrix adaptation,

    Vera: Vector-based random matrix adaptation , author=. arXiv preprint arXiv:2310.11454 , year=

  37. [37]

    arXiv preprint arXiv:2402.16842 , year=

    Asymmetry in low-rank adapters of foundation models , author=. arXiv preprint arXiv:2402.16842 , year=

  38. [38]

    Milora: Harnessing minor singular components for parameter-efficient llm finetuning , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  39. [39]

    A rank stabilization scaling factor for fine-tuning with lora,

    A rank stabilization scaling factor for fine-tuning with lora , author=. arXiv preprint arXiv:2312.03732 , year=

  40. [40]

    Advances in Neural Information Processing Systems , volume=

    Adaptformer: Adapting vision transformers for scalable visual recognition , author=. Advances in Neural Information Processing Systems , volume=

  41. [41]

    Training Compute-Optimal Large Language Models

    Training compute-optimal large language models , author=. arXiv preprint arXiv:2203.15556 , year=

  42. [42]

    Nature Machine Intelligence , volume=

    Parameter-efficient fine-tuning of large-scale pre-trained language models , author=. Nature Machine Intelligence , volume=. 2023 , publisher=

  43. [43]

    Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment

    Parameter-efficient fine-tuning methods for pretrained language models: A critical review and assessment , author=. arXiv preprint arXiv:2312.12148 , year=

  44. [44]

    Advances in Neural Information Processing Systems , volume=

    Conditional adapters: Parameter-efficient transfer learning with fast inference , author=. Advances in Neural Information Processing Systems , volume=

  45. [45]

    arXiv preprint arXiv:2211.01979 , year=

    Tiny-attention adapter: Contexts are more important than the number of parameters , author=. arXiv preprint arXiv:2211.01979 , year=

  46. [46]

    arXiv preprint arXiv:2205.11961 , year=

    Attempt: Parameter-efficient multi-task tuning via attentional mixtures of soft prompts , author=. arXiv preprint arXiv:2205.11961 , year=

  47. [47]

    AI Open , volume=

    GPT understands, too , author=. AI Open , volume=. 2024 , publisher=

  48. [48]

    arXiv preprint arXiv:2101.00297 , year=

    Analyzing commonsense emergence in few-shot knowledge models , author=. arXiv preprint arXiv:2101.00297 , year=

  49. [49]

    arXiv preprint arXiv:1909.01066 , year=

    Language models as knowledge bases? , author=. arXiv preprint arXiv:1909.01066 , year=

  50. [50]

    How Much Knowledge Can You Pack Into the Parameters of a Language Model?

    How much knowledge can you pack into the parameters of a language model? , author=. arXiv preprint arXiv:2002.08910 , year=

  51. [51]

    Nature Reviews Methods Primers , volume=

    Principal component analysis , author=. Nature Reviews Methods Primers , volume=. 2022 , publisher=

  52. [52]

    Delta tuning: A comprehen- sive study of parameter efficient methods for pre-trained lan- guage models

    Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models , author=. arXiv preprint arXiv:2203.06904 , year=

  53. [53]

    arXiv preprint arXiv:2012.13255 , year=

    Intrinsic dimensionality explains the effectiveness of language model fine-tuning , author=. arXiv preprint arXiv:2012.13255 , year=

  54. [54]

    Proceedings of the IEEE international conference on computer vision , pages=

    Delving deep into rectifiers: Surpassing human-level performance on imagenet classification , author=. Proceedings of the IEEE international conference on computer vision , pages=

  55. [55]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Importance estimation for neural network pruning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  56. [56]

    arXiv preprint arXiv:2105.12002 , year=

    Super tickets in pre-trained language models: From model compression to improving generalization , author=. arXiv preprint arXiv:2105.12002 , year=

  57. [57]

    International conference on machine learning , pages=

    Platon: Pruning large transformer models with upper confidence bound of weight importance , author=. International conference on machine learning , pages=. 2022 , organization=

  58. [58]

    Journal of machine learning research , volume=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

  59. [59]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. arXiv preprint arXiv:1804.07461 , year=

  60. [60]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  61. [61]

    Third international workshop on paraphrasing (IWP2005) , year=

    Automatically constructing a corpus of sentential paraphrases , author=. Third international workshop on paraphrasing (IWP2005) , year=

  62. [62]

    Transactions of the Association for Computational Linguistics , volume=

    Neural network acceptability judgments , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

  63. [63]

    Proceedings of the TAC , year=

    The fifth PASCAL recognizing textual entailment challenge , author=. Proceedings of the TAC , year=

  64. [64]

    Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

    Recursive deep models for semantic compositionality over a sentiment treebank , author=. Proceedings of the 2013 conference on empirical methods in natural language processing , pages=

  65. [65]

    SIAM Journal on Scientific Computing , volume=

    Principal angles between subspaces in an A-based scalar product: algorithms and perturbation estimates , author=. SIAM Journal on Scientific Computing , volume=. 2002 , publisher=

  66. [66]

    Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

    SQuAD: 100,000+ questions for machine comprehension of text , author=. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages=

  67. [67]

    Measuring the intrinsic dimension of objective landscapes,

    Measuring the intrinsic dimension of objective landscapes , author=. arXiv preprint arXiv:1804.08838 , year=

  68. [68]

    Llm-adapters: An adapter family for parameter-efficient fine- tuning of large language models,

    Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models , author=. arXiv preprint arXiv:2304.01933 , year=

  69. [69]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Boolq: Exploring the surprising difficulty of natural yes/no questions , author=. arXiv preprint arXiv:1905.10044 , year=

  70. [70]

    Proceedings of the AAAI conference on artificial intelligence , volume=

    Piqa: Reasoning about physical commonsense in natural language , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

  71. [71]

    SocialIQA: Commonsense Reasoning about Social Interactions

    Socialiqa: Commonsense reasoning about social interactions , author=. arXiv preprint arXiv:1904.09728 , year=

  72. [72]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Hellaswag: Can a machine really finish your sentence? , author=. arXiv preprint arXiv:1905.07830 , year=

  73. [73]

    Communications of the ACM , volume=

    Winogrande: An adversarial winograd schema challenge at scale , author=. Communications of the ACM , volume=. 2021 , publisher=

  74. [74]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Think you have solved question answering? try arc, the ai2 reasoning challenge , author=. arXiv preprint arXiv:1803.05457 , year=

  75. [75]

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

    Can a suit of armor conduct electricity? a new dataset for open book question answering , author=. arXiv preprint arXiv:1809.02789 , year=

  76. [76]

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    Metamath: Bootstrap your own mathematical questions for large language models , author=. arXiv preprint arXiv:2309.12284 , year=

  77. [77]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems, 2021 , author=. URL https://arxiv. org/abs/2110.14168 , volume=

  78. [78]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

  79. [79]

    Opencodeinterpreter: Integrat- ing code generation with execution and refinement

    Opencodeinterpreter: Integrating code generation with execution and refinement , author=. arXiv preprint arXiv:2402.14658 , year=

  80. [80]

    Program Synthesis with Large Language Models

    Program synthesis with large language models , author=. arXiv preprint arXiv:2108.07732 , year=

Showing first 80 references.