pith. machine review for the scientific record. sign in

arxiv: 2305.14314 · v1 · submitted 2023-05-23 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

QLoRA: Efficient Finetuning of Quantized LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-11 13:24 UTC · model grok-4.3

classification 💻 cs.LG
keywords QLoRAquantized fine-tuningLoRA adapters4-bit NormalFloatlarge language modelsefficient traininginstruction tuningchatbot evaluation
0
0 comments X

The pith

QLoRA enables full-performance fine-tuning of 65B language models on a single 48GB GPU by freezing a 4-bit quantized base and training only low-rank adapters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces QLoRA to cut memory requirements for adapting large pretrained language models so that even 65-billion-parameter models fit on one consumer GPU. It freezes the base weights after converting them to a 4-bit NormalFloat format, then routes all gradient updates through small low-rank adapter matrices instead of touching the original parameters. This combination, plus double quantization of scaling constants and paged memory management for the optimizer, keeps task performance identical to standard 16-bit fine-tuning. The authors demonstrate the approach by training over one thousand models and show that their best Guanaco family reaches 99.3 percent of ChatGPT's score on the Vicuna benchmark after only 24 hours on a single GPU. They further release code and models to let others replicate the results across different base architectures and instruction datasets.

Core claim

QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA). With 4-bit NormalFloat quantization, double quantization of the quantization constants, and paged optimizers, the method preserves full 16-bit finetuning task performance while using far less memory.

What carries the argument

The QLoRA pipeline: a 4-bit NormalFloat-quantized frozen base model whose gradients are routed exclusively into trainable low-rank adapter matrices, supported by double quantization and paged optimizers to control memory spikes.

If this is right

  • A 65B-parameter model becomes fine-tunable on a single consumer GPU without loss of task accuracy.
  • Small high-quality instruction datasets produce state-of-the-art chatbot performance even when the base model is smaller than prior leaders.
  • Open models can reach 99.3% of closed-model benchmark scores after 24 hours of single-GPU training.
  • GPT-4-based automatic evaluations serve as a practical and inexpensive substitute for human chatbot judgments.
  • Current public chatbot benchmarks contain systematic gaps that make them unreliable for ranking model quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same quantization-plus-adapter pattern could be applied to even larger base models if the gradient flow through the adapters remains stable.
  • Task-specific adaptation may be largely separable from the general knowledge stored in the base weights, allowing repeated low-cost updates without retraining the entire model.
  • Combining QLoRA with other memory-saving techniques could open multi-task or continual-learning regimes on hardware that previously could hold only one model copy.

Load-bearing premise

The 4-bit NormalFloat representation of the frozen base weights must retain enough information and gradient signal that low-rank adapters can recover the full task performance of 16-bit fine-tuning.

What would settle it

A controlled experiment that fine-tunes the identical base model and dataset once with standard 16-bit precision and once with QLoRA, then measures whether the QLoRA version falls short by more than a few percent on the same evaluation suite.

read the original abstract

We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces QLoRA, a memory-efficient fine-tuning method for large language models. It freezes a pretrained LLM quantized to 4 bits using a new NormalFloat (NF4) datatype, injects LoRA adapters for task-specific updates, and adds double quantization of the quantization constants plus paged optimizers to handle memory spikes. The authors report that this enables fine-tuning of a 65B-parameter model on a single 48 GB GPU while preserving full 16-bit task performance. Their best models (Guanaco) reach 99.3% of ChatGPT performance on the Vicuna benchmark after 24 hours of single-GPU training on high-quality instruction data, and they release code, CUDA kernels, and over 1,000 fine-tuned models with extensive ablations across model scales, datasets, and evaluation protocols.

Significance. If the core performance claims hold under rigorous controls, the work has high practical significance: it substantially lowers the hardware barrier for adapting LLMs at the 30B–65B scale, enabling broader experimentation by researchers without multi-GPU clusters. The release of all models, training code, and 4-bit kernels is a clear strength. The scale of the empirical study (>1,000 models) and the dual human/GPT-4 evaluation analysis also contribute useful data on instruction-following and chatbot assessment.

major comments (2)
  1. [§3.2 and §4.1] §3.2 and §4.1: The central claim that 4-bit NF4 quantization of the frozen base model preserves full 16-bit task performance for LoRA adaptation rests on the unverified assumption that quantization error does not systematically distort gradient directions or norms for the adapters. The manuscript provides no direct diagnostic (e.g., gradient cosine similarity, norm ratios, or loss-landscape curvature comparisons) between QLoRA and 16-bit back-propagation on identical forward passes; the performance equivalence is inferred only from downstream benchmark scores.
  2. [Table 2 and §5.1] Table 2 and §5.1: While QLoRA is shown to match full 16-bit fine-tuning on smaller models (7B–13B), the paper does not isolate the contribution of NF4 versus the choice of instruction dataset or LoRA hyperparameters. A controlled ablation that holds data and rank fixed while varying only the base-model precision would be required to substantiate the “parameter-free” recovery claim.
minor comments (3)
  1. [Abstract] Abstract: “optimziers” is a typographical error.
  2. [§4.3] §4.3: The description of paged optimizers would benefit from a short pseudocode or memory-timeline figure to clarify how page swapping interacts with the Adam optimizer states.
  3. [Figure 3] Figure 3: Axis labels and legend text are too small for print; consider increasing font size or splitting into two panels.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and constructive comments. We address each major point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [§3.2 and §4.1] The central claim that 4-bit NF4 quantization of the frozen base model preserves full 16-bit task performance for LoRA adaptation rests on the unverified assumption that quantization error does not systematically distort gradient directions or norms for the adapters. The manuscript provides no direct diagnostic (e.g., gradient cosine similarity, norm ratios, or loss-landscape curvature comparisons) between QLoRA and 16-bit back-propagation on identical forward passes; the performance equivalence is inferred only from downstream benchmark scores.

    Authors: We agree that the manuscript relies on downstream task performance rather than direct gradient diagnostics to support equivalence. While any systematic distortion in gradients would be expected to degrade final task metrics (which we do not observe across MMLU, Vicuna, and other benchmarks for models up to 65B), we acknowledge that explicit comparisons would strengthen the argument. In the revision we will add a short discussion in §4.1 referencing the observed gradient norm stability from our internal checks on smaller models and note the absence of full side-by-side diagnostics as a limitation. revision: partial

  2. Referee: [Table 2 and §5.1] While QLoRA is shown to match full 16-bit fine-tuning on smaller models (7B–13B), the paper does not isolate the contribution of NF4 versus the choice of instruction dataset or LoRA hyperparameters. A controlled ablation that holds data and rank fixed while varying only the base-model precision would be required to substantiate the “parameter-free” recovery claim.

    Authors: Table 2 already reports 4-bit versus 16-bit results for the 7B and 13B models under identical LoRA rank, dataset, and hyperparameter settings, showing near-identical performance. To make the isolation of quantization more explicit, we will add a dedicated controlled ablation in the revised §5.1 (and update Table 2) that fixes the instruction dataset, LoRA rank, and all other hyperparameters while varying only base-model precision (NF4 4-bit vs. 16-bit). revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method validated on external benchmarks

full rationale

The paper presents QLoRA as an engineering combination of 4-bit NF4 quantization, double quantization, paged optimizers, and LoRA adapters. All performance claims (Guanaco reaching 99.3% of ChatGPT on Vicuna) are measured against external benchmarks and prior models rather than derived from internal fitted parameters or self-referential equations. NF4 is motivated by information-theoretic optimality for normal distributions but its task performance is demonstrated empirically across >1000 models on multiple datasets and scales; no derivation chain reduces the central preservation-of-performance claim to a tautology or self-citation. The work is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that a frozen 4-bit quantized base model supplies adequate gradients for LoRA adapters to match full-precision fine-tuning; the new NF4 datatype is introduced without external validation beyond the paper's own experiments.

free parameters (1)
  • LoRA rank and alpha
    Hyperparameters controlling adapter capacity; their specific values are chosen per experiment but not enumerated in the abstract.
axioms (1)
  • domain assumption 4-bit quantization of pretrained weights preserves enough representational capacity for downstream adaptation via adapters
    Invoked when claiming that frozen 4-bit models plus LoRA recover full 16-bit performance.
invented entities (1)
  • NormalFloat (NF4) 4-bit datatype no independent evidence
    purpose: Information-theoretically optimal representation for normally distributed weights
    New datatype proposed in the paper to improve 4-bit quantization fidelity.

pith-pipeline@v0.9.0 · 5676 in / 1474 out tokens · 48729 ms · 2026-05-11T13:24:33.099632+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 50 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Standardized Re-evaluation of Conversational Recommender Systems on the ReDial Dataset

    cs.IR 2026-05 accept novelty 7.0

    Standardized re-evaluation of CRS methods on ReDial finds that nearly half of reported accuracy stems from repetition shortcuts absent in novelty-focused tests, performance tracks LLM capacity more than architecture, ...

  2. DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules

    cs.AI 2026-05 unverdicted novelty 7.0

    DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...

  3. A Multi-View Media Profiling Suite: Resources, Evaluation, and Analysis

    cs.CL 2026-05 unverdicted novelty 7.0

    Presents MBFC-2025 dataset and multi-view embeddings with fusion methods for media bias and factuality, reporting SOTA results on ACL-2020 and new benchmarks on MBFC-2025.

  4. Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

    cs.RO 2026-04 unverdicted novelty 7.0

    VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...

  5. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  6. CrashSight: A Phase-Aware, Infrastructure-Centric Video Benchmark for Traffic Crash Scene Understanding and Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    CrashSight is a new infrastructure-focused benchmark showing that state-of-the-art vision-language models can describe crash scenes but fail at temporal and causal reasoning.

  7. AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    AtlasOCR delivers the first open-source Darija OCR by fine-tuning Qwen2.5-VL 3B, achieving state-of-the-art results on custom and existing benchmarks for both Darija and Arabic.

  8. KITE: Keyframe-Indexed Tokenized Evidence for VLM-Based Robot Failure Analysis

    cs.RO 2026-04 unverdicted novelty 7.0

    KITE is a training-free method that uses keyframe-indexed tokenized evidence including BEV schematics to enhance VLM performance on robot failure detection, identification, localization, explanation, and correction.

  9. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  10. Self-Rewarding Language Models

    cs.CL 2024-01 conditional novelty 7.0

    Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

  11. MinT: Managed Infrastructure for Training and Serving Millions of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.

  12. Output Composability of QLoRA PEFT Modules for Plug-and-Play Attribute-Controlled Text Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Summing outputs from separately trained QLoRA PEFT modules provides strong performance for attribute-controlled text generation, often matching or exceeding single-task modules even on single-attribute tests.

  13. Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

    cs.LG 2026-05 unverdicted novelty 6.0

    Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

  14. State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...

  15. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

  16. Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study

    cs.SE 2026-04 unverdicted novelty 6.0

    Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.

  17. Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

    cs.AI 2026-04 unverdicted novelty 6.0

    A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.

  18. Pioneer Agent: Continual Improvement of Small Language Models in Production

    cs.AI 2026-04 unverdicted novelty 6.0

    Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

  19. Sensitivity-Positional Co-Localization in GQA Transformers

    cs.CL 2026-04 unverdicted novelty 6.0

    In Llama 3.1 8B, task-sensitive layers cluster late while RoPE adaptation is strongest early, yet applying both adaptations only to sensitivity-identified layers outperforms other layer choices by 4-16 points on MMLU,...

  20. The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training

    cs.CR 2026-04 unverdicted novelty 6.0

    ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.

  21. ForkKV: Scaling Multi-LoRA Agent Serving via Copy-on-Write Disaggregated KV Cache

    cs.DC 2026-04 unverdicted novelty 6.0

    ForkKV uses copy-on-write disaggregated KV cache with DualRadixTree and ResidualAttention kernels to deliver up to 3x throughput over prior multi-LoRA serving systems with negligible quality loss.

  22. Constraint-Driven Warm-Freeze for Efficient Transfer Learning in Photovoltaic Systems

    cs.NE 2026-04 unverdicted novelty 6.0

    CDWF achieves 90-99% of full fine-tuning performance with up to 120x fewer trainable parameters by dynamically allocating full trainability to gradient-important blocks and LoRA to others for PV cyberattack transfer learning.

  23. Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures

    cs.LG 2026-04 conditional novelty 6.0

    Gradient-guided layer selection for LoRA yields 15-28% training speedup with matched downstream results on MMLU, GSM8K, and HumanEval across 14 models from 0.5B to 72B parameters.

  24. An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

    cs.CV 2026-04 unverdicted novelty 6.0

    A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.

  25. LiFT: Does Instruction Fine-Tuning Improve In-Context Learning for Longitudinal Modelling by Large Language Models?

    cs.CL 2026-03 unverdicted novelty 6.0

    LiFT instruction fine-tunes LLMs with a temporal curriculum to improve in-context learning on longitudinal NLP tasks, yielding gains on out-of-distribution data and rare change events across multiple model sizes.

  26. AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    cs.CL 2023-10 conditional novelty 6.0

    AutoDAN automatically generates semantically meaningful jailbreak prompts for aligned LLMs via a hierarchical genetic algorithm, achieving higher attack success, cross-model transferability, and universality than base...

  27. MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models

    cs.CL 2023-09 conditional novelty 6.0

    Bootstrapping math questions via rewriting creates MetaMathQA; fine-tuning LLaMA-2 on it yields 66.4% on GSM8K for 7B and 82.3% for 70B, beating prior same-size models by large margins.

  28. Baseline Defenses for Adversarial Attacks Against Aligned Language Models

    cs.LG 2023-09 conditional novelty 6.0

    Baseline defenses including perplexity-based detection, input preprocessing, and adversarial training offer partial robustness to text adversarial attacks on LLMs, with challenges arising from weak discrete optimizers.

  29. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    cs.CL 2023-06 accept novelty 6.0

    GPT-4 as an LLM judge achieves over 80% agreement with human preferences on MT-Bench and Chatbot Arena, matching human agreement levels and providing a scalable evaluation method.

  30. Fine-Tuning Models for Automated Code Review Feedback

    cs.SE 2026-05 conditional novelty 5.0

    PEFT fine-tuning of Code Llama yields feedback on student Java bugs that students judge equal to ChatGPT and better than prompt engineering, using BLEU/ROUGE/BERTScore plus human ratings.

  31. Towards a Large Language-Vision Question Answering Model for MSTAR Automatic Target Recognition

    cs.CV 2026-05 unverdicted novelty 5.0

    A fine-tuned large language-vision model achieves 98% accuracy on visual question answering for military vehicle identification in SAR imagery from an extended MSTAR benchmark.

  32. AdaMeZO: Adam-style Zeroth-Order Optimizer for LLM Fine-tuning Without Maintaining the Moments

    cs.LG 2026-05 unverdicted novelty 5.0

    AdaMeZO adapts Adam moment estimates to zeroth-order LLM fine-tuning without extra memory storage, outperforming MeZO with up to 70% fewer forward passes.

  33. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 5.0

    Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.

  34. ChipLingo: A Systematic Training Framework for Large Language Models in EDA

    cs.LG 2026-04 unverdicted novelty 5.0

    ChipLingo trains LLMs on EDA data via corpus construction, domain-adaptive pretraining, and RAG scenario alignment, reaching 59.7% accuracy with an 8B model and 70.02% with a 32B model on a new internal EDA benchmark.

  35. A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models

    cs.LG 2026-04 unverdicted novelty 5.0

    KL divergence provides a superior forward-only metric for identifying quantization-sensitive parts in SSM-Transformer hybrids, outperforming MSE and SQNR and supporting practical mixed-precision deployment on edge devices.

  36. Please Make it Sound like Human: Encoder-Decoder vs. Decoder-Only Transformers for AI-to-Human Text Style Transfer

    cs.CL 2026-04 unverdicted novelty 5.0

    BART-large outperforms Mistral-7B in AI-to-human style transfer with higher reference similarity scores and far fewer parameters, while showing that marker shift can reflect overshoot rather than accurate transfer.

  37. NyayaMind- A Framework for Transparent Legal Reasoning and Judgment Prediction in the Indian Legal System

    cs.CL 2026-04 unverdicted novelty 5.0

    NyayaMind combines RAG retrieval with domain-specific LLMs to generate transparent, structured legal reasoning and judgment predictions for Indian court cases.

  38. PassiveQA: A Three-Action Framework for Epistemically Calibrated Question Answering via Supervised Finetuning

    cs.CL 2026-04 unverdicted novelty 5.0

    PassiveQA trains models via supervised finetuning to decide Answer, Ask, or Abstain using structured information-state representations and knowledge-graph context, yielding better abstention and lower hallucination on...

  39. VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

    cs.CL 2026-05 unverdicted novelty 4.0

    VectraYX-Nano is a 42M-parameter Spanish cybersecurity LLM trained with curriculum learning and native MCP tool use, achieving 0.78 conversational gate and improved tool selection with denser data.

  40. LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language

    cs.CL 2026-05 conditional novelty 4.0

    Qwen2.5-3B was continued-pretrained and then fine-tuned with rsLoRA r256 on Sardinian data to reach 28.5 BLEU into the language, outperforming full fine-tuning and other LoRA variants.

  41. OpenSOC-AI: Democratizing Security Operations with Parameter Efficient LLM Log Analysis

    cs.CR 2026-04 unverdicted novelty 4.0

    LoRA fine-tuning of TinyLlama-1.1B on 450 SOC examples produces 68% threat classification accuracy and 58% severity accuracy on 50 held-out logs, with full code, weights, and data released.

  42. Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support

    cs.AI 2026-04 unverdicted novelty 4.0

    A cross-platform mobile application deploys an ensemble of quantized open-source LLMs for fully local, DSM-5-aligned psychiatric decision support with claimed accuracy comparable to prior cloud versions.

  43. FLeX: Fourier-based Low-rank EXpansion for multilingual transfer

    cs.LG 2026-04 unverdicted novelty 4.0

    LoRA fine-tuning of Code Llama with Fourier regularization raises Java pass@1 from 34.2% to 42.1% while using a small high-quality dataset.

  44. Information Extraction from Electricity Invoices with General-Purpose Large Language Models

    cs.CL 2026-04 unverdicted novelty 4.0

    Few-shot prompting lifts F1 scores above 96 percent on electricity-invoice extraction for Gemini 1.5 Pro and Mistral-small, while hyperparameter changes produce only marginal gains.

  45. Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey

    cs.LG 2024-03 accept novelty 4.0

    A comprehensive survey of PEFT algorithms for large models, covering their performance, overhead, applications, and real-world system implementations.

  46. SLM Finetuning for Natural Language to Domain Specific Code Generation in Production

    cs.LG 2026-04 unverdicted novelty 3.0

    Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...

  47. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

  48. FUTURAL: A Metasearch Platform for Empowering Rural Areas with Smart Solutions

    cs.IR 2026-04 unverdicted novelty 2.0

    FUTURAL's metasearch MVP uses LLMs to enable natural language queries over smart solutions data to support rural development.

  49. The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge

    cs.LG 2026-04 unverdicted novelty 2.0

    A competition entry achieved efficient fine-tuning of LLaMa2 70B on one GPU in 24 hours with competitive QA benchmark performance.

  50. QU-NLP at ArchEHR-QA 2026: Two-Stage QLoRA Fine-Tuning of Qwen3-4B for Patient-Oriented Clinical Question Answering and Evidence Sentence Alignment

    cs.CL 2026-03 unverdicted novelty 2.0

    Two-stage QLoRA fine-tuning of Qwen3-4B plus retrieval ensemble achieves 32.87 overall score on clinical QA and 67.16 micro-F1 on evidence alignment, highlighting that 20 training cases are insufficient.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 49 Pith papers · 23 internal anchors

  1. [1]

    S. An, Y . Li, Z. Lin, Q. Liu, B. Chen, Q. Fu, W. Chen, N. Zheng, and J.-G. Lou. Input-tuning: Adapting unfamiliar inputs to frozen pretrained models. arXiv preprint arXiv:2203.03131, 2022

  2. [2]

    A General Language Assistant as a Laboratory for Alignment

    A. Askell, Y . Bai, A. Chen, D. Drain, D. Ganguli, T. Henighan, A. Jones, N. Joseph, B. Mann, N. DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021

  3. [3]

    S. H. Bach, V . Sanh, Z.-X. Yong, A. Webson, C. Raffel, N. V . Nayak, A. Sharma, T. Kim, M. S. Bari, T. Fevry, et al. Promptsource: An integrated development environment and repository for natural language prompts. arXiv preprint arXiv:2202.01279, 2022

  4. [4]

    Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

  5. [5]

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  6. [6]

    E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

  7. [7]

    A., Purohit, S., Prashanth, U

    S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373, 2023

  8. [8]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

  9. [9]

    T. Chen, B. Xu, C. Zhang, and C. Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016

  10. [10]

    Chiang, Z

    W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/

  11. [11]

    P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems , 30, 2017

  12. [12]

    H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, E. Li, X. Wang, M. De- hghani, S. Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

  13. [13]

    Dettmers and L

    T. Dettmers and L. Zettlemoyer. The case for 4-bit precision: k-bit inference scaling laws.arXiv preprint arXiv:2212.09720, 2022

  14. [14]

    Dettmers, M

    T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer. LLM.int8(): 8-bit matrix multiplication for transformers at scale. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, 2022

  15. [15]

    Dettmers, M

    T. Dettmers, M. Lewis, S. Shleifer, and L. Zettlemoyer. 8-bit optimizers via block-wise quantization. 9th International Conference on Learning Representations, ICLR, 2022

  16. [16]

    A. E. Elo. The proposed uscf rating system. its development, theory, and applications. Chess Life, 22(8):242–247, 1967

  17. [17]

    A. E. Elo. The rating of chessplayers, past and present. Arco Pub., 1978. 17

  18. [18]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022

  19. [19]

    Findings of the Association for Computa- tional Linguistics: ACL-IJCNLP 2021

    J. Fu, S.-K. Ng, Z. Jiang, and P. Liu. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166, 2023

  20. [20]

    X. Geng, A. Gudibande, H. Liu, E. Wallace, P. Abbeel, S. Levine, and D. Song. Koala: A dialogue model for academic research. Blog post, April 2023. URLhttps://bair.berkeley. edu/blog/2023/04/03/koala/

  21. [21]

    Improving alignment of dialogue agents via targeted human judgements

    A. Glaese, N. McAleese, M. Tr˛ ebacz, J. Aslanides, V . Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. arXiv preprint arXiv:2209.14375, 2022

  22. [22]

    Gururangan, S

    S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith. Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324, 2018

  23. [23]

    Henderson, S

    J. Henderson, S. Ruder, et al. Compacter: Efficient low-rank hypercomplex adapter layers. In Advances in Neural Information Processing Systems, 2021

  24. [24]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. In International Conference on Learning Representations, 2020

  25. [25]

    Holtzman, J

    A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020

  26. [26]

    Unnatural instructions: Tuning language models with (almost) no human labor

    O. Honovich, T. Scialom, O. Levy, and T. Schick. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689, 2022

  27. [27]

    Houlsby, A

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. At- tariyan, and S. Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019

  28. [28]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

  29. [29]

    S. Iyer, X. V . Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura, et al. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017, 2022

  30. [30]

    arXiv preprint arXiv:2304.08460 , year=

    A. Köksal, T. Schick, A. Korhonen, and H. Schütze. Longform: Optimizing instruction tuning for long text generation with corpus extraction. arXiv preprint arXiv:2304.08460, 2023

  31. [31]

    A. Köpf, Y . Kilcher, D. von Rütte, S. Anagnostidis, Z.-R. Tam, K. Stevens, A. Barhoum, N. M. Duc, O. Stanley, R. Nagyfi, et al. Openassistant conversations–democratizing large language model alignment. arXiv preprint arXiv:2304.07327, 2023

  32. [32]

    Open-instruction-generalist dataset

    LAION. Open-instruction-generalist dataset. https://github.com/LAION-AI/ Open-Instruction-Generalist, 2023

  33. [33]

    The Power of Scale for Parameter-Efficient Prompt Tuning

    B. Lester, R. Al-Rfou, and N. Constant. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021

  34. [34]

    X. L. Li and P. Liang. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021

  35. [35]

    Holistic Evaluation of Language Models

    P. Liang, R. Bommasani, T. Lee, D. Tsipras, D. Soylu, M. Yasunaga, Y . Zhang, D. Narayanan, Y . Wu, A. Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

  36. [36]

    T. Liao, R. Taori, I. D. Raji, and L. Schmidt. Are we learning yet? a meta review of evaluation failures across machine learning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021. 18

  37. [37]

    H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang, M. Bansal, and C. A. Raffel. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022

  38. [38]

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  39. [39]

    W., Tay, Y ., Zhou, D., Le, Q

    S. Longpre, L. Hou, T. Vu, A. Webson, H. W. Chung, Y . Tay, D. Zhou, Q. V . Le, B. Zoph, J. Wei, et al. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023

  40. [40]

    S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi. Metaicl: Learning to learn in context. arXiv preprint arXiv:2110.15943, 2021

  41. [41]

    Nematzadeh, K

    A. Nematzadeh, K. Burns, E. Grant, A. Gopnik, and T. Griffiths. Evaluating theory of mind in question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2392–2400, 2018

  42. [42]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. arXiv, 2023

  43. [43]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  44. [44]

    G. Park, B. Park, S. J. Kwon, B. Kim, Y . Lee, and D. Lee. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557, 2022

  45. [45]

    B. Peng, C. Li, P. He, M. Galley, and J. Gao. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023

  46. [46]

    Poliak, J

    A. Poliak, J. Naradowsky, A. Haldar, R. Rudinger, and B. Van Durme. Hypothesis only baselines in natural language inference. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 180–191, 2018

  47. [47]

    R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, A. Levskaya, J. Heek, K. Xiao, S. Agrawal, and J. Dean. Efficiently scaling transformer inference. arXiv preprint arXiv:2211.05102, 2022

  48. [48]

    Learning how to ask: Querying lms with mixtures of soft prompts

    G. Qin and J. Eisner. Learning how to ask: Querying lms with mixtures of soft prompts. arXiv preprint arXiv:2104.06599, 2021

  49. [49]

    Raffel, N

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435

  50. [50]

    V . Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021

  51. [51]

    M. Sap, R. LeBras, D. Fried, and Y . Choi. Neural theory-of-mind? on the limits of social intelligence in large lms. arXiv preprint arXiv:2210.13312, 2022

  52. [52]

    T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagné, A. S. Luccioni, F. Yvon, M. Gallé, et al. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022

  53. [53]

    Shaphiro and M

    S. Shaphiro and M. Wilk. An analysis of variance test for normality.Biometrika, 52(3):591–611, 1965

  54. [54]

    Y .-L. Sung, V . Nair, and C. A. Raffel. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34:24193–24205, 2021. 19

  55. [55]

    Taori, I

    R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/ stanford_alpaca, 2023

  56. [56]

    LaMDA: Language Models for Dialog Applications

    R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y . Du, et al. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022

  57. [57]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  58. [58]

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi- task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

  59. [59]

    Y . Wang, Y . Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022

  60. [60]

    Y . Wang, S. Mishra, P. Alipoormolabashi, Y . Kordi, A. Mirzaei, A. Arunkumar, A. Ashok, A. S. Dhanasekaran, A. Naik, D. Stap, et al. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP, 2022

  61. [61]

    Y . Wang, S. Mishra, P. Alipoormolabashi, Y . Kordi, A. Mirzaei, A. Naik, A. Ashok, A. S. Dhanasekaran, A. Arunkumar, D. Stap, et al. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, 2022

  62. [62]

    J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021

  63. [63]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. H. Chi, Q. V . Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, 2022

  64. [64]

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019

  65. [65]

    Wortsman, T

    M. Wortsman, T. Dettmers, L. Zettlemoyer, A. Morcos, A. Farhadi, and L. Schmidt. Stable and low-precision training for large-scale vision-language models. arXiv preprint arXiv:2304.13013, 2023

  66. [66]

    G. Xiao, J. Lin, M. Seznec, J. Demouth, and S. Han. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2022

  67. [67]

    T. Xie, C. H. Wu, P. Shi, R. Zhong, T. Scholak, M. Yasunaga, C.-S. Wu, M. Zhong, P. Yin, S. I. Wang, et al. Unifiedskg: Unifying and multi-tasking structured knowledge grounding with text-to-text language models. arXiv preprint arXiv:2201.05966, 2022

  68. [68]

    Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018

  69. [69]

    Z. Yao, R. Y . Aminabadi, M. Zhang, X. Wu, C. Li, and Y . He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. arXiv preprint arXiv:2206.01861, 2022

  70. [70]

    E. B. Zaken, S. Ravfogel, and Y . Goldberg. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021

  71. [71]

    A. Zeng, X. Liu, Z. Du, Z. Wang, H. Lai, M. Ding, Z. Yang, Y . Xu, W. Zheng, X. Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022. 20

  72. [72]

    OPT: Open Pre-trained Transformer Language Models

    S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin, et al. Opt: Open pre-trained transformer language models.arXiv preprint arXiv:2205.01068, 2022

  73. [73]

    Adapt- ing language models for zero-shot learning by meta- tuning on dataset and prompt collections.arXiv preprint arXiv:2104.04670,

    R. Zhong, K. Lee, Z. Zhang, and D. Klein. Adapting language models for zero-shot learning by meta-tuning on dataset and prompt collections. arXiv preprint arXiv:2104.04670, 2021. 21 A QLoRA vs Standard Finetuning Experimental Setup Details A.1 Hyperparameters for QL ORA We do a hyperparameter search for LoRA over the following variables: LoRA dropout { 0....