arxiv: 2604.09034 · v1 · submitted 2026-04-10 · 💻 cs.LG

Recognition: no theorem link

The nextAI Solution to the NeurIPS 2023 LLM Efficiency Challenge

Gyuwon Park , Dongil Shin , SolGil Oh , SangGi Ryu , Byung-Hak Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords LLM efficiencyQLoRA fine-tuningLLaMa2 70Bsingle GPUNeurIPS challengequestion answeringresource optimization

0 comments

The pith

A 70 billion parameter language model can be fine-tuned on a single GPU in 24 hours while achieving high accuracy on QA benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors demonstrate a method to fine-tune the LLaMa2 70B model under the strict constraints of one A100 GPU and 24 hours. They assembled a custom dataset from open sources and used QLoRA with Flash Attention 2 to optimize efficiency. This approach resulted in a model with strong performance on question answering tasks. A sympathetic reader would care because it shows large models can be adapted without massive computing clusters. If true, it opens the door to more widespread use of powerful LLMs in practical settings.

Core claim

The paper establishes that through iterative selection of dataset compositions and LoRA configurations, combined with quantization and efficient attention mechanisms, the LLaMa2 70B model can be fine-tuned on a single GPU to deliver high accuracy across a range of QA benchmarks within the given time and hardware limits.

What carries the argument

Quantized Low-Rank Adaptation (QLoRA) with Flash Attention 2 applied to LLaMa2 70B, supported by custom dataset iteration.

If this is right

Teams without access to large GPU clusters can still customize large language models effectively.
Resource usage for LLM fine-tuning drops dramatically while preserving task performance.
Open-source foundation models become more practical for specialized applications.
The combination of quantization and low-rank adaptation proves sufficient for 70B-scale models under time limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar techniques could extend to other large models or different hardware setups beyond the challenge constraints.
Success here suggests that careful data curation is as important as the fine-tuning algorithm for efficiency.
Further tests on diverse tasks might reveal if the efficiency gains hold across domains.
Adoption could accelerate development of accessible AI tools for smaller organizations.

Load-bearing premise

The custom dataset and configuration choices yield performance that remains robust on the challenge's unseen test data rather than overfitting to the training and public benchmarks.

What would settle it

Evaluation of the model on the hidden test set from the NeurIPS challenge showing substantially lower accuracy than claimed on public benchmarks would falsify the robustness of the solution.

read the original abstract

The rapid evolution of Large Language Models (LLMs) has significantly impacted the field of natural language processing, but their growing complexity raises concerns about resource usage and transparency. Addressing these challenges, we participated in the NeurIPS LLM Efficiency Challenge, aiming to fine-tune a foundation model within stringent constraints. Our focus was the LLaMa2 70 billion model, optimized on a single A100 40GB GPU within a 24-hour limit. Our methodology hinged on a custom dataset, carefully assembled from diverse open-source resources and benchmark tests, aligned with the challenge's open-source ethos. Our approach leveraged Quantized-Low Rank Adaptation (QLoRA) Fine tuning, integrated with advanced attention mechanisms like Flash Attention 2. We experimented with various configurations of the LoRA technique, optimizing the balance between computational efficiency and model accuracy. Our fine-tuning strategy was underpinned by the creation and iterative testing of multiple dataset compositions, leading to the selection of a version that demonstrated robust performance across diverse tasks and benchmarks. The culmination of our efforts was an efficiently fine-tuned LLaMa2 70B model that operated within the constraints of a single GPU, showcasing not only a significant reduction in resource utilization but also high accuracy across a range of QA benchmarks. Our study serves as a testament to the feasibility of optimizing large-scale models in resource-constrained environments, emphasizing the potential of LLMs in real-world applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a competition report applying QLoRA and Flash Attention to LLaMa2-70B on one GPU, but it provides no performance numbers or details to back up the accuracy claims.

read the letter

Colleague, The punchline on this one is that it's a report on their solution to the NeurIPS 2023 LLM Efficiency Challenge rather than a research paper with new methods. They describe fine-tuning the LLaMa2 70B model on a single A100 GPU using QLoRA combined with Flash Attention 2, after assembling a custom dataset from various open sources and benchmarks. They do a decent job laying out the practical steps: experimenting with different dataset mixes and LoRA configurations, then selecting the one that performed well across tasks. This kind of engineering detail can be helpful for others trying to run large models under similar constraints, and it aligns with the challenge's emphasis on efficiency and open resources. The main issue is the missing evidence. The text claims robust performance and high accuracy but doesn't include any specific metrics, comparisons to other approaches, or results from the hidden test set. The iterative selection process they used raises the exact risk you noted in the stress-test – it could have led to choices that work on the development benchmarks but don't generalize. Without ablations or the actual numbers, it's difficult to judge how much of the success is real versus tuned to the visible data. This kind of submission is aimed at challenge participants or practitioners looking for implementation ideas on resource-limited fine-tuning. It doesn't offer new theoretical insights or verified improvements over the cited techniques like QLoRA. I don't see value in citing it or discussing it in a reading group. It also doesn't seem ready or substantial enough for peer review, as the core claims lack the supporting data that referees would expect. My recommendation is to desk reject or not engage further for review purposes.

Referee Report

4 major / 2 minor

Summary. The manuscript reports the nextAI team's entry in the NeurIPS 2023 LLM Efficiency Challenge. The authors describe fine-tuning LLaMa2-70B via QLoRA (with Flash Attention 2) on a single A100 40GB GPU under a 24-hour budget, using a custom dataset assembled from open-source sources and iteratively refined after testing multiple compositions. They claim the final model satisfies the challenge constraints while delivering high accuracy on QA benchmarks.

Significance. If the performance claims were substantiated with concrete metrics and hidden-test results, the work would provide a practical demonstration of resource-constrained fine-tuning of 70B-scale models, reinforcing the utility of QLoRA-style methods for accessibility. The contribution is primarily empirical and engineering-oriented rather than theoretical.

major comments (4)

[Abstract] Abstract: the claim that the final model achieves 'high accuracy across a range of QA benchmarks' is unsupported by any numerical results, baseline comparisons, error bars, or hidden-test performance; without these data the central empirical claim cannot be evaluated.
[Dataset and Iterative Selection] The section on dataset construction and iterative selection: multiple dataset compositions are said to have been tested and one chosen for 'robust performance,' yet no accuracy numbers, ablation tables, or selection criteria from those iterations are reported, leaving open the possibility that the chosen composition overfits visible benchmarks rather than generalizing.
[Methodology] The LoRA configuration experiments: the text states that 'various configurations of the LoRA technique' were tested, but supplies neither the specific rank/alpha/dropout values tried, nor their corresponding validation accuracies, nor any comparison to the final chosen setting.
[Results] Results section: no quantitative outcomes on the challenge's hidden test set, no comparison to other challenge submissions or standard baselines (e.g., vanilla QLoRA or full fine-tuning), and no ablation on the contribution of Flash Attention 2 are provided, rendering the 'significant reduction in resource utilization' and 'high accuracy' assertions unverifiable.

minor comments (2)

[Dataset Construction] The description of the custom dataset sources could include explicit licenses and exact proportions of each component to improve reproducibility.
[Figures] Figure captions and axis labels in any performance plots should be expanded to state the exact metrics and evaluation splits used.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript on the nextAI entry to the NeurIPS 2023 LLM Efficiency Challenge. We appreciate the emphasis on substantiating empirical claims and will revise the paper to address the identified gaps in quantitative reporting while preserving the focus on the practical engineering approach.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the final model achieves 'high accuracy across a range of QA benchmarks' is unsupported by any numerical results, baseline comparisons, error bars, or hidden-test performance; without these data the central empirical claim cannot be evaluated.

Authors: We agree that the abstract requires concrete supporting data. In the revised manuscript we will insert specific QA benchmark scores achieved by the final model, direct comparisons to standard baselines such as vanilla QLoRA, and any available hidden-test results from the challenge, along with brief mention of variability where relevant. revision: yes
Referee: [Dataset and Iterative Selection] The section on dataset construction and iterative selection: multiple dataset compositions are said to have been tested and one chosen for 'robust performance,' yet no accuracy numbers, ablation tables, or selection criteria from those iterations are reported, leaving open the possibility that the chosen composition overfits visible benchmarks rather than generalizing.

Authors: We will expand the dataset section with a table reporting validation accuracy for each composition tested, the quantitative selection criteria (e.g., average score across held-out tasks), and a short discussion of how we monitored for overfitting to visible benchmarks. This addition will make the iterative process transparent and address the generalization concern. revision: yes
Referee: [Methodology] The LoRA configuration experiments: the text states that 'various configurations of the LoRA technique' were tested, but supplies neither the specific rank/alpha/dropout values tried, nor their corresponding validation accuracies, nor any comparison to the final chosen setting.

Authors: We will add an explicit table or paragraph listing the rank, alpha, and dropout values explored, the corresponding validation accuracies obtained, and the rationale for selecting the final configuration. This will allow readers to evaluate the hyper-parameter search directly. revision: yes
Referee: [Results] Results section: no quantitative outcomes on the challenge's hidden test set, no comparison to other challenge submissions or standard baselines (e.g., vanilla QLoRA or full fine-tuning), and no ablation on the contribution of Flash Attention 2 are provided, rendering the 'significant reduction in resource utilization' and 'high accuracy' assertions unverifiable.

Authors: We acknowledge that the current results section is primarily descriptive. In revision we will report the hidden-test performance numbers, place our submission in context with other challenge entries and standard baselines, and include a brief ablation isolating the effect of Flash Attention 2 on training speed and memory usage. These additions will render the efficiency and accuracy claims verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical methodology with no derivations or self-referential claims

full rationale

The paper contains no equations, derivations, or mathematical claims. It describes an empirical workflow: assembling a custom dataset from open-source resources, applying QLoRA fine-tuning with Flash Attention 2 on LLaMa2-70B, experimenting with LoRA configurations, and iteratively selecting a dataset composition based on benchmark performance. All core techniques (QLoRA, Flash Attention) are referenced from prior external publications. Performance claims rest on external QA benchmarks rather than any internal reduction or self-definition. No load-bearing step reduces to its own inputs by construction, and no self-citations form a circular justification chain. This is a standard empirical report whose validity depends on external reproducibility, not internal tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied engineering report on fine-tuning under constraints. It introduces no new mathematical axioms, free parameters, or invented entities; all components are drawn from established open-source tools and prior publications on quantization and attention.

pith-pipeline@v0.9.0 · 5563 in / 1197 out tokens · 51716 ms · 2026-05-10T17:49:13.203197+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 17 canonical work pages · 10 internal anchors

[1]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-V oss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford,...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

Chen, W., Yin, M., Ku, M., Lu, P., Wan, Y ., Ma, X., Xu, J., Wang, X., and Xia, T. (2023a). Theoremqa: A theorem-driven question answering dataset
[3]

Chen, Y ., Qian, S., Tang, H., Lai, X., Liu, Z., Han, S., and Jia, J. (2023b). Longlora: Efficient fine-tuning of long-context large language models.arXiv:2309.12307

work page arXiv
[4]

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y ., Shazeer, N., Prabhakaran, V ., Reif, E., Du, N., Hutchinson, B., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., Levs...

2022
[5]

W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, Y ., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S

Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y ., Fedus, W., Li, Y ., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Castro-Ros, A., Pellat, M., Robinson, K., Valter, D., Narang, S., Mishra, G., Yu, A., Zhao, V ., Huang, Y ., Dai, A., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Robe...

2022
[6]

Conover, M., Hayes, M., Mathur, A., Xie, J., Wan, J., Shah, S., Ghodsi, A., Wendell, P., Zaharia, M., and Xin, R. (2023). Free dolly: Introducing the world’s first truly open instruction-tuned llm

2023
[7]

Dao, T. (2023). Flashattention-2: Faster attention with better parallelism and work partitioning

2023
[8]

Y ., Ermon, S., Rudra, A., and Ré, C

Dao, T., Fu, D. Y ., Ermon, S., Rudra, A., and Ré, C. (2022). FlashAttention: Fast and memory- efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems

2022
[9]

Dettmers, T., Lewis, M., Belkada, Y ., and Zettlemoyer, L. (2022). Llm.int8(): 8-bit matrix multiplication for transformers at scale

2022
[10]

Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. (2023). Qlora: Efficient finetuning of quantized llms.arXiv preprint arXiv:2305.14314. 10

work page internal anchor Pith review arXiv 2023
[11]

Dziri, N., Kamalloo, E., Milton, S., Zaiane, O., Yu, M., Ponti, E., and Reddy, S. (2022). Faithdial: A faithful benchmark for information-seeking dialogue.arXiv preprint, arXiv:2204.10757

work page arXiv 2022
[12]

R., Li, I., She, T., Li, S., and Radev, D

Fabbri, A. R., Li, I., She, T., Li, S., and Radev, D. R. (2019). Multi-news: a large-scale multi-document summarization dataset and abstractive hierarchical model

2019
[13]

He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., and Neubig, G. (2021). Towards a unified view of parameter-efficient transfer learning.CoRR, abs/2110.04366

work page arXiv 2021
[14]

Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2020). Measuring massive multitask language understanding.CoRR, abs/2009.03300

work page internal anchor Pith review Pith/arXiv arXiv 2020
[15]

LoRA: Low-Rank Adaptation of Large Language Models

Hu, E. J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., and Chen, W. (2021). Lora: Low-rank adaptation of large language models.CoRR, abs/2106.09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., de las Casas, D., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., Lavaud, L. R., Lachaux, M.-A., Stock, P., Scao, T. L., Lavril, T., Wang, T., Lacroix, T., and Sayed, W. E. (2023). Mistral 7b

2023
[17]

N., Hunter, C

Lee, A. N., Hunter, C. J., and Ruiz, N. (2023). Platypus: Quick, cheap, and powerful refinement of llms

2023
[18]

Lester, B., Al-Rfou, R., and Constant, N. (2021). The power of scale for parameter-efficient prompt tuning.CoRR, abs/2104.08691

work page internal anchor Pith review arXiv 2021
[19]

Li, X. L. and Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. CoRR, abs/2101.00190

work page internal anchor Pith review arXiv 2021
[20]

D., Ré, C., Acosta-Navas, D., Hudson, D

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y ., Narayanan, D., Wu, Y ., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul, ...

2023
[21]

Lightman, H., Kosaraju, V ., Burda, Y ., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. (2023). Let’s verify step by step

2023
[22]

Lin, S., Hilton, J., and Evans, O. (2021). Truthfulqa: Measuring how models mimic human falsehoods.CoRR, abs/2109.07958

work page internal anchor Pith review arXiv 2021
[23]

Liu, H., Tam, D., Muqeeth, M., Mohta, J., Huang, T., Bansal, M., and Raffel, C. (2022). Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning

2022
[24]

Liu, X., Ji, K., Fu, Y ., Du, Z., Yang, Z., and Tang, J. (2021a). P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks.CoRR, abs/2110.07602

work page arXiv
[25]

Liu, Z., Lin, Y ., Cao, Y ., Hu, H., Wei, Y ., Zhang, Z., Lin, S., and Guo, B. (2021b). Swin transformer: Hierarchical vision transformer using shifted windows.CoRR, abs/2103.14030

work page arXiv
[26]

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., Tafjord, O., Clark, P., and Kalyan, A. (2022). Learn to explain: Multimodal reasoning via thought chains for science question answering

2022
[27]

Mangrulkar, S., Gugger, S., Debut, L., Belkada, Y ., Paul, S., and Bossan, B. (2022). Peft: State- of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft

2022
[28]

mhenrichsen (2023). codegen

2023
[29]

Mukherjee, S., Mitra, A., Jawahar, G., Agarwal, S., Palangi, H., and Awadallah, A. (2023). Orca: Progressive learning from complex explanation traces of gpt-4

2023
[30]

H., Zheng, Z., and You, Y

Ni, J., Xue, F., Jain, K., Shah, M. H., Zheng, Z., and You, Y . (2023). Instruction in the wild: A user-based instruction dataset.https://github.com/XueFuzhao/InstructionWild. 11

2023
[31]

OpenAccess-AI-Collective (2023). axolotl. https://github.com/ OpenAccess-AI-Collective/axolotl

2023
[32]

Gpt-4 technical report

OpenAI (2023). Gpt-4 technical report

2023
[33]

BBQ: A hand-built bias benchmark for question answering.arXiv preprint arXiv:2110.08193,

Parrish, A., Chen, A., Nangia, N., Padmakumar, V ., Phang, J., Thompson, J., Htut, P. M., and Bowman, S. R. (2021). BBQ: A hand-built bias benchmark for question answering.CoRR, abs/2110.08193

work page arXiv 2021
[34]

Perlitz, Y ., Bandel, E., Gera, A., Arviv, O., Ein-Dor, L., Shnarch, E., Slonim, N., Shmueli- Scheuer, M., and Choshen, L. (2023). Efficient benchmarking (of language models)

2023
[35]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks.CoRR, abs/1908.10084

work page internal anchor Pith review Pith/arXiv arXiv 2019
[36]

J., Gupta, K., and Komatsuzaki, A

Sawada, T., Paleka, D., Havrilla, A., Tadepalli, P., Vidas, P., Kranias, A., Nay, J. J., Gupta, K., and Komatsuzaki, A. (2023). Arb: Advanced reasoning benchmark for large language models

2023
[37]

J., and Manning, C

See, A., Liu, P. J., and Manning, C. D. (2017). Get to the point: Summarization with pointer- generator networks. InProceedings of the 55th Annual Meeting of the Association for Computa- tional Linguistics (V olume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics

2017
[38]

skar02 (2023). codegen

2023
[39]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y ., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023). Stanford alpaca: An instruction-following llama model. https://github.com/ tatsu-lab/stanford_alpaca

2023
[40]

tigerbot-kaggle-leetcodesolutions-en-2k

TigerResearch (2023). tigerbot-kaggle-leetcodesolutions-en-2k

2023
[41]

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023a). Llama: Open and efficient foundation language models
[42]

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V ., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V ., Kha...
[43]

Attention Is All You Need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention is all you need.CoRR, abs/1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[44]

A., Khashabi, D., and Hajishirzi, H

Wang, Y ., Kordi, Y ., Mishra, S., Liu, A., Smith, N. A., Khashabi, D., and Hajishirzi, H. (2022). Self-instruct: Aligning language model with self generated instructions

2022
[45]

Finetuned Language Models Are Zero-Shot Learners

Wei, J., Bosma, M., Zhao, V . Y ., Guu, K., Yu, A. W., Lester, B., Du, N., Dai, A. M., and Le, Q. V . (2021). Finetuned language models are zero-shot learners.CoRR, abs/2109.01652

work page internal anchor Pith review arXiv 2021
[46]

Xu, Y ., Xie, L., Gu, X., Chen, X., Chang, H., Zhang, H., Chen, Z., Zhang, X., and Tian, Q. (2023). Qa-lora: Quantization-aware low-rank adaptation of large language models

2023
[47]

Yu, W., Jiang, Z., Dong, Y ., and Feng, J. (2020). Reclor: A reading comprehension dataset requiring logical reasoning.CoRR, abs/2002.04326

work page arXiv 2020
[48]

Zhang, Z., Zhang, A., Li, M., and Smola, A. (2022). Automatic chain of thought prompting in large language models

2022
[49]

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y ., Ma, X., Efrat, A., Yu, P., Yu, L., Zhang, S., Ghosh, G., Lewis, M., Zettlemoyer, L., and Levy, O. (2023). Lima: Less is more for alignment. 12

2023