pith. sign in

arxiv: 2309.12071 · v1 · submitted 2023-09-21 · 💻 cs.AI · cs.CL

Benchmarking quantized LLaMa-based models on the Brazilian Secondary School Exam

Pith reviewed 2026-05-24 06:15 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords Large Language ModelsQuantizationLLaMAENEMBenchmarkPortugueseConsumer hardwareReasoning
0
0 comments X

The pith

Quantized LLaMA-based models reach 46 percent accuracy on Portuguese ENEM questions and 49 percent on English translations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests quantized Alpaca, Koala, and Vicuna models built on 7B and 13B LLaMA backbones against 1006 questions drawn from Brazil's national secondary school exam. It records accuracy on the original Portuguese wording and on English translations while the models run on ordinary consumer hardware. Processing times are also measured to show what response speed is realistic without specialized equipment. The work therefore supplies a concrete data point on how far current accessible LLMs have progressed toward handling real exam material.

Core claim

The best performing models achieved an accuracy of approximately 46% for the original texts of the Portuguese questions and 49% on their English translations. On average, the 7 and 13 billion LLMs took approximately 20 and 50 seconds, respectively, to process the queries on a machine equipped with an AMD Ryzen 5 3600x processor.

What carries the argument

A database of 1006 ENEM questions used to score model accuracy after quantization enables execution on home hardware.

If this is right

  • Quantized LLaMA models can be executed on standard consumer processors for question-answering tasks.
  • Translating questions from Portuguese to English produces a modest accuracy gain in these models.
  • The 13B models require roughly 2.5 times longer to answer than the 7B models under the tested conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reported numbers supply a baseline against which later quantized models can be compared on Portuguese-language educational content.
  • The accuracy difference between languages points to the value of testing translated versions when evaluating models whose training data is unevenly distributed across languages.
  • Similar evaluations on other national exams could test whether the same performance pattern holds for different subject distributions and cultural contexts.

Load-bearing premise

The 1006 ENEM questions constitute an unbiased and uncontaminated test of the models' reasoning ability, with no overlap between the evaluation items and any data the models may have encountered during training.

What would settle it

Showing that many of the 1006 ENEM questions already appeared in the training data of the tested models, or that accuracy on a fresh set of comparable exam items falls well below the reported figures.

Figures

Figures reproduced from arXiv: 2309.12071 by Cl\'audio E. C. Campelo, Matheus L. O. Santos.

Figure 1
Figure 1. Figure 1: Performance degradation of quantized models. Chart available at: [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: No questions were extracted for the years 2010 and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

Although Large Language Models (LLMs) represent a revolution in the way we interact with computers, allowing the construction of complex questions and the ability to reason over a sequence of statements, their use is restricted due to the need for dedicated hardware for execution. In this study, we evaluate the performance of LLMs based on the 7 and 13 billion LLaMA models, subjected to a quantization process and run on home hardware. The models considered were Alpaca, Koala, and Vicuna. To evaluate the effectiveness of these models, we developed a database containing 1,006 questions from the ENEM (Brazilian National Secondary School Exam). Our analysis revealed that the best performing models achieved an accuracy of approximately 46% for the original texts of the Portuguese questions and 49% on their English translations. In addition, we evaluated the computational efficiency of the models by measuring the time required for execution. On average, the 7 and 13 billion LLMs took approximately 20 and 50 seconds, respectively, to process the queries on a machine equipped with an AMD Ryzen 5 3600x processor

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper claims to evaluate the performance of quantized LLaMA-based models (Alpaca, Koala, and Vicuna, 7B and 13B) on 1006 questions from the Brazilian ENEM exam. The best performing models are reported to achieve approximately 46% accuracy on the original Portuguese questions and 49% on their English translations, with average execution times of 20 seconds for 7B models and 50 seconds for 13B models on an AMD Ryzen 5 3600x processor.

Significance. If the evaluation is methodologically sound, the results would offer a practical benchmark for quantized open-source LLMs on a culturally specific, non-English high-school exam, highlighting both capabilities and limitations for educational applications in Brazil. The emphasis on consumer hardware accessibility is a positive aspect.

major comments (1)
  1. [Abstract] The abstract states numerical results but supplies no evaluation protocol, prompting details, error bars, or data-exclusion rules; central performance claims cannot be verified from the given text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and constructive comment. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] The abstract states numerical results but supplies no evaluation protocol, prompting details, error bars, or data-exclusion rules; central performance claims cannot be verified from the given text.

    Authors: We agree that the provided abstract is too concise to allow verification of the central claims. The full manuscript contains a methods section describing the evaluation (all 1006 questions used with no exclusions, zero-shot multiple-choice prompting on the original Portuguese and English-translated versions, single deterministic runs on the specified hardware). To address the concern directly, we will revise the abstract to include a one-sentence summary of the protocol, the fact that error bars are omitted because inference is deterministic given fixed seeds and hardware, and a note that the complete prompting templates and data-handling rules appear in the methods section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical reporting

full rationale

The available abstract describes a straightforward empirical evaluation: quantized LLaMA-based models (Alpaca, Koala, Vicuna) are run on a fixed set of 1006 ENEM questions, with direct measurement of accuracy (~46% Portuguese, ~49% English) and inference time (~20s/50s). No derivations, equations, fitted parameters renamed as predictions, self-citations, or ansatzes appear. The central claims are observational results against an external benchmark, not reductions to the paper's own inputs by construction. This matches the default case of a self-contained empirical study (score 0-2).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking study; the abstract introduces no mathematical derivations, free parameters, domain axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5707 in / 1156 out tokens · 26109 ms · 2026-05-24T06:15:50.663219+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

  1. [1]

    Attention is all you need,

    A. V aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017

  2. [2]

    Transformers: State-of-the-art natural language processing,

    T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P . Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P . von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Met...

  3. [3]

    Word embeddings: A survey,

    F. Almeida and G. Xex ´eo, “Word embeddings: A survey,” 2023

  4. [4]

    Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation,

    Z. Y ao, X. Wu, C. Li, S. Y oun, and Y . He, “Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation,” 2023

  5. [5]

    Llama: Open and efficient foundation language models,

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023

  6. [6]

    Emergent abilities of large language models,

    J. Wei, Y . Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Y ogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P . Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” Transactions on Machine Learning Research , 2022. Survey Certification

  7. [7]

    Palm: Scaling language modeling with path- ways,

    A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, and E. A. Paul Barham, “Palm: Scaling language modeling with path- ways,” 2022

  8. [8]

    Large language models encode clinical knowledge,

    K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P . Payne, M. Senevi- ratne, P . Gamble, C. Kelly, N. Scharli, A. Chowdhery, P . Mansfield, B. A. y Arcas, D. Webster, G. S. Corrado, Y . Matias, K. Chou, J. Gottweis, N. Tomasev, Y . Liu, A. Rajkomar, J. Barral, C. Semturs, A. Karthikesalingam,...

  9. [9]

    Internlm: A multilingual language model with progressively enhanced capabilities,

    I. Team, “Internlm: A multilingual language model with progressively enhanced capabilities,” 2023

  10. [10]

    Orca: Progressive learning from complex explanation traces of gpt-4,

    S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah, “Orca: Progressive learning from complex explanation traces of gpt-4,” 2023

  11. [11]

    Gpt-4 technical report,

    OpenAI, “Gpt-4 technical report,” 2023

  12. [12]

    Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams,

    D. Nunes, R. Primi, R. Pires, R. Lotufo, and R. Nogueira, “Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams,” 2023

  13. [13]

    Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,

    J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Y oo, S. J. Kwon, and D. Lee, “Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,” 2023

  14. [14]

    Squeezellm: Dense-and-sparse quantization,

    S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney, and K. Keutzer, “Squeezellm: Dense-and-sparse quantization,” 2023

  15. [15]

    Language models are few-shot learners,

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. ...

  16. [16]

    Brazilian Secondary School Exam (ENEM) Questions Dataset,

    M. L. O. dos Santos and C. E. C. Campelo, “Brazilian Secondary School Exam (ENEM) Questions Dataset,” Aug. 2023

  17. [17]

    Llama 2: Open foundation and fine-tuned chat models,

    H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale, and D. B. et. al.s, “Llama 2: Open foundation and fine-tuned chat models,” 2023