Benchmarking quantized LLaMa-based models on the Brazilian Secondary School Exam
Pith reviewed 2026-05-24 06:15 UTC · model grok-4.3
The pith
Quantized LLaMA-based models reach 46 percent accuracy on Portuguese ENEM questions and 49 percent on English translations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The best performing models achieved an accuracy of approximately 46% for the original texts of the Portuguese questions and 49% on their English translations. On average, the 7 and 13 billion LLMs took approximately 20 and 50 seconds, respectively, to process the queries on a machine equipped with an AMD Ryzen 5 3600x processor.
What carries the argument
A database of 1006 ENEM questions used to score model accuracy after quantization enables execution on home hardware.
If this is right
- Quantized LLaMA models can be executed on standard consumer processors for question-answering tasks.
- Translating questions from Portuguese to English produces a modest accuracy gain in these models.
- The 13B models require roughly 2.5 times longer to answer than the 7B models under the tested conditions.
Where Pith is reading between the lines
- The reported numbers supply a baseline against which later quantized models can be compared on Portuguese-language educational content.
- The accuracy difference between languages points to the value of testing translated versions when evaluating models whose training data is unevenly distributed across languages.
- Similar evaluations on other national exams could test whether the same performance pattern holds for different subject distributions and cultural contexts.
Load-bearing premise
The 1006 ENEM questions constitute an unbiased and uncontaminated test of the models' reasoning ability, with no overlap between the evaluation items and any data the models may have encountered during training.
What would settle it
Showing that many of the 1006 ENEM questions already appeared in the training data of the tested models, or that accuracy on a fresh set of comparable exam items falls well below the reported figures.
Figures
read the original abstract
Although Large Language Models (LLMs) represent a revolution in the way we interact with computers, allowing the construction of complex questions and the ability to reason over a sequence of statements, their use is restricted due to the need for dedicated hardware for execution. In this study, we evaluate the performance of LLMs based on the 7 and 13 billion LLaMA models, subjected to a quantization process and run on home hardware. The models considered were Alpaca, Koala, and Vicuna. To evaluate the effectiveness of these models, we developed a database containing 1,006 questions from the ENEM (Brazilian National Secondary School Exam). Our analysis revealed that the best performing models achieved an accuracy of approximately 46% for the original texts of the Portuguese questions and 49% on their English translations. In addition, we evaluated the computational efficiency of the models by measuring the time required for execution. On average, the 7 and 13 billion LLMs took approximately 20 and 50 seconds, respectively, to process the queries on a machine equipped with an AMD Ryzen 5 3600x processor
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to evaluate the performance of quantized LLaMA-based models (Alpaca, Koala, and Vicuna, 7B and 13B) on 1006 questions from the Brazilian ENEM exam. The best performing models are reported to achieve approximately 46% accuracy on the original Portuguese questions and 49% on their English translations, with average execution times of 20 seconds for 7B models and 50 seconds for 13B models on an AMD Ryzen 5 3600x processor.
Significance. If the evaluation is methodologically sound, the results would offer a practical benchmark for quantized open-source LLMs on a culturally specific, non-English high-school exam, highlighting both capabilities and limitations for educational applications in Brazil. The emphasis on consumer hardware accessibility is a positive aspect.
major comments (1)
- [Abstract] The abstract states numerical results but supplies no evaluation protocol, prompting details, error bars, or data-exclusion rules; central performance claims cannot be verified from the given text.
Simulated Author's Rebuttal
We thank the referee for their review and constructive comment. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] The abstract states numerical results but supplies no evaluation protocol, prompting details, error bars, or data-exclusion rules; central performance claims cannot be verified from the given text.
Authors: We agree that the provided abstract is too concise to allow verification of the central claims. The full manuscript contains a methods section describing the evaluation (all 1006 questions used with no exclusions, zero-shot multiple-choice prompting on the original Portuguese and English-translated versions, single deterministic runs on the specified hardware). To address the concern directly, we will revise the abstract to include a one-sentence summary of the protocol, the fact that error bars are omitted because inference is deterministic given fixed seeds and hardware, and a note that the complete prompting templates and data-handling rules appear in the methods section. revision: yes
Circularity Check
No significant circularity; purely empirical reporting
full rationale
The available abstract describes a straightforward empirical evaluation: quantized LLaMA-based models (Alpaca, Koala, Vicuna) are run on a fixed set of 1006 ENEM questions, with direct measurement of accuracy (~46% Portuguese, ~49% English) and inference time (~20s/50s). No derivations, equations, fitted parameters renamed as predictions, self-citations, or ansatzes appear. The central claims are observational results against an external benchmark, not reductions to the paper's own inputs by construction. This matches the default case of a self-contained empirical study (score 0-2).
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A. V aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017
work page 2017
-
[2]
Transformers: State-of-the-art natural language processing,
T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P . Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P . von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Met...
work page 2020
-
[3]
F. Almeida and G. Xex ´eo, “Word embeddings: A survey,” 2023
work page 2023
-
[4]
Z. Y ao, X. Wu, C. Li, S. Y oun, and Y . He, “Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation,” 2023
work page 2023
-
[5]
Llama: Open and efficient foundation language models,
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023
work page 2023
-
[6]
Emergent abilities of large language models,
J. Wei, Y . Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Y ogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P . Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” Transactions on Machine Learning Research , 2022. Survey Certification
work page 2022
-
[7]
Palm: Scaling language modeling with path- ways,
A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, and E. A. Paul Barham, “Palm: Scaling language modeling with path- ways,” 2022
work page 2022
-
[8]
Large language models encode clinical knowledge,
K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P . Payne, M. Senevi- ratne, P . Gamble, C. Kelly, N. Scharli, A. Chowdhery, P . Mansfield, B. A. y Arcas, D. Webster, G. S. Corrado, Y . Matias, K. Chou, J. Gottweis, N. Tomasev, Y . Liu, A. Rajkomar, J. Barral, C. Semturs, A. Karthikesalingam,...
work page 2022
-
[9]
Internlm: A multilingual language model with progressively enhanced capabilities,
I. Team, “Internlm: A multilingual language model with progressively enhanced capabilities,” 2023
work page 2023
-
[10]
Orca: Progressive learning from complex explanation traces of gpt-4,
S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah, “Orca: Progressive learning from complex explanation traces of gpt-4,” 2023
work page 2023
- [11]
-
[12]
Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams,
D. Nunes, R. Primi, R. Pires, R. Lotufo, and R. Nogueira, “Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams,” 2023
work page 2023
-
[13]
Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,
J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Y oo, S. J. Kwon, and D. Lee, “Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization,” 2023
work page 2023
-
[14]
Squeezellm: Dense-and-sparse quantization,
S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney, and K. Keutzer, “Squeezellm: Dense-and-sparse quantization,” 2023
work page 2023
-
[15]
Language models are few-shot learners,
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. ...
work page 2020
-
[16]
Brazilian Secondary School Exam (ENEM) Questions Dataset,
M. L. O. dos Santos and C. E. C. Campelo, “Brazilian Secondary School Exam (ENEM) Questions Dataset,” Aug. 2023
work page 2023
-
[17]
Llama 2: Open foundation and fine-tuned chat models,
H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale, and D. B. et. al.s, “Llama 2: Open foundation and fine-tuned chat models,” 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.