Benchmarking quantized LLaMa-based models on the Brazilian Secondary School Exam

Cl\'audio E. C. Campelo; Matheus L. O. Santos

arxiv: 2309.12071 · v1 · submitted 2023-09-21 · 💻 cs.AI · cs.CL

Benchmarking quantized LLaMa-based models on the Brazilian Secondary School Exam

Matheus L. O. Santos , Cl\'audio E. C. Campelo This is my paper

Pith reviewed 2026-05-24 06:15 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords Large Language ModelsQuantizationLLaMAENEMBenchmarkPortugueseConsumer hardwareReasoning

0 comments

The pith

Quantized LLaMA-based models reach 46 percent accuracy on Portuguese ENEM questions and 49 percent on English translations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests quantized Alpaca, Koala, and Vicuna models built on 7B and 13B LLaMA backbones against 1006 questions drawn from Brazil's national secondary school exam. It records accuracy on the original Portuguese wording and on English translations while the models run on ordinary consumer hardware. Processing times are also measured to show what response speed is realistic without specialized equipment. The work therefore supplies a concrete data point on how far current accessible LLMs have progressed toward handling real exam material.

Core claim

The best performing models achieved an accuracy of approximately 46% for the original texts of the Portuguese questions and 49% on their English translations. On average, the 7 and 13 billion LLMs took approximately 20 and 50 seconds, respectively, to process the queries on a machine equipped with an AMD Ryzen 5 3600x processor.

What carries the argument

A database of 1006 ENEM questions used to score model accuracy after quantization enables execution on home hardware.

If this is right

Quantized LLaMA models can be executed on standard consumer processors for question-answering tasks.
Translating questions from Portuguese to English produces a modest accuracy gain in these models.
The 13B models require roughly 2.5 times longer to answer than the 7B models under the tested conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reported numbers supply a baseline against which later quantized models can be compared on Portuguese-language educational content.
The accuracy difference between languages points to the value of testing translated versions when evaluating models whose training data is unevenly distributed across languages.
Similar evaluations on other national exams could test whether the same performance pattern holds for different subject distributions and cultural contexts.

Load-bearing premise

The 1006 ENEM questions constitute an unbiased and uncontaminated test of the models' reasoning ability, with no overlap between the evaluation items and any data the models may have encountered during training.

What would settle it

Showing that many of the 1006 ENEM questions already appeared in the training data of the tested models, or that accuracy on a fresh set of comparable exam items falls well below the reported figures.

Figures

Figures reproduced from arXiv: 2309.12071 by Cl\'audio E. C. Campelo, Matheus L. O. Santos.

**Figure 2.** Figure 2: No questions were extracted for the years 2010 and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Although Large Language Models (LLMs) represent a revolution in the way we interact with computers, allowing the construction of complex questions and the ability to reason over a sequence of statements, their use is restricted due to the need for dedicated hardware for execution. In this study, we evaluate the performance of LLMs based on the 7 and 13 billion LLaMA models, subjected to a quantization process and run on home hardware. The models considered were Alpaca, Koala, and Vicuna. To evaluate the effectiveness of these models, we developed a database containing 1,006 questions from the ENEM (Brazilian National Secondary School Exam). Our analysis revealed that the best performing models achieved an accuracy of approximately 46% for the original texts of the Portuguese questions and 49% on their English translations. In addition, we evaluated the computational efficiency of the models by measuring the time required for execution. On average, the 7 and 13 billion LLMs took approximately 20 and 50 seconds, respectively, to process the queries on a machine equipped with an AMD Ryzen 5 3600x processor

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Thin benchmark of existing quantized LLaMA models on ENEM questions that reports numbers but supplies no evaluation protocol.

read the letter

The main thing to know about this paper is that it runs a few already-available quantized LLaMA-based models on the Brazilian ENEM exam and reports accuracy figures around 46 percent for the original Portuguese questions and 49 percent for English translations, along with some runtime measurements on consumer hardware. It does one thing reasonably well: it supplies concrete performance numbers for open models on a non-English national exam dataset. The inclusion of both Portuguese and translated versions lets readers see a small difference, and the timing data on a Ryzen processor gives a practical sense of what local inference looks like for 7B and 13B models. For anyone tracking how these models handle real educational content outside English, this adds a data point. The rest is standard application work. The models are Alpaca, Koala, and Vicuna in their quantized forms, which have been studied before. No new architecture, quantization method, or evaluation technique is introduced. The novelty is limited to the choice of the ENEM questions as the test bed. The soft spots are significant given what's presented. The abstract gives the headline numbers but says nothing about the prompting strategy, whether examples were provided, how the model outputs were mapped to answer choices, or any steps taken to avoid or detect contamination from training data. There are no details on the number of runs, variance, or statistical significance of the results. Without an evaluation protocol, it's impossible to know if the reported accuracies reflect genuine reasoning or artifacts of the setup. The claim that the 1006 questions form an unbiased test is stated but not supported with evidence in the available text. This paper would mainly interest researchers or practitioners doing applied evaluations of LLMs on Portuguese-language exams or similar educational benchmarks. It might serve as a quick reference for performance on consumer hardware, but it does not contain enough original insight or rigorous documentation to justify the time of peer reviewers. I would not send it for review.

Referee Report

1 major / 0 minor

Summary. The paper claims to evaluate the performance of quantized LLaMA-based models (Alpaca, Koala, and Vicuna, 7B and 13B) on 1006 questions from the Brazilian ENEM exam. The best performing models are reported to achieve approximately 46% accuracy on the original Portuguese questions and 49% on their English translations, with average execution times of 20 seconds for 7B models and 50 seconds for 13B models on an AMD Ryzen 5 3600x processor.

Significance. If the evaluation is methodologically sound, the results would offer a practical benchmark for quantized open-source LLMs on a culturally specific, non-English high-school exam, highlighting both capabilities and limitations for educational applications in Brazil. The emphasis on consumer hardware accessibility is a positive aspect.

major comments (1)

[Abstract] The abstract states numerical results but supplies no evaluation protocol, prompting details, error bars, or data-exclusion rules; central performance claims cannot be verified from the given text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and constructive comment. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] The abstract states numerical results but supplies no evaluation protocol, prompting details, error bars, or data-exclusion rules; central performance claims cannot be verified from the given text.

Authors: We agree that the provided abstract is too concise to allow verification of the central claims. The full manuscript contains a methods section describing the evaluation (all 1006 questions used with no exclusions, zero-shot multiple-choice prompting on the original Portuguese and English-translated versions, single deterministic runs on the specified hardware). To address the concern directly, we will revise the abstract to include a one-sentence summary of the protocol, the fact that error bars are omitted because inference is deterministic given fixed seeds and hardware, and a note that the complete prompting templates and data-handling rules appear in the methods section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical reporting

full rationale

The available abstract describes a straightforward empirical evaluation: quantized LLaMA-based models (Alpaca, Koala, Vicuna) are run on a fixed set of 1006 ENEM questions, with direct measurement of accuracy (~46% Portuguese, ~49% English) and inference time (~20s/50s). No derivations, equations, fitted parameters renamed as predictions, self-citations, or ansatzes appear. The central claims are observational results against an external benchmark, not reductions to the paper's own inputs by construction. This matches the default case of a self-contained empirical study (score 0-2).

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmarking study; the abstract introduces no mathematical derivations, free parameters, domain axioms, or new postulated entities.

pith-pipeline@v0.9.0 · 5707 in / 1156 out tokens · 26109 ms · 2026-05-24T06:15:50.663219+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Attention is all you need,

A. V aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017

work page 2017
[2]

Transformers: State-of-the-art natural language processing,

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P . Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P . von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Met...

work page 2020
[3]

Word embeddings: A survey,

F. Almeida and G. Xex ´eo, “Word embeddings: A survey,” 2023

work page 2023
[4]

Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation,

Z. Y ao, X. Wu, C. Li, S. Y oun, and Y . He, “Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation,” 2023

work page 2023
[5]

Llama: Open and efﬁcient foundation language models,

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efﬁcient foundation language models,” 2023

work page 2023
[6]

Emergent abilities of large language models,

J. Wei, Y . Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Y ogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P . Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” Transactions on Machine Learning Research , 2022. Survey Certiﬁcation

work page 2022
[7]

Palm: Scaling language modeling with path- ways,

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, and E. A. Paul Barham, “Palm: Scaling language modeling with path- ways,” 2022

work page 2022
[8]

Large language models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P . Payne, M. Senevi- ratne, P . Gamble, C. Kelly, N. Scharli, A. Chowdhery, P . Mansﬁeld, B. A. y Arcas, D. Webster, G. S. Corrado, Y . Matias, K. Chou, J. Gottweis, N. Tomasev, Y . Liu, A. Rajkomar, J. Barral, C. Semturs, A. Karthikesalingam,...

work page 2022
[9]

Internlm: A multilingual language model with progressively enhanced capabilities,

I. Team, “Internlm: A multilingual language model with progressively enhanced capabilities,” 2023

work page 2023
[10]

Orca: Progressive learning from complex explanation traces of gpt-4,

S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah, “Orca: Progressive learning from complex explanation traces of gpt-4,” 2023

work page 2023
[11]

Gpt-4 technical report,

OpenAI, “Gpt-4 technical report,” 2023

work page 2023
[12]

Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams,

D. Nunes, R. Primi, R. Pires, R. Lotufo, and R. Nogueira, “Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams,” 2023

work page 2023
[13]

Memory-efﬁcient ﬁne-tuning of compressed large language models via sub-4-bit integer quantization,

J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Y oo, S. J. Kwon, and D. Lee, “Memory-efﬁcient ﬁne-tuning of compressed large language models via sub-4-bit integer quantization,” 2023

work page 2023
[14]

Squeezellm: Dense-and-sparse quantization,

S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney, and K. Keutzer, “Squeezellm: Dense-and-sparse quantization,” 2023

work page 2023
[15]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. ...

work page 2020
[16]

Brazilian Secondary School Exam (ENEM) Questions Dataset,

M. L. O. dos Santos and C. E. C. Campelo, “Brazilian Secondary School Exam (ENEM) Questions Dataset,” Aug. 2023

work page 2023
[17]

Llama 2: Open foundation and ﬁne-tuned chat models,

H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale, and D. B. et. al.s, “Llama 2: Open foundation and ﬁne-tuned chat models,” 2023

work page 2023

[1] [1]

Attention is all you need,

A. V aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2017

work page 2017

[2] [2]

Transformers: State-of-the-art natural language processing,

T. Wolf, L. Debut, V . Sanh, J. Chaumond, C. Delangue, A. Moi, P . Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P . von Platen, C. Ma, Y . Jernite, J. Plu, C. Xu, T. Le Scao, S. Gugger, M. Drame, Q. Lhoest, and A. Rush, “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 Conference on Empirical Met...

work page 2020

[3] [3]

Word embeddings: A survey,

F. Almeida and G. Xex ´eo, “Word embeddings: A survey,” 2023

work page 2023

[4] [4]

Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation,

Z. Y ao, X. Wu, C. Li, S. Y oun, and Y . He, “Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation,” 2023

work page 2023

[5] [5]

Llama: Open and efﬁcient foundation language models,

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efﬁcient foundation language models,” 2023

work page 2023

[6] [6]

Emergent abilities of large language models,

J. Wei, Y . Tay, R. Bommasani, C. Raffel, B. Zoph, S. Borgeaud, D. Y ogatama, M. Bosma, D. Zhou, D. Metzler, E. H. Chi, T. Hashimoto, O. Vinyals, P . Liang, J. Dean, and W. Fedus, “Emergent abilities of large language models,” Transactions on Machine Learning Research , 2022. Survey Certiﬁcation

work page 2022

[7] [7]

Palm: Scaling language modeling with path- ways,

A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, and E. A. Paul Barham, “Palm: Scaling language modeling with path- ways,” 2022

work page 2022

[8] [8]

Large language models encode clinical knowledge,

K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, P . Payne, M. Senevi- ratne, P . Gamble, C. Kelly, N. Scharli, A. Chowdhery, P . Mansﬁeld, B. A. y Arcas, D. Webster, G. S. Corrado, Y . Matias, K. Chou, J. Gottweis, N. Tomasev, Y . Liu, A. Rajkomar, J. Barral, C. Semturs, A. Karthikesalingam,...

work page 2022

[9] [9]

Internlm: A multilingual language model with progressively enhanced capabilities,

I. Team, “Internlm: A multilingual language model with progressively enhanced capabilities,” 2023

work page 2023

[10] [10]

Orca: Progressive learning from complex explanation traces of gpt-4,

S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah, “Orca: Progressive learning from complex explanation traces of gpt-4,” 2023

work page 2023

[11] [11]

Gpt-4 technical report,

OpenAI, “Gpt-4 technical report,” 2023

work page 2023

[12] [12]

Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams,

D. Nunes, R. Primi, R. Pires, R. Lotufo, and R. Nogueira, “Evaluating gpt-3.5 and gpt-4 models on brazilian university admission exams,” 2023

work page 2023

[13] [13]

Memory-efﬁcient ﬁne-tuning of compressed large language models via sub-4-bit integer quantization,

J. Kim, J. H. Lee, S. Kim, J. Park, K. M. Y oo, S. J. Kwon, and D. Lee, “Memory-efﬁcient ﬁne-tuning of compressed large language models via sub-4-bit integer quantization,” 2023

work page 2023

[14] [14]

Squeezellm: Dense-and-sparse quantization,

S. Kim, C. Hooper, A. Gholami, Z. Dong, X. Li, S. Shen, M. W. Mahoney, and K. Keutzer, “Squeezellm: Dense-and-sparse quantization,” 2023

work page 2023

[15] [15]

Language models are few-shot learners,

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert- V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. ...

work page 2020

[16] [16]

Brazilian Secondary School Exam (ENEM) Questions Dataset,

M. L. O. dos Santos and C. E. C. Campelo, “Brazilian Secondary School Exam (ENEM) Questions Dataset,” Aug. 2023

work page 2023

[17] [17]

Llama 2: Open foundation and ﬁne-tuned chat models,

H. Touvron, L. Martin, K. Stone, P . Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosale, and D. B. et. al.s, “Llama 2: Open foundation and ﬁne-tuned chat models,” 2023

work page 2023