CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

Ali Hamdi; Khaled Shaban; Mohamed Ehab

arxiv: 2604.07583 · v2 · submitted 2026-04-08 · 💻 cs.CL · cs.LG

CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

Mohamed Ehab , Ali Hamdi , Khaled Shaban This is my paper

Pith reviewed 2026-05-10 17:23 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords class imbalanceensemble learninglanguage modelsmacro F1 scoreminority class optimizationtext classificationimbalanced data

0 comments

The pith

CAMO ensemble raises strict macro F1 on imbalanced language datasets by dynamically favoring minority classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes CAMO, an ensemble technique designed to handle class imbalance in text categorization tasks where standard methods underperform on minority classes. It uses a hierarchical process involving vote distributions, confidence calibration, and inter-model uncertainty to boost predictions for underrepresented classes. Tested on two highly unbalanced, domain-specific benchmarks with eight language models in zero-shot and fine-tuned modes, CAMO achieves the highest strict macro F1 scores when models are refined. This matters because real-world data often has imbalances, and better minority performance can improve reliability in applications like emotion detection or bias evaluation.

Core claim

CAMO consistently earns the greatest strict macro F1-score on refined models across the benchmarks, proving it a reliable, domain-neutral framework for unbalanced categorization that works in concert with model adaptation.

What carries the argument

The hierarchical procedure that incorporates vote distributions, confidence calibration, and inter-model uncertainty to dynamically boost underrepresented classes while preserving minority forecasts.

If this is right

CAMO's advantage increases when used with fine-tuned models rather than zero-shot.
The optimal ensemble method varies based on the properties of the underlying language models.
It establishes a new benchmark for strict macro F1 on highly unbalanced domain-specific text datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could be adapted to other imbalanced classification problems beyond text, such as image or tabular data.
It suggests that ensemble design should account for model type and training stage to maximize gains on minorities.
Future tests on more diverse datasets might reveal if the method generalizes without additional tuning.

Load-bearing premise

The hierarchical procedure using vote distributions, confidence calibration, and inter-model uncertainty will reliably boost underrepresented classes across different domains and model types without introducing new biases or requiring extensive tuning.

What would settle it

Evaluating CAMO on a third highly imbalanced dataset from a new domain, using a different set of language models, and checking whether it still achieves the highest strict macro F1 compared to the seven other ensembles.

read the original abstract

Real-world categorization is severely hampered by class imbalance because traditional ensembles favor majority classes, which lowers minority performance and overall F1-score. We provide a unique ensemble technique for imbalanced problems called CAMO (Class-Aware Minority-Optimized).Through a hierarchical procedure that incorporates vote distributions, confidence calibration, and inter model uncertainty, CAMO dynamically boosts underrepresented classes while preserving and amplifying minority forecasts. We verify CAMO on two highly unbalanced, domain-specific benchmarks: the DIAR-AI/Emotion dataset and the ternary BEA 2025 dataset. We benchmark against seven proven ensemble algorithms using eight different language models (three LLMs and five SLMs) under zero-shot and fine-tuned settings .With refined models, CAMO consistently earns the greatest strict macro F1-score, setting a new benchmark. Its benefit works in concert with model adaptation, showing that the best ensemble choice depends on model properties .This proves that CAMO is a reliable, domain-neutral framework for unbalanced categorization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CAMO is a practical hierarchical ensemble tweak for minority classes in imbalanced LM classification that delivers consistent macro-F1 gains on the two tested benchmarks without major internal flaws.

read the letter

The paper's core contribution is CAMO, a new ensemble method that stacks vote distributions, confidence calibration, and inter-model uncertainty in a hierarchical way to lift performance on underrepresented classes. It evaluates this on two highly imbalanced domain datasets using eight models under both zero-shot and fine-tuned conditions, and reports that CAMO tops the strict macro F1 against seven prior ensemble baselines when models are refined. The authors also note that the gains align with model adaptation rather than replacing it, which keeps the claim grounded. That combination of explicit steps and scoped results is the useful part here. The experiments are straightforward and cover a reasonable spread of model sizes and regimes, so a practitioner facing similar imbalance problems can see exactly how the method is applied and what the numbers look like. The full text supplies the implementation details that were missing from the abstract, and the reported improvements hold up within the stated scope without obvious contradictions or missing controls. The soft spots are modest. The approach is an incremental extension of existing calibration and uncertainty ideas rather than a conceptual leap, so the performance edge is real but not transformative. The paper does not claim universal superiority and ties benefits to the tested setups, which is honest, but wider testing on additional domains or larger-scale imbalance ratios would make the reliability clearer. No evidence of circularity or unfalsifiable claims appears. This work is aimed at researchers and engineers who handle imbalanced text classification with language models and need a concrete ensemble option with macro-F1 focus. A reader who wants empirical comparisons on real benchmarks will get direct value. It has enough concrete experiments and clear scoping to deserve a serious referee rather than a desk rejection.

Referee Report

1 major / 3 minor

Summary. The paper presents CAMO, a novel ensemble technique for imbalanced classification in language models. It employs a hierarchical procedure that integrates vote distributions, confidence calibration, and inter-model uncertainty to dynamically boost performance on minority classes while maintaining overall accuracy. Evaluations are conducted on two imbalanced datasets (DIAR-AI/Emotion and BEA 2025) using eight language models (three LLMs, five SLMs) and seven baselines in both zero-shot and fine-tuned settings, with claims that CAMO achieves the highest strict macro F1 scores when models are refined, and that its benefits align with model adaptation.

Significance. Should the empirical results hold under further scrutiny, CAMO represents a valuable contribution to handling class imbalance in NLP tasks, offering a domain-neutral approach that complements model fine-tuning. The comprehensive benchmarking across multiple models, settings, and baselines, along with the observation that ensemble performance interacts with model properties, provides actionable insights for practitioners. The method's design avoids introducing free parameters, enhancing its practicality.

major comments (1)

[Results section (e.g., Table 2 and Table 3)] The results tables report higher macro F1 scores for CAMO than the seven baselines across the eight models, but lack error bars, standard deviations across runs, or statistical significance tests (such as McNemar's test or Wilcoxon signed-rank tests) to support the claim of consistent superiority; this weakens the central empirical assertion in the absence of variance quantification.

minor comments (3)

[§3] The hierarchical procedure in §3 would benefit from pseudocode or an explicit algorithm listing to improve reproducibility of the vote aggregation, calibration, and uncertainty weighting steps.
[Evaluation Metrics subsection] The term 'strict macro F1-score' is introduced in the abstract and results without a formal definition or distinction from standard macro F1; this should be clarified in the evaluation metrics subsection.
[Related Work and Experimental Setup] Some of the seven baseline ensemble algorithms are referenced only by name; adding the original citations would strengthen the related work and experimental setup sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the minor revision recommendation. We address the concern about variance and statistical support in the results below.

read point-by-point responses

Referee: [Results section (e.g., Table 2 and Table 3)] The results tables report higher macro F1 scores for CAMO than the seven baselines across the eight models, but lack error bars, standard deviations across runs, or statistical significance tests (such as McNemar's test or Wilcoxon signed-rank tests) to support the claim of consistent superiority; this weakens the central empirical assertion in the absence of variance quantification.

Authors: We agree that quantifying variance and providing statistical tests would strengthen the empirical claims. In the revised manuscript we will add standard deviations computed over multiple random seeds for the fine-tuned SLM experiments (where computational cost permits) and include Wilcoxon signed-rank tests comparing CAMO against each baseline across the eight models and two datasets. For the zero-shot LLM results, which are largely deterministic given fixed prompts, we will explicitly note the single-run nature and discuss this limitation. These additions will appear in the updated Tables 2 and 3 and the accompanying text. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes CAMO as a procedural ensemble method that applies vote distributions, confidence calibration, and inter-model uncertainty hierarchically to boost minority classes on imbalanced NLP tasks. This is presented as an algorithmic recipe with explicit steps, empirically validated on two fixed datasets against seven baselines using eight models in zero-shot and fine-tuned regimes. No equations, parameter fits, or derivations are shown that reduce by construction to the inputs themselves, and no load-bearing self-citations or uniqueness theorems are invoked to justify the core procedure. Performance claims rest on reported macro-F1 results rather than tautological redefinitions, rendering the contribution self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method is described at a high level without mathematical formulation or listed assumptions.

pith-pipeline@v0.9.0 · 5475 in / 1003 out tokens · 44038 ms · 2026-05-10T17:23:31.557308+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

[1]

write newline

" write newline " cite write " FUNCTION editor.postfix editor num.names #1 > "( )" "( )" if FUNCTION editor.trans.postfix editor num.names #1 > "( )" "( )" if FUNCTION trans.postfix translator num.names #1 > "( )" "( )" if FUNCTION authors.editors.reflist.apa5 'field := 'dot := field num.names 'numnames := numnames 'format.num.names := format.num.names na...

work page
[2]

sn-aps.bst

FUNCTION identify.aps.version "sn-aps.bst" " [2024/07/19 v1.1 APS bibliography style]" * top ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key keywords month note number organization pages publisher school series title type url volume year eprint archive archivePrefix primaryClass adsurl adsnote version lab...

work page 2024
[3]

write newline

" write newline "" before.all 'output.state := FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION n.separate 't := "" #0 'numnames := t empty not t #-1 #1 subs...

work page
[4]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Abdin M, et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. Microsoft; 2024

work page 2024
[5]

Almazrouei E, Alobeidli H, Alshamsi A, Cappelli A, Cojocaru R, Debbah M, et al.: The Falcon Series of Open Language Models; 2023

work page 2023
[6]

Cobbe K, Kosaraju V, Bavarian M, Chen M, Jun H, Kaiser L, et al.: Training Verifiers to Solve Math Word Problems; 2021

work page 2021
[7]

Daheim N, Macina J, Kapur M, Gurevych I, Sachan M.: Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors; 2024

work page 2024
[8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI , Guo D, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature. 2025;645:633--638

work page 2025
[9]

Demszky D, Liu J, Mancenido Z, Cohen J, Hill H, Jurafsky D, et al.: Measuring Conversational Uptake: A Case Study on Student-Teacher Interactions; 2021

work page 2021
[10]

Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L.: QLoRA: Efficient Finetuning of Quantized LLMs; 2023

work page 2023
[11]

Grattafiori A, et al.: The Llama 3 Herd of Models; 2024

work page 2024
[12]

Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al.: Measuring Massive Multitask Language Understanding; 2021

work page 2021
[13]

Hendrycks D, Burns C, Kadavath S, Arora A, Basart S, Tang E, et al.: Measuring Mathematical Problem Solving With the MATH Dataset; 2021

work page 2021
[14]

Henrichsen M, Krebs R.: Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning; 2025

work page 2025
[15]

MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLM s as Math Tutors

Hikal B, Basem M, Oshallah I, Hamdi A. MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLM s as Math Tutors. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BE...

work page 2025
[16]

Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al.: LoRA: Low-Rank Adaptation of Large Language Models; 2021

work page 2021
[17]

Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, de las Casas D, et al.: Mistral 7B; 2023

work page 2023
[18]

A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs

Kermani A, Perez-Rosas V, Metsis V. A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG . In: Zirikly A, Yates A, Desmet B, Ireland M, Bedrick S, MacAvaney S, et al., editors. Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025) Albuquerque, N...

work page 2025
[19]

Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI -powered Tutors

Kochmar E, Maurya K, Petukhova K, Srivatsa KA, Tack A, Vasselli J. Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI -powered Tutors. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2...

work page 2025
[20]

Lakshminarayanan B, Pritzel A, Blundell C.: Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles; 2017

work page 2017
[21]

AutoTutor and Family: A Review of 17 Years of Natural Language Tutoring

Nye B, Graesser A, Hu X. AutoTutor and Family: A Review of 17 Years of Natural Language Tutoring. International Journal of Artificial Intelligence in Education. 2014;24. doi:10.1007/s40593-014-0029-5

work page doi:10.1007/s40593-014-0029-5 2014
[22]

bea-jh at BEA 2025 Shared Task: Evaluating AI -powered Tutors through Pedagogically-Informed Reasoning

Roh J, Bang J. bea-jh at BEA 2025 Shared Task: Evaluating AI -powered Tutors through Pedagogically-Informed Reasoning. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) Vienna, Austria: Association for Comp...

work page 2025
[23]

NLIP at BEA 2025 Shared Task: Evaluation of Pedagogical Ability of AI Tutors

Saha T, Ganguli S, Desarkar MS. NLIP at BEA 2025 Shared Task: Evaluation of Pedagogical Ability of AI Tutors. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) Vienna, Austria: Association for Computational...

work page 2025
[24]

Self-Explaining Emotion Classification through Preference-Aligned Large Language Models

Siddiqui MHF, Inkpen D, Gelbukh A. Self-Explaining Emotion Classification through Preference-Aligned Large Language Models. In: CS & IT Conference Proceedings, vol. 15 CS & IT Conference Proceedings; 2025

work page 2025
[25]

Tack A, Piech C.: The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues; 2022

work page 2022
[26]

Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al.: LLaMA: Open and Efficient Foundation Language Models; 2023

work page 2023
[27]

Touvron H, et al.: Llama 2: Open Foundation and Fine-Tuned Chat Models; 2023

work page 2023
[28]

Wang RE, Zhang Q, Robinson C, Loeb S, Demszky D.: Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes; 2024

work page 2024
[29]

Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment

Xu L, Xie H, Qin SJ, Tao X, Wang FL. Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2026;p. 1--20. doi:10.1109/TPAMI.2026.3657354

work page doi:10.1109/tpami.2026.3657354 2026
[30]

Qwen3 Technical Report

Yang A, et al. Qwen3 Technical Report. arXiv. 2025;abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

sn-basic.bst

FUNCTION identify.basic.version "sn-basic.bst" " [2024/07/19 v1.1 bibliography style]" * top ENTRY address archive author booktitle chapter doi edition editor eid eprint howpublished institution journal key keywords month note number organization pages publisher school series title type url volume year archivePrefix primaryClass adsurl adsnote version lab...

work page 2024
[32]

write newline

" write newline "" before.all 'output.state := FUNCTION add.period duplicate empty 'skip "." * add.blank if FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION ...

work page
[33]

write newline

" write newline "" before.all 'output.state := FUNCTION output.doi doi empty skip "doi:" doi * "" * output if FUNCTION format.archive archivePrefix empty "" archivePrefix ":" * if FUNCTION format.primaryClass primaryClass empty "" " [" primaryClass * "] " * if FUNCTION format.eprint eprint empty "" archive empty " https://arxiv.org/abs/" eprint * " " * " ...

work page
[34]

write newline

" write newline "" before.all 'output.state := FUNCTION string.to.integer 't := t text.length 'k := #1 'char.num := t char.num #1 substring 's := s is.num s "." = or char.num k = not and char.num #1 + 'char.num := while char.num #1 - 'char.num := t #1 char.num substring FUNCTION find.integer 't := #0 'int := int not t empty not and t #1 #1 substring 's :=...

work page
[35]

write newline

" write newline "" before.all 'output.state := FUNCTION string.to.integer 't := t text.length 'k := #1 'char.num := t char.num #1 substring 's := s is.num s "." = or char.num k = not and char.num #1 + 'char.num := while char.num #1 - 'char.num := t #1 char.num substring FUNCTION find.integer 't := #0 'int := int not t empty not and t #1 #1 substring 's :=...

work page
[36]

sn-nature.bst

FUNCTION identify.nature.version "sn-nature.bst" " [2024/07/19 v1.1 bibliography style]" * top ENTRY address archive author booktitle chapter edition editor eprint howpublished institution journal key keywords month note number organization pages publisher school series title type url doi volume year archivePrefix primaryClass eid adsurl adsnote version l...

work page 2024
[37]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page
[38]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...

work page
[39]

sn-vancouver-num.bst

FUNCTION identify.vancouver.version "sn-vancouver-num.bst" " [2024/07/19 v1.1 Vancouver bibliography style]" * top ENTRY address assignee author booktitle chapter cartographer day edition editor howpublished institution inventor journal key keywords month note number organization pages part publisher school series title type volume word year eprint doi ur...

work page 2024
[40]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...

work page

[1] [1]

write newline

" write newline " cite write " FUNCTION editor.postfix editor num.names #1 > "( )" "( )" if FUNCTION editor.trans.postfix editor num.names #1 > "( )" "( )" if FUNCTION trans.postfix translator num.names #1 > "( )" "( )" if FUNCTION authors.editors.reflist.apa5 'field := 'dot := field num.names 'numnames := numnames 'format.num.names := format.num.names na...

work page

[2] [2]

sn-aps.bst

FUNCTION identify.aps.version "sn-aps.bst" " [2024/07/19 v1.1 APS bibliography style]" * top ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key keywords month note number organization pages publisher school series title type url volume year eprint archive archivePrefix primaryClass adsurl adsnote version lab...

work page 2024

[3] [3]

write newline

" write newline "" before.all 'output.state := FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION n.separate 't := "" #0 'numnames := t empty not t #-1 #1 subs...

work page

[4] [4]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Abdin M, et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. Microsoft; 2024

work page 2024

[5] [5]

Almazrouei E, Alobeidli H, Alshamsi A, Cappelli A, Cojocaru R, Debbah M, et al.: The Falcon Series of Open Language Models; 2023

work page 2023

[6] [6]

Cobbe K, Kosaraju V, Bavarian M, Chen M, Jun H, Kaiser L, et al.: Training Verifiers to Solve Math Word Problems; 2021

work page 2021

[7] [7]

Daheim N, Macina J, Kapur M, Gurevych I, Sachan M.: Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors; 2024

work page 2024

[8] [8]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI , Guo D, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature. 2025;645:633--638

work page 2025

[9] [9]

Demszky D, Liu J, Mancenido Z, Cohen J, Hill H, Jurafsky D, et al.: Measuring Conversational Uptake: A Case Study on Student-Teacher Interactions; 2021

work page 2021

[10] [10]

Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L.: QLoRA: Efficient Finetuning of Quantized LLMs; 2023

work page 2023

[11] [11]

Grattafiori A, et al.: The Llama 3 Herd of Models; 2024

work page 2024

[12] [12]

Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al.: Measuring Massive Multitask Language Understanding; 2021

work page 2021

[13] [13]

Hendrycks D, Burns C, Kadavath S, Arora A, Basart S, Tang E, et al.: Measuring Mathematical Problem Solving With the MATH Dataset; 2021

work page 2021

[14] [14]

Henrichsen M, Krebs R.: Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning; 2025

work page 2025

[15] [15]

MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLM s as Math Tutors

Hikal B, Basem M, Oshallah I, Hamdi A. MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLM s as Math Tutors. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BE...

work page 2025

[16] [16]

Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al.: LoRA: Low-Rank Adaptation of Large Language Models; 2021

work page 2021

[17] [17]

Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, de las Casas D, et al.: Mistral 7B; 2023

work page 2023

[18] [18]

A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs

Kermani A, Perez-Rosas V, Metsis V. A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG . In: Zirikly A, Yates A, Desmet B, Ireland M, Bedrick S, MacAvaney S, et al., editors. Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025) Albuquerque, N...

work page 2025

[19] [19]

Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI -powered Tutors

Kochmar E, Maurya K, Petukhova K, Srivatsa KA, Tack A, Vasselli J. Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI -powered Tutors. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2...

work page 2025

[20] [20]

Lakshminarayanan B, Pritzel A, Blundell C.: Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles; 2017

work page 2017

[21] [21]

AutoTutor and Family: A Review of 17 Years of Natural Language Tutoring

Nye B, Graesser A, Hu X. AutoTutor and Family: A Review of 17 Years of Natural Language Tutoring. International Journal of Artificial Intelligence in Education. 2014;24. doi:10.1007/s40593-014-0029-5

work page doi:10.1007/s40593-014-0029-5 2014

[22] [22]

bea-jh at BEA 2025 Shared Task: Evaluating AI -powered Tutors through Pedagogically-Informed Reasoning

Roh J, Bang J. bea-jh at BEA 2025 Shared Task: Evaluating AI -powered Tutors through Pedagogically-Informed Reasoning. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) Vienna, Austria: Association for Comp...

work page 2025

[23] [23]

NLIP at BEA 2025 Shared Task: Evaluation of Pedagogical Ability of AI Tutors

Saha T, Ganguli S, Desarkar MS. NLIP at BEA 2025 Shared Task: Evaluation of Pedagogical Ability of AI Tutors. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) Vienna, Austria: Association for Computational...

work page 2025

[24] [24]

Self-Explaining Emotion Classification through Preference-Aligned Large Language Models

Siddiqui MHF, Inkpen D, Gelbukh A. Self-Explaining Emotion Classification through Preference-Aligned Large Language Models. In: CS & IT Conference Proceedings, vol. 15 CS & IT Conference Proceedings; 2025

work page 2025

[25] [25]

Tack A, Piech C.: The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues; 2022

work page 2022

[26] [26]

Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al.: LLaMA: Open and Efficient Foundation Language Models; 2023

work page 2023

[27] [27]

Touvron H, et al.: Llama 2: Open Foundation and Fine-Tuned Chat Models; 2023

work page 2023

[28] [28]

Wang RE, Zhang Q, Robinson C, Loeb S, Demszky D.: Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes; 2024

work page 2024

[29] [29]

Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment

Xu L, Xie H, Qin SJ, Tao X, Wang FL. Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2026;p. 1--20. doi:10.1109/TPAMI.2026.3657354

work page doi:10.1109/tpami.2026.3657354 2026

[30] [30]

Qwen3 Technical Report

Yang A, et al. Qwen3 Technical Report. arXiv. 2025;abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

sn-basic.bst

FUNCTION identify.basic.version "sn-basic.bst" " [2024/07/19 v1.1 bibliography style]" * top ENTRY address archive author booktitle chapter doi edition editor eid eprint howpublished institution journal key keywords month note number organization pages publisher school series title type url volume year archivePrefix primaryClass adsurl adsnote version lab...

work page 2024

[32] [32]

write newline

" write newline "" before.all 'output.state := FUNCTION add.period duplicate empty 'skip "." * add.blank if FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION ...

work page

[33] [33]

write newline

" write newline "" before.all 'output.state := FUNCTION output.doi doi empty skip "doi:" doi * "" * output if FUNCTION format.archive archivePrefix empty "" archivePrefix ":" * if FUNCTION format.primaryClass primaryClass empty "" " [" primaryClass * "] " * if FUNCTION format.eprint eprint empty "" archive empty " https://arxiv.org/abs/" eprint * " " * " ...

work page

[34] [34]

write newline

" write newline "" before.all 'output.state := FUNCTION string.to.integer 't := t text.length 'k := #1 'char.num := t char.num #1 substring 's := s is.num s "." = or char.num k = not and char.num #1 + 'char.num := while char.num #1 - 'char.num := t #1 char.num substring FUNCTION find.integer 't := #0 'int := int not t empty not and t #1 #1 substring 's :=...

work page

[35] [35]

write newline

" write newline "" before.all 'output.state := FUNCTION string.to.integer 't := t text.length 'k := #1 'char.num := t char.num #1 substring 's := s is.num s "." = or char.num k = not and char.num #1 + 'char.num := while char.num #1 - 'char.num := t #1 char.num substring FUNCTION find.integer 't := #0 'int := int not t empty not and t #1 #1 substring 's :=...

work page

[36] [36]

sn-nature.bst

FUNCTION identify.nature.version "sn-nature.bst" " [2024/07/19 v1.1 bibliography style]" * top ENTRY address archive author booktitle chapter edition editor eprint howpublished institution journal key keywords month note number organization pages publisher school series title type url doi volume year archivePrefix primaryClass eid adsurl adsnote version l...

work page 2024

[37] [37]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

work page

[38] [38]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...

work page

[39] [39]

sn-vancouver-num.bst

FUNCTION identify.vancouver.version "sn-vancouver-num.bst" " [2024/07/19 v1.1 Vancouver bibliography style]" * top ENTRY address assignee author booktitle chapter cartographer day edition editor howpublished institution inventor journal key keywords month note number organization pages part publisher school series title type volume word year eprint doi ur...

work page 2024

[40] [40]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...

work page