CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data
Pith reviewed 2026-05-10 17:23 UTC · model grok-4.3
The pith
CAMO ensemble raises strict macro F1 on imbalanced language datasets by dynamically favoring minority classes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CAMO consistently earns the greatest strict macro F1-score on refined models across the benchmarks, proving it a reliable, domain-neutral framework for unbalanced categorization that works in concert with model adaptation.
What carries the argument
The hierarchical procedure that incorporates vote distributions, confidence calibration, and inter-model uncertainty to dynamically boost underrepresented classes while preserving minority forecasts.
If this is right
- CAMO's advantage increases when used with fine-tuned models rather than zero-shot.
- The optimal ensemble method varies based on the properties of the underlying language models.
- It establishes a new benchmark for strict macro F1 on highly unbalanced domain-specific text datasets.
Where Pith is reading between the lines
- This approach could be adapted to other imbalanced classification problems beyond text, such as image or tabular data.
- It suggests that ensemble design should account for model type and training stage to maximize gains on minorities.
- Future tests on more diverse datasets might reveal if the method generalizes without additional tuning.
Load-bearing premise
The hierarchical procedure using vote distributions, confidence calibration, and inter-model uncertainty will reliably boost underrepresented classes across different domains and model types without introducing new biases or requiring extensive tuning.
What would settle it
Evaluating CAMO on a third highly imbalanced dataset from a new domain, using a different set of language models, and checking whether it still achieves the highest strict macro F1 compared to the seven other ensembles.
read the original abstract
Real-world categorization is severely hampered by class imbalance because traditional ensembles favor majority classes, which lowers minority performance and overall F1-score. We provide a unique ensemble technique for imbalanced problems called CAMO (Class-Aware Minority-Optimized).Through a hierarchical procedure that incorporates vote distributions, confidence calibration, and inter model uncertainty, CAMO dynamically boosts underrepresented classes while preserving and amplifying minority forecasts. We verify CAMO on two highly unbalanced, domain-specific benchmarks: the DIAR-AI/Emotion dataset and the ternary BEA 2025 dataset. We benchmark against seven proven ensemble algorithms using eight different language models (three LLMs and five SLMs) under zero-shot and fine-tuned settings .With refined models, CAMO consistently earns the greatest strict macro F1-score, setting a new benchmark. Its benefit works in concert with model adaptation, showing that the best ensemble choice depends on model properties .This proves that CAMO is a reliable, domain-neutral framework for unbalanced categorization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CAMO, a novel ensemble technique for imbalanced classification in language models. It employs a hierarchical procedure that integrates vote distributions, confidence calibration, and inter-model uncertainty to dynamically boost performance on minority classes while maintaining overall accuracy. Evaluations are conducted on two imbalanced datasets (DIAR-AI/Emotion and BEA 2025) using eight language models (three LLMs, five SLMs) and seven baselines in both zero-shot and fine-tuned settings, with claims that CAMO achieves the highest strict macro F1 scores when models are refined, and that its benefits align with model adaptation.
Significance. Should the empirical results hold under further scrutiny, CAMO represents a valuable contribution to handling class imbalance in NLP tasks, offering a domain-neutral approach that complements model fine-tuning. The comprehensive benchmarking across multiple models, settings, and baselines, along with the observation that ensemble performance interacts with model properties, provides actionable insights for practitioners. The method's design avoids introducing free parameters, enhancing its practicality.
major comments (1)
- [Results section (e.g., Table 2 and Table 3)] The results tables report higher macro F1 scores for CAMO than the seven baselines across the eight models, but lack error bars, standard deviations across runs, or statistical significance tests (such as McNemar's test or Wilcoxon signed-rank tests) to support the claim of consistent superiority; this weakens the central empirical assertion in the absence of variance quantification.
minor comments (3)
- [§3] The hierarchical procedure in §3 would benefit from pseudocode or an explicit algorithm listing to improve reproducibility of the vote aggregation, calibration, and uncertainty weighting steps.
- [Evaluation Metrics subsection] The term 'strict macro F1-score' is introduced in the abstract and results without a formal definition or distinction from standard macro F1; this should be clarified in the evaluation metrics subsection.
- [Related Work and Experimental Setup] Some of the seven baseline ensemble algorithms are referenced only by name; adding the original citations would strengthen the related work and experimental setup sections.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the minor revision recommendation. We address the concern about variance and statistical support in the results below.
read point-by-point responses
-
Referee: [Results section (e.g., Table 2 and Table 3)] The results tables report higher macro F1 scores for CAMO than the seven baselines across the eight models, but lack error bars, standard deviations across runs, or statistical significance tests (such as McNemar's test or Wilcoxon signed-rank tests) to support the claim of consistent superiority; this weakens the central empirical assertion in the absence of variance quantification.
Authors: We agree that quantifying variance and providing statistical tests would strengthen the empirical claims. In the revised manuscript we will add standard deviations computed over multiple random seeds for the fine-tuned SLM experiments (where computational cost permits) and include Wilcoxon signed-rank tests comparing CAMO against each baseline across the eight models and two datasets. For the zero-shot LLM results, which are largely deterministic given fixed prompts, we will explicitly note the single-run nature and discuss this limitation. These additions will appear in the updated Tables 2 and 3 and the accompanying text. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper describes CAMO as a procedural ensemble method that applies vote distributions, confidence calibration, and inter-model uncertainty hierarchically to boost minority classes on imbalanced NLP tasks. This is presented as an algorithmic recipe with explicit steps, empirically validated on two fixed datasets against seven baselines using eight models in zero-shot and fine-tuned regimes. No equations, parameter fits, or derivations are shown that reduce by construction to the inputs themselves, and no load-bearing self-citations or uniqueness theorems are invoked to justify the core procedure. Performance claims rest on reported macro-F1 results rather than tautological redefinitions, rendering the contribution self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
" write newline " cite write " FUNCTION editor.postfix editor num.names #1 > "( )" "( )" if FUNCTION editor.trans.postfix editor num.names #1 > "( )" "( )" if FUNCTION trans.postfix translator num.names #1 > "( )" "( )" if FUNCTION authors.editors.reflist.apa5 'field := 'dot := field num.names 'numnames := numnames 'format.num.names := format.num.names na...
-
[2]
FUNCTION identify.aps.version "sn-aps.bst" " [2024/07/19 v1.1 APS bibliography style]" * top ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key keywords month note number organization pages publisher school series title type url volume year eprint archive archivePrefix primaryClass adsurl adsnote version lab...
work page 2024
-
[3]
" write newline "" before.all 'output.state := FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION n.separate 't := "" #0 'numnames := t empty not t #-1 #1 subs...
-
[4]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Abdin M, et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. Microsoft; 2024
work page 2024
-
[5]
Almazrouei E, Alobeidli H, Alshamsi A, Cappelli A, Cojocaru R, Debbah M, et al.: The Falcon Series of Open Language Models; 2023
work page 2023
-
[6]
Cobbe K, Kosaraju V, Bavarian M, Chen M, Jun H, Kaiser L, et al.: Training Verifiers to Solve Math Word Problems; 2021
work page 2021
-
[7]
Daheim N, Macina J, Kapur M, Gurevych I, Sachan M.: Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors; 2024
work page 2024
-
[8]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI , Guo D, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature. 2025;645:633--638
work page 2025
-
[9]
Demszky D, Liu J, Mancenido Z, Cohen J, Hill H, Jurafsky D, et al.: Measuring Conversational Uptake: A Case Study on Student-Teacher Interactions; 2021
work page 2021
-
[10]
Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L.: QLoRA: Efficient Finetuning of Quantized LLMs; 2023
work page 2023
-
[11]
Grattafiori A, et al.: The Llama 3 Herd of Models; 2024
work page 2024
-
[12]
Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al.: Measuring Massive Multitask Language Understanding; 2021
work page 2021
-
[13]
Hendrycks D, Burns C, Kadavath S, Arora A, Basart S, Tang E, et al.: Measuring Mathematical Problem Solving With the MATH Dataset; 2021
work page 2021
-
[14]
Henrichsen M, Krebs R.: Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning; 2025
work page 2025
-
[15]
Hikal B, Basem M, Oshallah I, Hamdi A. MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLM s as Math Tutors. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BE...
work page 2025
-
[16]
Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al.: LoRA: Low-Rank Adaptation of Large Language Models; 2021
work page 2021
-
[17]
Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, de las Casas D, et al.: Mistral 7B; 2023
work page 2023
-
[18]
A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs
Kermani A, Perez-Rosas V, Metsis V. A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG . In: Zirikly A, Yates A, Desmet B, Ireland M, Bedrick S, MacAvaney S, et al., editors. Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025) Albuquerque, N...
work page 2025
-
[19]
Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI -powered Tutors
Kochmar E, Maurya K, Petukhova K, Srivatsa KA, Tack A, Vasselli J. Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI -powered Tutors. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2...
work page 2025
-
[20]
Lakshminarayanan B, Pritzel A, Blundell C.: Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles; 2017
work page 2017
-
[21]
AutoTutor and Family: A Review of 17 Years of Natural Language Tutoring
Nye B, Graesser A, Hu X. AutoTutor and Family: A Review of 17 Years of Natural Language Tutoring. International Journal of Artificial Intelligence in Education. 2014;24. doi:10.1007/s40593-014-0029-5
-
[22]
Roh J, Bang J. bea-jh at BEA 2025 Shared Task: Evaluating AI -powered Tutors through Pedagogically-Informed Reasoning. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) Vienna, Austria: Association for Comp...
work page 2025
-
[23]
NLIP at BEA 2025 Shared Task: Evaluation of Pedagogical Ability of AI Tutors
Saha T, Ganguli S, Desarkar MS. NLIP at BEA 2025 Shared Task: Evaluation of Pedagogical Ability of AI Tutors. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) Vienna, Austria: Association for Computational...
work page 2025
-
[24]
Self-Explaining Emotion Classification through Preference-Aligned Large Language Models
Siddiqui MHF, Inkpen D, Gelbukh A. Self-Explaining Emotion Classification through Preference-Aligned Large Language Models. In: CS & IT Conference Proceedings, vol. 15 CS & IT Conference Proceedings; 2025
work page 2025
-
[25]
Tack A, Piech C.: The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues; 2022
work page 2022
-
[26]
Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al.: LLaMA: Open and Efficient Foundation Language Models; 2023
work page 2023
-
[27]
Touvron H, et al.: Llama 2: Open Foundation and Fine-Tuned Chat Models; 2023
work page 2023
-
[28]
Wang RE, Zhang Q, Robinson C, Loeb S, Demszky D.: Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes; 2024
work page 2024
-
[29]
Xu L, Xie H, Qin SJ, Tao X, Wang FL. Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2026;p. 1--20. doi:10.1109/TPAMI.2026.3657354
-
[30]
Yang A, et al. Qwen3 Technical Report. arXiv. 2025;abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
FUNCTION identify.basic.version "sn-basic.bst" " [2024/07/19 v1.1 bibliography style]" * top ENTRY address archive author booktitle chapter doi edition editor eid eprint howpublished institution journal key keywords month note number organization pages publisher school series title type url volume year archivePrefix primaryClass adsurl adsnote version lab...
work page 2024
-
[32]
" write newline "" before.all 'output.state := FUNCTION add.period duplicate empty 'skip "." * add.blank if FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION ...
-
[33]
" write newline "" before.all 'output.state := FUNCTION output.doi doi empty skip "doi:" doi * "" * output if FUNCTION format.archive archivePrefix empty "" archivePrefix ":" * if FUNCTION format.primaryClass primaryClass empty "" " [" primaryClass * "] " * if FUNCTION format.eprint eprint empty "" archive empty " https://arxiv.org/abs/" eprint * " " * " ...
-
[34]
" write newline "" before.all 'output.state := FUNCTION string.to.integer 't := t text.length 'k := #1 'char.num := t char.num #1 substring 's := s is.num s "." = or char.num k = not and char.num #1 + 'char.num := while char.num #1 - 'char.num := t #1 char.num substring FUNCTION find.integer 't := #0 'int := int not t empty not and t #1 #1 substring 's :=...
-
[35]
" write newline "" before.all 'output.state := FUNCTION string.to.integer 't := t text.length 'k := #1 'char.num := t char.num #1 substring 's := s is.num s "." = or char.num k = not and char.num #1 + 'char.num := while char.num #1 - 'char.num := t #1 char.num substring FUNCTION find.integer 't := #0 'int := int not t empty not and t #1 #1 substring 's :=...
-
[36]
FUNCTION identify.nature.version "sn-nature.bst" " [2024/07/19 v1.1 bibliography style]" * top ENTRY address archive author booktitle chapter edition editor eprint howpublished institution journal key keywords month note number organization pages publisher school series title type url doi volume year archivePrefix primaryClass eid adsurl adsnote version l...
work page 2024
-
[37]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
-
[38]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...
-
[39]
FUNCTION identify.vancouver.version "sn-vancouver-num.bst" " [2024/07/19 v1.1 Vancouver bibliography style]" * top ENTRY address assignee author booktitle chapter cartographer day edition editor howpublished institution inventor journal key keywords month note number organization pages part publisher school series title type volume word year eprint doi ur...
work page 2024
-
[40]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.