pith. sign in

arxiv: 2604.07583 · v2 · submitted 2026-04-08 · 💻 cs.CL · cs.LG

CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

Pith reviewed 2026-05-10 17:23 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords class imbalanceensemble learninglanguage modelsmacro F1 scoreminority class optimizationtext classificationimbalanced data
0
0 comments X

The pith

CAMO ensemble raises strict macro F1 on imbalanced language datasets by dynamically favoring minority classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes CAMO, an ensemble technique designed to handle class imbalance in text categorization tasks where standard methods underperform on minority classes. It uses a hierarchical process involving vote distributions, confidence calibration, and inter-model uncertainty to boost predictions for underrepresented classes. Tested on two highly unbalanced, domain-specific benchmarks with eight language models in zero-shot and fine-tuned modes, CAMO achieves the highest strict macro F1 scores when models are refined. This matters because real-world data often has imbalances, and better minority performance can improve reliability in applications like emotion detection or bias evaluation.

Core claim

CAMO consistently earns the greatest strict macro F1-score on refined models across the benchmarks, proving it a reliable, domain-neutral framework for unbalanced categorization that works in concert with model adaptation.

What carries the argument

The hierarchical procedure that incorporates vote distributions, confidence calibration, and inter-model uncertainty to dynamically boost underrepresented classes while preserving minority forecasts.

If this is right

  • CAMO's advantage increases when used with fine-tuned models rather than zero-shot.
  • The optimal ensemble method varies based on the properties of the underlying language models.
  • It establishes a new benchmark for strict macro F1 on highly unbalanced domain-specific text datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could be adapted to other imbalanced classification problems beyond text, such as image or tabular data.
  • It suggests that ensemble design should account for model type and training stage to maximize gains on minorities.
  • Future tests on more diverse datasets might reveal if the method generalizes without additional tuning.

Load-bearing premise

The hierarchical procedure using vote distributions, confidence calibration, and inter-model uncertainty will reliably boost underrepresented classes across different domains and model types without introducing new biases or requiring extensive tuning.

What would settle it

Evaluating CAMO on a third highly imbalanced dataset from a new domain, using a different set of language models, and checking whether it still achieves the highest strict macro F1 compared to the seven other ensembles.

read the original abstract

Real-world categorization is severely hampered by class imbalance because traditional ensembles favor majority classes, which lowers minority performance and overall F1-score. We provide a unique ensemble technique for imbalanced problems called CAMO (Class-Aware Minority-Optimized).Through a hierarchical procedure that incorporates vote distributions, confidence calibration, and inter model uncertainty, CAMO dynamically boosts underrepresented classes while preserving and amplifying minority forecasts. We verify CAMO on two highly unbalanced, domain-specific benchmarks: the DIAR-AI/Emotion dataset and the ternary BEA 2025 dataset. We benchmark against seven proven ensemble algorithms using eight different language models (three LLMs and five SLMs) under zero-shot and fine-tuned settings .With refined models, CAMO consistently earns the greatest strict macro F1-score, setting a new benchmark. Its benefit works in concert with model adaptation, showing that the best ensemble choice depends on model properties .This proves that CAMO is a reliable, domain-neutral framework for unbalanced categorization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper presents CAMO, a novel ensemble technique for imbalanced classification in language models. It employs a hierarchical procedure that integrates vote distributions, confidence calibration, and inter-model uncertainty to dynamically boost performance on minority classes while maintaining overall accuracy. Evaluations are conducted on two imbalanced datasets (DIAR-AI/Emotion and BEA 2025) using eight language models (three LLMs, five SLMs) and seven baselines in both zero-shot and fine-tuned settings, with claims that CAMO achieves the highest strict macro F1 scores when models are refined, and that its benefits align with model adaptation.

Significance. Should the empirical results hold under further scrutiny, CAMO represents a valuable contribution to handling class imbalance in NLP tasks, offering a domain-neutral approach that complements model fine-tuning. The comprehensive benchmarking across multiple models, settings, and baselines, along with the observation that ensemble performance interacts with model properties, provides actionable insights for practitioners. The method's design avoids introducing free parameters, enhancing its practicality.

major comments (1)
  1. [Results section (e.g., Table 2 and Table 3)] The results tables report higher macro F1 scores for CAMO than the seven baselines across the eight models, but lack error bars, standard deviations across runs, or statistical significance tests (such as McNemar's test or Wilcoxon signed-rank tests) to support the claim of consistent superiority; this weakens the central empirical assertion in the absence of variance quantification.
minor comments (3)
  1. [§3] The hierarchical procedure in §3 would benefit from pseudocode or an explicit algorithm listing to improve reproducibility of the vote aggregation, calibration, and uncertainty weighting steps.
  2. [Evaluation Metrics subsection] The term 'strict macro F1-score' is introduced in the abstract and results without a formal definition or distinction from standard macro F1; this should be clarified in the evaluation metrics subsection.
  3. [Related Work and Experimental Setup] Some of the seven baseline ensemble algorithms are referenced only by name; adding the original citations would strengthen the related work and experimental setup sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the minor revision recommendation. We address the concern about variance and statistical support in the results below.

read point-by-point responses
  1. Referee: [Results section (e.g., Table 2 and Table 3)] The results tables report higher macro F1 scores for CAMO than the seven baselines across the eight models, but lack error bars, standard deviations across runs, or statistical significance tests (such as McNemar's test or Wilcoxon signed-rank tests) to support the claim of consistent superiority; this weakens the central empirical assertion in the absence of variance quantification.

    Authors: We agree that quantifying variance and providing statistical tests would strengthen the empirical claims. In the revised manuscript we will add standard deviations computed over multiple random seeds for the fine-tuned SLM experiments (where computational cost permits) and include Wilcoxon signed-rank tests comparing CAMO against each baseline across the eight models and two datasets. For the zero-shot LLM results, which are largely deterministic given fixed prompts, we will explicitly note the single-run nature and discuss this limitation. These additions will appear in the updated Tables 2 and 3 and the accompanying text. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper describes CAMO as a procedural ensemble method that applies vote distributions, confidence calibration, and inter-model uncertainty hierarchically to boost minority classes on imbalanced NLP tasks. This is presented as an algorithmic recipe with explicit steps, empirically validated on two fixed datasets against seven baselines using eight models in zero-shot and fine-tuned regimes. No equations, parameter fits, or derivations are shown that reduce by construction to the inputs themselves, and no load-bearing self-citations or uniqueness theorems are invoked to justify the core procedure. Performance claims rest on reported macro-F1 results rather than tautological redefinitions, rendering the contribution self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method is described at a high level without mathematical formulation or listed assumptions.

pith-pipeline@v0.9.0 · 5475 in / 1003 out tokens · 44038 ms · 2026-05-10T17:23:31.557308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 1 internal anchor

  1. [1]

    write newline

    " write newline " cite write " FUNCTION editor.postfix editor num.names #1 > "( )" "( )" if FUNCTION editor.trans.postfix editor num.names #1 > "( )" "( )" if FUNCTION trans.postfix translator num.names #1 > "( )" "( )" if FUNCTION authors.editors.reflist.apa5 'field := 'dot := field num.names 'numnames := numnames 'format.num.names := format.num.names na...

  2. [2]

    sn-aps.bst

    FUNCTION identify.aps.version "sn-aps.bst" " [2024/07/19 v1.1 APS bibliography style]" * top ENTRY address author booktitle chapter doi edition editor eid howpublished institution journal key keywords month note number organization pages publisher school series title type url volume year eprint archive archivePrefix primaryClass adsurl adsnote version lab...

  3. [3]

    write newline

    " write newline "" before.all 'output.state := FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION n.separate 't := "" #0 'numnames := t empty not t #-1 #1 subs...

  4. [4]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Abdin M, et al. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. Microsoft; 2024

  5. [5]

    Almazrouei E, Alobeidli H, Alshamsi A, Cappelli A, Cojocaru R, Debbah M, et al.: The Falcon Series of Open Language Models; 2023

  6. [6]

    Cobbe K, Kosaraju V, Bavarian M, Chen M, Jun H, Kaiser L, et al.: Training Verifiers to Solve Math Word Problems; 2021

  7. [7]

    Daheim N, Macina J, Kapur M, Gurevych I, Sachan M.: Stepwise Verification and Remediation of Student Reasoning Errors with Large Language Model Tutors; 2024

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI , Guo D, et al. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. Nature. 2025;645:633--638

  9. [9]

    Demszky D, Liu J, Mancenido Z, Cohen J, Hill H, Jurafsky D, et al.: Measuring Conversational Uptake: A Case Study on Student-Teacher Interactions; 2021

  10. [10]

    Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L.: QLoRA: Efficient Finetuning of Quantized LLMs; 2023

  11. [11]

    Grattafiori A, et al.: The Llama 3 Herd of Models; 2024

  12. [12]

    Hendrycks D, Burns C, Basart S, Zou A, Mazeika M, Song D, et al.: Measuring Massive Multitask Language Understanding; 2021

  13. [13]

    Hendrycks D, Burns C, Kadavath S, Arora A, Basart S, Tang E, et al.: Measuring Mathematical Problem Solving With the MATH Dataset; 2021

  14. [14]

    Henrichsen M, Krebs R.: Two-Stage Reasoning-Infused Learning: Improving Classification with LLM-Generated Reasoning; 2025

  15. [15]

    MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLM s as Math Tutors

    Hikal B, Basem M, Oshallah I, Hamdi A. MSA at BEA 2025 Shared Task: Disagreement-Aware Instruction Tuning for Multi-Dimensional Evaluation of LLM s as Math Tutors. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BE...

  16. [16]

    Hu EJ, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al.: LoRA: Low-Rank Adaptation of Large Language Models; 2021

  17. [17]

    Jiang AQ, Sablayrolles A, Mensch A, Bamford C, Chaplot DS, de las Casas D, et al.: Mistral 7B; 2023

  18. [18]

    A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs

    Kermani A, Perez-Rosas V, Metsis V. A Systematic Evaluation of LLM Strategies for Mental Health Text Analysis: Fine-tuning vs. Prompt Engineering vs. RAG . In: Zirikly A, Yates A, Desmet B, Ireland M, Bedrick S, MacAvaney S, et al., editors. Proceedings of the 10th Workshop on Computational Linguistics and Clinical Psychology (CLPsych 2025) Albuquerque, N...

  19. [19]

    Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI -powered Tutors

    Kochmar E, Maurya K, Petukhova K, Srivatsa KA, Tack A, Vasselli J. Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI -powered Tutors. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2...

  20. [20]

    Lakshminarayanan B, Pritzel A, Blundell C.: Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles; 2017

  21. [21]

    AutoTutor and Family: A Review of 17 Years of Natural Language Tutoring

    Nye B, Graesser A, Hu X. AutoTutor and Family: A Review of 17 Years of Natural Language Tutoring. International Journal of Artificial Intelligence in Education. 2014;24. doi:10.1007/s40593-014-0029-5

  22. [22]

    bea-jh at BEA 2025 Shared Task: Evaluating AI -powered Tutors through Pedagogically-Informed Reasoning

    Roh J, Bang J. bea-jh at BEA 2025 Shared Task: Evaluating AI -powered Tutors through Pedagogically-Informed Reasoning. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) Vienna, Austria: Association for Comp...

  23. [23]

    NLIP at BEA 2025 Shared Task: Evaluation of Pedagogical Ability of AI Tutors

    Saha T, Ganguli S, Desarkar MS. NLIP at BEA 2025 Shared Task: Evaluation of Pedagogical Ability of AI Tutors. In: Kochmar E, Alhafni B, Bexte M, Burstein J, Horbach A, Laarmann-Quante R, et al., editors. Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025) Vienna, Austria: Association for Computational...

  24. [24]

    Self-Explaining Emotion Classification through Preference-Aligned Large Language Models

    Siddiqui MHF, Inkpen D, Gelbukh A. Self-Explaining Emotion Classification through Preference-Aligned Large Language Models. In: CS & IT Conference Proceedings, vol. 15 CS & IT Conference Proceedings; 2025

  25. [25]

    Tack A, Piech C.: The AI Teacher Test: Measuring the Pedagogical Ability of Blender and GPT-3 in Educational Dialogues; 2022

  26. [26]

    Touvron H, Lavril T, Izacard G, Martinet X, Lachaux MA, Lacroix T, et al.: LLaMA: Open and Efficient Foundation Language Models; 2023

  27. [27]

    Touvron H, et al.: Llama 2: Open Foundation and Fine-Tuned Chat Models; 2023

  28. [28]

    Wang RE, Zhang Q, Robinson C, Loeb S, Demszky D.: Bridging the Novice-Expert Gap via Models of Decision-Making: A Case Study on Remediating Math Mistakes; 2024

  29. [29]

    Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment

    Xu L, Xie H, Qin SJ, Tao X, Wang FL. Parameter-Efficient Fine-Tuning Methods for Pretrained Language Models: A Critical Review and Assessment. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2026;p. 1--20. doi:10.1109/TPAMI.2026.3657354

  30. [30]

    Qwen3 Technical Report

    Yang A, et al. Qwen3 Technical Report. arXiv. 2025;abs/2505.09388

  31. [31]

    sn-basic.bst

    FUNCTION identify.basic.version "sn-basic.bst" " [2024/07/19 v1.1 bibliography style]" * top ENTRY address archive author booktitle chapter doi edition editor eid eprint howpublished institution journal key keywords month note number organization pages publisher school series title type url volume year archivePrefix primaryClass adsurl adsnote version lab...

  32. [32]

    write newline

    " write newline "" before.all 'output.state := FUNCTION add.period duplicate empty 'skip "." * add.blank if FUNCTION if.digit duplicate "0" = swap duplicate "1" = swap duplicate "2" = swap duplicate "3" = swap duplicate "4" = swap duplicate "5" = swap duplicate "6" = swap duplicate "7" = swap duplicate "8" = swap "9" = or or or or or or or or or FUNCTION ...

  33. [33]

    write newline

    " write newline "" before.all 'output.state := FUNCTION output.doi doi empty skip "doi:" doi * "" * output if FUNCTION format.archive archivePrefix empty "" archivePrefix ":" * if FUNCTION format.primaryClass primaryClass empty "" " [" primaryClass * "] " * if FUNCTION format.eprint eprint empty "" archive empty " https://arxiv.org/abs/" eprint * " " * " ...

  34. [34]

    write newline

    " write newline "" before.all 'output.state := FUNCTION string.to.integer 't := t text.length 'k := #1 'char.num := t char.num #1 substring 's := s is.num s "." = or char.num k = not and char.num #1 + 'char.num := while char.num #1 - 'char.num := t #1 char.num substring FUNCTION find.integer 't := #0 'int := int not t empty not and t #1 #1 substring 's :=...

  35. [35]

    write newline

    " write newline "" before.all 'output.state := FUNCTION string.to.integer 't := t text.length 'k := #1 'char.num := t char.num #1 substring 's := s is.num s "." = or char.num k = not and char.num #1 + 'char.num := while char.num #1 - 'char.num := t #1 char.num substring FUNCTION find.integer 't := #0 'int := int not t empty not and t #1 #1 substring 's :=...

  36. [36]

    sn-nature.bst

    FUNCTION identify.nature.version "sn-nature.bst" " [2024/07/19 v1.1 bibliography style]" * top ENTRY address archive author booktitle chapter edition editor eprint howpublished institution journal key keywords month note number organization pages publisher school series title type url doi volume year archivePrefix primaryClass eid adsurl adsnote version l...

  37. [37]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  38. [38]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...

  39. [39]

    sn-vancouver-num.bst

    FUNCTION identify.vancouver.version "sn-vancouver-num.bst" " [2024/07/19 v1.1 Vancouver bibliography style]" * top ENTRY address assignee author booktitle chapter cartographer day edition editor howpublished institution inventor journal key keywords month note number organization pages part publisher school series title type volume word year eprint doi ur...

  40. [40]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize ":" * " " *...