Development of email classifier in Brazilian Portuguese using feature selection for automatic response
Pith reviewed 2026-05-25 01:20 UTC · model grok-4.3
The pith
Support Vector Machines with non-lemmatized verbs, nouns and adjectives classify Brazilian Portuguese business emails at 87.3% accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors present a novel corpus of Brazilian Portuguese business emails for automatic categorization. Baseline experiments compare Naive Bayes and Support Vector Machines, examining the effects of lemmatization and part-of-speech tagging. Support Vector Machines classification coupled with nonlemmatized selection of verbs, nouns and adjectives was the best approach, with 87.3% maximum accuracy. Straightforward lemmatization led to the lowest classification results.
What carries the argument
Support Vector Machines classification coupled with nonlemmatized selection of verbs, nouns and adjectives via part-of-speech tagging.
Load-bearing premise
The new email corpus is representative of the target business domain and the train/test split does not introduce selection bias that inflates reported accuracy.
What would settle it
Evaluating the trained model on a held-out set of emails collected from a different time period or different company in the same domain would reveal if the 87.3% accuracy generalizes or is inflated by the specific corpus characteristics.
read the original abstract
Automatic email categorization is an important application of text classification. We study the automatic reply of email business messages in Brazilian Portuguese. We present a novel corpus containing messages from a real application, and baseline categorization experiments using Naive Bayes and support Vector Machines. We then discuss the effect of lemmatization and the role of part-of-speech tagging filtering on precision and recall. Support Vector Machines classification coupled with nonlemmatized selection of verbs, nouns and adjectives was the best approach, with 87.3% maximum accuracy. Straightforward lemmatization in Portuguese led to the lowest classification results in the group, with 85.3% and 81.7% precision in SVM and Naive Bayes respectively. Thus, while lemmatization reduced precision and recall, part-of-speech filtering improved overall results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a novel corpus of Brazilian Portuguese business emails and reports baseline text classification results for automatic reply categorization using Naive Bayes and SVM. It compares the effects of lemmatization versus non-lemmatized part-of-speech filtering (verbs, nouns, adjectives), concluding that SVM on the non-lemmatized filtered features yields the highest accuracy of 87.3%.
Significance. If the accuracy figures are reproducible under a properly documented protocol, the work supplies a new domain-specific corpus and some empirical guidance on preprocessing choices for Portuguese email classification. The contribution is primarily empirical and corpus-oriented rather than methodological.
major comments (2)
- [Abstract] Abstract: the reported maximum accuracies (87.3% for SVM non-lemmatized, 85.3% and 81.7% for lemmatized SVM/NB) are presented without any accompanying information on corpus size, number of classes, class distribution, train/test split procedure, or cross-validation scheme. These omissions render the central empirical claim unverifiable and prevent assessment of possible selection bias or leakage.
- [Results/Evaluation] Evaluation/results section: the superiority of the non-lemmatized POS-filtered feature set is asserted on the basis of single accuracy numbers; no statistical significance tests, confidence intervals, or variance across folds/runs are reported, so the claim that this configuration is reliably best cannot be evaluated.
minor comments (2)
- The title emphasizes 'feature selection' but the described experiments center on POS filtering and lemmatization; clarify whether any additional feature-selection algorithms (e.g., information gain, chi-squared) were applied beyond the POS step.
- [Abstract] The abstract states that 'straightforward lemmatization led to the lowest classification results' yet supplies only precision figures for the lemmatized case; recall or full precision-recall tables for all conditions would improve comparability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important issues of verifiability and statistical rigor in our empirical results. We respond to each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported maximum accuracies (87.3% for SVM non-lemmatized, 85.3% and 81.7% for lemmatized SVM/NB) are presented without any accompanying information on corpus size, number of classes, class distribution, train/test split procedure, or cross-validation scheme. These omissions render the central empirical claim unverifiable and prevent assessment of possible selection bias or leakage.
Authors: We agree that the abstract should include these details to support verifiability of the reported accuracies. The manuscript body already describes the corpus, class structure, and evaluation protocol; we will revise the abstract to concisely incorporate this information so that the central empirical claims can be properly assessed. revision: yes
-
Referee: [Results/Evaluation] Evaluation/results section: the superiority of the non-lemmatized POS-filtered feature set is asserted on the basis of single accuracy numbers; no statistical significance tests, confidence intervals, or variance across folds/runs are reported, so the claim that this configuration is reliably best cannot be evaluated.
Authors: The evaluation was performed with a single fixed train/test split, and the manuscript reports only the resulting point accuracies. We acknowledge that the absence of variance estimates or significance tests limits the strength of the superiority claim. In revision we will explicitly document the split procedure within the results section and note the single-run nature as a limitation of this baseline study. Performing additional cross-validation folds and statistical tests would require new experiments outside the current scope. revision: partial
Circularity Check
No circularity: purely empirical classifier evaluation on held-out data
full rationale
The paper conducts standard supervised classification experiments (Naive Bayes and SVM) on a novel email corpus, reporting precision/recall/accuracy after varying preprocessing choices such as lemmatization and POS filtering. The 87.3% accuracy figure is a direct empirical measurement on a test partition; no equations, derivations, or first-principles claims are present that could reduce to fitted inputs or self-citations by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text. This is the expected outcome for an applied ML comparison paper whose central claims rest on external data rather than internal definitional closure.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction ............................................................................................................. 12
-
[2]
Related Work And Background............................................................................... 15 2.1. Natural Language Processing And Text Classification ................................. 15 2.2. Naive Bayes Classifier .................................................................................. 16 2.3. Support Vector Machines Classifier...
-
[3]
Corpus Collection ................................................................................................... 25 3.1. Partnership With Not-Profit Organization To Obtain Data ................................... 25 3.2. Email Is Structured In Tickets .............................................................................. 26 3.3. Macros As Cla...
-
[4]
Corpus Processing ................................................................................................. 32 4.1. Text Filtering With Lemmatizer And Parts Of Speech ................................... 32 4.2. Naive Bayes And Svm Classifiers ................................................................. 33
-
[5]
Evaluation Of Lemmatization And Part -Of-Speech Filtering Effect On Performance 34
-
[6]
Discussion .............................................................................................................. 35 6.1. The Choice Of The Classifiers ...................................................................... 35 6.2. Precision And Recall Are Consistent With Literature .................................... 35 6.3. Colloquial Speech Red...
-
[7]
Conclusions ............................................................................................................ 39
-
[8]
References ............................................................................................................. 40 12
-
[9]
Email is quickly received, and it can be sent asynchronousl y at low cost
INTRODUCTION Electronic mail is an ubiquitous mode of communication in personal and work life [1], [2]. Email is quickly received, and it can be sent asynchronousl y at low cost. On the other hand, providing personalized and appropriate answers to questions sent by email is not an easy task, particularly as the number of messages scales up [3]. Messages a...
-
[10]
The latter two will be further detailed in Section 2
and Naïve Bayes (NB) classifiers. The latter two will be further detailed in Section 2. State-of-the-art algorithms vary depending on the type of classification being performed, that could be binary or between multiple categories, text length and the types of features to be taken into account in the statistical method [13], [14]. In this project we examin...
-
[11]
RELATED WORK AND BACKGROUND 2.1. NATURAL LANGUAGE PROCESSING AND TEXT CLASSIFICATION The most accurate algorithms for text classification today are Support Vector Machines (SVM), Naive Bayes (NB) and k -Nearest-Neighbors (kNN), including hybrid approaches that can achieve greater precision than these methods separately [14]. SVM is one of the top performe...
-
[12]
method for classifying texts into categories, despite the overly simplistic approach of assuming complete independence between words in a sentence, what does not even take into account the order of words in a text. A simple numerical example to facilitate the understanding of the Naive -Bayes classifier was extracted from Manning, Raghavan and Schutze [6]...
work page 2001
-
[13]
CORPUS COLLECTION In this section we explain how we built the corpus we used in all experiments. 3.1. PARTNERSHIP WITH NOT-PROFIT ORGANIZATION TO OBTAIN DATA Even though email communication is important in many interactions between customers and companies, real -life data is of limited availability. To our knowledge, no public enterprise email corpus with...
work page 2081
-
[14]
CORPUS PROCESSING In this section we explain how we processed our email corpus to prepare the datasets used in the experiments, and the techniques applied to classify messages. 4.1. TEXT FILTERING WITH LEMMATIZER AND PARTS OF SPEECH We used different techniques to process the training corpus with the objective of assessing the impact on recall and precisi...
-
[15]
EVALUATION OF LEMMATIZATION AN D PART-OF-SPEECH FILTERING EFFECT ON PERFORMANCE Table 9 presents the effect of the POS -Tagger filter and of the lemmatizer in precision, recall and F1 measurements with our different training and test data . Comparing both classifiers among all filters, the highest precision achieved was 87.5%, recall 87.2% and F1 87.3%, f...
-
[16]
DISCUSSION 6.1. THE CHOICE OF THE CLASSIFIERS In this project, we focused on the Naive Bayes and SVM algorithms for classification. A common application of these classifiers is separating “Spam and Ham” in email inboxes [44], but they have also attained high precision and recall in classification problems with more than two categories [12]. Naive Bayes ha...
-
[17]
CONCLUSIONS We successfully built a corpus of email messages in Brazilian Portu guese. That was accomplished in association with Fundação Estudar a non -profit organization in education that provided us with their email logs. Based on the corpus created, we produced a study of email classification. We implemented a Naive Bayes and a Support Vector Machine...
-
[18]
Revised Naive Bayes Classifier for Combating the Focus Attack in Spam Filtering,
J. Peng and P. P. K. Chan, “Revised Naive Bayes Classifier for Combating the Focus Attack in Spam Filtering,” in 2013 International Conference on Machine Learning and Cybernetics, 2013, pp. 14–17
work page 2013
-
[19]
An Optimized k-NN Classifier based on Minimum Spanning Tree for Email Filtering,
A. Chakrabarty, “An Optimized k-NN Classifier based on Minimum Spanning Tree for Email Filtering,” Bus. Inf. Manag., pp. 47–52, 2014
work page 2014
-
[20]
System And Method For Message Process And Response,
K. D. Richardson, J. Greif, D. Buedel, and B. Aleksandrovsky, “System And Method For Message Process And Response,” US6278996 B1, 2001
work page 2001
-
[21]
M. C. Buskirk Jr, F. J. Damerau, D. H. Johnson, M. Raaen, Buskirk, M. C. Buskirk Jr, F. J. Damerau, D. H. Johnson, and M. Raaen, “Machine Learning Based Electronic Messagind System,US6424997 B1, 2003.03
work page 2003
-
[22]
S. Ayyadurai, “System a nd Method for Content - Sensitive Automatic Reply Message Generation for Text -Based Asynchronous Communications", US6718368 B1, 2004
work page 2004
-
[23]
C. D. Manning, P. Raghavan, and H. Shütze, An Introduction to Information Retrieval, C. Cambridge: Cambridge UP, pp. 154-157 and 261, 2009
work page 2009
-
[24]
Transductive Inference for Text Classification Using Support Vector Machines,
J. Thorsten, “Transductive Inference for Text Classification Using Support Vector Machines,” ICML, vol. 99, pp. 200–209, 1999
work page 1999
-
[25]
W. W. Cohen, V. R. Carvalho, and T. M. Mitchell, Learning to Classify Email into “Speech Acts,” vol. 4, no. 11. 2004
work page 2004
-
[26]
Towards Designing an Email Classification System Using Multi -view Based Semi -Supervised Learning,
W. Li, W. Meng, Z. Tan, and Y. Xiang, “Towards Designing an Email Classification System Using Multi -view Based Semi -Supervised Learning,” 2014 IEEE 13th Int. Conf. Trust. Secur. Priv. Comput. Commun., pp. 174–181, 2014
work page 2014
-
[27]
Email Classification with Co -Training,
S. Kiritchenko and S. Matwin, “Email Classification with Co -Training,” Proc. 2001 Conf. Cent. Adv. Stud. Collab. Res., p. 8, 2001. 41
work page 2001
-
[28]
A comparative study for email classification,
S. Youn and D. McLeod, “A comparative study for email classification,” Adv. Innov. Syst. Comput. Sci. Softw. Eng., pp. 387–391, 2007
work page 2007
-
[29]
The Enron corpus: A new dataset for Email Classification Research,
B. Klimt and Y. Yang, “The Enron corpus: A new dataset for Email Classification Research,” Mach. Learn. ECML 2004, pp. 217–226, 2004
work page 2004
-
[30]
Baselines and bigrams: Simple, good sentiment and topic classification,
S. Wang and C. D. C. Manning, “Baselines and bigrams: Simple, good sentiment and topic classification,” ACL 12 Proc. 50th Annu. Meet. Assoc. Comput. Linguist. Short Pap. - Vol. 2, vol. 94305, no. 1, pp. 90–94, 2012
work page 2012
-
[31]
A Comparative Study on Different Types of Approaches to Text Categorization,
P. Y. Pawar and S. H. Gawande, “A Comparative Study on Different Types of Approaches to Text Categorization,” Int. J. Mach. Learn. Comput., vol. 2, no. 4, pp. 423–426, 2012
work page 2012
-
[32]
Classificação Automática de Emails,
M. J. A. Lima, “Classificação Automática de Emails,” Universidade do Porto, 2013
work page 2013
-
[33]
IBM’s Watson Now A Customer Service Agent
B. Upbin, “IBM’s Watson Now A Customer Service Agent”, Forbes Magazine, p. 1, Access at May 21, 2013 < http://www.forbes.com/sites/bruceupbin/2013/05/21/ibms-watson-now-a- customer-service-agent-coming-to-smartphones-soon/ >
work page 2013
-
[34]
Reinforced Multicategory Support Vector Machines,
Y. Liu and M. Yuan, “Reinforced Multicategory Support Vector Machines,” J. Comput. Graph. Stat., vol. 20, no. 4, pp. 901–919, 2011
work page 2011
-
[35]
Joachims, Learning to Classify Text Using Support Vector Machines
T. Joachims, Learning to Classify Text Using Support Vector Machines. ICML, vol. 99, pp.200-209, 2001
work page 2001
-
[36]
Towards an Adaptive Mail Classifier,
E. Masciari, M. Ruffolo, and A. Tagarelli, “Towards an Adaptive Mail Classifier,” Ital. Assoc. Artif. Intell. Work. Su Apprendimento Autom . Metod. ed Appl. , no. August, 2002
work page 2002
-
[37]
Classifying spam emails using text and readability features,
R. Shams and R. E. Mercer, “Classifying spam emails using text and readability features,” Proc. - IEEE Int. Conf. Data Mining, ICDM, pp. 657–666, 2013
work page 2013
-
[38]
An empirical performance comparison of machine learning methods for spam e -mail categorization,
C.-C. Lai and M. -C. Tsai, “An empirical performance comparison of machine learning methods for spam e -mail categorization,” Fourth Int. Conf. Hybrid Intell. Syst., pp. 0–4, 2004. 42
work page 2004
-
[39]
An approach to spam detec tion by Naive Bayes ensemble based on decision induction,
Y. Zhen, N. Xiangfei, X. Weiran, and G. Jun, “An approach to spam detec tion by Naive Bayes ensemble based on decision induction,” Proc. - ISDA 2006 Sixth Int. Conf. Intell. Syst. Des. Appl., vol. 2, pp. 861–866, 2006
work page 2006
-
[40]
Learning to Filter Spam E -Mail: A Comparison of a Naive Bayesian and a Memory -Based Approach,
I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. D. Spyropoulos, and P. Stamatopoulos, “Learning to Filter Spam E -Mail: A Comparison of a Naive Bayesian and a Memory -Based Approach,” Proc. Work. “Machine Learn. Textual Inf. Access,” no. September 2000, pp. 1–12, 2000
work page 2000
-
[41]
C. Cortes and V. Vapnik, “Support -vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995
work page 1995
-
[42]
A Practical Guide to Support Vector Classification,
C.-W. Hsu, C. -C. Chang, and C. -J. Lin, “A Practical Guide to Support Vector Classification,” pp. 1–16, 2003
work page 2003
-
[43]
A Training Algorithm for Optimal Margin Classifiers,
B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” Proc. 5th Annu. ACM Work. Comput. Learn. Theory, pp. 144– 152, 1992
work page 1992
-
[44]
Joachims, Learning to Classify Text Using Support Vector Machines
T. Joachims, Learning to Classify Text Using Support Vector Machines. 2001
work page 2001
-
[45]
Automatic Opinion Polarity Classification of Movie,
F. Salvetti, S. Lewis, and C. Reichenbach, “Automatic Opinion Polarity Classification of Movie,” Color. Res. Linguist., vol. 17, no. 1, p. 2, 2004
work page 2004
-
[46]
Natural language processing in support of decision -making: Phrases and part -of-speech tagging,
R. M. Losee, “Natural language processing in support of decision -making: Phrases and part -of-speech tagging,” Inf. Process. Manag. , vol. 37, no. 6, pp. 769–787, 2001
work page 2001
-
[47]
Automatic Induction of Rules for e -mail Classification,
E. Crawford, J. Kay, and E. M cCreath, “Automatic Induction of Rules for e -mail Classification,” in Sixth Australasian Document Computing Symposium, 2001
work page 2001
-
[48]
The Case Against Accuracy Estimation for Comparing Induction Algorithms,
F. Provost, T. Fawcett, and R. Kohavi, “The Case Against Accuracy Estimation for Comparing Induction Algorithms,” Proc. Fifte enth Int. Conf. Mach. Learn. , pp. 445–453, 1997
work page 1997
-
[49]
Enhanced email spam filtering through combining similarity graphs,
A. Dasgupta, M. Gurevich, and K. Punera, “Enhanced email spam filtering through combining similarity graphs,” Proc. fourth ACM Int. Conf. Web search data Min. - WSDM ’11, p. 785, 2011. 43
work page 2011
-
[50]
A Semi -Supervised Bayesian Network Model for Microblog Topic Classification,
Y. Chen, Z . Li, L. Nie, X. Hu, X. Wang, T. Chua, X. Zhang, L. Liqiang, N. X, W. Tat, S. Chua, and X. Zhang, “A Semi -Supervised Bayesian Network Model for Microblog Topic Classification,” Coling, vol. 1, no. December, pp. 561–576, 2012
work page 2012
-
[51]
Utiliza ção De Redes Neurais Artificiais Para Classificação De Spam,
A. M. Da Silva, “Utiliza ção De Redes Neurais Artificiais Para Classificação De Spam,” Centro Federal De Educação Tecnológica De Minas Gerais, 2009
work page 2009
-
[52]
Detecção De Mensagens Não Solicitadas Utilizando Mineração De Textos,
M. S. Moreira, “Detecção De Mensagens Não Solicitadas Utilizando Mineração De Textos,” COPPE UFRJ, 2010
work page 2010
-
[53]
Min eração de opinião em textos opinativos utilizando algoritmos de classificação,
F. Santos, “Min eração de opinião em textos opinativos utilizando algoritmos de classificação,” Universidade de Brasilia, 2013
work page 2013
-
[54]
Twitter Sentiment Analysis : The Good the Bad and the OMG !,
E. Kouloumpis, T. Wilson, and J. Moore, “Twitter Sentiment Analysis : The Good the Bad and the OMG !,” Proc. Fifith Int. AAAI Conf. Weblogs Soc. Media , pp. 538–541, 2011
work page 2011
-
[55]
Precise tweet classification and sentiment analysis,
R. Batool, A. M. Khattak, J. Maqbool, and S. Lee, “Precise tweet classification and sentiment analysis,” 2013 IEEE/ACIS 12th Int. Conf. Comput. Inf. Sci. ICIS 2013 - Proc., pp. 461–466, 2013
work page 2013
-
[56]
Thumbs up?: sentiment classification using machine learning techniques,
B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification using machine learning techniques,” Proc. Conf. Empir. Methods Nat. Lang. Process., pp. 79–86, 2002
work page 2002
-
[57]
Second HAREM : Advancing the State of the Art of Named Entity Recognition in Portuguese,
C. Freitas, C. Mota, D. Santos, H. G. Oliveira, and P. Carvalho, “Second HAREM : Advancing the State of the Art of Named Entity Recognition in Portuguese,” Proc. Seventh Int. Conf. Lang. Resour. Eval., no. 3, pp. 3630–3637, 2010
work page 2010
-
[58]
CHAVE : topics and questions on the Portuguese participation in CLEF,
D. Santos and P. Roc ha, “CHAVE : topics and questions on the Portuguese participation in CLEF,” Cross Lang. Eval. Forum Work. Notes CLEF 2004 Work. , pp. 639–648, 2004
work page 2004
-
[59]
Stemming para a língua portuguesa: estudo, análise e melhoria do algoritmo RSLP,
A. R. Coelho, “Stemming para a língua portuguesa: estudo, análise e melhoria do algoritmo RSLP,” Universidade Federal do Rio Grande do Sul, 2007
work page 2007
-
[60]
Mac -Morpho Revisited: Towards Robust Part -of- Speech Tagging,
E. R. Fonseca and G. Rosa, “Mac -Morpho Revisited: Towards Robust Part -of- Speech Tagging,” in Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology, 2013, pp. 98–107. 44
work page 2013
-
[61]
Spam filtering with Naive Bayes -Which naive bayes?,
V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with Naive Bayes -Which naive bayes?,” Ceas, p. 9, 2006
work page 2006
-
[62]
The Form is the Substance: Classification of Genres in Text,
N. Dewdney, C. VanEss-Dykema, and R. MacMillan, “The Form is the Substance: Classification of Genres in Text,” in Proceedings of the workshop on Human Language Technology and Knowledge Management, 2001, pp. 1–8
work page 2001
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.