pith. sign in

arxiv: 1907.04905 · v1 · pith:IMQ376G5new · submitted 2019-07-08 · 💻 cs.IR · cs.CL· cs.LG

Development of email classifier in Brazilian Portuguese using feature selection for automatic response

Pith reviewed 2026-05-25 01:20 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.LG
keywords email classificationBrazilian Portuguesesupport vector machinesfeature selectionpart-of-speech tagginglemmatizationautomatic responsetext categorization
0
0 comments X

The pith

Support Vector Machines with non-lemmatized verbs, nouns and adjectives classify Brazilian Portuguese business emails at 87.3% accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an email classifier for automatic responses to business messages written in Brazilian Portuguese. It introduces a new corpus from a real application and tests Naive Bayes and Support Vector Machines under different preprocessing choices. The key finding is that SVM performs best when features are restricted to verbs, nouns and adjectives without lemmatization, reaching 87.3 percent accuracy. This matters because many companies receive large volumes of emails in Portuguese and could use such classifiers to speed up replies. The work also shows that standard lemmatization hurts performance while part-of-speech filtering helps.

Core claim

The authors present a novel corpus of Brazilian Portuguese business emails for automatic categorization. Baseline experiments compare Naive Bayes and Support Vector Machines, examining the effects of lemmatization and part-of-speech tagging. Support Vector Machines classification coupled with nonlemmatized selection of verbs, nouns and adjectives was the best approach, with 87.3% maximum accuracy. Straightforward lemmatization led to the lowest classification results.

What carries the argument

Support Vector Machines classification coupled with nonlemmatized selection of verbs, nouns and adjectives via part-of-speech tagging.

Load-bearing premise

The new email corpus is representative of the target business domain and the train/test split does not introduce selection bias that inflates reported accuracy.

What would settle it

Evaluating the trained model on a held-out set of emails collected from a different time period or different company in the same domain would reveal if the 87.3% accuracy generalizes or is inflated by the specific corpus characteristics.

read the original abstract

Automatic email categorization is an important application of text classification. We study the automatic reply of email business messages in Brazilian Portuguese. We present a novel corpus containing messages from a real application, and baseline categorization experiments using Naive Bayes and support Vector Machines. We then discuss the effect of lemmatization and the role of part-of-speech tagging filtering on precision and recall. Support Vector Machines classification coupled with nonlemmatized selection of verbs, nouns and adjectives was the best approach, with 87.3% maximum accuracy. Straightforward lemmatization in Portuguese led to the lowest classification results in the group, with 85.3% and 81.7% precision in SVM and Naive Bayes respectively. Thus, while lemmatization reduced precision and recall, part-of-speech filtering improved overall results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a novel corpus of Brazilian Portuguese business emails and reports baseline text classification results for automatic reply categorization using Naive Bayes and SVM. It compares the effects of lemmatization versus non-lemmatized part-of-speech filtering (verbs, nouns, adjectives), concluding that SVM on the non-lemmatized filtered features yields the highest accuracy of 87.3%.

Significance. If the accuracy figures are reproducible under a properly documented protocol, the work supplies a new domain-specific corpus and some empirical guidance on preprocessing choices for Portuguese email classification. The contribution is primarily empirical and corpus-oriented rather than methodological.

major comments (2)
  1. [Abstract] Abstract: the reported maximum accuracies (87.3% for SVM non-lemmatized, 85.3% and 81.7% for lemmatized SVM/NB) are presented without any accompanying information on corpus size, number of classes, class distribution, train/test split procedure, or cross-validation scheme. These omissions render the central empirical claim unverifiable and prevent assessment of possible selection bias or leakage.
  2. [Results/Evaluation] Evaluation/results section: the superiority of the non-lemmatized POS-filtered feature set is asserted on the basis of single accuracy numbers; no statistical significance tests, confidence intervals, or variance across folds/runs are reported, so the claim that this configuration is reliably best cannot be evaluated.
minor comments (2)
  1. The title emphasizes 'feature selection' but the described experiments center on POS filtering and lemmatization; clarify whether any additional feature-selection algorithms (e.g., information gain, chi-squared) were applied beyond the POS step.
  2. [Abstract] The abstract states that 'straightforward lemmatization led to the lowest classification results' yet supplies only precision figures for the lemmatized case; recall or full precision-recall tables for all conditions would improve comparability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important issues of verifiability and statistical rigor in our empirical results. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported maximum accuracies (87.3% for SVM non-lemmatized, 85.3% and 81.7% for lemmatized SVM/NB) are presented without any accompanying information on corpus size, number of classes, class distribution, train/test split procedure, or cross-validation scheme. These omissions render the central empirical claim unverifiable and prevent assessment of possible selection bias or leakage.

    Authors: We agree that the abstract should include these details to support verifiability of the reported accuracies. The manuscript body already describes the corpus, class structure, and evaluation protocol; we will revise the abstract to concisely incorporate this information so that the central empirical claims can be properly assessed. revision: yes

  2. Referee: [Results/Evaluation] Evaluation/results section: the superiority of the non-lemmatized POS-filtered feature set is asserted on the basis of single accuracy numbers; no statistical significance tests, confidence intervals, or variance across folds/runs are reported, so the claim that this configuration is reliably best cannot be evaluated.

    Authors: The evaluation was performed with a single fixed train/test split, and the manuscript reports only the resulting point accuracies. We acknowledge that the absence of variance estimates or significance tests limits the strength of the superiority claim. In revision we will explicitly document the split procedure within the results section and note the single-run nature as a limitation of this baseline study. Performing additional cross-validation folds and statistical tests would require new experiments outside the current scope. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical classifier evaluation on held-out data

full rationale

The paper conducts standard supervised classification experiments (Naive Bayes and SVM) on a novel email corpus, reporting precision/recall/accuracy after varying preprocessing choices such as lemmatization and POS filtering. The 87.3% accuracy figure is a direct empirical measurement on a test partition; no equations, derivations, or first-principles claims are present that could reduce to fitted inputs or self-citations by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text. This is the expected outcome for an applied ML comparison paper whose central claims rest on external data rather than internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical application of standard supervised classifiers; it relies on the usual i.i.d. assumption for train/test splits and on the correctness of the part-of-speech tagger used for filtering, none of which are derived inside the paper.

pith-pipeline@v0.9.0 · 5664 in / 975 out tokens · 18703 ms · 2026-05-25T01:20:16.668632+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

  1. [1]

    Introduction ............................................................................................................. 12

  2. [2]

    Related Work And Background............................................................................... 15 2.1. Natural Language Processing And Text Classification ................................. 15 2.2. Naive Bayes Classifier .................................................................................. 16 2.3. Support Vector Machines Classifier...

  3. [3]

    Corpus Collection ................................................................................................... 25 3.1. Partnership With Not-Profit Organization To Obtain Data ................................... 25 3.2. Email Is Structured In Tickets .............................................................................. 26 3.3. Macros As Cla...

  4. [4]

    Corpus Processing ................................................................................................. 32 4.1. Text Filtering With Lemmatizer And Parts Of Speech ................................... 32 4.2. Naive Bayes And Svm Classifiers ................................................................. 33

  5. [5]

    Evaluation Of Lemmatization And Part -Of-Speech Filtering Effect On Performance 34

  6. [6]

    Discussion .............................................................................................................. 35 6.1. The Choice Of The Classifiers ...................................................................... 35 6.2. Precision And Recall Are Consistent With Literature .................................... 35 6.3. Colloquial Speech Red...

  7. [7]

    Conclusions ............................................................................................................ 39

  8. [8]

    References ............................................................................................................. 40 12

  9. [9]

    Email is quickly received, and it can be sent asynchronousl y at low cost

    INTRODUCTION Electronic mail is an ubiquitous mode of communication in personal and work life [1], [2]. Email is quickly received, and it can be sent asynchronousl y at low cost. On the other hand, providing personalized and appropriate answers to questions sent by email is not an easy task, particularly as the number of messages scales up [3]. Messages a...

  10. [10]

    The latter two will be further detailed in Section 2

    and Naïve Bayes (NB) classifiers. The latter two will be further detailed in Section 2. State-of-the-art algorithms vary depending on the type of classification being performed, that could be binary or between multiple categories, text length and the types of features to be taken into account in the statistical method [13], [14]. In this project we examin...

  11. [11]

    bag of words

    RELATED WORK AND BACKGROUND 2.1. NATURAL LANGUAGE PROCESSING AND TEXT CLASSIFICATION The most accurate algorithms for text classification today are Support Vector Machines (SVM), Naive Bayes (NB) and k -Nearest-Neighbors (kNN), including hybrid approaches that can achieve greater precision than these methods separately [14]. SVM is one of the top performe...

  12. [12]

    method for classifying texts into categories, despite the overly simplistic approach of assuming complete independence between words in a sentence, what does not even take into account the order of words in a text. A simple numerical example to facilitate the understanding of the Naive -Bayes classifier was extracted from Manning, Raghavan and Schutze [6]...

  13. [13]

    CORPUS COLLECTION In this section we explain how we built the corpus we used in all experiments. 3.1. PARTNERSHIP WITH NOT-PROFIT ORGANIZATION TO OBTAIN DATA Even though email communication is important in many interactions between customers and companies, real -life data is of limited availability. To our knowledge, no public enterprise email corpus with...

  14. [14]

    CORPUS PROCESSING In this section we explain how we processed our email corpus to prepare the datasets used in the experiments, and the techniques applied to classify messages. 4.1. TEXT FILTERING WITH LEMMATIZER AND PARTS OF SPEECH We used different techniques to process the training corpus with the objective of assessing the impact on recall and precisi...

  15. [15]

    EVALUATION OF LEMMATIZATION AN D PART-OF-SPEECH FILTERING EFFECT ON PERFORMANCE Table 9 presents the effect of the POS -Tagger filter and of the lemmatizer in precision, recall and F1 measurements with our different training and test data . Comparing both classifiers among all filters, the highest precision achieved was 87.5%, recall 87.2% and F1 87.3%, f...

  16. [16]

    Spam and Ham

    DISCUSSION 6.1. THE CHOICE OF THE CLASSIFIERS In this project, we focused on the Naive Bayes and SVM algorithms for classification. A common application of these classifiers is separating “Spam and Ham” in email inboxes [44], but they have also attained high precision and recall in classification problems with more than two categories [12]. Naive Bayes ha...

  17. [17]

    That was accomplished in association with Fundação Estudar a non -profit organization in education that provided us with their email logs

    CONCLUSIONS We successfully built a corpus of email messages in Brazilian Portu guese. That was accomplished in association with Fundação Estudar a non -profit organization in education that provided us with their email logs. Based on the corpus created, we produced a study of email classification. We implemented a Naive Bayes and a Support Vector Machine...

  18. [18]

    Revised Naive Bayes Classifier for Combating the Focus Attack in Spam Filtering,

    J. Peng and P. P. K. Chan, “Revised Naive Bayes Classifier for Combating the Focus Attack in Spam Filtering,” in 2013 International Conference on Machine Learning and Cybernetics, 2013, pp. 14–17

  19. [19]

    An Optimized k-NN Classifier based on Minimum Spanning Tree for Email Filtering,

    A. Chakrabarty, “An Optimized k-NN Classifier based on Minimum Spanning Tree for Email Filtering,” Bus. Inf. Manag., pp. 47–52, 2014

  20. [20]

    System And Method For Message Process And Response,

    K. D. Richardson, J. Greif, D. Buedel, and B. Aleksandrovsky, “System And Method For Message Process And Response,” US6278996 B1, 2001

  21. [21]

    M. C. Buskirk Jr, F. J. Damerau, D. H. Johnson, M. Raaen, Buskirk, M. C. Buskirk Jr, F. J. Damerau, D. H. Johnson, and M. Raaen, “Machine Learning Based Electronic Messagind System,US6424997 B1, 2003.03

  22. [22]

    System a nd Method for Content - Sensitive Automatic Reply Message Generation for Text -Based Asynchronous Communications

    S. Ayyadurai, “System a nd Method for Content - Sensitive Automatic Reply Message Generation for Text -Based Asynchronous Communications", US6718368 B1, 2004

  23. [23]

    C. D. Manning, P. Raghavan, and H. Shütze, An Introduction to Information Retrieval, C. Cambridge: Cambridge UP, pp. 154-157 and 261, 2009

  24. [24]

    Transductive Inference for Text Classification Using Support Vector Machines,

    J. Thorsten, “Transductive Inference for Text Classification Using Support Vector Machines,” ICML, vol. 99, pp. 200–209, 1999

  25. [25]

    Speech Acts,

    W. W. Cohen, V. R. Carvalho, and T. M. Mitchell, Learning to Classify Email into “Speech Acts,” vol. 4, no. 11. 2004

  26. [26]

    Towards Designing an Email Classification System Using Multi -view Based Semi -Supervised Learning,

    W. Li, W. Meng, Z. Tan, and Y. Xiang, “Towards Designing an Email Classification System Using Multi -view Based Semi -Supervised Learning,” 2014 IEEE 13th Int. Conf. Trust. Secur. Priv. Comput. Commun., pp. 174–181, 2014

  27. [27]

    Email Classification with Co -Training,

    S. Kiritchenko and S. Matwin, “Email Classification with Co -Training,” Proc. 2001 Conf. Cent. Adv. Stud. Collab. Res., p. 8, 2001. 41

  28. [28]

    A comparative study for email classification,

    S. Youn and D. McLeod, “A comparative study for email classification,” Adv. Innov. Syst. Comput. Sci. Softw. Eng., pp. 387–391, 2007

  29. [29]

    The Enron corpus: A new dataset for Email Classification Research,

    B. Klimt and Y. Yang, “The Enron corpus: A new dataset for Email Classification Research,” Mach. Learn. ECML 2004, pp. 217–226, 2004

  30. [30]

    Baselines and bigrams: Simple, good sentiment and topic classification,

    S. Wang and C. D. C. Manning, “Baselines and bigrams: Simple, good sentiment and topic classification,” ACL 12 Proc. 50th Annu. Meet. Assoc. Comput. Linguist. Short Pap. - Vol. 2, vol. 94305, no. 1, pp. 90–94, 2012

  31. [31]

    A Comparative Study on Different Types of Approaches to Text Categorization,

    P. Y. Pawar and S. H. Gawande, “A Comparative Study on Different Types of Approaches to Text Categorization,” Int. J. Mach. Learn. Comput., vol. 2, no. 4, pp. 423–426, 2012

  32. [32]

    Classificação Automática de Emails,

    M. J. A. Lima, “Classificação Automática de Emails,” Universidade do Porto, 2013

  33. [33]

    IBM’s Watson Now A Customer Service Agent

    B. Upbin, “IBM’s Watson Now A Customer Service Agent”, Forbes Magazine, p. 1, Access at May 21, 2013 < http://www.forbes.com/sites/bruceupbin/2013/05/21/ibms-watson-now-a- customer-service-agent-coming-to-smartphones-soon/ >

  34. [34]

    Reinforced Multicategory Support Vector Machines,

    Y. Liu and M. Yuan, “Reinforced Multicategory Support Vector Machines,” J. Comput. Graph. Stat., vol. 20, no. 4, pp. 901–919, 2011

  35. [35]

    Joachims, Learning to Classify Text Using Support Vector Machines

    T. Joachims, Learning to Classify Text Using Support Vector Machines. ICML, vol. 99, pp.200-209, 2001

  36. [36]

    Towards an Adaptive Mail Classifier,

    E. Masciari, M. Ruffolo, and A. Tagarelli, “Towards an Adaptive Mail Classifier,” Ital. Assoc. Artif. Intell. Work. Su Apprendimento Autom . Metod. ed Appl. , no. August, 2002

  37. [37]

    Classifying spam emails using text and readability features,

    R. Shams and R. E. Mercer, “Classifying spam emails using text and readability features,” Proc. - IEEE Int. Conf. Data Mining, ICDM, pp. 657–666, 2013

  38. [38]

    An empirical performance comparison of machine learning methods for spam e -mail categorization,

    C.-C. Lai and M. -C. Tsai, “An empirical performance comparison of machine learning methods for spam e -mail categorization,” Fourth Int. Conf. Hybrid Intell. Syst., pp. 0–4, 2004. 42

  39. [39]

    An approach to spam detec tion by Naive Bayes ensemble based on decision induction,

    Y. Zhen, N. Xiangfei, X. Weiran, and G. Jun, “An approach to spam detec tion by Naive Bayes ensemble based on decision induction,” Proc. - ISDA 2006 Sixth Int. Conf. Intell. Syst. Des. Appl., vol. 2, pp. 861–866, 2006

  40. [40]

    Learning to Filter Spam E -Mail: A Comparison of a Naive Bayesian and a Memory -Based Approach,

    I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. D. Spyropoulos, and P. Stamatopoulos, “Learning to Filter Spam E -Mail: A Comparison of a Naive Bayesian and a Memory -Based Approach,” Proc. Work. “Machine Learn. Textual Inf. Access,” no. September 2000, pp. 1–12, 2000

  41. [41]

    Support -vector networks,

    C. Cortes and V. Vapnik, “Support -vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995

  42. [42]

    A Practical Guide to Support Vector Classification,

    C.-W. Hsu, C. -C. Chang, and C. -J. Lin, “A Practical Guide to Support Vector Classification,” pp. 1–16, 2003

  43. [43]

    A Training Algorithm for Optimal Margin Classifiers,

    B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” Proc. 5th Annu. ACM Work. Comput. Learn. Theory, pp. 144– 152, 1992

  44. [44]

    Joachims, Learning to Classify Text Using Support Vector Machines

    T. Joachims, Learning to Classify Text Using Support Vector Machines. 2001

  45. [45]

    Automatic Opinion Polarity Classification of Movie,

    F. Salvetti, S. Lewis, and C. Reichenbach, “Automatic Opinion Polarity Classification of Movie,” Color. Res. Linguist., vol. 17, no. 1, p. 2, 2004

  46. [46]

    Natural language processing in support of decision -making: Phrases and part -of-speech tagging,

    R. M. Losee, “Natural language processing in support of decision -making: Phrases and part -of-speech tagging,” Inf. Process. Manag. , vol. 37, no. 6, pp. 769–787, 2001

  47. [47]

    Automatic Induction of Rules for e -mail Classification,

    E. Crawford, J. Kay, and E. M cCreath, “Automatic Induction of Rules for e -mail Classification,” in Sixth Australasian Document Computing Symposium, 2001

  48. [48]

    The Case Against Accuracy Estimation for Comparing Induction Algorithms,

    F. Provost, T. Fawcett, and R. Kohavi, “The Case Against Accuracy Estimation for Comparing Induction Algorithms,” Proc. Fifte enth Int. Conf. Mach. Learn. , pp. 445–453, 1997

  49. [49]

    Enhanced email spam filtering through combining similarity graphs,

    A. Dasgupta, M. Gurevich, and K. Punera, “Enhanced email spam filtering through combining similarity graphs,” Proc. fourth ACM Int. Conf. Web search data Min. - WSDM ’11, p. 785, 2011. 43

  50. [50]

    A Semi -Supervised Bayesian Network Model for Microblog Topic Classification,

    Y. Chen, Z . Li, L. Nie, X. Hu, X. Wang, T. Chua, X. Zhang, L. Liqiang, N. X, W. Tat, S. Chua, and X. Zhang, “A Semi -Supervised Bayesian Network Model for Microblog Topic Classification,” Coling, vol. 1, no. December, pp. 561–576, 2012

  51. [51]

    Utiliza ção De Redes Neurais Artificiais Para Classificação De Spam,

    A. M. Da Silva, “Utiliza ção De Redes Neurais Artificiais Para Classificação De Spam,” Centro Federal De Educação Tecnológica De Minas Gerais, 2009

  52. [52]

    Detecção De Mensagens Não Solicitadas Utilizando Mineração De Textos,

    M. S. Moreira, “Detecção De Mensagens Não Solicitadas Utilizando Mineração De Textos,” COPPE UFRJ, 2010

  53. [53]

    Min eração de opinião em textos opinativos utilizando algoritmos de classificação,

    F. Santos, “Min eração de opinião em textos opinativos utilizando algoritmos de classificação,” Universidade de Brasilia, 2013

  54. [54]

    Twitter Sentiment Analysis : The Good the Bad and the OMG !,

    E. Kouloumpis, T. Wilson, and J. Moore, “Twitter Sentiment Analysis : The Good the Bad and the OMG !,” Proc. Fifith Int. AAAI Conf. Weblogs Soc. Media , pp. 538–541, 2011

  55. [55]

    Precise tweet classification and sentiment analysis,

    R. Batool, A. M. Khattak, J. Maqbool, and S. Lee, “Precise tweet classification and sentiment analysis,” 2013 IEEE/ACIS 12th Int. Conf. Comput. Inf. Sci. ICIS 2013 - Proc., pp. 461–466, 2013

  56. [56]

    Thumbs up?: sentiment classification using machine learning techniques,

    B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification using machine learning techniques,” Proc. Conf. Empir. Methods Nat. Lang. Process., pp. 79–86, 2002

  57. [57]

    Second HAREM : Advancing the State of the Art of Named Entity Recognition in Portuguese,

    C. Freitas, C. Mota, D. Santos, H. G. Oliveira, and P. Carvalho, “Second HAREM : Advancing the State of the Art of Named Entity Recognition in Portuguese,” Proc. Seventh Int. Conf. Lang. Resour. Eval., no. 3, pp. 3630–3637, 2010

  58. [58]

    CHAVE : topics and questions on the Portuguese participation in CLEF,

    D. Santos and P. Roc ha, “CHAVE : topics and questions on the Portuguese participation in CLEF,” Cross Lang. Eval. Forum Work. Notes CLEF 2004 Work. , pp. 639–648, 2004

  59. [59]

    Stemming para a língua portuguesa: estudo, análise e melhoria do algoritmo RSLP,

    A. R. Coelho, “Stemming para a língua portuguesa: estudo, análise e melhoria do algoritmo RSLP,” Universidade Federal do Rio Grande do Sul, 2007

  60. [60]

    Mac -Morpho Revisited: Towards Robust Part -of- Speech Tagging,

    E. R. Fonseca and G. Rosa, “Mac -Morpho Revisited: Towards Robust Part -of- Speech Tagging,” in Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology, 2013, pp. 98–107. 44

  61. [61]

    Spam filtering with Naive Bayes -Which naive bayes?,

    V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with Naive Bayes -Which naive bayes?,” Ceas, p. 9, 2006

  62. [62]

    The Form is the Substance: Classification of Genres in Text,

    N. Dewdney, C. VanEss-Dykema, and R. MacMillan, “The Form is the Substance: Classification of Genres in Text,” in Proceedings of the workshop on Human Language Technology and Knowledge Management, 2001, pp. 1–8