Development of email classifier in Brazilian Portuguese using feature selection for automatic response

Arthur Gola de Paula; Rogerio Bonatti

REVIEW 2 major objections 2 minor 62 references

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

Support Vector Machines with non-lemmatized verbs, nouns and adjectives classify Brazilian Portuguese business emails at 87.3% accuracy.

2026-05-25 01:20 UTC pith:IMQ376G5

load-bearing objection New Brazilian Portuguese business email corpus with baseline SVM/NB results showing non-lemmatized POS filtering beats lemmatization at 87.3%, but no dataset size, class details, or split procedure given. the 2 major comments →

arxiv 1907.04905 v1 pith:IMQ376G5 submitted 2019-07-08 cs.IR cs.CLcs.LG

Development of email classifier in Brazilian Portuguese using feature selection for automatic response

Rogerio Bonatti , Arthur Gola de Paula This is my paper

classification cs.IR cs.CLcs.LG

keywords email classificationBrazilian Portuguesesupport vector machinesfeature selectionpart-of-speech tagginglemmatizationautomatic responsetext categorization

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an email classifier for automatic responses to business messages written in Brazilian Portuguese. It introduces a new corpus from a real application and tests Naive Bayes and Support Vector Machines under different preprocessing choices. The key finding is that SVM performs best when features are restricted to verbs, nouns and adjectives without lemmatization, reaching 87.3 percent accuracy. This matters because many companies receive large volumes of emails in Portuguese and could use such classifiers to speed up replies. The work also shows that standard lemmatization hurts performance while part-of-speech filtering helps.

Core claim

The authors present a novel corpus of Brazilian Portuguese business emails for automatic categorization. Baseline experiments compare Naive Bayes and Support Vector Machines, examining the effects of lemmatization and part-of-speech tagging. Support Vector Machines classification coupled with nonlemmatized selection of verbs, nouns and adjectives was the best approach, with 87.3% maximum accuracy. Straightforward lemmatization led to the lowest classification results.

What carries the argument

Support Vector Machines classification coupled with nonlemmatized selection of verbs, nouns and adjectives via part-of-speech tagging.

Load-bearing premise

The new email corpus is representative of the target business domain and the train/test split does not introduce selection bias that inflates reported accuracy.

What would settle it

Evaluating the trained model on a held-out set of emails collected from a different time period or different company in the same domain would reveal if the 87.3% accuracy generalizes or is inflated by the specific corpus characteristics.

Watch this falsifier — get emailed when new claim-graph text bears on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

New Brazilian Portuguese business email corpus with baseline SVM/NB results showing non-lemmatized POS filtering beats lemmatization at 87.3%, but no dataset size, class details, or split procedure given.

read the letter

The paper's main contribution is a new corpus of real business emails in Brazilian Portuguese plus straightforward experiments comparing Naive Bayes and SVM under lemmatization and part-of-speech filtering. The clearest result is that SVM on non-lemmatized verbs, nouns, and adjectives reaches 87.3% accuracy while lemmatization drops performance to the low 80s. That is a usable data point for anyone working on Portuguese text classification, since most published numbers are English-only and the lemmatization outcome runs counter to the usual assumption that it helps.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a novel corpus of Brazilian Portuguese business emails and reports baseline text classification results for automatic reply categorization using Naive Bayes and SVM. It compares the effects of lemmatization versus non-lemmatized part-of-speech filtering (verbs, nouns, adjectives), concluding that SVM on the non-lemmatized filtered features yields the highest accuracy of 87.3%.

Significance. If the accuracy figures are reproducible under a properly documented protocol, the work supplies a new domain-specific corpus and some empirical guidance on preprocessing choices for Portuguese email classification. The contribution is primarily empirical and corpus-oriented rather than methodological.

major comments (2)

[Abstract] Abstract: the reported maximum accuracies (87.3% for SVM non-lemmatized, 85.3% and 81.7% for lemmatized SVM/NB) are presented without any accompanying information on corpus size, number of classes, class distribution, train/test split procedure, or cross-validation scheme. These omissions render the central empirical claim unverifiable and prevent assessment of possible selection bias or leakage.
[Results/Evaluation] Evaluation/results section: the superiority of the non-lemmatized POS-filtered feature set is asserted on the basis of single accuracy numbers; no statistical significance tests, confidence intervals, or variance across folds/runs are reported, so the claim that this configuration is reliably best cannot be evaluated.

minor comments (2)

The title emphasizes 'feature selection' but the described experiments center on POS filtering and lemmatization; clarify whether any additional feature-selection algorithms (e.g., information gain, chi-squared) were applied beyond the POS step.
[Abstract] The abstract states that 'straightforward lemmatization led to the lowest classification results' yet supplies only precision figures for the lemmatized case; recall or full precision-recall tables for all conditions would improve comparability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important issues of verifiability and statistical rigor in our empirical results. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the reported maximum accuracies (87.3% for SVM non-lemmatized, 85.3% and 81.7% for lemmatized SVM/NB) are presented without any accompanying information on corpus size, number of classes, class distribution, train/test split procedure, or cross-validation scheme. These omissions render the central empirical claim unverifiable and prevent assessment of possible selection bias or leakage.

Authors: We agree that the abstract should include these details to support verifiability of the reported accuracies. The manuscript body already describes the corpus, class structure, and evaluation protocol; we will revise the abstract to concisely incorporate this information so that the central empirical claims can be properly assessed. revision: yes
Referee: [Results/Evaluation] Evaluation/results section: the superiority of the non-lemmatized POS-filtered feature set is asserted on the basis of single accuracy numbers; no statistical significance tests, confidence intervals, or variance across folds/runs are reported, so the claim that this configuration is reliably best cannot be evaluated.

Authors: The evaluation was performed with a single fixed train/test split, and the manuscript reports only the resulting point accuracies. We acknowledge that the absence of variance estimates or significance tests limits the strength of the superiority claim. In revision we will explicitly document the split procedure within the results section and note the single-run nature as a limitation of this baseline study. Performing additional cross-validation folds and statistical tests would require new experiments outside the current scope. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical classifier evaluation on held-out data

full rationale

The paper conducts standard supervised classification experiments (Naive Bayes and SVM) on a novel email corpus, reporting precision/recall/accuracy after varying preprocessing choices such as lemmatization and POS filtering. The 87.3% accuracy figure is a direct empirical measurement on a test partition; no equations, derivations, or first-principles claims are present that could reduce to fitted inputs or self-citations by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text. This is the expected outcome for an applied ML comparison paper whose central claims rest on external data rather than internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical application of standard supervised classifiers; it relies on the usual i.i.d. assumption for train/test splits and on the correctness of the part-of-speech tagger used for filtering, none of which are derived inside the paper.

pith-pipeline@v0.9.0 · 5664 in / 975 out tokens · 18703 ms · 2026-05-25T01:20:16.668632+00:00 · methodology

0 comments

read the original abstract

Automatic email categorization is an important application of text classification. We study the automatic reply of email business messages in Brazilian Portuguese. We present a novel corpus containing messages from a real application, and baseline categorization experiments using Naive Bayes and support Vector Machines. We then discuss the effect of lemmatization and the role of part-of-speech tagging filtering on precision and recall. Support Vector Machines classification coupled with nonlemmatized selection of verbs, nouns and adjectives was the best approach, with 87.3% maximum accuracy. Straightforward lemmatization in Portuguese led to the lowest classification results in the group, with 85.3% and 81.7% precision in SVM and Naive Bayes respectively. Thus, while lemmatization reduced precision and recall, part-of-speech filtering improved overall results.

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages

[1]

Introduction ............................................................................................................. 12

work page
[2]

Related Work And Background............................................................................... 15 2.1. Natural Language Processing And Text Classification ................................. 15 2.2. Naive Bayes Classifier .................................................................................. 16 2.3. Support Vector Machines Classifier...

work page
[3]

Corpus Collection ................................................................................................... 25 3.1. Partnership With Not-Profit Organization To Obtain Data ................................... 25 3.2. Email Is Structured In Tickets .............................................................................. 26 3.3. Macros As Cla...

work page
[4]

Corpus Processing ................................................................................................. 32 4.1. Text Filtering With Lemmatizer And Parts Of Speech ................................... 32 4.2. Naive Bayes And Svm Classifiers ................................................................. 33

work page
[5]

Evaluation Of Lemmatization And Part -Of-Speech Filtering Effect On Performance 34

work page
[6]

Discussion .............................................................................................................. 35 6.1. The Choice Of The Classifiers ...................................................................... 35 6.2. Precision And Recall Are Consistent With Literature .................................... 35 6.3. Colloquial Speech Red...

work page
[7]

Conclusions ............................................................................................................ 39

work page
[8]

References ............................................................................................................. 40 12

work page
[9]

Email is quickly received, and it can be sent asynchronousl y at low cost

INTRODUCTION Electronic mail is an ubiquitous mode of communication in personal and work life [1], [2]. Email is quickly received, and it can be sent asynchronousl y at low cost. On the other hand, providing personalized and appropriate answers to questions sent by email is not an easy task, particularly as the number of messages scales up [3]. Messages a...

work page
[10]

The latter two will be further detailed in Section 2

and Naïve Bayes (NB) classifiers. The latter two will be further detailed in Section 2. State-of-the-art algorithms vary depending on the type of classification being performed, that could be binary or between multiple categories, text length and the types of features to be taken into account in the statistical method [13], [14]. In this project we examin...

work page
[11]

bag of words

RELATED WORK AND BACKGROUND 2.1. NATURAL LANGUAGE PROCESSING AND TEXT CLASSIFICATION The most accurate algorithms for text classification today are Support Vector Machines (SVM), Naive Bayes (NB) and k -Nearest-Neighbors (kNN), including hybrid approaches that can achieve greater precision than these methods separately [14]. SVM is one of the top performe...

work page
[12]

method for classifying texts into categories, despite the overly simplistic approach of assuming complete independence between words in a sentence, what does not even take into account the order of words in a text. A simple numerical example to facilitate the understanding of the Naive -Bayes classifier was extracted from Manning, Raghavan and Schutze [6]...

work page 2001
[13]

CORPUS COLLECTION In this section we explain how we built the corpus we used in all experiments. 3.1. PARTNERSHIP WITH NOT-PROFIT ORGANIZATION TO OBTAIN DATA Even though email communication is important in many interactions between customers and companies, real -life data is of limited availability. To our knowledge, no public enterprise email corpus with...

work page 2081
[14]

CORPUS PROCESSING In this section we explain how we processed our email corpus to prepare the datasets used in the experiments, and the techniques applied to classify messages. 4.1. TEXT FILTERING WITH LEMMATIZER AND PARTS OF SPEECH We used different techniques to process the training corpus with the objective of assessing the impact on recall and precisi...

work page
[15]

EVALUATION OF LEMMATIZATION AN D PART-OF-SPEECH FILTERING EFFECT ON PERFORMANCE Table 9 presents the effect of the POS -Tagger filter and of the lemmatizer in precision, recall and F1 measurements with our different training and test data . Comparing both classifiers among all filters, the highest precision achieved was 87.5%, recall 87.2% and F1 87.3%, f...

work page
[16]

Spam and Ham

DISCUSSION 6.1. THE CHOICE OF THE CLASSIFIERS In this project, we focused on the Naive Bayes and SVM algorithms for classification. A common application of these classifiers is separating “Spam and Ham” in email inboxes [44], but they have also attained high precision and recall in classification problems with more than two categories [12]. Naive Bayes ha...

work page
[17]

That was accomplished in association with Fundação Estudar a non -profit organization in education that provided us with their email logs

CONCLUSIONS We successfully built a corpus of email messages in Brazilian Portu guese. That was accomplished in association with Fundação Estudar a non -profit organization in education that provided us with their email logs. Based on the corpus created, we produced a study of email classification. We implemented a Naive Bayes and a Support Vector Machine...

work page
[18]

Revised Naive Bayes Classifier for Combating the Focus Attack in Spam Filtering,

J. Peng and P. P. K. Chan, “Revised Naive Bayes Classifier for Combating the Focus Attack in Spam Filtering,” in 2013 International Conference on Machine Learning and Cybernetics, 2013, pp. 14–17

work page 2013
[19]

An Optimized k-NN Classifier based on Minimum Spanning Tree for Email Filtering,

A. Chakrabarty, “An Optimized k-NN Classifier based on Minimum Spanning Tree for Email Filtering,” Bus. Inf. Manag., pp. 47–52, 2014

work page 2014
[20]

System And Method For Message Process And Response,

K. D. Richardson, J. Greif, D. Buedel, and B. Aleksandrovsky, “System And Method For Message Process And Response,” US6278996 B1, 2001

work page 2001
[21]

M. C. Buskirk Jr, F. J. Damerau, D. H. Johnson, M. Raaen, Buskirk, M. C. Buskirk Jr, F. J. Damerau, D. H. Johnson, and M. Raaen, “Machine Learning Based Electronic Messagind System,US6424997 B1, 2003.03

work page 2003
[22]

System a nd Method for Content - Sensitive Automatic Reply Message Generation for Text -Based Asynchronous Communications

S. Ayyadurai, “System a nd Method for Content - Sensitive Automatic Reply Message Generation for Text -Based Asynchronous Communications", US6718368 B1, 2004

work page 2004
[23]

C. D. Manning, P. Raghavan, and H. Shütze, An Introduction to Information Retrieval, C. Cambridge: Cambridge UP, pp. 154-157 and 261, 2009

work page 2009
[24]

Transductive Inference for Text Classification Using Support Vector Machines,

J. Thorsten, “Transductive Inference for Text Classification Using Support Vector Machines,” ICML, vol. 99, pp. 200–209, 1999

work page 1999
[25]

Speech Acts,

W. W. Cohen, V. R. Carvalho, and T. M. Mitchell, Learning to Classify Email into “Speech Acts,” vol. 4, no. 11. 2004

work page 2004
[26]

Towards Designing an Email Classification System Using Multi -view Based Semi -Supervised Learning,

W. Li, W. Meng, Z. Tan, and Y. Xiang, “Towards Designing an Email Classification System Using Multi -view Based Semi -Supervised Learning,” 2014 IEEE 13th Int. Conf. Trust. Secur. Priv. Comput. Commun., pp. 174–181, 2014

work page 2014
[27]

Email Classification with Co -Training,

S. Kiritchenko and S. Matwin, “Email Classification with Co -Training,” Proc. 2001 Conf. Cent. Adv. Stud. Collab. Res., p. 8, 2001. 41

work page 2001
[28]

A comparative study for email classification,

S. Youn and D. McLeod, “A comparative study for email classification,” Adv. Innov. Syst. Comput. Sci. Softw. Eng., pp. 387–391, 2007

work page 2007
[29]

The Enron corpus: A new dataset for Email Classification Research,

B. Klimt and Y. Yang, “The Enron corpus: A new dataset for Email Classification Research,” Mach. Learn. ECML 2004, pp. 217–226, 2004

work page 2004
[30]

Baselines and bigrams: Simple, good sentiment and topic classification,

S. Wang and C. D. C. Manning, “Baselines and bigrams: Simple, good sentiment and topic classification,” ACL 12 Proc. 50th Annu. Meet. Assoc. Comput. Linguist. Short Pap. - Vol. 2, vol. 94305, no. 1, pp. 90–94, 2012

work page 2012
[31]

A Comparative Study on Different Types of Approaches to Text Categorization,

P. Y. Pawar and S. H. Gawande, “A Comparative Study on Different Types of Approaches to Text Categorization,” Int. J. Mach. Learn. Comput., vol. 2, no. 4, pp. 423–426, 2012

work page 2012
[32]

Classificação Automática de Emails,

M. J. A. Lima, “Classificação Automática de Emails,” Universidade do Porto, 2013

work page 2013
[33]

IBM’s Watson Now A Customer Service Agent

B. Upbin, “IBM’s Watson Now A Customer Service Agent”, Forbes Magazine, p. 1, Access at May 21, 2013 < http://www.forbes.com/sites/bruceupbin/2013/05/21/ibms-watson-now-a- customer-service-agent-coming-to-smartphones-soon/ >

work page 2013
[34]

Reinforced Multicategory Support Vector Machines,

Y. Liu and M. Yuan, “Reinforced Multicategory Support Vector Machines,” J. Comput. Graph. Stat., vol. 20, no. 4, pp. 901–919, 2011

work page 2011
[35]

Joachims, Learning to Classify Text Using Support Vector Machines

T. Joachims, Learning to Classify Text Using Support Vector Machines. ICML, vol. 99, pp.200-209, 2001

work page 2001
[36]

Towards an Adaptive Mail Classifier,

E. Masciari, M. Ruffolo, and A. Tagarelli, “Towards an Adaptive Mail Classifier,” Ital. Assoc. Artif. Intell. Work. Su Apprendimento Autom . Metod. ed Appl. , no. August, 2002

work page 2002
[37]

Classifying spam emails using text and readability features,

R. Shams and R. E. Mercer, “Classifying spam emails using text and readability features,” Proc. - IEEE Int. Conf. Data Mining, ICDM, pp. 657–666, 2013

work page 2013
[38]

An empirical performance comparison of machine learning methods for spam e -mail categorization,

C.-C. Lai and M. -C. Tsai, “An empirical performance comparison of machine learning methods for spam e -mail categorization,” Fourth Int. Conf. Hybrid Intell. Syst., pp. 0–4, 2004. 42

work page 2004
[39]

An approach to spam detec tion by Naive Bayes ensemble based on decision induction,

Y. Zhen, N. Xiangfei, X. Weiran, and G. Jun, “An approach to spam detec tion by Naive Bayes ensemble based on decision induction,” Proc. - ISDA 2006 Sixth Int. Conf. Intell. Syst. Des. Appl., vol. 2, pp. 861–866, 2006

work page 2006
[40]

Learning to Filter Spam E -Mail: A Comparison of a Naive Bayesian and a Memory -Based Approach,

I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. D. Spyropoulos, and P. Stamatopoulos, “Learning to Filter Spam E -Mail: A Comparison of a Naive Bayesian and a Memory -Based Approach,” Proc. Work. “Machine Learn. Textual Inf. Access,” no. September 2000, pp. 1–12, 2000

work page 2000
[41]

Support -vector networks,

C. Cortes and V. Vapnik, “Support -vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995

work page 1995
[42]

A Practical Guide to Support Vector Classification,

C.-W. Hsu, C. -C. Chang, and C. -J. Lin, “A Practical Guide to Support Vector Classification,” pp. 1–16, 2003

work page 2003
[43]

A Training Algorithm for Optimal Margin Classifiers,

B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” Proc. 5th Annu. ACM Work. Comput. Learn. Theory, pp. 144– 152, 1992

work page 1992
[44]

Joachims, Learning to Classify Text Using Support Vector Machines

T. Joachims, Learning to Classify Text Using Support Vector Machines. 2001

work page 2001
[45]

Automatic Opinion Polarity Classification of Movie,

F. Salvetti, S. Lewis, and C. Reichenbach, “Automatic Opinion Polarity Classification of Movie,” Color. Res. Linguist., vol. 17, no. 1, p. 2, 2004

work page 2004
[46]

Natural language processing in support of decision -making: Phrases and part -of-speech tagging,

R. M. Losee, “Natural language processing in support of decision -making: Phrases and part -of-speech tagging,” Inf. Process. Manag. , vol. 37, no. 6, pp. 769–787, 2001

work page 2001
[47]

Automatic Induction of Rules for e -mail Classification,

E. Crawford, J. Kay, and E. M cCreath, “Automatic Induction of Rules for e -mail Classification,” in Sixth Australasian Document Computing Symposium, 2001

work page 2001
[48]

The Case Against Accuracy Estimation for Comparing Induction Algorithms,

F. Provost, T. Fawcett, and R. Kohavi, “The Case Against Accuracy Estimation for Comparing Induction Algorithms,” Proc. Fifte enth Int. Conf. Mach. Learn. , pp. 445–453, 1997

work page 1997
[49]

Enhanced email spam filtering through combining similarity graphs,

A. Dasgupta, M. Gurevich, and K. Punera, “Enhanced email spam filtering through combining similarity graphs,” Proc. fourth ACM Int. Conf. Web search data Min. - WSDM ’11, p. 785, 2011. 43

work page 2011
[50]

A Semi -Supervised Bayesian Network Model for Microblog Topic Classification,

Y. Chen, Z . Li, L. Nie, X. Hu, X. Wang, T. Chua, X. Zhang, L. Liqiang, N. X, W. Tat, S. Chua, and X. Zhang, “A Semi -Supervised Bayesian Network Model for Microblog Topic Classification,” Coling, vol. 1, no. December, pp. 561–576, 2012

work page 2012
[51]

Utiliza ção De Redes Neurais Artificiais Para Classificação De Spam,

A. M. Da Silva, “Utiliza ção De Redes Neurais Artificiais Para Classificação De Spam,” Centro Federal De Educação Tecnológica De Minas Gerais, 2009

work page 2009
[52]

Detecção De Mensagens Não Solicitadas Utilizando Mineração De Textos,

M. S. Moreira, “Detecção De Mensagens Não Solicitadas Utilizando Mineração De Textos,” COPPE UFRJ, 2010

work page 2010
[53]

Min eração de opinião em textos opinativos utilizando algoritmos de classiﬁcação,

F. Santos, “Min eração de opinião em textos opinativos utilizando algoritmos de classiﬁcação,” Universidade de Brasilia, 2013

work page 2013
[54]

Twitter Sentiment Analysis : The Good the Bad and the OMG !,

E. Kouloumpis, T. Wilson, and J. Moore, “Twitter Sentiment Analysis : The Good the Bad and the OMG !,” Proc. Fifith Int. AAAI Conf. Weblogs Soc. Media , pp. 538–541, 2011

work page 2011
[55]

Precise tweet classification and sentiment analysis,

R. Batool, A. M. Khattak, J. Maqbool, and S. Lee, “Precise tweet classification and sentiment analysis,” 2013 IEEE/ACIS 12th Int. Conf. Comput. Inf. Sci. ICIS 2013 - Proc., pp. 461–466, 2013

work page 2013
[56]

Thumbs up?: sentiment classification using machine learning techniques,

B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification using machine learning techniques,” Proc. Conf. Empir. Methods Nat. Lang. Process., pp. 79–86, 2002

work page 2002
[57]

Second HAREM : Advancing the State of the Art of Named Entity Recognition in Portuguese,

C. Freitas, C. Mota, D. Santos, H. G. Oliveira, and P. Carvalho, “Second HAREM : Advancing the State of the Art of Named Entity Recognition in Portuguese,” Proc. Seventh Int. Conf. Lang. Resour. Eval., no. 3, pp. 3630–3637, 2010

work page 2010
[58]

CHAVE : topics and questions on the Portuguese participation in CLEF,

D. Santos and P. Roc ha, “CHAVE : topics and questions on the Portuguese participation in CLEF,” Cross Lang. Eval. Forum Work. Notes CLEF 2004 Work. , pp. 639–648, 2004

work page 2004
[59]

Stemming para a língua portuguesa: estudo, análise e melhoria do algoritmo RSLP,

A. R. Coelho, “Stemming para a língua portuguesa: estudo, análise e melhoria do algoritmo RSLP,” Universidade Federal do Rio Grande do Sul, 2007

work page 2007
[60]

Mac -Morpho Revisited: Towards Robust Part -of- Speech Tagging,

E. R. Fonseca and G. Rosa, “Mac -Morpho Revisited: Towards Robust Part -of- Speech Tagging,” in Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology, 2013, pp. 98–107. 44

work page 2013
[61]

Spam filtering with Naive Bayes -Which naive bayes?,

V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with Naive Bayes -Which naive bayes?,” Ceas, p. 9, 2006

work page 2006
[62]

The Form is the Substance: Classification of Genres in Text,

N. Dewdney, C. VanEss-Dykema, and R. MacMillan, “The Form is the Substance: Classification of Genres in Text,” in Proceedings of the workshop on Human Language Technology and Knowledge Management, 2001, pp. 1–8

work page 2001

[1] [1]

Introduction ............................................................................................................. 12

work page

[2] [2]

Related Work And Background............................................................................... 15 2.1. Natural Language Processing And Text Classification ................................. 15 2.2. Naive Bayes Classifier .................................................................................. 16 2.3. Support Vector Machines Classifier...

work page

[3] [3]

Corpus Collection ................................................................................................... 25 3.1. Partnership With Not-Profit Organization To Obtain Data ................................... 25 3.2. Email Is Structured In Tickets .............................................................................. 26 3.3. Macros As Cla...

work page

[4] [4]

Corpus Processing ................................................................................................. 32 4.1. Text Filtering With Lemmatizer And Parts Of Speech ................................... 32 4.2. Naive Bayes And Svm Classifiers ................................................................. 33

work page

[5] [5]

Evaluation Of Lemmatization And Part -Of-Speech Filtering Effect On Performance 34

work page

[6] [6]

Discussion .............................................................................................................. 35 6.1. The Choice Of The Classifiers ...................................................................... 35 6.2. Precision And Recall Are Consistent With Literature .................................... 35 6.3. Colloquial Speech Red...

work page

[7] [7]

Conclusions ............................................................................................................ 39

work page

[8] [8]

References ............................................................................................................. 40 12

work page

[9] [9]

Email is quickly received, and it can be sent asynchronousl y at low cost

INTRODUCTION Electronic mail is an ubiquitous mode of communication in personal and work life [1], [2]. Email is quickly received, and it can be sent asynchronousl y at low cost. On the other hand, providing personalized and appropriate answers to questions sent by email is not an easy task, particularly as the number of messages scales up [3]. Messages a...

work page

[10] [10]

The latter two will be further detailed in Section 2

and Naïve Bayes (NB) classifiers. The latter two will be further detailed in Section 2. State-of-the-art algorithms vary depending on the type of classification being performed, that could be binary or between multiple categories, text length and the types of features to be taken into account in the statistical method [13], [14]. In this project we examin...

work page

[11] [11]

bag of words

RELATED WORK AND BACKGROUND 2.1. NATURAL LANGUAGE PROCESSING AND TEXT CLASSIFICATION The most accurate algorithms for text classification today are Support Vector Machines (SVM), Naive Bayes (NB) and k -Nearest-Neighbors (kNN), including hybrid approaches that can achieve greater precision than these methods separately [14]. SVM is one of the top performe...

work page

[12] [12]

method for classifying texts into categories, despite the overly simplistic approach of assuming complete independence between words in a sentence, what does not even take into account the order of words in a text. A simple numerical example to facilitate the understanding of the Naive -Bayes classifier was extracted from Manning, Raghavan and Schutze [6]...

work page 2001

[13] [13]

CORPUS COLLECTION In this section we explain how we built the corpus we used in all experiments. 3.1. PARTNERSHIP WITH NOT-PROFIT ORGANIZATION TO OBTAIN DATA Even though email communication is important in many interactions between customers and companies, real -life data is of limited availability. To our knowledge, no public enterprise email corpus with...

work page 2081

[14] [14]

CORPUS PROCESSING In this section we explain how we processed our email corpus to prepare the datasets used in the experiments, and the techniques applied to classify messages. 4.1. TEXT FILTERING WITH LEMMATIZER AND PARTS OF SPEECH We used different techniques to process the training corpus with the objective of assessing the impact on recall and precisi...

work page

[15] [15]

EVALUATION OF LEMMATIZATION AN D PART-OF-SPEECH FILTERING EFFECT ON PERFORMANCE Table 9 presents the effect of the POS -Tagger filter and of the lemmatizer in precision, recall and F1 measurements with our different training and test data . Comparing both classifiers among all filters, the highest precision achieved was 87.5%, recall 87.2% and F1 87.3%, f...

work page

[16] [16]

Spam and Ham

DISCUSSION 6.1. THE CHOICE OF THE CLASSIFIERS In this project, we focused on the Naive Bayes and SVM algorithms for classification. A common application of these classifiers is separating “Spam and Ham” in email inboxes [44], but they have also attained high precision and recall in classification problems with more than two categories [12]. Naive Bayes ha...

work page

[17] [17]

That was accomplished in association with Fundação Estudar a non -profit organization in education that provided us with their email logs

CONCLUSIONS We successfully built a corpus of email messages in Brazilian Portu guese. That was accomplished in association with Fundação Estudar a non -profit organization in education that provided us with their email logs. Based on the corpus created, we produced a study of email classification. We implemented a Naive Bayes and a Support Vector Machine...

work page

[18] [18]

Revised Naive Bayes Classifier for Combating the Focus Attack in Spam Filtering,

J. Peng and P. P. K. Chan, “Revised Naive Bayes Classifier for Combating the Focus Attack in Spam Filtering,” in 2013 International Conference on Machine Learning and Cybernetics, 2013, pp. 14–17

work page 2013

[19] [19]

An Optimized k-NN Classifier based on Minimum Spanning Tree for Email Filtering,

A. Chakrabarty, “An Optimized k-NN Classifier based on Minimum Spanning Tree for Email Filtering,” Bus. Inf. Manag., pp. 47–52, 2014

work page 2014

[20] [20]

System And Method For Message Process And Response,

K. D. Richardson, J. Greif, D. Buedel, and B. Aleksandrovsky, “System And Method For Message Process And Response,” US6278996 B1, 2001

work page 2001

[21] [21]

M. C. Buskirk Jr, F. J. Damerau, D. H. Johnson, M. Raaen, Buskirk, M. C. Buskirk Jr, F. J. Damerau, D. H. Johnson, and M. Raaen, “Machine Learning Based Electronic Messagind System,US6424997 B1, 2003.03

work page 2003

[22] [22]

System a nd Method for Content - Sensitive Automatic Reply Message Generation for Text -Based Asynchronous Communications

S. Ayyadurai, “System a nd Method for Content - Sensitive Automatic Reply Message Generation for Text -Based Asynchronous Communications", US6718368 B1, 2004

work page 2004

[23] [23]

C. D. Manning, P. Raghavan, and H. Shütze, An Introduction to Information Retrieval, C. Cambridge: Cambridge UP, pp. 154-157 and 261, 2009

work page 2009

[24] [24]

Transductive Inference for Text Classification Using Support Vector Machines,

J. Thorsten, “Transductive Inference for Text Classification Using Support Vector Machines,” ICML, vol. 99, pp. 200–209, 1999

work page 1999

[25] [25]

Speech Acts,

W. W. Cohen, V. R. Carvalho, and T. M. Mitchell, Learning to Classify Email into “Speech Acts,” vol. 4, no. 11. 2004

work page 2004

[26] [26]

Towards Designing an Email Classification System Using Multi -view Based Semi -Supervised Learning,

W. Li, W. Meng, Z. Tan, and Y. Xiang, “Towards Designing an Email Classification System Using Multi -view Based Semi -Supervised Learning,” 2014 IEEE 13th Int. Conf. Trust. Secur. Priv. Comput. Commun., pp. 174–181, 2014

work page 2014

[27] [27]

Email Classification with Co -Training,

S. Kiritchenko and S. Matwin, “Email Classification with Co -Training,” Proc. 2001 Conf. Cent. Adv. Stud. Collab. Res., p. 8, 2001. 41

work page 2001

[28] [28]

A comparative study for email classification,

S. Youn and D. McLeod, “A comparative study for email classification,” Adv. Innov. Syst. Comput. Sci. Softw. Eng., pp. 387–391, 2007

work page 2007

[29] [29]

The Enron corpus: A new dataset for Email Classification Research,

B. Klimt and Y. Yang, “The Enron corpus: A new dataset for Email Classification Research,” Mach. Learn. ECML 2004, pp. 217–226, 2004

work page 2004

[30] [30]

Baselines and bigrams: Simple, good sentiment and topic classification,

S. Wang and C. D. C. Manning, “Baselines and bigrams: Simple, good sentiment and topic classification,” ACL 12 Proc. 50th Annu. Meet. Assoc. Comput. Linguist. Short Pap. - Vol. 2, vol. 94305, no. 1, pp. 90–94, 2012

work page 2012

[31] [31]

A Comparative Study on Different Types of Approaches to Text Categorization,

P. Y. Pawar and S. H. Gawande, “A Comparative Study on Different Types of Approaches to Text Categorization,” Int. J. Mach. Learn. Comput., vol. 2, no. 4, pp. 423–426, 2012

work page 2012

[32] [32]

Classificação Automática de Emails,

M. J. A. Lima, “Classificação Automática de Emails,” Universidade do Porto, 2013

work page 2013

[33] [33]

IBM’s Watson Now A Customer Service Agent

B. Upbin, “IBM’s Watson Now A Customer Service Agent”, Forbes Magazine, p. 1, Access at May 21, 2013 < http://www.forbes.com/sites/bruceupbin/2013/05/21/ibms-watson-now-a- customer-service-agent-coming-to-smartphones-soon/ >

work page 2013

[34] [34]

Reinforced Multicategory Support Vector Machines,

Y. Liu and M. Yuan, “Reinforced Multicategory Support Vector Machines,” J. Comput. Graph. Stat., vol. 20, no. 4, pp. 901–919, 2011

work page 2011

[35] [35]

Joachims, Learning to Classify Text Using Support Vector Machines

T. Joachims, Learning to Classify Text Using Support Vector Machines. ICML, vol. 99, pp.200-209, 2001

work page 2001

[36] [36]

Towards an Adaptive Mail Classifier,

E. Masciari, M. Ruffolo, and A. Tagarelli, “Towards an Adaptive Mail Classifier,” Ital. Assoc. Artif. Intell. Work. Su Apprendimento Autom . Metod. ed Appl. , no. August, 2002

work page 2002

[37] [37]

Classifying spam emails using text and readability features,

R. Shams and R. E. Mercer, “Classifying spam emails using text and readability features,” Proc. - IEEE Int. Conf. Data Mining, ICDM, pp. 657–666, 2013

work page 2013

[38] [38]

An empirical performance comparison of machine learning methods for spam e -mail categorization,

C.-C. Lai and M. -C. Tsai, “An empirical performance comparison of machine learning methods for spam e -mail categorization,” Fourth Int. Conf. Hybrid Intell. Syst., pp. 0–4, 2004. 42

work page 2004

[39] [39]

An approach to spam detec tion by Naive Bayes ensemble based on decision induction,

Y. Zhen, N. Xiangfei, X. Weiran, and G. Jun, “An approach to spam detec tion by Naive Bayes ensemble based on decision induction,” Proc. - ISDA 2006 Sixth Int. Conf. Intell. Syst. Des. Appl., vol. 2, pp. 861–866, 2006

work page 2006

[40] [40]

Learning to Filter Spam E -Mail: A Comparison of a Naive Bayesian and a Memory -Based Approach,

I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. D. Spyropoulos, and P. Stamatopoulos, “Learning to Filter Spam E -Mail: A Comparison of a Naive Bayesian and a Memory -Based Approach,” Proc. Work. “Machine Learn. Textual Inf. Access,” no. September 2000, pp. 1–12, 2000

work page 2000

[41] [41]

Support -vector networks,

C. Cortes and V. Vapnik, “Support -vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995

work page 1995

[42] [42]

A Practical Guide to Support Vector Classification,

C.-W. Hsu, C. -C. Chang, and C. -J. Lin, “A Practical Guide to Support Vector Classification,” pp. 1–16, 2003

work page 2003

[43] [43]

A Training Algorithm for Optimal Margin Classifiers,

B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” Proc. 5th Annu. ACM Work. Comput. Learn. Theory, pp. 144– 152, 1992

work page 1992

[44] [44]

Joachims, Learning to Classify Text Using Support Vector Machines

T. Joachims, Learning to Classify Text Using Support Vector Machines. 2001

work page 2001

[45] [45]

Automatic Opinion Polarity Classification of Movie,

F. Salvetti, S. Lewis, and C. Reichenbach, “Automatic Opinion Polarity Classification of Movie,” Color. Res. Linguist., vol. 17, no. 1, p. 2, 2004

work page 2004

[46] [46]

Natural language processing in support of decision -making: Phrases and part -of-speech tagging,

R. M. Losee, “Natural language processing in support of decision -making: Phrases and part -of-speech tagging,” Inf. Process. Manag. , vol. 37, no. 6, pp. 769–787, 2001

work page 2001

[47] [47]

Automatic Induction of Rules for e -mail Classification,

E. Crawford, J. Kay, and E. M cCreath, “Automatic Induction of Rules for e -mail Classification,” in Sixth Australasian Document Computing Symposium, 2001

work page 2001

[48] [48]

The Case Against Accuracy Estimation for Comparing Induction Algorithms,

F. Provost, T. Fawcett, and R. Kohavi, “The Case Against Accuracy Estimation for Comparing Induction Algorithms,” Proc. Fifte enth Int. Conf. Mach. Learn. , pp. 445–453, 1997

work page 1997

[49] [49]

Enhanced email spam filtering through combining similarity graphs,

A. Dasgupta, M. Gurevich, and K. Punera, “Enhanced email spam filtering through combining similarity graphs,” Proc. fourth ACM Int. Conf. Web search data Min. - WSDM ’11, p. 785, 2011. 43

work page 2011

[50] [50]

A Semi -Supervised Bayesian Network Model for Microblog Topic Classification,

Y. Chen, Z . Li, L. Nie, X. Hu, X. Wang, T. Chua, X. Zhang, L. Liqiang, N. X, W. Tat, S. Chua, and X. Zhang, “A Semi -Supervised Bayesian Network Model for Microblog Topic Classification,” Coling, vol. 1, no. December, pp. 561–576, 2012

work page 2012

[51] [51]

Utiliza ção De Redes Neurais Artificiais Para Classificação De Spam,

A. M. Da Silva, “Utiliza ção De Redes Neurais Artificiais Para Classificação De Spam,” Centro Federal De Educação Tecnológica De Minas Gerais, 2009

work page 2009

[52] [52]

Detecção De Mensagens Não Solicitadas Utilizando Mineração De Textos,

M. S. Moreira, “Detecção De Mensagens Não Solicitadas Utilizando Mineração De Textos,” COPPE UFRJ, 2010

work page 2010

[53] [53]

Min eração de opinião em textos opinativos utilizando algoritmos de classiﬁcação,

F. Santos, “Min eração de opinião em textos opinativos utilizando algoritmos de classiﬁcação,” Universidade de Brasilia, 2013

work page 2013

[54] [54]

Twitter Sentiment Analysis : The Good the Bad and the OMG !,

E. Kouloumpis, T. Wilson, and J. Moore, “Twitter Sentiment Analysis : The Good the Bad and the OMG !,” Proc. Fifith Int. AAAI Conf. Weblogs Soc. Media , pp. 538–541, 2011

work page 2011

[55] [55]

Precise tweet classification and sentiment analysis,

R. Batool, A. M. Khattak, J. Maqbool, and S. Lee, “Precise tweet classification and sentiment analysis,” 2013 IEEE/ACIS 12th Int. Conf. Comput. Inf. Sci. ICIS 2013 - Proc., pp. 461–466, 2013

work page 2013

[56] [56]

Thumbs up?: sentiment classification using machine learning techniques,

B. Pang, L. Lee, and S. Vaithyanathan, “Thumbs up?: sentiment classification using machine learning techniques,” Proc. Conf. Empir. Methods Nat. Lang. Process., pp. 79–86, 2002

work page 2002

[57] [57]

Second HAREM : Advancing the State of the Art of Named Entity Recognition in Portuguese,

C. Freitas, C. Mota, D. Santos, H. G. Oliveira, and P. Carvalho, “Second HAREM : Advancing the State of the Art of Named Entity Recognition in Portuguese,” Proc. Seventh Int. Conf. Lang. Resour. Eval., no. 3, pp. 3630–3637, 2010

work page 2010

[58] [58]

CHAVE : topics and questions on the Portuguese participation in CLEF,

D. Santos and P. Roc ha, “CHAVE : topics and questions on the Portuguese participation in CLEF,” Cross Lang. Eval. Forum Work. Notes CLEF 2004 Work. , pp. 639–648, 2004

work page 2004

[59] [59]

Stemming para a língua portuguesa: estudo, análise e melhoria do algoritmo RSLP,

A. R. Coelho, “Stemming para a língua portuguesa: estudo, análise e melhoria do algoritmo RSLP,” Universidade Federal do Rio Grande do Sul, 2007

work page 2007

[60] [60]

Mac -Morpho Revisited: Towards Robust Part -of- Speech Tagging,

E. R. Fonseca and G. Rosa, “Mac -Morpho Revisited: Towards Robust Part -of- Speech Tagging,” in Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology, 2013, pp. 98–107. 44

work page 2013

[61] [61]

Spam filtering with Naive Bayes -Which naive bayes?,

V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with Naive Bayes -Which naive bayes?,” Ceas, p. 9, 2006

work page 2006

[62] [62]

The Form is the Substance: Classification of Genres in Text,

N. Dewdney, C. VanEss-Dykema, and R. MacMillan, “The Form is the Substance: Classification of Genres in Text,” in Proceedings of the workshop on Human Language Technology and Knowledge Management, 2001, pp. 1–8

work page 2001