A New Statistical Approach for Comparing Algorithms for Lexicon Based Sentiment Analysis

Evandro Ruiz; Kuruvilla Joseph Abraham; Mateus Machado

arxiv: 1906.08717 · v1 · pith:QXM74WOGnew · submitted 2019-06-20 · 💻 cs.CL

A New Statistical Approach for Comparing Algorithms for Lexicon Based Sentiment Analysis

Mateus Machado , Evandro Ruiz , Kuruvilla Joseph Abraham This is my paper

Pith reviewed 2026-05-25 19:37 UTC · model grok-4.3

classification 💻 cs.CL

keywords lexicon-based sentiment analysisstatistical comparisonmarginal homogeneity testslog linear modelsalgorithm comparison without labelsPortuguese sentiment analysismaximum likelihood estimation

0 comments

The pith

Statistical methods using marginal homogeneity tests and log linear models enable direct comparison of lexicon-based sentiment algorithms without human annotations or known labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops methods to statistically compare different lexicon-based sentiment analysis algorithms directly with each other. This avoids the need for human-annotated texts, which are scarce in languages like Portuguese. It uses marginal homogeneity tests and log linear models under maximum likelihood estimation to assess agreement and variability between algorithms. This matters because it provides a way to rank or establish equivalence between algorithms when gold standard labels are unavailable. The approach also notes similarities between uncertainties in lexicon outputs and human annotations.

Core claim

The central claim is that marginal homogeneity tests and log linear models within a maximum likelihood framework can compare the raw outputs of lexicon-based sentiment algorithms, producing rankings or equivalence statements without reference to human judgments or known class labels. The paper demonstrates that output variability is lexicon-dependent and can be quantified in the log linear model framework, while also showing that uncertainties in these algorithms resemble those in human-annotated tweets.

What carries the argument

Marginal homogeneity tests and log linear models for direct comparison of algorithm outputs.

If this is right

Algorithms can be ranked or declared equivalent based on statistical agreement alone.
Lexicon-dependent variability in outputs can be quantified and compared.
The approach works in settings where human annotation is scarce or absent.
Uncertainties in lexicon-based methods can be analyzed similarly to those in annotated data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might apply to comparing other unsupervised text classifiers without ground truth.
In languages with some annotations available, the statistical rankings could be cross-checked for consistency.
This opens comparisons across different domains where labeled data is limited.

Load-bearing premise

The raw outputs of different lexicon-based algorithms can be directly compared via marginal homogeneity tests and log linear models in a way that produces meaningful rankings or equivalence statements, without any external validation against human judgments.

What would settle it

An experiment showing that statistical rankings from these tests disagree with human-annotated rankings in a language where annotated data is available.

read the original abstract

Lexicon based sentiment analysis usually relies on the identification of various words to which a numerical value corresponding to sentiment can be assigned. In principle, classifiers can be obtained from these algorithms by comparison with human annotation, which is considered the gold standard. In practise this is difficult in languages such as Portuguese where there is a paucity of human annotated texts. Thus in order to compare algorithms, a next best step is to directly compare different algorithms with each other without referring to human annotation. In this paper we develop methods for a statistical comparison of algorithms which does not rely on human annotation or on known class labels. We will motivate the use of marginal homogeneity tests, as well as log linear models within the framework of maximum likelihood estimation We will also show how some uncertainties present in lexicon based sentiment analysis may be similar to those which occur in human annotated tweets. We will also show how the variability in the output of different algorithms is lexicon dependent, and quantify this variability in the output within the framework of log linear models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proposes marginal homogeneity tests and log-linear models to compare lexicon-based sentiment algorithms without labels, but these only flag output differences and do not show the differences track actual sentiment quality.

read the letter

The main point is that the authors want to compare different lexicon-based sentiment algorithms using marginal homogeneity tests and log-linear models fitted by maximum likelihood, all without human annotations or known labels. This targets the practical issue in languages like Portuguese where annotated data is scarce. The framing is new in this specific setting even though the tests themselves are established tools from categorical data analysis. The paper also notes that some uncertainties in lexicon outputs resemble those in human annotations, which is a reasonable observation to make. It further quantifies variability across algorithms as lexicon-dependent within the log-linear framework. These elements show honest engagement with the constraints of low-resource settings and a clear attempt to adapt existing statistical methods rather than invent new ones. The citation pattern in the abstract stays within standard references for sentiment analysis and categorical modeling, with no obvious self-promotion or circular fitting. The central soft spot is the missing link between statistical discrepancy and algorithm quality. The tests can detect whether two algorithms produce different marginal distributions of positive, negative, or neutral labels, but nothing in the description derives or demonstrates that a significant difference means one lexicon is closer to true sentiment polarity. No small validation set, no auxiliary check against even limited human judgments, and no simulation results are referenced to close that gap. The claim that this yields a valid comparison therefore rests on an unstated assumption that distributional match or mismatch substitutes for performance. This is a methodological proposal rather than a completed empirical study. It is aimed at researchers who handle sentiment analysis in annotation-poor languages and who might be looking for label-free evaluation ideas. A reader already familiar with log-linear models will see routine extensions, while someone focused on practical workarounds could pick up the specific combination. The work shows clear thinking on its own terms and does not contain internal contradictions or invented entities. It deserves a serious referee to check whether the full manuscript supplies the missing validation steps or concrete examples that would make the method usable in practice. I would send it for peer review.

Referee Report

3 major / 1 minor

Summary. The paper proposes statistical methods for comparing lexicon-based sentiment analysis algorithms without human annotations or known class labels. It motivates the application of marginal homogeneity tests and log-linear models (under maximum likelihood estimation) to the categorical outputs of different algorithms, claims that uncertainties in lexicon-based outputs resemble those in human annotations, and states that output variability is lexicon-dependent and can be quantified via log-linear models.

Significance. If validated, the approach could enable algorithm comparison in low-resource settings such as Portuguese where annotated corpora are scarce. The statistical procedures themselves are standard and well-defined for testing distributional differences, but the manuscript provides no derivation, auxiliary result, or empirical demonstration that such differences correspond to differences in sentiment-analysis quality.

major comments (3)

[Abstract] Abstract and introduction: the central claim that marginal homogeneity tests and log-linear models applied to raw algorithm outputs yield a 'valid statistical comparison' of the algorithms (i.e., one that can substitute for human-annotation-based evaluation) is unsupported; no derivation or auxiliary result establishes that a statistically significant difference in marginal distributions implies one lexicon is closer to true sentiment polarity.
[Abstract] Abstract: the statement that 'some uncertainties present in lexicon based sentiment analysis may be similar to those which occur in human annotated tweets' is noted but does not close the gap, as it only observes shared noise sources without showing that the proposed tests recover quality rankings.
[Abstract] Abstract: the manuscript supplies no validation data, fitted models, p-values, or cross-algorithm rankings demonstrating that the methods produce meaningful equivalence or superiority statements; the central claim therefore remains untested within the provided text.

minor comments (1)

[Abstract] The abstract uses future tense ('We will motivate', 'We will also show') which is atypical for a completed manuscript; revise to present tense once results are included.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We respond to each major comment below, clarifying the scope of our methodological contribution.

read point-by-point responses

Referee: [Abstract] Abstract and introduction: the central claim that marginal homogeneity tests and log-linear models applied to raw algorithm outputs yield a 'valid statistical comparison' of the algorithms (i.e., one that can substitute for human-annotation-based evaluation) is unsupported; no derivation or auxiliary result establishes that a statistically significant difference in marginal distributions implies one lexicon is closer to true sentiment polarity.

Authors: The manuscript introduces these statistical methods as tools for comparing the output distributions of different lexicon-based algorithms without requiring human annotations. We draw an analogy to their use in comparing human raters but do not provide a derivation showing that distributional differences correspond to differences in accuracy relative to true sentiment. We will revise the abstract and introduction to emphasize that the methods enable statistical comparison of outputs rather than claiming they substitute for quality evaluation based on ground truth. revision: yes
Referee: [Abstract] Abstract: the statement that 'some uncertainties present in lexicon based sentiment analysis may be similar to those which occur in human annotated tweets' is noted but does not close the gap, as it only observes shared noise sources without showing that the proposed tests recover quality rankings.

Authors: This statement serves to justify the applicability of marginal homogeneity tests and log-linear models by highlighting similarities in uncertainty sources. We agree that it does not demonstrate recovery of quality rankings, which would indeed require ground truth labels not available in the target low-resource settings. The paper does not claim to show such recovery. revision: no
Referee: [Abstract] Abstract: the manuscript supplies no validation data, fitted models, p-values, or cross-algorithm rankings demonstrating that the methods produce meaningful equivalence or superiority statements; the central claim therefore remains untested within the provided text.

Authors: The focus of the manuscript is on developing and motivating the statistical framework. It does not include specific empirical validations or numerical results from applications, as the goal is to present the methods themselves. We acknowledge that without such demonstrations, the practical utility remains to be shown in follow-up work. revision: no

Circularity Check

0 steps flagged

No circularity: standard external statistical tests applied to outputs

full rationale

The paper proposes applying marginal homogeneity tests and log-linear models (under MLE) directly to the categorical outputs of lexicon-based algorithms as a way to compare them without human labels. These are standard, externally defined statistical procedures whose validity does not depend on any derivation, fit, or self-citation internal to the paper. No equations, parameters, or predictions in the described method reduce to the inputs by construction, nor is any load-bearing premise justified solely by prior work from the same authors. The claim that distributional comparisons can substitute for quality evaluation is a substantive (and debatable) assumption about what the tests measure, but it is not a circular reduction. This is the most common honest non-finding for papers that import off-the-shelf statistical machinery.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that algorithm outputs behave like categorical data suitable for homogeneity testing.

axioms (1)

domain assumption Outputs of lexicon-based algorithms can be treated as categorical variables amenable to marginal homogeneity testing without reference to external labels.
Invoked when proposing direct comparison of algorithms.

pith-pipeline@v0.9.0 · 5704 in / 1131 out tokens · 21505 ms · 2026-05-25T19:37:23.885359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We will motivate the use of marginal homogeneity tests, as well as log linear models within the framework of maximum likelihood estimation
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the quasi independence model ... log nij = λ + λ1_i + λ2_j + δi I(i = j)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

A. Agresti. Modelling patterns of agreement and disgareement. Statis- tical Methods in Medical Research , 1:201–218, 1992

work page 1992
[2]

A. Agresti. Categorical Data Analysis . Wiley, 3rd edition, 2013

work page 2013
[3]

An evaluation of the brazilian portuguese liwc dictio- nary for sentiment analysis

Pedro P Balage Filho, Thiago Alexandre Salgueiro Pardo, and San- dra Maria Aluisio. An evaluation of the brazilian portuguese liwc dictio- nary for sentiment analysis. In Proceedings of the 9th Brazilian Sympo- sium in Information and Human Language Technology , pages 215–219, 2013

work page 2013
[4]

J.R. Bergan. Measuring observer agreement using the quasi- independence concept. Journal of Educational Measurement , 17:59–69, 1980

work page 1980
[5]

Y. M. Bishop, S. E. Fienberg, and P. W Holland. Discrete Multivariate Analysis: Theory and Practice . Springer Science & Business Media, 2007. 14

work page 2007
[6]

Bostanci and E

B. Bostanci and E. Bostanci. An evaluation of classiﬁcation algor ithms using mcnemar’s test. In et. al. Bansal J.C., editor, Proceedings of Sev- enth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA 2012) , pages 15–26. Springer India, Hyderabad, India, 2013

work page 2012
[7]

Brown and C

I. Brown and C. Mues. An experimental comparison of classiﬁcat ion algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39:3446–3453, 2012

work page 2012
[8]

J. Cohen. A coeﬃcient of agreement for nominal scales. Educational and Psychological Measurement , 20:213–220, 1968

work page 1968
[9]

Demeˇ sar

J. Demeˇ sar. Statistical comparisons of classiﬁers over multiple data sets. Journal of Machine Learning Research , 7:1–30, 2006

work page 2006
[10]

Freitas, E

C. Freitas, E. Motta, R.L. Milidi´ u, and et. al. Que brilha... r´ a ! de saﬁos na anota¸ c˜ ao de opini˜ ao em um corpus de resenhas de livros. In XI Encontro de Lingu ´ ıstica de Corpus (ELC 2012). So Paulo, Brazil, 2012

work page 2012
[11]

Friedman

M. Friedman. A comparison of alternative tests of signiﬁcance f or the problem of m rankings. Annals of mathematical Statistics , 11:86–92, 1940

work page 1940
[12]

irr: Various Coeﬃcients of Interrater Reliability and Agreemen t, 2019

Matthias Gamer, Jim Lemon, and Ian Fellows Puspendra Singh. irr: Various Coeﬃcients of Interrater Reliability and Agreemen t, 2019. R package version 0.84.1

work page 2019
[13]

Olga Kolchyna, Th´ arsis T. P. Souza, Philip Treleaven, and Tomaso Aste. Twitter sentiment analysis: Lexicon method, machine learning metho d and their combination. In Gautam Mitra and Xiang Yu, editors, Hand- book of Sentiment Analysis in Finance , chapter 5. 2016

work page 2016
[14]

M. T. Machado, T.A.S. Pardo, and E.E.S. Ruiz. Creating a portugu ese context sensitive lexicon for sentiment analysis. In A. Villavicencio, V. Moreira, A. Abad, and et. al., editors, International Conference on Computational Processing of the Portuguese Language , pages 335–344. Springer, Canela, Brazil, 2018

work page 2018
[15]

A.E. Maxwell. Comparing the classiﬁcation of subjects by two inde pen- dent judges. The British Journal of Psychiatry , 116:651–655, 1970. 15

work page 1970
[16]

Mozetiˇ c, M

I. Mozetiˇ c, M. Grˇ car, and J. Smailoviˇ c. Multilingual twittersentiment classiﬁcation: The role of human annotators. PLOS ONE, 11:1–26, 2016

work page 2016
[17]

R: A Language and Environment for Statistical Comput- ing

R Core Team. R: A Language and Environment for Statistical Comput- ing. R Foundation for Statistical Computing, Vienna, Austria, 2019

work page 2019
[18]

F. Rapallo. Algebraic exact inference for rater agreement mod els. Sta- tistical Methods & Applications , 14:45–66, 2005

work page 2005
[19]

Au- tomatic expansion of a social judgment lexicon for sentiment analys is

M´ ario J Silva, Paula Carvalho, Carlos Costa, and Lu ´ ıs Sarmento. Au- tomatic expansion of a social judgment lexicon for sentiment analys is. 2010

work page 2010
[20]

Construction of a portuguese opinion lexicon from multiple resources

Marlo Souza, Renata Viera, D´ ebora Busetti, Rove Chishman, a nd Isa Mara Alves. Construction of a portuguese opinion lexicon from multiple resources. In Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology , 2011

work page 2011
[21]

R.A. Stine. Sentiment analysis. Annual Review of Statistics and Its Application, 6:287–308, 2019

work page 2019
[22]

A. Stuart. A test for homogeneity of the marginal distribution s in a two-way classiﬁcation. Biometrika, 42:412–416, 1955

work page 1955
[23]

Tellez, M

E.S. Tellez, M. Graﬀ, R.R. Suarez, and et.al. A simple approach to m ul- tilingual polarity classiﬁcation in twitter. Pattern Recognition Letters , 94:68–74, 2017. 16

work page 2017

[1] [1]

A. Agresti. Modelling patterns of agreement and disgareement. Statis- tical Methods in Medical Research , 1:201–218, 1992

work page 1992

[2] [2]

A. Agresti. Categorical Data Analysis . Wiley, 3rd edition, 2013

work page 2013

[3] [3]

An evaluation of the brazilian portuguese liwc dictio- nary for sentiment analysis

Pedro P Balage Filho, Thiago Alexandre Salgueiro Pardo, and San- dra Maria Aluisio. An evaluation of the brazilian portuguese liwc dictio- nary for sentiment analysis. In Proceedings of the 9th Brazilian Sympo- sium in Information and Human Language Technology , pages 215–219, 2013

work page 2013

[4] [4]

J.R. Bergan. Measuring observer agreement using the quasi- independence concept. Journal of Educational Measurement , 17:59–69, 1980

work page 1980

[5] [5]

Y. M. Bishop, S. E. Fienberg, and P. W Holland. Discrete Multivariate Analysis: Theory and Practice . Springer Science & Business Media, 2007. 14

work page 2007

[6] [6]

Bostanci and E

B. Bostanci and E. Bostanci. An evaluation of classiﬁcation algor ithms using mcnemar’s test. In et. al. Bansal J.C., editor, Proceedings of Sev- enth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA 2012) , pages 15–26. Springer India, Hyderabad, India, 2013

work page 2012

[7] [7]

Brown and C

I. Brown and C. Mues. An experimental comparison of classiﬁcat ion algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39:3446–3453, 2012

work page 2012

[8] [8]

J. Cohen. A coeﬃcient of agreement for nominal scales. Educational and Psychological Measurement , 20:213–220, 1968

work page 1968

[9] [9]

Demeˇ sar

J. Demeˇ sar. Statistical comparisons of classiﬁers over multiple data sets. Journal of Machine Learning Research , 7:1–30, 2006

work page 2006

[10] [10]

Freitas, E

C. Freitas, E. Motta, R.L. Milidi´ u, and et. al. Que brilha... r´ a ! de saﬁos na anota¸ c˜ ao de opini˜ ao em um corpus de resenhas de livros. In XI Encontro de Lingu ´ ıstica de Corpus (ELC 2012). So Paulo, Brazil, 2012

work page 2012

[11] [11]

Friedman

M. Friedman. A comparison of alternative tests of signiﬁcance f or the problem of m rankings. Annals of mathematical Statistics , 11:86–92, 1940

work page 1940

[12] [12]

irr: Various Coeﬃcients of Interrater Reliability and Agreemen t, 2019

Matthias Gamer, Jim Lemon, and Ian Fellows Puspendra Singh. irr: Various Coeﬃcients of Interrater Reliability and Agreemen t, 2019. R package version 0.84.1

work page 2019

[13] [13]

Olga Kolchyna, Th´ arsis T. P. Souza, Philip Treleaven, and Tomaso Aste. Twitter sentiment analysis: Lexicon method, machine learning metho d and their combination. In Gautam Mitra and Xiang Yu, editors, Hand- book of Sentiment Analysis in Finance , chapter 5. 2016

work page 2016

[14] [14]

M. T. Machado, T.A.S. Pardo, and E.E.S. Ruiz. Creating a portugu ese context sensitive lexicon for sentiment analysis. In A. Villavicencio, V. Moreira, A. Abad, and et. al., editors, International Conference on Computational Processing of the Portuguese Language , pages 335–344. Springer, Canela, Brazil, 2018

work page 2018

[15] [15]

A.E. Maxwell. Comparing the classiﬁcation of subjects by two inde pen- dent judges. The British Journal of Psychiatry , 116:651–655, 1970. 15

work page 1970

[16] [16]

Mozetiˇ c, M

I. Mozetiˇ c, M. Grˇ car, and J. Smailoviˇ c. Multilingual twittersentiment classiﬁcation: The role of human annotators. PLOS ONE, 11:1–26, 2016

work page 2016

[17] [17]

R: A Language and Environment for Statistical Comput- ing

R Core Team. R: A Language and Environment for Statistical Comput- ing. R Foundation for Statistical Computing, Vienna, Austria, 2019

work page 2019

[18] [18]

F. Rapallo. Algebraic exact inference for rater agreement mod els. Sta- tistical Methods & Applications , 14:45–66, 2005

work page 2005

[19] [19]

Au- tomatic expansion of a social judgment lexicon for sentiment analys is

M´ ario J Silva, Paula Carvalho, Carlos Costa, and Lu ´ ıs Sarmento. Au- tomatic expansion of a social judgment lexicon for sentiment analys is. 2010

work page 2010

[20] [20]

Construction of a portuguese opinion lexicon from multiple resources

Marlo Souza, Renata Viera, D´ ebora Busetti, Rove Chishman, a nd Isa Mara Alves. Construction of a portuguese opinion lexicon from multiple resources. In Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology , 2011

work page 2011

[21] [21]

R.A. Stine. Sentiment analysis. Annual Review of Statistics and Its Application, 6:287–308, 2019

work page 2019

[22] [22]

A. Stuart. A test for homogeneity of the marginal distribution s in a two-way classiﬁcation. Biometrika, 42:412–416, 1955

work page 1955

[23] [23]

Tellez, M

E.S. Tellez, M. Graﬀ, R.R. Suarez, and et.al. A simple approach to m ul- tilingual polarity classiﬁcation in twitter. Pattern Recognition Letters , 94:68–74, 2017. 16

work page 2017