pith. sign in

arxiv: 1906.08717 · v1 · pith:QXM74WOGnew · submitted 2019-06-20 · 💻 cs.CL

A New Statistical Approach for Comparing Algorithms for Lexicon Based Sentiment Analysis

Pith reviewed 2026-05-25 19:37 UTC · model grok-4.3

classification 💻 cs.CL
keywords lexicon-based sentiment analysisstatistical comparisonmarginal homogeneity testslog linear modelsalgorithm comparison without labelsPortuguese sentiment analysismaximum likelihood estimation
0
0 comments X

The pith

Statistical methods using marginal homogeneity tests and log linear models enable direct comparison of lexicon-based sentiment algorithms without human annotations or known labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops methods to statistically compare different lexicon-based sentiment analysis algorithms directly with each other. This avoids the need for human-annotated texts, which are scarce in languages like Portuguese. It uses marginal homogeneity tests and log linear models under maximum likelihood estimation to assess agreement and variability between algorithms. This matters because it provides a way to rank or establish equivalence between algorithms when gold standard labels are unavailable. The approach also notes similarities between uncertainties in lexicon outputs and human annotations.

Core claim

The central claim is that marginal homogeneity tests and log linear models within a maximum likelihood framework can compare the raw outputs of lexicon-based sentiment algorithms, producing rankings or equivalence statements without reference to human judgments or known class labels. The paper demonstrates that output variability is lexicon-dependent and can be quantified in the log linear model framework, while also showing that uncertainties in these algorithms resemble those in human-annotated tweets.

What carries the argument

Marginal homogeneity tests and log linear models for direct comparison of algorithm outputs.

If this is right

  • Algorithms can be ranked or declared equivalent based on statistical agreement alone.
  • Lexicon-dependent variability in outputs can be quantified and compared.
  • The approach works in settings where human annotation is scarce or absent.
  • Uncertainties in lexicon-based methods can be analyzed similarly to those in annotated data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method might apply to comparing other unsupervised text classifiers without ground truth.
  • In languages with some annotations available, the statistical rankings could be cross-checked for consistency.
  • This opens comparisons across different domains where labeled data is limited.

Load-bearing premise

The raw outputs of different lexicon-based algorithms can be directly compared via marginal homogeneity tests and log linear models in a way that produces meaningful rankings or equivalence statements, without any external validation against human judgments.

What would settle it

An experiment showing that statistical rankings from these tests disagree with human-annotated rankings in a language where annotated data is available.

read the original abstract

Lexicon based sentiment analysis usually relies on the identification of various words to which a numerical value corresponding to sentiment can be assigned. In principle, classifiers can be obtained from these algorithms by comparison with human annotation, which is considered the gold standard. In practise this is difficult in languages such as Portuguese where there is a paucity of human annotated texts. Thus in order to compare algorithms, a next best step is to directly compare different algorithms with each other without referring to human annotation. In this paper we develop methods for a statistical comparison of algorithms which does not rely on human annotation or on known class labels. We will motivate the use of marginal homogeneity tests, as well as log linear models within the framework of maximum likelihood estimation We will also show how some uncertainties present in lexicon based sentiment analysis may be similar to those which occur in human annotated tweets. We will also show how the variability in the output of different algorithms is lexicon dependent, and quantify this variability in the output within the framework of log linear models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes statistical methods for comparing lexicon-based sentiment analysis algorithms without human annotations or known class labels. It motivates the application of marginal homogeneity tests and log-linear models (under maximum likelihood estimation) to the categorical outputs of different algorithms, claims that uncertainties in lexicon-based outputs resemble those in human annotations, and states that output variability is lexicon-dependent and can be quantified via log-linear models.

Significance. If validated, the approach could enable algorithm comparison in low-resource settings such as Portuguese where annotated corpora are scarce. The statistical procedures themselves are standard and well-defined for testing distributional differences, but the manuscript provides no derivation, auxiliary result, or empirical demonstration that such differences correspond to differences in sentiment-analysis quality.

major comments (3)
  1. [Abstract] Abstract and introduction: the central claim that marginal homogeneity tests and log-linear models applied to raw algorithm outputs yield a 'valid statistical comparison' of the algorithms (i.e., one that can substitute for human-annotation-based evaluation) is unsupported; no derivation or auxiliary result establishes that a statistically significant difference in marginal distributions implies one lexicon is closer to true sentiment polarity.
  2. [Abstract] Abstract: the statement that 'some uncertainties present in lexicon based sentiment analysis may be similar to those which occur in human annotated tweets' is noted but does not close the gap, as it only observes shared noise sources without showing that the proposed tests recover quality rankings.
  3. [Abstract] Abstract: the manuscript supplies no validation data, fitted models, p-values, or cross-algorithm rankings demonstrating that the methods produce meaningful equivalence or superiority statements; the central claim therefore remains untested within the provided text.
minor comments (1)
  1. [Abstract] The abstract uses future tense ('We will motivate', 'We will also show') which is atypical for a completed manuscript; revise to present tense once results are included.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments. We respond to each major comment below, clarifying the scope of our methodological contribution.

read point-by-point responses
  1. Referee: [Abstract] Abstract and introduction: the central claim that marginal homogeneity tests and log-linear models applied to raw algorithm outputs yield a 'valid statistical comparison' of the algorithms (i.e., one that can substitute for human-annotation-based evaluation) is unsupported; no derivation or auxiliary result establishes that a statistically significant difference in marginal distributions implies one lexicon is closer to true sentiment polarity.

    Authors: The manuscript introduces these statistical methods as tools for comparing the output distributions of different lexicon-based algorithms without requiring human annotations. We draw an analogy to their use in comparing human raters but do not provide a derivation showing that distributional differences correspond to differences in accuracy relative to true sentiment. We will revise the abstract and introduction to emphasize that the methods enable statistical comparison of outputs rather than claiming they substitute for quality evaluation based on ground truth. revision: yes

  2. Referee: [Abstract] Abstract: the statement that 'some uncertainties present in lexicon based sentiment analysis may be similar to those which occur in human annotated tweets' is noted but does not close the gap, as it only observes shared noise sources without showing that the proposed tests recover quality rankings.

    Authors: This statement serves to justify the applicability of marginal homogeneity tests and log-linear models by highlighting similarities in uncertainty sources. We agree that it does not demonstrate recovery of quality rankings, which would indeed require ground truth labels not available in the target low-resource settings. The paper does not claim to show such recovery. revision: no

  3. Referee: [Abstract] Abstract: the manuscript supplies no validation data, fitted models, p-values, or cross-algorithm rankings demonstrating that the methods produce meaningful equivalence or superiority statements; the central claim therefore remains untested within the provided text.

    Authors: The focus of the manuscript is on developing and motivating the statistical framework. It does not include specific empirical validations or numerical results from applications, as the goal is to present the methods themselves. We acknowledge that without such demonstrations, the practical utility remains to be shown in follow-up work. revision: no

Circularity Check

0 steps flagged

No circularity: standard external statistical tests applied to outputs

full rationale

The paper proposes applying marginal homogeneity tests and log-linear models (under MLE) directly to the categorical outputs of lexicon-based algorithms as a way to compare them without human labels. These are standard, externally defined statistical procedures whose validity does not depend on any derivation, fit, or self-citation internal to the paper. No equations, parameters, or predictions in the described method reduce to the inputs by construction, nor is any load-bearing premise justified solely by prior work from the same authors. The claim that distributional comparisons can substitute for quality evaluation is a substantive (and debatable) assumption about what the tests measure, but it is not a circular reduction. This is the most common honest non-finding for papers that import off-the-shelf statistical machinery.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that algorithm outputs behave like categorical data suitable for homogeneity testing.

axioms (1)
  • domain assumption Outputs of lexicon-based algorithms can be treated as categorical variables amenable to marginal homogeneity testing without reference to external labels.
    Invoked when proposing direct comparison of algorithms.

pith-pipeline@v0.9.0 · 5704 in / 1131 out tokens · 21505 ms · 2026-05-25T19:37:23.885359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

  1. [1]

    A. Agresti. Modelling patterns of agreement and disgareement. Statis- tical Methods in Medical Research , 1:201–218, 1992

  2. [2]

    A. Agresti. Categorical Data Analysis . Wiley, 3rd edition, 2013

  3. [3]

    An evaluation of the brazilian portuguese liwc dictio- nary for sentiment analysis

    Pedro P Balage Filho, Thiago Alexandre Salgueiro Pardo, and San- dra Maria Aluisio. An evaluation of the brazilian portuguese liwc dictio- nary for sentiment analysis. In Proceedings of the 9th Brazilian Sympo- sium in Information and Human Language Technology , pages 215–219, 2013

  4. [4]

    J.R. Bergan. Measuring observer agreement using the quasi- independence concept. Journal of Educational Measurement , 17:59–69, 1980

  5. [5]

    Y. M. Bishop, S. E. Fienberg, and P. W Holland. Discrete Multivariate Analysis: Theory and Practice . Springer Science & Business Media, 2007. 14

  6. [6]

    Bostanci and E

    B. Bostanci and E. Bostanci. An evaluation of classification algor ithms using mcnemar’s test. In et. al. Bansal J.C., editor, Proceedings of Sev- enth International Conference on Bio-Inspired Computing: Theories and Applications (BIC-TA 2012) , pages 15–26. Springer India, Hyderabad, India, 2013

  7. [7]

    Brown and C

    I. Brown and C. Mues. An experimental comparison of classificat ion algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39:3446–3453, 2012

  8. [8]

    J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement , 20:213–220, 1968

  9. [9]

    Demeˇ sar

    J. Demeˇ sar. Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research , 7:1–30, 2006

  10. [10]

    Freitas, E

    C. Freitas, E. Motta, R.L. Milidi´ u, and et. al. Que brilha... r´ a ! de safios na anota¸ c˜ ao de opini˜ ao em um corpus de resenhas de livros. In XI Encontro de Lingu ´ ıstica de Corpus (ELC 2012). So Paulo, Brazil, 2012

  11. [11]

    Friedman

    M. Friedman. A comparison of alternative tests of significance f or the problem of m rankings. Annals of mathematical Statistics , 11:86–92, 1940

  12. [12]

    irr: Various Coefficients of Interrater Reliability and Agreemen t, 2019

    Matthias Gamer, Jim Lemon, and Ian Fellows Puspendra Singh. irr: Various Coefficients of Interrater Reliability and Agreemen t, 2019. R package version 0.84.1

  13. [13]

    Olga Kolchyna, Th´ arsis T. P. Souza, Philip Treleaven, and Tomaso Aste. Twitter sentiment analysis: Lexicon method, machine learning metho d and their combination. In Gautam Mitra and Xiang Yu, editors, Hand- book of Sentiment Analysis in Finance , chapter 5. 2016

  14. [14]

    M. T. Machado, T.A.S. Pardo, and E.E.S. Ruiz. Creating a portugu ese context sensitive lexicon for sentiment analysis. In A. Villavicencio, V. Moreira, A. Abad, and et. al., editors, International Conference on Computational Processing of the Portuguese Language , pages 335–344. Springer, Canela, Brazil, 2018

  15. [15]

    A.E. Maxwell. Comparing the classification of subjects by two inde pen- dent judges. The British Journal of Psychiatry , 116:651–655, 1970. 15

  16. [16]

    Mozetiˇ c, M

    I. Mozetiˇ c, M. Grˇ car, and J. Smailoviˇ c. Multilingual twittersentiment classification: The role of human annotators. PLOS ONE, 11:1–26, 2016

  17. [17]

    R: A Language and Environment for Statistical Comput- ing

    R Core Team. R: A Language and Environment for Statistical Comput- ing. R Foundation for Statistical Computing, Vienna, Austria, 2019

  18. [18]

    F. Rapallo. Algebraic exact inference for rater agreement mod els. Sta- tistical Methods & Applications , 14:45–66, 2005

  19. [19]

    Au- tomatic expansion of a social judgment lexicon for sentiment analys is

    M´ ario J Silva, Paula Carvalho, Carlos Costa, and Lu ´ ıs Sarmento. Au- tomatic expansion of a social judgment lexicon for sentiment analys is. 2010

  20. [20]

    Construction of a portuguese opinion lexicon from multiple resources

    Marlo Souza, Renata Viera, D´ ebora Busetti, Rove Chishman, a nd Isa Mara Alves. Construction of a portuguese opinion lexicon from multiple resources. In Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology , 2011

  21. [21]

    R.A. Stine. Sentiment analysis. Annual Review of Statistics and Its Application, 6:287–308, 2019

  22. [22]

    A. Stuart. A test for homogeneity of the marginal distribution s in a two-way classification. Biometrika, 42:412–416, 1955

  23. [23]

    Tellez, M

    E.S. Tellez, M. Graff, R.R. Suarez, and et.al. A simple approach to m ul- tilingual polarity classification in twitter. Pattern Recognition Letters , 94:68–74, 2017. 16