A New Semisupervised Technique for Polarity Analysis using Masked Language Models

Kohei Watanabe

arxiv: 2604.26230 · v1 · submitted 2026-04-29 · 💻 cs.CL · stat.ME

A New Semisupervised Technique for Polarity Analysis using Masked Language Models

Kohei Watanabe This is my paper

Pith reviewed 2026-05-07 13:29 UTC · model grok-4.3

classification 💻 cs.CL stat.ME

keywords polarity analysismasked language modelsLatent Semantic Scalingsemisupervised learningsentiment analysistext polarityCOVID media analysis

0 comments

The pith

A masked language model version of Latent Semantic Scaling assigns polarity as predicted probabilities of seed words, claiming greater accuracy and consistency than spatial models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a semisupervised method for analyzing text polarity by adapting Latent Semantic Scaling to use masked language models. It replaces spatial vector distances with the probability that positive or negative seed words would appear in a document's context according to the model. This probabilistic approach is tested on China Daily articles about health achievements during the COVID pandemic, where it shows advantages in accuracy, interpretability, and consistency over traditional methods. Readers should care because it simplifies scaling sentiment without needing much labeled data and could extend to other text analysis tasks.

Core claim

By employing word2vec as a masked language model, the new Latent Semantic Scaling technique assigns polarity scores to words and documents based on the predicted probability of seed words occurring in given contexts rather than their positions in vector space. These probabilistic scores prove more accurate, interpretable, and consistent when applied to media coverage of China and other countries' health issues in the COVID era.

What carries the argument

The probabilistic polarity scoring mechanism, where a masked language model predicts the likelihood of seed word occurrence to determine document and word polarity.

Load-bearing premise

That the probabilities predicted by the masked language model for seed words accurately and reliably represent the underlying polarity dimension across different types of real-world text.

What would settle it

Running the probabilistic and spatial models on a separate dataset with human-rated polarity labels and finding that the probabilistic scores do not outperform the spatial ones in accuracy or consistency.

Figures

Figures reproduced from arXiv: 2604.26230 by Kohei Watanabe.

**Figure 1.** Figure 1: Figure1: Word2vec algorithms view at source ↗

**Figure 2.** Figure 2: Correlation of document polarity scores. The vertical axis is Pearson’s correlation coefficient. The boxes show the quantile range of the correlation coefficients. The perplexity scores of the probabilistic models tend to be smaller in ‘health’ than in ‘achievement’ ( view at source ↗

**Figure 3.** Figure 3: Perplexity of probabilistic models. Colors indicate 10 samples of seed words; horizontal axis is perplexity scores for seed words; vertical axis is correlation coefficients. Models were trained with 10 different samples of seed words. Example I selected a model for each concept that achieved the lowest perplexity scores in above evaluation: the model for ‘achievement’ is 150-dimensional and seeded with sam… view at source ↗

**Figure 5.** Figure 5: Polarity words for ‘health’. The horizontal axis is the polarity scores; the vertical axis is the frequency in the corpus. The seed words and full-dictionary words are highlighted in red and blue, respectively. To reveal China Daily’s coverage during the COVID crisis, I smoothed document polarity scores separately for the articles about China or other countries ( view at source ↗

**Figure 6.** Figure 6: Document polarity in China’s Daily. The plots on top are document polarity scores for ‘achievement’ and ‘health’. The plot on the bottom is the combined document polarity scores for ‘achievement in health’. Red and blue lines are the polarity scores of documents about China and other countries, respectively. Bands around the lines are the 95% confidence intervals of local regression smoothing view at source ↗

read the original abstract

I developed a new version of Latent Semantic Scaling (LSS) employing word2vec as a masked language model. Unlike original spatial models, it assigns polarity scores to words and documents as predicted probabilities of seed words to occur in given contexts. These probabilistic polarity scores are more accurate, interpretable and consistent than those spatial polarity models can produce in text analysis. I demonstrate these advantages by applying both probabilistic and spatial models to China Daily's coverage of China and other countries during the coronavirus disease (COVID) pandemic in terms of achievement in health issues. The result suggests that more advanced masked language models would further improve the semisupervised machine learning technique.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes LSS with word2vec-derived probabilities but calls word2vec a masked language model, which it is not, so the claimed accuracy and interpretability gains lack a solid basis.

read the letter

The main thing to know is that this paper tries to give Latent Semantic Scaling a probabilistic makeover by using word2vec to predict seed word occurrences, but it mischaracterizes word2vec as a masked language model, which it is not. The new element is replacing the spatial distance calculations in LSS with probabilities. Instead of measuring how close a word is to positive or negative seeds in embedding space, it uses the model's prediction of how likely a seed word is to appear in a given context. The author applies both versions to coverage in China Daily during the COVID pandemic, focusing on health achievements for China versus other countries. The results are said to show better accuracy, interpretability, and consistency for the probabilistic approach. It does a decent job of illustrating the method on a timely dataset and pointing out that more advanced models like true masked LMs could make it even better. For someone already working with semisupervised polarity tools in social science, this could be a straightforward extension to consider. The main weakness is in the technical setup. Word2vec models like CBOW or skip-gram learn embeddings through prediction tasks, but they do not output probabilities conditioned on masked contexts in the sense of modern masked language models. There is no mention of how exactly the probabilities are computed from the embeddings, and without that, the interpretability advantage is hard to evaluate. The abstract also skips any quantitative comparison or validation steps, leaving the superiority claim unsupported by numbers. Relying on one corpus also limits how much we can generalize. This paper would appeal to computational social scientists and media analysts who use tools like LSS for large-scale text polarity work. It is not going to change core NLP techniques, but it might offer a practical tweak for existing workflows. I would recommend sending it to peer review. The idea has enough substance to warrant feedback, though the authors will need to address the modeling details and provide more evidence for the performance claims.

Referee Report

2 major / 1 minor

Summary. The paper proposes a new semisupervised polarity analysis technique by adapting Latent Semantic Scaling (LSS) to employ word2vec as a masked language model. It computes polarity scores for words and documents as the predicted probabilities of seed words occurring in given contexts, claiming these probabilistic scores are more accurate, interpretable, and consistent than those produced by spatial polarity models. The approach is demonstrated on China Daily coverage of health achievements during the COVID pandemic, with the suggestion that more advanced masked language models would yield further improvements.

Significance. If the probabilistic construction can be rigorously defined from embeddings and the superiority claims are supported by quantitative validation, the work could offer a more interpretable alternative to spatial embedding methods for semisupervised text analysis. The core idea of deriving polarity from context-conditioned predictions has potential, but the manuscript provides no evidence that this advantage is realized.

major comments (2)

[Abstract] Abstract: The assertion that the probabilistic polarity scores are 'more accurate, interpretable and consistent' than spatial models is made without any quantitative metrics, error analysis, baseline comparisons, or validation procedures. The single China Daily demonstration supplies no numerical results to support the superiority claim.
[Abstract] Abstract: The central construction treats word2vec (CBOW/skip-gram) 'as a masked language model' to obtain 'predicted probabilities of seed words to occur in given contexts.' Standard word2vec produces embedding similarities rather than a conditional distribution P(seed | context) via masking. The manuscript must explicitly define the procedure (including any normalization or softmax) that converts embeddings into these probabilities; without it, the probabilistic semantics and the claimed advantages over spatial LSS rest on an undefined quantity.

minor comments (1)

[Abstract] Abstract: The phrasing 'in terms of achievement in health issues' is unclear and should be revised for precision.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. These have highlighted important areas where the presentation and validation can be strengthened. We address each major comment below and commit to revisions that will incorporate explicit definitions and additional quantitative support.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the probabilistic polarity scores are 'more accurate, interpretable and consistent' than spatial models is made without any quantitative metrics, error analysis, baseline comparisons, or validation procedures. The single China Daily demonstration supplies no numerical results to support the superiority claim.

Authors: We acknowledge that the abstract asserts superiority without accompanying quantitative evidence, and the China Daily case study is presented primarily as an illustrative application rather than a formal validation. The manuscript does not include numerical metrics, baselines, or error analysis. To address this, we will revise the abstract and add a dedicated evaluation section with quantitative comparisons (e.g., polarity classification accuracy against human annotations or spatial LSS baselines), consistency measures across runs, and error analysis on the COVID-related corpus. revision: yes
Referee: [Abstract] Abstract: The central construction treats word2vec (CBOW/skip-gram) 'as a masked language model' to obtain 'predicted probabilities of seed words to occur in given contexts.' Standard word2vec produces embedding similarities rather than a conditional distribution P(seed | context) via masking. The manuscript must explicitly define the procedure (including any normalization or softmax) that converts embeddings into these probabilities; without it, the probabilistic semantics and the claimed advantages over spatial LSS rest on an undefined quantity.

Authors: We agree that the current manuscript does not provide an explicit mathematical definition of how word2vec embeddings are converted into conditional probabilities P(seed | context). Although word2vec (particularly CBOW) is trained to predict target words from context and thus admits a probabilistic interpretation via its output softmax, the manuscript relies on this without detailing the normalization step. In the revised version, we will add a precise description in the Methods section: the context embedding is used to compute logits via the output weight matrix, followed by softmax normalization restricted to the seed-word vocabulary to obtain the desired probabilities. This will rigorously ground the probabilistic construction and clarify its relation to (but distinction from) true masked language models. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper introduces a new variant of LSS by reinterpreting word2vec embeddings to produce polarity scores framed as predicted probabilities of seed-word occurrence in context, then empirically compares these to the original spatial LSS on the China Daily COVID corpus. No equations or steps are shown that define the output polarity directly in terms of itself, rename a fitted parameter as an independent prediction, or rest the central claim solely on a self-citation whose content is unverified. The probabilistic construction is presented as a distinct computational choice from the spatial baseline, and the demonstration supplies an external benchmark, rendering the derivation self-contained against the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.0 · 5394 in / 1054 out tokens · 70696 ms · 2026-05-07T13:29:34.658347+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 2 canonical work pages

[1]

LIWC contains keywords about ‘health’ (294 words) and ‘achievement’ (213 words), which were manually collected for social and psychological research.9

I perform full dictionary analysis of the articles using LIWC (Pennebaker et al., 2015) to obtain the benchmark scores. LIWC contains keywords about ‘health’ (294 words) and ‘achievement’ (213 words), which were manually collected for social and psychological research.9

2015
[2]

These seed words are not optimal but allow evaluation of models in situations where seed words chosen by users are relevant but not necessarily most extreme words.10

I randomly sample 10 sets of 10 unipolar seed words from the full dictionary for LSS. These seed words are not optimal but allow evaluation of models in situations where seed words chosen by users are relevant but not necessarily most extreme words.10
[3]

This evaluation results in sets of scores in 3,100 conditions in total

I apply spatial and probabilistic models to the same articles with different hyperparameters.11 These models are trained using SVD or word2vec on the corpus with a sample of the unipolar seed words taken from the full dictionary. This evaluation results in sets of scores in 3,100 conditions in total
[4]

In this comparison, I also included mini dictionaries that only comprise the seed words to highlight the contribution of the LSS models

I correlate the polarity scores of documents produced by the full dictionary and the three types of LSS models without any aggregation. In this comparison, I also included mini dictionaries that only comprise the seed words to highlight the contribution of the LSS models
[5]

opportunities

I analyze the relationship between the correlation coefficient and perplexity scores. If strong correlation is found between them, perplexity scores can be used to optimize 9 The LWIC dictionaries are created in three steps: (1) collect candidate words from English dictionaries and thesauri, (2) select only relevant words from the candidate words by emplo...

work page doi:10.1109/bts- 2015
[6]

inadequa*

https://doi.org/10.1017/S0305741022001722 Chen, L. (2012). Reporting news in China: Evaluation as an indicator of change in the China Daily. China Information, 26(3), 303–329. https://doi.org/10.1177/0920203X12456338 Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of...

work page doi:10.1017/s0305741022001722 2012

[1] [1]

LIWC contains keywords about ‘health’ (294 words) and ‘achievement’ (213 words), which were manually collected for social and psychological research.9

I perform full dictionary analysis of the articles using LIWC (Pennebaker et al., 2015) to obtain the benchmark scores. LIWC contains keywords about ‘health’ (294 words) and ‘achievement’ (213 words), which were manually collected for social and psychological research.9

2015

[2] [2]

These seed words are not optimal but allow evaluation of models in situations where seed words chosen by users are relevant but not necessarily most extreme words.10

I randomly sample 10 sets of 10 unipolar seed words from the full dictionary for LSS. These seed words are not optimal but allow evaluation of models in situations where seed words chosen by users are relevant but not necessarily most extreme words.10

[3] [3]

This evaluation results in sets of scores in 3,100 conditions in total

I apply spatial and probabilistic models to the same articles with different hyperparameters.11 These models are trained using SVD or word2vec on the corpus with a sample of the unipolar seed words taken from the full dictionary. This evaluation results in sets of scores in 3,100 conditions in total

[4] [4]

In this comparison, I also included mini dictionaries that only comprise the seed words to highlight the contribution of the LSS models

I correlate the polarity scores of documents produced by the full dictionary and the three types of LSS models without any aggregation. In this comparison, I also included mini dictionaries that only comprise the seed words to highlight the contribution of the LSS models

[5] [5]

opportunities

I analyze the relationship between the correlation coefficient and perplexity scores. If strong correlation is found between them, perplexity scores can be used to optimize 9 The LWIC dictionaries are created in three steps: (1) collect candidate words from English dictionaries and thesauri, (2) select only relevant words from the candidate words by emplo...

work page doi:10.1109/bts- 2015

[6] [6]

inadequa*

https://doi.org/10.1017/S0305741022001722 Chen, L. (2012). Reporting news in China: Evaluation as an indicator of change in the China Daily. China Information, 26(3), 303–329. https://doi.org/10.1177/0920203X12456338 Deerwester, S. C., Dumais, S. T., Landauer, T. K., Furnas, G. W., & Harshman, R. A. (1990). Indexing by latent semantic analysis. Journal of...

work page doi:10.1017/s0305741022001722 2012