Modeling Semantic Compositionality with Sememe Knowledge

Chenghao Yang; Fanchao Qi; Junjie Huang; Maosong Sun; Qun Liu; Xiao Chen; Zhiyuan Liu

arxiv: 1907.04744 · v1 · pith:2INM4VYAnew · submitted 2019-07-10 · 💻 cs.CL

Modeling Semantic Compositionality with Sememe Knowledge

Fanchao Qi , Junjie Huang , Chenghao Yang , Zhiyuan Liu , Xiao Chen , Qun Liu , Maosong Sun This is my paper

Pith reviewed 2026-05-24 23:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords semantic compositionalitysememesHowNetmultiword expressionsword representationsknowledge integrationnatural language processing

0 comments

The pith

Incorporating sememe knowledge from HowNet significantly improves models of semantic compositionality for multiword expressions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that sememes, the smallest semantic units of language, can be used to strengthen models of how word meanings combine into larger phrases. Prior approaches to semantic compositionality have emphasized complex mathematical functions while rarely drawing on external knowledge sources. The authors run a confirmatory test and then integrate sememe information drawn from the HowNet knowledge base into existing composition models. They apply the resulting models to the task of building representations for multiword expressions and report clear gains over baselines that ignore sememe knowledge. Intrinsic and extrinsic evaluations plus case studies support the benefit of this addition.

Core claim

We verify the effectiveness of sememes in modeling semantic compositionality by a confirmatory experiment and make the first attempt to incorporate sememe knowledge from HowNet into SC models. When these models are used to learn representations of multiword expressions, they achieve significant performance improvements over baseline methods that do not consider sememe knowledge, as measured by both intrinsic and extrinsic evaluations.

What carries the argument

Sememe knowledge drawn from HowNet and inserted into semantic compositionality functions for enriching multiword expression representations.

If this is right

Sememe knowledge produces measurable gains on both intrinsic and extrinsic evaluations of multiword expression representations.
Quantitative analysis and case studies confirm that sememe information aids semantic compositionality modeling.
The approach demonstrates that external sememe knowledge can be added to composition models without task-specific retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sememe integration technique could be tested on other compositionality problems such as full sentence embeddings.
If HowNet sememes prove useful, comparable gains might appear when other structured knowledge bases are substituted.
Sememe signals could help models distinguish compositional from non-compositional phrases such as idioms.

Load-bearing premise

Sememe annotations in HowNet are accurate, complete, and directly relevant to the compositionality task so that adding them improves performance without introducing noise.

What would settle it

An experiment that applies the same models and datasets but replaces HowNet sememe labels with random labels and finds no performance difference or a reversal of the reported gains.

read the original abstract

Semantic compositionality (SC) refers to the phenomenon that the meaning of a complex linguistic unit can be composed of the meanings of its constituents. Most related works focus on using complicated compositionality functions to model SC while few works consider external knowledge in models. In this paper, we verify the effectiveness of sememes, the minimum semantic units of human languages, in modeling SC by a confirmatory experiment. Furthermore, we make the first attempt to incorporate sememe knowledge into SC models, and employ the sememeincorporated models in learning representations of multiword expressions, a typical task of SC. In experiments, we implement our models by incorporating knowledge from a famous sememe knowledge base HowNet and perform both intrinsic and extrinsic evaluations. Experimental results show that our models achieve significant performance boost as compared to the baseline methods without considering sememe knowledge. We further conduct quantitative analysis and case studies to demonstrate the effectiveness of applying sememe knowledge in modeling SC. All the code and data of this paper can be obtained on https://github.com/thunlp/Sememe-SC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows a concrete lift from adding HowNet sememes to composition models on MWE tasks and ships the code.

read the letter

The main point is that sememe knowledge from HowNet improves multiword expression representations when plugged into semantic composition models, and the authors make the code and data public. They position this as the first such attempt, starting with a confirmatory check that sememes matter, then building and testing integrated models on both intrinsic and extrinsic tasks plus some analysis and cases. The work is straightforward: it takes an existing knowledge base and shows it helps where pure composition functions fall short. Releasing the implementation is the clearest practical value here, since anyone can inspect how the integration was done and try it on their own data. The experiments follow a reasonable pattern for this kind of knowledge-injection paper. The soft spots are mostly around the strength of the evidence. The abstract claims a significant boost but gives no effect sizes, no significance tests, and no breakdown of baselines or splits, so it is difficult to judge robustness from the summary alone. The whole result also rests on HowNet annotations being accurate and relevant enough not to add noise; that assumption is plausible but untested in the reported work. This is a paper for people already working on compositionality or external-knowledge methods in NLP who want a working example rather than a new theoretical framework. It is not revolutionary, but the empirical demonstration plus code makes it worth a referee's time to check the details and see whether the gains hold up under closer scrutiny. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The paper proposes incorporating sememe knowledge from the HowNet knowledge base into semantic compositionality (SC) models, with a focus on learning representations for multiword expressions. It conducts a confirmatory experiment on sememes' effectiveness and implements sememe-augmented models, reporting performance gains over baselines in both intrinsic and extrinsic evaluations. Code and data are released on GitHub.

Significance. If the reported gains prove robust, the work demonstrates the value of external sememe knowledge for SC tasks and provides a reproducible baseline via the public code release, which supports verification and extension in knowledge-augmented NLP models.

major comments (1)

[§4] §4 (experimental results): the central claim of a 'significant performance boost' lacks any reported statistical significance tests, error bars, exact baseline implementations, or data-split details, which directly undermines assessment of whether the gains are reliable and load-bearing for the paper's main contribution.

minor comments (2)

[§3] The description of how sememes are integrated into the composition functions (likely §3) could include a clearer formal definition or pseudocode to aid replication.
Table captions and axis labels in the results figures should explicitly state the evaluation metrics and number of runs for clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback. We address the single major comment below and will revise the manuscript accordingly to strengthen the experimental reporting.

read point-by-point responses

Referee: [§4] §4 (experimental results): the central claim of a 'significant performance boost' lacks any reported statistical significance tests, error bars, exact baseline implementations, or data-split details, which directly undermines assessment of whether the gains are reliable and load-bearing for the paper's main contribution.

Authors: We agree that the absence of statistical significance tests, error bars, and more explicit details on baselines and splits weakens the presentation of results. In the revised version we will add paired t-tests (or equivalent) with p-values for all reported improvements, include error bars on performance figures where feasible, and expand §4 with precise descriptions of baseline re-implementations and the train/dev/test splits used. The public GitHub release already contains the code and data, but we will make the experimental protocol clearer in the text itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents an empirical modeling approach that incorporates external sememe annotations from HowNet into compositionality functions for multiword expressions, with performance gains shown via intrinsic and extrinsic evaluations against baselines. No derivation chain, uniqueness theorem, or prediction reduces by construction to self-defined quantities, fitted inputs, or self-citation load-bearing premises; results depend on external knowledge base and reported comparisons.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach depends on the external HowNet sememe knowledge base as an unexamined input and on standard neural composition functions whose hyperparameters are not detailed in the abstract. No new entities are postulated.

free parameters (1)

sememe integration hyperparameters
Weights or gating parameters controlling how sememe vectors are combined with word vectors in the composition function; these are fitted during training.

axioms (1)

domain assumption Sememes are the minimum semantic units of human languages and HowNet provides a reliable mapping from words to sememes.
Invoked in the abstract as the justification for using sememe knowledge to model SC.

pith-pipeline@v0.9.0 · 5726 in / 1193 out tokens · 19478 ms · 2026-05-24T23:52:43.647560+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose two sememe-incorporated SC models... SCAS model concatenates the embeddings of the MWE’s constituents and their sememes, while SCMSA model considers the mutual attention between a constituent’s sememes and the other constituent.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Sememes are defined as the minimum semantic units of human languages... HowNet... annotates over 100,000 Chinese words

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.