A Data-Driven Approach to Idiomaticity Based on Experts' Criteria in Theoretical Linguistics
Pith reviewed 2026-05-20 05:54 UTC · model grok-4.3
The pith
Expert ratings of 286 multi-word expressions show none qualify as absolutely idiomatic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that there are no absolutely idiomatic expressions. Lexical criteria seem to be the most influential; grammatical criteria are bound to certain conditions; presence of obsolete words and grammar influence ability of an MWE to be replaced with one word.
What carries the argument
The 16 criteria drawn from theoretical linguistics sources, applied through expert annotation to MWEs collected from the same sources.
If this is right
- Lexical criteria provide the primary signal for distinguishing degrees of idiomaticity in multi-word expressions.
- Grammatical criteria only contribute to idiomaticity judgments when specific contextual conditions are met.
- Expressions containing obsolete words or grammar are less likely to be replaceable by a single word.
- Idiomaticity should be treated as a matter of degree in linguistic analysis rather than a binary property.
Where Pith is reading between the lines
- The annotation method could be adapted to train computational systems for grading idiomaticity in large text corpora.
- The absence of absolute idiomaticity may affect how language models handle multi-word expressions during parsing or generation.
- Testing the same criteria on expressions drawn from non-theoretical sources would check whether the patterns hold beyond the original collection.
Load-bearing premise
The 16 criteria drawn from theoretical sources are sufficient to capture the full notion of idiomaticity and that expert annotations provide a reliable, unbiased measurement of those criteria.
What would settle it
Finding even one multi-word expression that multiple independent groups of linguistics experts unanimously rate as satisfying every one of the 16 criteria would falsify the claim that no absolutely idiomatic expressions exist.
Figures
read the original abstract
The article observes data analysis of 286 multi-word expressions (MWEs) based on 16 lexical, grammatical and other criteria described in theoretical books and papers on the notion of idiomaticity. MWEs were collected from the same theoretical sources, and a set of experts in linguistics annotated them with these categories. The distribution of categories shows that there are no absolutely idiomatic expressions. Lexical criteria seem to be the most influential; grammatical criteria are bound to certain conditions; presence of obsolete words and grammar influence ability of an MWE to be replaced with one word.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper collects 286 multi-word expressions (MWEs) from theoretical linguistics sources and has linguistics experts annotate them according to 16 lexical, grammatical, and other criteria drawn from the literature. Analysis of the resulting annotation distributions supports the claims that no MWEs qualify as absolutely idiomatic, that lexical criteria exert the strongest influence, that grammatical criteria apply only under specific conditions, and that the presence of obsolete words or grammar affects an MWE's replaceability by a single word.
Significance. If the annotation protocol were fully operationalized and shown to be reliable, the work would offer a useful empirical test of theoretical criteria for idiomaticity and could inform both linguistic theory and computational models of MWEs. At present the absence of a clear mapping from the 16 criteria to the notion of 'absolute idiomaticity' and the lack of reliability metrics limit the strength of the distributional conclusions.
major comments (2)
- [Abstract] Abstract: the claim that 'there are no absolutely idiomatic expressions' is not logically entailed by the reported distributions without an explicit threshold or combination rule (e.g., positive annotation on all 16 criteria, or a minimum aggregate score) that defines absolute idiomaticity. The manuscript must specify this mapping before the zero-count observation can support the stated conclusion.
- [Annotation procedure] Annotation procedure (presumably §3 or §4): no information is given on inter-annotator agreement, the precise operational definitions applied to each of the 16 criteria, how disagreements were resolved, or any statistical tests used to rank criterion influence. These details are load-bearing for the reliability of the distribution claims and the assertion that lexical criteria are most influential.
minor comments (1)
- [Abstract] The abstract and summary statements would benefit from a brief table or figure summarizing the 16 criteria and their observed frequencies across the 286 MWEs.
Simulated Author's Rebuttal
We thank the referee for their careful and constructive review of our manuscript. Below we provide point-by-point responses to the major comments and indicate how we plan to revise the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'there are no absolutely idiomatic expressions' is not logically entailed by the reported distributions without an explicit threshold or combination rule (e.g., positive annotation on all 16 criteria, or a minimum aggregate score) that defines absolute idiomaticity. The manuscript must specify this mapping before the zero-count observation can support the stated conclusion.
Authors: We agree that specifying the mapping from the criteria to absolute idiomaticity is essential for the validity of our conclusion. We will revise the manuscript to explicitly state that an expression is considered absolutely idiomatic only if it receives annotations indicating idiomaticity on every one of the 16 criteria. Given that our annotations of the 286 MWEs yielded no instances meeting this criterion, the data supports the claim of no absolutely idiomatic expressions. This clarification will be added to the abstract and the results section. revision: yes
-
Referee: [Annotation procedure] Annotation procedure (presumably §3 or §4): no information is given on inter-annotator agreement, the precise operational definitions applied to each of the 16 criteria, how disagreements were resolved, or any statistical tests used to rank criterion influence. These details are load-bearing for the reliability of the distribution claims and the assertion that lexical criteria are most influential.
Authors: The referee is correct that the annotation procedure section lacks several key details. We will expand this section in the revised manuscript to include the precise operational definitions for each of the 16 criteria, which are based directly on the theoretical linguistics literature cited in the paper. We will also detail how disagreements among the expert annotators were resolved through iterative discussions leading to consensus. Additionally, we will describe the analytical approach used to assess the relative influence of lexical versus grammatical criteria, including the distributional comparisons performed. However, formal inter-annotator agreement statistics were not computed as part of the original study, limiting our ability to report them. revision: partial
- Formal inter-annotator agreement metrics
Circularity Check
Empirical annotation study with no circular derivation
full rationale
The paper selects 286 MWEs and 16 criteria from existing theoretical linguistics sources, then obtains fresh expert annotations on those criteria for the collected expressions. The central claim that no expressions are absolutely idiomatic is presented as following directly from the resulting category distributions in this new annotated dataset. No equations, fitted parameters, self-citations, or uniqueness theorems are invoked to derive the result; the observations rest on independent expert judgments rather than reducing to the paper's own inputs or prior author work by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Expert linguists can reliably and consistently apply the 16 theoretical criteria to MWEs.
- domain assumption The 286 MWEs collected from theoretical sources form a representative sample for studying idiomaticity.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The distribution of categories shows that there are no absolutely idiomatic expressions. ... vector sum ... higher is the vector sum of an annotated MWE, the more idiomatic an MWE is.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.