Estimating Commonsense Plausibility through Semantic Shifts
Pith reviewed 2026-05-23 02:36 UTC · model grok-4.3
The pith
ComPaSS quantifies commonsense plausibility by the magnitude of semantic shifts after sentence augmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ComPaSS quantifies commonsense plausibility by measuring semantic shifts when augmenting sentences with commonsense-related information; plausible augmentations induce minimal shifts while implausible ones result in substantial deviations, and this yields stronger fine-grained discrimination than generative methods across language and vision-language backbones.
What carries the argument
ComPaSS framework, which scores plausibility by the size of the semantic shift in model embeddings after augmentation with commonsense information.
If this is right
- ComPaSS outperforms generative baselines on two types of fine-grained commonsense plausibility tasks across LLMs and VLMs.
- VLMs combined with ComPaSS outperform LMs on vision-grounded commonsense tasks.
- Contrastive pre-training on the backbone improves capture of semantic nuances and thereby strengthens ComPaSS performance.
Where Pith is reading between the lines
- The same shift-based signal could be tested on other reasoning domains such as factual consistency or causal inference.
- The method supplies a potential unsupervised signal for detecting when a model lacks relevant knowledge.
- Varying the source or scale of the augmentation information offers a direct route to measure robustness of the shift metric.
Load-bearing premise
The magnitude of the semantic shift after augmentation with commonsense-related information reliably separates plausible statements from implausible ones.
What would settle it
A controlled experiment in which known plausible statements produce larger embedding shifts than known implausible statements after the same augmentation procedure.
Figures
read the original abstract
Commonsense plausibility estimation is critical for evaluating language models (LMs), yet existing generative approaches--reliant on likelihoods or verbalized judgments--struggle with fine-grained discrimination. In this paper, we propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts when augmenting sentences with commonsense-related information. Plausible augmentations induce minimal shifts in semantics, while implausible ones result in substantial deviations. Evaluations on two types of fine-grained commonsense plausibility estimation tasks across different backbones, including LLMs and vision-language models (VLMs), show that ComPaSS consistently outperforms baselines. It demonstrates the advantage of discriminative approaches over generative methods in fine-grained commonsense plausibility evaluation. Experiments also show that (1) VLMs yield superior performance to LMs, when integrated with ComPaSS, on vision-grounded commonsense tasks. (2) contrastive pre-training sharpens backbone models' ability to capture semantic nuances, thereby further enhancing ComPaSS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ComPaSS, a discriminative framework for commonsense plausibility estimation that measures semantic shifts induced by augmenting input sentences with commonsense-related information. Plausible augmentations are claimed to produce minimal shifts while implausible ones produce large deviations. The manuscript reports that ComPaSS outperforms generative baselines on two types of fine-grained plausibility tasks across LLM and VLM backbones, with further gains from VLMs on vision-grounded tasks and from contrastive pre-training.
Significance. If the empirical claims are robust, the work supplies evidence that a shift-based discriminative metric can outperform likelihood- or verbalization-based generative methods on fine-grained commonsense discrimination and identifies practical advantages for VLMs and contrastively trained encoders.
major comments (1)
- [Abstract] Abstract: the central premise—that magnitude of semantic shift after augmentation reliably separates plausible from implausible statements—requires explicit clarification on the provenance of the augmentation text. If the commonsense-related information is itself produced by the backbone (or a similar LM), the measured shift largely reflects the model’s internal consistency rather than an independent commonsense signal; this circularity risk directly undermines the claim that the discriminative shift metric is the source of the reported gains over generative baselines.
minor comments (1)
- The abstract refers to “two types of fine-grained commonsense plausibility estimation tasks” without naming the tasks or citing the datasets; this information should appear in the introduction or experimental setup.
Simulated Author's Rebuttal
We thank the referee for the careful reading and the opportunity to address the concern about augmentation provenance. We provide clarification below and will revise the abstract accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central premise—that magnitude of semantic shift after augmentation reliably separates plausible from implausible statements—requires explicit clarification on the provenance of the augmentation text. If the commonsense-related information is itself produced by the backbone (or a similar LM), the measured shift largely reflects the model’s internal consistency rather than an independent commonsense signal; this circularity risk directly undermines the claim that the discriminative shift metric is the source of the reported gains over generative baselines.
Authors: The commonsense-related augmentation text is sourced from external, independent knowledge bases (ConceptNet and similar structured resources) rather than generated by the backbone model or any similar LM. This design ensures the measured semantic shift reflects alignment with an external commonsense signal, not internal model consistency. Consequently, the reported gains over generative baselines arise from the discriminative shift metric itself. We will revise the abstract to state this provenance explicitly. revision: yes
Circularity Check
No significant circularity; ComPaSS is a self-contained discriminative proposal
full rationale
The paper's central derivation defines plausibility via measured semantic shift magnitude after augmentation with commonsense information, with plausible cases expected to show minimal shift. This premise is introduced as a novel discriminative framework and does not reduce by the paper's own equations or citations to quantities already fitted from the target data. No self-citation chain is load-bearing for the uniqueness of the shift metric, no fitted parameter is relabeled as a prediction, and the augmentation/shift procedure is not shown to be definitionally equivalent to the plausibility labels. The method remains independent of the reported evaluation outcomes.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qwen2 technical report. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai- tanya Malaviya, Asli Celikyilmaz, and Yejin Choi
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
In Annual Meeting of the Association for Computational Lin- guistics
Comet: Commonsense transformers for au- tomatic knowledge graph construction. In Annual Meeting of the Association for Computational Lin- guistics. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Adv...
-
[3]
Commonsense knowledge mining from pre- trained models. In Proceedings of the 2019 con- ference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 1173–1178. Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard H. Hovy, Hinrich Schütze, an...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[4]
Towards General Text Embeddings with Multi-stage Contrastive Learning
Visual genome: Connecting language and vi- sion using crowdsourced dense image annotations. International journal of computer vision, 123:32–73. Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281. Jiacheng Liu, Wenya ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
In Conference on Em- pirical Methods in Natural Language Processing
Retrieval augmentation for commonsense rea- soning: A unified approach. In Conference on Em- pirical Methods in Natural Language Processing. Chenyu Zhang, Benjamin Van Durme, Zhuowan Li, and Elias Stengel-Eskin. 2022. Visual commonsense in pretrained unimodal and multimodal models. In Proceedings of the 2022 Conference of the North American Chapter of the...
work page 2022
-
[6]
IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 30:594–604
Alleviating the knowledge-language inconsis- tency: A study for deep commonsense knowledge. IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 30:594–604. Zirui Zhao, Wee Sun Lee, and David Hsu. 2024. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Infor- mation Processing Systems, 36. Sheng...
-
[7]
person, 2. chauffeur, 3. taxi driver, 4. a person, 5. or a driver. Sentences 1:
-
[9]
A chauffeur was driving through the night, shooting blurred lights out of focus
-
[10]
A taxi driver was driving through the night, shooting blurred lights out of focus
-
[11]
A person was driving through the night, shooting blurred lights out of focus
-
[12]
Question 2: why would a goat eat hay in a stable? Answers 2:
A driver was driving through the night, shooting blurred lights out of focus. Question 2: why would a goat eat hay in a stable? Answers 2:
-
[13]
gain energy, 2. to fulfill hunger, 3. to get nutrition, 4. get nutrition Sentences 2:
-
[14]
a goat eats hay in a stable to gain energy
-
[15]
a goat eats hay in a stable to fulfill hunger
-
[16]
a goat eats hay in a stable to get nutrition
-
[17]
Question 3: why would an aircraft receive fuel from a cargo aircraft? Answers 3:
a goat eats hay in a stable to get nutrition. Question 3: why would an aircraft receive fuel from a cargo aircraft? Answers 3:
-
[18]
longer flight times, 2. takeoff, 3. traveling, 4. enable travel, 5. refill fuel Sentences 3:
-
[19]
an aircraft receives fuel from a cargo aircraft because of longer flight times
-
[20]
an aircraft receives fuel from a cargo aircraft for takeoff
-
[21]
an aircraft receives fuel from a cargo aircraft for traveling
-
[22]
an aircraft receives fuel from a cargo aircraft to enable travel
-
[23]
an aircraft receives fuel from a cargo aircraft to refill fuel. New Task: Question 4: <Q> Answers 4: <A> Sentences 4: Figure 5: The prompt for converting question-answer pair into sentence. The blue part is the instruction, the green part is the 3-shot example, and the red part is the placeholder for the specific input. Property Templates for anchor Templ...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.