pith. sign in

arxiv: 2502.13464 · v2 · submitted 2025-02-19 · 💻 cs.CL · cs.AI

Estimating Commonsense Plausibility through Semantic Shifts

Pith reviewed 2026-05-23 02:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords commonsense plausibilitysemantic shiftsComPaSSdiscriminative frameworklanguage modelsvision-language modelssentence augmentation
0
0 comments X

The pith

ComPaSS quantifies commonsense plausibility by the magnitude of semantic shifts after sentence augmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ComPaSS as a discriminative framework for estimating how plausible statements are under commonsense. It augments input sentences with commonsense-related information and tracks the resulting change in semantic embeddings. Plausible statements produce only small shifts while implausible ones produce large deviations. This approach is shown to outperform likelihood-based and verbalized generative baselines on two kinds of fine-grained tasks, and it benefits from vision-language models on grounded cases as well as from contrastive pre-training.

Core claim

ComPaSS quantifies commonsense plausibility by measuring semantic shifts when augmenting sentences with commonsense-related information; plausible augmentations induce minimal shifts while implausible ones result in substantial deviations, and this yields stronger fine-grained discrimination than generative methods across language and vision-language backbones.

What carries the argument

ComPaSS framework, which scores plausibility by the size of the semantic shift in model embeddings after augmentation with commonsense information.

If this is right

  • ComPaSS outperforms generative baselines on two types of fine-grained commonsense plausibility tasks across LLMs and VLMs.
  • VLMs combined with ComPaSS outperform LMs on vision-grounded commonsense tasks.
  • Contrastive pre-training on the backbone improves capture of semantic nuances and thereby strengthens ComPaSS performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same shift-based signal could be tested on other reasoning domains such as factual consistency or causal inference.
  • The method supplies a potential unsupervised signal for detecting when a model lacks relevant knowledge.
  • Varying the source or scale of the augmentation information offers a direct route to measure robustness of the shift metric.

Load-bearing premise

The magnitude of the semantic shift after augmentation with commonsense-related information reliably separates plausible statements from implausible ones.

What would settle it

A controlled experiment in which known plausible statements produce larger embedding shifts than known implausible statements after the same augmentation procedure.

Figures

Figures reproduced from arXiv: 2502.13464 by Jiafeng Guo, Keping Bi, Wanqing Cui, Wei Huang, Xueqi Cheng.

Figure 1
Figure 1. Figure 1: How ComPaSS works on different tasks. 2.2 CSPE Based on External Knowledge Language models (LMs) may have insufficient or inaccurate knowledge, which led to some meth￾ods to incorporate external knowledge to bet￾ter estimate commonsense plausibility. A typi￾cal approach is to augment the model’s knowl￾edge by retrieving relevant sentences from external sources (Zhang et al., 2021; Yu et al., 2022). Com￾mon… view at source ↗
Figure 2
Figure 2. Figure 2: Binary classification accuracy of models with ComPaSS on different groups. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ComPaSS performance with different template types and template ensemble settings. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The ranking of sheep colors by humans and different models, along with corresponding images from the [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The prompt for converting question-answer pair into sentence. The blue part is the instruction, the green [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The prompt for attribute value ranking task and commonsense frame completion task. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Commonsense plausibility estimation is critical for evaluating language models (LMs), yet existing generative approaches--reliant on likelihoods or verbalized judgments--struggle with fine-grained discrimination. In this paper, we propose ComPaSS, a novel discriminative framework that quantifies commonsense plausibility by measuring semantic shifts when augmenting sentences with commonsense-related information. Plausible augmentations induce minimal shifts in semantics, while implausible ones result in substantial deviations. Evaluations on two types of fine-grained commonsense plausibility estimation tasks across different backbones, including LLMs and vision-language models (VLMs), show that ComPaSS consistently outperforms baselines. It demonstrates the advantage of discriminative approaches over generative methods in fine-grained commonsense plausibility evaluation. Experiments also show that (1) VLMs yield superior performance to LMs, when integrated with ComPaSS, on vision-grounded commonsense tasks. (2) contrastive pre-training sharpens backbone models' ability to capture semantic nuances, thereby further enhancing ComPaSS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes ComPaSS, a discriminative framework for commonsense plausibility estimation that measures semantic shifts induced by augmenting input sentences with commonsense-related information. Plausible augmentations are claimed to produce minimal shifts while implausible ones produce large deviations. The manuscript reports that ComPaSS outperforms generative baselines on two types of fine-grained plausibility tasks across LLM and VLM backbones, with further gains from VLMs on vision-grounded tasks and from contrastive pre-training.

Significance. If the empirical claims are robust, the work supplies evidence that a shift-based discriminative metric can outperform likelihood- or verbalization-based generative methods on fine-grained commonsense discrimination and identifies practical advantages for VLMs and contrastively trained encoders.

major comments (1)
  1. [Abstract] Abstract: the central premise—that magnitude of semantic shift after augmentation reliably separates plausible from implausible statements—requires explicit clarification on the provenance of the augmentation text. If the commonsense-related information is itself produced by the backbone (or a similar LM), the measured shift largely reflects the model’s internal consistency rather than an independent commonsense signal; this circularity risk directly undermines the claim that the discriminative shift metric is the source of the reported gains over generative baselines.
minor comments (1)
  1. The abstract refers to “two types of fine-grained commonsense plausibility estimation tasks” without naming the tasks or citing the datasets; this information should appear in the introduction or experimental setup.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the opportunity to address the concern about augmentation provenance. We provide clarification below and will revise the abstract accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central premise—that magnitude of semantic shift after augmentation reliably separates plausible from implausible statements—requires explicit clarification on the provenance of the augmentation text. If the commonsense-related information is itself produced by the backbone (or a similar LM), the measured shift largely reflects the model’s internal consistency rather than an independent commonsense signal; this circularity risk directly undermines the claim that the discriminative shift metric is the source of the reported gains over generative baselines.

    Authors: The commonsense-related augmentation text is sourced from external, independent knowledge bases (ConceptNet and similar structured resources) rather than generated by the backbone model or any similar LM. This design ensures the measured semantic shift reflects alignment with an external commonsense signal, not internal model consistency. Consequently, the reported gains over generative baselines arise from the discriminative shift metric itself. We will revise the abstract to state this provenance explicitly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; ComPaSS is a self-contained discriminative proposal

full rationale

The paper's central derivation defines plausibility via measured semantic shift magnitude after augmentation with commonsense information, with plausible cases expected to show minimal shift. This premise is introduced as a novel discriminative framework and does not reduce by the paper's own equations or citations to quantities already fitted from the target data. No self-citation chain is load-bearing for the uniqueness of the shift metric, no fitted parameter is relabeled as a prediction, and the augmentation/shift procedure is not shown to be definitionally equivalent to the plausibility labels. The method remains independent of the reported evaluation outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no concrete free parameters, axioms, or invented entities; full text would be required to audit any modeling choices or background assumptions.

pith-pipeline@v0.9.0 · 5711 in / 1115 out tokens · 50967 ms · 2026-05-23T02:36:00.359440+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 3 internal anchors

  1. [1]

    GPT-4 Technical Report

    Qwen2 technical report. Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774. Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chai- tanya Malaviya, Asli Celikyilmaz, and Yejin Choi

  2. [2]

    In Annual Meeting of the Association for Computational Lin- guistics

    Comet: Commonsense transformers for au- tomatic knowledge graph construction. In Annual Meeting of the Association for Computational Lin- guistics. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Adv...

  3. [3]

    Commonsense knowledge mining from pre- trained models. In Proceedings of the 2019 con- ference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 1173–1178. Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard H. Hovy, Hinrich Schütze, an...

  4. [4]

    Towards General Text Embeddings with Multi-stage Contrastive Learning

    Visual genome: Connecting language and vi- sion using crowdsourced dense image annotations. International journal of computer vision, 123:32–73. Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. 2023. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281. Jiacheng Liu, Wenya ...

  5. [5]

    In Conference on Em- pirical Methods in Natural Language Processing

    Retrieval augmentation for commonsense rea- soning: A unified approach. In Conference on Em- pirical Methods in Natural Language Processing. Chenyu Zhang, Benjamin Van Durme, Zhuowan Li, and Elias Stengel-Eskin. 2022. Visual commonsense in pretrained unimodal and multimodal models. In Proceedings of the 2022 Conference of the North American Chapter of the...

  6. [6]

    IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 30:594–604

    Alleviating the knowledge-language inconsis- tency: A study for deep commonsense knowledge. IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, 30:594–604. Zirui Zhao, Wee Sun Lee, and David Hsu. 2024. Large language models as commonsense knowledge for large-scale task planning. Advances in Neural Infor- mation Processing Systems, 36. Sheng...

  7. [7]

    chauffeur, 3

    person, 2. chauffeur, 3. taxi driver, 4. a person, 5. or a driver. Sentences 1:

  8. [9]

    A chauffeur was driving through the night, shooting blurred lights out of focus

  9. [10]

    A taxi driver was driving through the night, shooting blurred lights out of focus

  10. [11]

    A person was driving through the night, shooting blurred lights out of focus

  11. [12]

    Question 2: why would a goat eat hay in a stable? Answers 2:

    A driver was driving through the night, shooting blurred lights out of focus. Question 2: why would a goat eat hay in a stable? Answers 2:

  12. [13]

    to fulfill hunger, 3

    gain energy, 2. to fulfill hunger, 3. to get nutrition, 4. get nutrition Sentences 2:

  13. [14]

    a goat eats hay in a stable to gain energy

  14. [15]

    a goat eats hay in a stable to fulfill hunger

  15. [16]

    a goat eats hay in a stable to get nutrition

  16. [17]

    Question 3: why would an aircraft receive fuel from a cargo aircraft? Answers 3:

    a goat eats hay in a stable to get nutrition. Question 3: why would an aircraft receive fuel from a cargo aircraft? Answers 3:

  17. [18]

    takeoff, 3

    longer flight times, 2. takeoff, 3. traveling, 4. enable travel, 5. refill fuel Sentences 3:

  18. [19]

    an aircraft receives fuel from a cargo aircraft because of longer flight times

  19. [20]

    an aircraft receives fuel from a cargo aircraft for takeoff

  20. [21]

    an aircraft receives fuel from a cargo aircraft for traveling

  21. [22]

    an aircraft receives fuel from a cargo aircraft to enable travel

  22. [23]

    New Task: Question 4: <Q> Answers 4: <A> Sentences 4: Figure 5: The prompt for converting question-answer pair into sentence

    an aircraft receives fuel from a cargo aircraft to refill fuel. New Task: Question 4: <Q> Answers 4: <A> Sentences 4: Figure 5: The prompt for converting question-answer pair into sentence. The blue part is the instruction, the green part is the 3-shot example, and the red part is the placeholder for the specific input. Property Templates for anchor Templ...