pith. sign in

arxiv: 2605.22542 · v2 · pith:FBZT6EPUnew · submitted 2026-05-21 · 💻 cs.CL

Scene Abstraction for Lexical Semantics: Structured Representations of Situated Meaning

Pith reviewed 2026-05-22 06:32 UTC · model grok-4.3

classification 💻 cs.CL
keywords scene abstractionlexical semanticssituated meaninginterpretive scenescontextual scenesexpression profilesfew-shot promptinghuman evaluation
0
0 comments X

The pith

Structured scene representations capture the situated meanings of words more effectively than standard embeddings or ATOMIC-based profiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Scene Abstraction as a framework for building structured representations of the interpretive scenes that words evoke in usage contexts. Each scene breaks into a Contextual Scene covering events, entities, and setting, plus an Expression Profile covering engaged events, generalizable properties, and evoked emotions, all built via few-shot prompting of a large language model. On a new dataset of 520 real usage instances from 26 keywords, experiments find that people identify these scenes at 82.4 percent accuracy, an 11.8 point gain over text-only embeddings, and that the profiles match human interpretations of words in context 86.4 percent of the time over ATOMIC-based alternatives. The work shows that situated dimensions of meaning are real, systematic, and can be made explicit in computational representations. This matters because current lexical models leave these implicit aspects unaddressed, limiting how well systems grasp context-dependent word use.

Core claim

Scene Abstraction is a framework for constructing structured representations of the interpretive scenes that words participate in across usage contexts. Each scene consists of a Contextual Scene (Events, Entities, Setting) and an Expression Profile (Engaged events, Generalizable properties, Evoked emotions), operationalized through few-shot prompting of a large language model. Empirical evidence from two experiments on the COCA-Scenes dataset of 520 usage instances across 26 keywords shows that scenes are reliably identifiable across human observers at 82.4 percent accuracy, exceeding text-only embeddings by 11.8 percentage points, and that scene profiles align more closely with human word-a

What carries the argument

The Scene Abstraction framework, which decomposes word meaning into a Contextual Scene (Events, Entities, Setting) and an Expression Profile (Engaged events, Generalizable properties, Evoked emotions) constructed via few-shot prompting of a large language model.

If this is right

  • Human observers identify scenes from usage instances at 82.4 percent accuracy, exceeding text-only embeddings by 11.8 percentage points.
  • Scene profiles are preferred 86.4 percent of the time over ATOMIC-based alternatives across three semantic dimensions of human interpretation.
  • The COCA-Scenes dataset of 520 usage instances across 26 keywords provides a benchmark for testing situated lexical representations.
  • Structured scene representations make implicit situated dimensions of word meaning explicit and usable in computational models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could extend to tasks like word sense disambiguation by grounding senses in typical evoked scenes rather than static definitions.
  • Scene profiles might improve dialogue systems by enabling generation of responses that better match the atmospheres and associations a speaker intends.
  • Testing the framework across languages could reveal whether evoked scenes vary systematically with cultural context.

Load-bearing premise

The interpretive scenes that words participate in can be accurately and consistently operationalized through few-shot prompting of a large language model.

What would settle it

A replication experiment on a new set of words and contexts where human raters show no preference for scene profiles over ATOMIC alternatives or where scene identification accuracy falls to chance levels would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.22542 by Katrin Erk, Yejin Cho.

Figure 1
Figure 1. Figure 1: The Scene Abstraction framework. Given a usage context u and target expression x (coffee) in it, an LLM produces S(u, x) comprising a Contextual Scene C (Events, Entities, Setting) that captures the overall situation described by u, and an Expression Profile E (Engaged events, Generalizable properties, Evoked emotions) that characterizes the scene-grounded meaning of x. they reflect structured interpretive… view at source ↗
Figure 2
Figure 2. Figure 2: The prompt instruction for the scene abstraction process. [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗
read the original abstract

Coffee and tea share many properties, yet they evoke strikingly different situations, atmospheres, and affective associations. These situated dimensions of word meaning are real and systematic, but they remain implicit in most computational representations of lexical meaning. We propose Scene Abstraction, a framework for constructing structured representations of the interpretive scenes that words participate in across usage contexts. Each scene consists of a Contextual Scene (Events, Entities, Setting) and an expression-centered Expression Profile (Engaged events, Generalizable properties, Evoked emotions), operationalized through few-shot prompting of a large language model. Our contributions are three-fold: (1) a structured representation framework for situated lexical meaning; (2) COCA-Scenes, a dataset of 520 usage instances across 26 keywords for distinct scene identification; and (3) empirical evidence from two experiments suggesting that scenes are reliably identifiable across human observers (82.4% accuracy, +11.8 pp over text-only embeddings) and that our scene profiles more closely align with human interpretation of words in context than ATOMIC-based alternatives (86.4% preference across three semantic dimensions).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Scene Abstraction, a framework for structured representations of situated lexical meaning. Each representation consists of a Contextual Scene (Events, Entities, Setting) and an Expression Profile (Engaged events, Generalizable properties, Evoked emotions), both operationalized via few-shot prompting of a large language model. The paper introduces the COCA-Scenes dataset (520 usage instances across 26 keywords) and reports two experiments claiming 82.4% human accuracy in identifying scenes (+11.8 pp over text-only embeddings) and 86.4% human preference for the generated scene profiles over ATOMIC-based alternatives across three semantic dimensions.

Significance. If the empirical results hold under reproducible conditions, the work would supply a useful structured alternative for capturing interpretive and situated dimensions of word meaning that are typically implicit in embeddings or knowledge bases. The new dataset and direct human preference comparisons constitute concrete contributions. The significance is currently limited by the absence of methodological specifics required to verify that the reported accuracies and preferences arise from stable scene representations rather than prompting artifacts.

major comments (2)
  1. [Abstract and §3] Abstract and §3: The central claims rest on scenes generated exclusively by few-shot LLM prompting, yet the manuscript provides no prompt templates, shot count, model version, temperature, or inter-run stability metrics. This information is load-bearing for interpreting the 82.4% identification accuracy and 86.4% preference results; without it, the human judgments could reflect LLM priors rather than the intended interpretive scenes.
  2. [Experiments] Experiments section (referenced in abstract): The abstract states concrete accuracy (82.4%) and preference (86.4%) figures from two experiments, but supplies no details on experimental design, statistical tests, participant demographics, task instructions, or how the scene profiles were presented to annotators. These omissions leave only moderate evidential support for the claims that scenes are reliably identifiable across observers and align more closely with human interpretation than ATOMIC alternatives.
minor comments (1)
  1. [Abstract] The abstract could more explicitly separate the three listed contributions (framework, dataset, empirical evidence) to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important gaps in methodological transparency. We agree that these details are essential for reproducibility and will revise the manuscript to address them directly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3: The central claims rest on scenes generated exclusively by few-shot LLM prompting, yet the manuscript provides no prompt templates, shot count, model version, temperature, or inter-run stability metrics. This information is load-bearing for interpreting the 82.4% identification accuracy and 86.4% preference results; without it, the human judgments could reflect LLM priors rather than the intended interpretive scenes.

    Authors: We agree that the absence of these implementation details limits the ability to verify the results. In the revised manuscript, we will add the complete prompt templates for both Contextual Scene and Expression Profile generation, specify the exact LLM (including version), number of shots, temperature setting, and include inter-run stability analysis by reporting consistency metrics across repeated generations with different random seeds. revision: yes

  2. Referee: [Experiments] Experiments section (referenced in abstract): The abstract states concrete accuracy (82.4%) and preference (86.4%) figures from two experiments, but supplies no details on experimental design, statistical tests, participant demographics, task instructions, or how the scene profiles were presented to annotators. These omissions leave only moderate evidential support for the claims that scenes are reliably identifiable across observers and align more closely with human interpretation than ATOMIC alternatives.

    Authors: We concur that fuller experimental details are needed to support the reported accuracies and preferences. The revised Experiments section will include participant demographics and recruitment procedures, complete task instructions and interface descriptions, details on how scene profiles were presented to annotators (including the ATOMIC comparison setup), and the specific statistical tests performed along with any significance results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on new dataset and independent human judgments

full rationale

The paper proposes a framework for scene abstraction, operationalizes it via few-shot LLM prompting to build the COCA-Scenes dataset of 520 instances, and validates via two fresh human experiments reporting 82.4% identification accuracy and 86.4% preference over ATOMIC. No equations, fitted parameters, or self-citations reduce the central empirical claims to inputs by construction. The LLM step is a generative method whose outputs are then tested against external human annotations, satisfying the criterion for self-contained evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the domain assumption that situated meaning decomposes cleanly into the listed scene components and that LLM few-shot prompting can extract them faithfully; no numerical free parameters are reported, but the prompting procedure itself introduces unstated choices.

axioms (1)
  • domain assumption Situated word meaning can be decomposed into a Contextual Scene (Events, Entities, Setting) plus an Expression Profile (Engaged events, Generalizable properties, Evoked emotions).
    This decomposition is the core structuring choice of the proposed framework.
invented entities (2)
  • Contextual Scene no independent evidence
    purpose: To represent the situational context (events, entities, setting) in which a word is used.
    New conceptual container introduced by the framework; no independent evidence outside this work is cited.
  • Expression Profile no independent evidence
    purpose: To capture expression-centered aspects including engaged events, generalizable properties, and evoked emotions.
    New conceptual container introduced by the framework; no independent evidence outside this work is cited.

pith-pipeline@v0.9.0 · 5722 in / 1546 out tokens · 77321 ms · 2026-05-22T06:32:30.703749+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.