Scene Abstraction for Lexical Semantics: Structured Representations of Situated Meaning

Katrin Erk; Yejin Cho

arxiv: 2605.22542 · v2 · pith:FBZT6EPUnew · submitted 2026-05-21 · 💻 cs.CL

Scene Abstraction for Lexical Semantics: Structured Representations of Situated Meaning

Yejin Cho , Katrin Erk This is my paper

Pith reviewed 2026-05-22 06:32 UTC · model grok-4.3

classification 💻 cs.CL

keywords scene abstractionlexical semanticssituated meaninginterpretive scenescontextual scenesexpression profilesfew-shot promptinghuman evaluation

0 comments

The pith

Structured scene representations capture the situated meanings of words more effectively than standard embeddings or ATOMIC-based profiles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Scene Abstraction as a framework for building structured representations of the interpretive scenes that words evoke in usage contexts. Each scene breaks into a Contextual Scene covering events, entities, and setting, plus an Expression Profile covering engaged events, generalizable properties, and evoked emotions, all built via few-shot prompting of a large language model. On a new dataset of 520 real usage instances from 26 keywords, experiments find that people identify these scenes at 82.4 percent accuracy, an 11.8 point gain over text-only embeddings, and that the profiles match human interpretations of words in context 86.4 percent of the time over ATOMIC-based alternatives. The work shows that situated dimensions of meaning are real, systematic, and can be made explicit in computational representations. This matters because current lexical models leave these implicit aspects unaddressed, limiting how well systems grasp context-dependent word use.

Core claim

Scene Abstraction is a framework for constructing structured representations of the interpretive scenes that words participate in across usage contexts. Each scene consists of a Contextual Scene (Events, Entities, Setting) and an Expression Profile (Engaged events, Generalizable properties, Evoked emotions), operationalized through few-shot prompting of a large language model. Empirical evidence from two experiments on the COCA-Scenes dataset of 520 usage instances across 26 keywords shows that scenes are reliably identifiable across human observers at 82.4 percent accuracy, exceeding text-only embeddings by 11.8 percentage points, and that scene profiles align more closely with human word-a

What carries the argument

The Scene Abstraction framework, which decomposes word meaning into a Contextual Scene (Events, Entities, Setting) and an Expression Profile (Engaged events, Generalizable properties, Evoked emotions) constructed via few-shot prompting of a large language model.

If this is right

Human observers identify scenes from usage instances at 82.4 percent accuracy, exceeding text-only embeddings by 11.8 percentage points.
Scene profiles are preferred 86.4 percent of the time over ATOMIC-based alternatives across three semantic dimensions of human interpretation.
The COCA-Scenes dataset of 520 usage instances across 26 keywords provides a benchmark for testing situated lexical representations.
Structured scene representations make implicit situated dimensions of word meaning explicit and usable in computational models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could extend to tasks like word sense disambiguation by grounding senses in typical evoked scenes rather than static definitions.
Scene profiles might improve dialogue systems by enabling generation of responses that better match the atmospheres and associations a speaker intends.
Testing the framework across languages could reveal whether evoked scenes vary systematically with cultural context.

Load-bearing premise

The interpretive scenes that words participate in can be accurately and consistently operationalized through few-shot prompting of a large language model.

What would settle it

A replication experiment on a new set of words and contexts where human raters show no preference for scene profiles over ATOMIC alternatives or where scene identification accuracy falls to chance levels would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.22542 by Katrin Erk, Yejin Cho.

**Figure 1.** Figure 1: The Scene Abstraction framework. Given a usage context u and target expression x (coffee) in it, an LLM produces S(u, x) comprising a Contextual Scene C (Events, Entities, Setting) that captures the overall situation described by u, and an Expression Profile E (Engaged events, Generalizable properties, Evoked emotions) that characterizes the scene-grounded meaning of x. they reflect structured interpretive… view at source ↗

**Figure 2.** Figure 2: The prompt instruction for the scene abstraction process. [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

read the original abstract

Coffee and tea share many properties, yet they evoke strikingly different situations, atmospheres, and affective associations. These situated dimensions of word meaning are real and systematic, but they remain implicit in most computational representations of lexical meaning. We propose Scene Abstraction, a framework for constructing structured representations of the interpretive scenes that words participate in across usage contexts. Each scene consists of a Contextual Scene (Events, Entities, Setting) and an expression-centered Expression Profile (Engaged events, Generalizable properties, Evoked emotions), operationalized through few-shot prompting of a large language model. Our contributions are three-fold: (1) a structured representation framework for situated lexical meaning; (2) COCA-Scenes, a dataset of 520 usage instances across 26 keywords for distinct scene identification; and (3) empirical evidence from two experiments suggesting that scenes are reliably identifiable across human observers (82.4% accuracy, +11.8 pp over text-only embeddings) and that our scene profiles more closely align with human interpretation of words in context than ATOMIC-based alternatives (86.4% preference across three semantic dimensions).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces Scene Abstraction to capture situated word meanings through LLM-generated Contextual Scenes and Expression Profiles, supported by a new dataset and human preference results, though the prompting details remain thin.

read the letter

The main thing to know is that this paper puts forward Scene Abstraction as a way to build structured representations of the interpretive scenes words evoke in context. It combines a Contextual Scene part covering events, entities, and setting with an Expression Profile for engaged events, properties, and emotions. They generate these using few-shot LLM prompts and test them on a new COCA-Scenes dataset of 520 instances from 26 keywords. The reported results show 82.4% human accuracy in identifying scenes, an 11.8 point lift over text-only embeddings, and 86.4% preference for their profiles over ATOMIC across semantic dimensions. This directly targets the gap between static embeddings and the situated, affective layers of meaning that examples like coffee versus tea highlight. The dataset and the head-to-head human comparisons against both embeddings and ATOMIC are concrete steps that go beyond prior work on commonsense graphs. The framework itself gives a clear structure for what counts as a scene, which could help in applications that need context-sensitive lexical understanding, such as narrative or dialogue systems. The soft spot is the heavy reliance on few-shot LLM prompting to produce the scenes in the first place. No specifics appear on the exact prompt templates, number of shots, model version, or checks for output stability across runs. If the generated scenes shift with small prompt tweaks, the human alignment numbers could partly reflect the LLM's own priors rather than independent interpretive structure. The abstract supplies the accuracy and preference figures, but fuller experimental design details, participant information, and any statistical tests would make the evidence easier to assess. The thinking here is straightforward and engages honestly with existing lexical semantics and commonsense literature without obvious internal contradictions. This is for computational linguists working on richer representations for context-dependent tasks. A reader interested in moving past static vectors would get practical value from the dataset and the evaluation setup. I would send it to peer review. The core idea and initial results are substantial enough to deserve referee time, with the main revisions likely focused on adding methodological transparency around the prompting process.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Scene Abstraction, a framework for structured representations of situated lexical meaning. Each representation consists of a Contextual Scene (Events, Entities, Setting) and an Expression Profile (Engaged events, Generalizable properties, Evoked emotions), both operationalized via few-shot prompting of a large language model. The paper introduces the COCA-Scenes dataset (520 usage instances across 26 keywords) and reports two experiments claiming 82.4% human accuracy in identifying scenes (+11.8 pp over text-only embeddings) and 86.4% human preference for the generated scene profiles over ATOMIC-based alternatives across three semantic dimensions.

Significance. If the empirical results hold under reproducible conditions, the work would supply a useful structured alternative for capturing interpretive and situated dimensions of word meaning that are typically implicit in embeddings or knowledge bases. The new dataset and direct human preference comparisons constitute concrete contributions. The significance is currently limited by the absence of methodological specifics required to verify that the reported accuracies and preferences arise from stable scene representations rather than prompting artifacts.

major comments (2)

[Abstract and §3] Abstract and §3: The central claims rest on scenes generated exclusively by few-shot LLM prompting, yet the manuscript provides no prompt templates, shot count, model version, temperature, or inter-run stability metrics. This information is load-bearing for interpreting the 82.4% identification accuracy and 86.4% preference results; without it, the human judgments could reflect LLM priors rather than the intended interpretive scenes.
[Experiments] Experiments section (referenced in abstract): The abstract states concrete accuracy (82.4%) and preference (86.4%) figures from two experiments, but supplies no details on experimental design, statistical tests, participant demographics, task instructions, or how the scene profiles were presented to annotators. These omissions leave only moderate evidential support for the claims that scenes are reliably identifiable across observers and align more closely with human interpretation than ATOMIC alternatives.

minor comments (1)

[Abstract] The abstract could more explicitly separate the three listed contributions (framework, dataset, empirical evidence) to improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important gaps in methodological transparency. We agree that these details are essential for reproducibility and will revise the manuscript to address them directly.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3: The central claims rest on scenes generated exclusively by few-shot LLM prompting, yet the manuscript provides no prompt templates, shot count, model version, temperature, or inter-run stability metrics. This information is load-bearing for interpreting the 82.4% identification accuracy and 86.4% preference results; without it, the human judgments could reflect LLM priors rather than the intended interpretive scenes.

Authors: We agree that the absence of these implementation details limits the ability to verify the results. In the revised manuscript, we will add the complete prompt templates for both Contextual Scene and Expression Profile generation, specify the exact LLM (including version), number of shots, temperature setting, and include inter-run stability analysis by reporting consistency metrics across repeated generations with different random seeds. revision: yes
Referee: [Experiments] Experiments section (referenced in abstract): The abstract states concrete accuracy (82.4%) and preference (86.4%) figures from two experiments, but supplies no details on experimental design, statistical tests, participant demographics, task instructions, or how the scene profiles were presented to annotators. These omissions leave only moderate evidential support for the claims that scenes are reliably identifiable across observers and align more closely with human interpretation than ATOMIC alternatives.

Authors: We concur that fuller experimental details are needed to support the reported accuracies and preferences. The revised Experiments section will include participant demographics and recruitment procedures, complete task instructions and interface descriptions, details on how scene profiles were presented to annotators (including the ATOMIC comparison setup), and the specific statistical tests performed along with any significance results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on new dataset and independent human judgments

full rationale

The paper proposes a framework for scene abstraction, operationalizes it via few-shot LLM prompting to build the COCA-Scenes dataset of 520 instances, and validates via two fresh human experiments reporting 82.4% identification accuracy and 86.4% preference over ATOMIC. No equations, fitted parameters, or self-citations reduce the central empirical claims to inputs by construction. The LLM step is a generative method whose outputs are then tested against external human annotations, satisfying the criterion for self-contained evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the domain assumption that situated meaning decomposes cleanly into the listed scene components and that LLM few-shot prompting can extract them faithfully; no numerical free parameters are reported, but the prompting procedure itself introduces unstated choices.

axioms (1)

domain assumption Situated word meaning can be decomposed into a Contextual Scene (Events, Entities, Setting) plus an Expression Profile (Engaged events, Generalizable properties, Evoked emotions).
This decomposition is the core structuring choice of the proposed framework.

invented entities (2)

Contextual Scene no independent evidence
purpose: To represent the situational context (events, entities, setting) in which a word is used.
New conceptual container introduced by the framework; no independent evidence outside this work is cited.
Expression Profile no independent evidence
purpose: To capture expression-centered aspects including engaged events, generalizable properties, and evoked emotions.
New conceptual container introduced by the framework; no independent evidence outside this work is cited.

pith-pipeline@v0.9.0 · 5722 in / 1546 out tokens · 77321 ms · 2026-05-22T06:32:30.703749+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

operationalized through few-shot prompting of a large language model... scene profiles more closely align with human interpretation... 86.4% preference
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Contextual Scene (Events, Entities, Setting) and Expression Profile

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.