arxiv: 2604.03877 · v1 · submitted 2026-04-04 · 💻 cs.CL · cs.AI· cs.LG

Recognition: no theorem link

When Models Know More Than They Say: Probing Analogical Reasoning in LLMs

Hope McGovern , Caroline Craig , Thomas Lippincott , Hale Sirin

Authors on Pith no claims yet

Pith reviewed 2026-05-13 16:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords analogical reasoningLLMsprobingpromptingnarrative analogiesrhetorical analogiesinternal representationstask-dependent performance

0 comments

The pith

LLMs internally represent more analogical reasoning than they express through prompting, especially for rhetorical analogies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the gap between what large language models know about analogies and what they reveal when directly prompted. It compares internal probing of model representations against standard prompting on tasks involving rhetorical and narrative analogies. For rhetorical analogies, probing yields significantly higher performance in open-source models, while narrative analogies show similarly low success rates in both approaches. This asymmetry suggests that prompting does not always fully access the information available in the model's internal states, and that this limitation varies by task type. Understanding this could help explain why LLMs struggle with deeper abstraction even when relevant knowledge is present.

Core claim

The authors establish that probing significantly outperforms prompting for rhetorical analogies in open-source LLMs, but both methods perform similarly poorly on narrative analogies. This reveals a task-dependent relationship between internal representations and prompted outputs, indicating that prompting may have limitations in surfacing available analogical knowledge.

What carries the argument

The probing of internal model representations for analogy detection, contrasted with direct prompting.

If this is right

Prompting techniques may underestimate the analogical capabilities encoded in LLMs for certain analogy types.
The performance difference highlights task-specific barriers in accessing internal knowledge through language generation.
Limitations in generalization to latent analogies persist regardless of access method.
Open-source models contain more usable information about rhetorical analogies than their prompted responses indicate.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improved methods for eliciting model knowledge, such as advanced probing, could enhance performance on rhetorical analogy tasks.
Narrative analogy detection may require models trained on deeper structural patterns rather than surface cues.
These results point to broader challenges in LLM interpretability, where internal states hold knowledge not reflected in outputs.
Future work could test whether fine-tuning on probed representations improves prompted performance.

Load-bearing premise

The probing technique accurately captures true latent analogical knowledge without introducing its own artifacts, and the analogy test cases cleanly distinguish rhetorical from narrative types based on surface versus latent information.

What would settle it

Observing no significant performance difference between probing and prompting on a refined set of analogies where potential artifacts in probing are controlled for, or finding that narrative analogies can be solved at high accuracy by prompting in newer models.

Figures

Figures reproduced from arXiv: 2604.03877 by Caroline Craig, Hale Sirin, Hope McGovern, Thomas Lippincott.

**Figure 2.** Figure 2: MAP for narrative (left) and rhetorical (right) parallelism tasks across different [PITH_FULL_IMAGE:figures/full_fig_p018_2.png] view at source ↗

**Figure 3.** Figure 3: Individual layer performance vs. all-layers configuration on Llama-3.2-1B base [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗

**Figure 4.** Figure 4: MAP on prompted ranking across different models [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of branch sizes based on the number of spans per parallel set [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of proverb sizes based on the number of narratives per proverb [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: 1/3/8B Llama F1 vs Lit-Bank Task (Pooling: Mean, Classifier: MLP) [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: 1/3/8B Llama F1 vs Lit-Bank Task (Pooling: Mean, Classifier: LogReg) [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: 1/3/8B Llama F1 vs Lit-Bank Task (Pooling: Max, Classifier: MLP) [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: 1/3/8B Llama F1 vs Lit-Bank Task (Pooling: Max, Classifier: LogReg) [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Similarity Scores across 446 ARN Document Pairs (Normalized with Min-Max [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Similarity Scores across 564 ASP Span Pairs (Normalized with Min-Max Scaling) [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

read the original abstract

Analogical reasoning is a core cognitive faculty essential for narrative understanding. While LLMs perform well when surface and structural cues align, they struggle in cases where an analogy is not apparent on the surface but requires latent information, suggesting limitations in abstraction and generalisation. In this paper we compare a model's probed representations with its prompted performance at detecting narrative analogies, revealing an asymmetry: for rhetorical analogies, probing significantly outperforms prompting in open-source models, while for narrative analogies, they achieve a similar (low) performance. This suggests that the relationship between internal representations and prompted behavior is task-dependent and may reflect limitations in how prompting accesses available information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows probing outperforming prompting on rhetorical analogies but both weak on narrative ones, yet the lack of methods details leaves the asymmetry hard to trust.

read the letter

The main takeaway is that for rhetorical analogies probing beats prompting in open-source models, while both hit the same low performance on narrative analogies. This suggests prompting sometimes fails to surface knowledge that is already in the representations, and the gap depends on the type of analogy task. The authors frame it around surface versus latent cues, which is a reasonable way to think about abstraction limits in LLMs. What is actually new is the split result across analogy types rather than a blanket claim about all reasoning. They earn credit for running the direct comparison and tying it to narrative understanding without inflating the findings. The work stays grounded in the experimental contrast they describe. The soft spots are the missing pieces on how the datasets were constructed, how rhetorical and narrative cases were balanced for lexical or syntactic overlap, and what statistical tests support the outperformance claim. Without those, the stress-test concern holds weight: a shallow probe could simply exploit easier surface patterns in the rhetorical examples that prompting does not leverage the same way. No model sizes, example counts, or confound checks are mentioned in the abstract, so the central claim feels preliminary rather than robust. This is for researchers working on LLM interpretability and reasoning benchmarks. Someone already running probes on analogy tasks could pick up the asymmetry observation and test it with tighter controls. It deserves peer review because the core idea is worth checking with full methods and the paper engages the literature honestly, even if revisions on the experimental details will be needed.

Referee Report

2 major / 1 minor

Summary. The paper compares internal representations (via probing) with prompted outputs in LLMs on analogical reasoning tasks. It reports an asymmetry: probing significantly outperforms prompting on rhetorical analogies for open-source models, while both methods yield similarly low performance on narrative analogies. This is interpreted as evidence that the relationship between latent representations and prompted behavior is task-dependent and that prompting may fail to access available analogical information.

Significance. If the asymmetry holds after proper controls, the result would clarify how LLMs encode analogical structure and why prompting often underperforms relative to what representations contain, with direct implications for inference-time methods and evaluation of abstraction capabilities.

major comments (2)

[§4] §4 (Experimental Setup): The construction of rhetorical vs. narrative analogy test cases does not report explicit matching or balancing on lexical overlap, syntactic structure, or surface cue strength; without these controls the reported probing advantage on rhetorical cases is compatible with the probe exploiting easier surface signals that prompting must also generate from, rather than accessing genuinely latent analogical knowledge.
[§5] §5 (Results): The abstract and results sections provide no details on model sizes, number of test instances, statistical tests, confidence intervals, or effect sizes for the claimed performance gaps; this prevents assessment of whether the asymmetry is robust or driven by small samples or particular model families.

minor comments (1)

[Introduction] The abstract and introduction use 'narrative analogies' and 'rhetorical analogies' without a concise operational definition or example pair that distinguishes surface vs. latent cues; a short table of examples would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments, which have helped us strengthen the manuscript. We address each major comment below and have made revisions to improve clarity, controls, and reporting of results.

read point-by-point responses

Referee: [§4] §4 (Experimental Setup): The construction of rhetorical vs. narrative analogy test cases does not report explicit matching or balancing on lexical overlap, syntactic structure, or surface cue strength; without these controls the reported probing advantage on rhetorical cases is compatible with the probe exploiting easier surface signals that prompting must also generate from, rather than accessing genuinely latent analogical knowledge.

Authors: We acknowledge that the original manuscript did not explicitly report balancing or matching on lexical overlap, syntactic structure, or surface cue strength between the rhetorical and narrative test cases. This is a valid concern, as it leaves open the possibility that the probe benefits from surface-level features. In the revised version, we have added a new subsection detailing the construction process, including quantitative measures of lexical overlap (e.g., Jaccard similarity) and syntactic complexity (e.g., parse tree depth) across categories, along with an additional control experiment that regresses out surface features from the probe predictions. The asymmetry remains significant after these controls, supporting that the probe accesses deeper analogical structure. revision: yes
Referee: [§5] §5 (Results): The abstract and results sections provide no details on model sizes, number of test instances, statistical tests, confidence intervals, or effect sizes for the claimed performance gaps; this prevents assessment of whether the asymmetry is robust or driven by small samples or particular model families.

Authors: We agree that these statistical details are necessary for evaluating robustness. The revised manuscript now includes: (i) exact model sizes (e.g., 7B, 13B, and 70B parameter variants for the open-source models), (ii) the number of test instances (n=240 per category), (iii) results of paired t-tests with p-values for the probing vs. prompting gaps, (iv) 95% confidence intervals around all accuracy scores, and (v) effect sizes (Cohen's d > 0.8 for the rhetorical asymmetry). These additions confirm the asymmetry is statistically reliable and holds across model families rather than being an artifact of small samples. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison of probe vs prompt performance

full rationale

The paper reports an experimental comparison between probing internal representations and prompting for analogical detection, split by rhetorical vs narrative cases. The asymmetry result is presented as an observed performance difference from direct measurement, not derived from any equation, fitted parameter renamed as prediction, or self-citation chain. No self-definitional steps, ansatz smuggling, or uniqueness theorems appear in the abstract or described method. The work is self-contained as a standard probing study whose claims rest on the experimental data rather than reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only, no explicit free parameters, axioms, or invented entities are described; the work implicitly assumes standard probing validity in ML without detailing them.

pith-pipeline@v0.9.0 · 5406 in / 1052 out tokens · 49118 ms · 2026-05-13T16:47:16.605008+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references · 6 canonical work pages · 1 internal anchor

[1]

doi: 10.18653/v1/N19-1220

Association for Computational Linguistics. doi: 10.18653/v1/N19-1220. David Bamman, Olivia Lewke, and Anya Mansoor. An Annotated Dataset of Coreference in English Literature, May 2020. Yonatan Belinkov. Probing Classifiers: Promises, Shortcomings, and Advances.Com- putational Linguistics, 48(1):207–219, April 2022. ISSN 0891-2017, 1530-9312. doi: 10.1162/...

work page doi:10.18653/v1/n19-1220 2020
[2]

doi: 10.18653/v1/2022.acl-long.276

Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.276. Sil Hamilton, Matthew Wilkens, and Andrew Piper. Narrabench: A comprehensive frame- work for narrative benchmarking, 2025. URLhttps://arxiv.org/abs/2510.09869. Linyang He, Peili Chen, Ercong Nie, Yuanning Li, and Jonathan R. Brennan. Decoding Probing: Revealing Internal Linguis...

work page doi:10.18653/v1/2022.acl-long.276 2022
[3]

Hope McGovern, Hale Sirin, and Tom Lippincott

URLhttps://aclanthology.org/2024.ml4al-1.26/. Hope McGovern, Hale Sirin, and Tom Lippincott. Computational discovery of chias- mus in ancient religious text. In Luis Chiruzzo, Alan Ritter, and Lu Wang (eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associ- ation for Computational Linguistics: Human Language Technol...

work page doi:10.18653/v1/2025.naacl-short.13 2024
[4]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

doi: 10.1162/tacl_a_00688. Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, All...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/tacl_a_00688 2023
[5]

doi: 10.48550/arXiv.2310.01714

arXiv, March 2024. doi: 10.48550/arXiv.2310.01714. A Dataset Details We evaluate narrative and rhetorical parallelism using two complementary datasets that operationalize parallel structure at markedly different scales. Analogical Reasoning over Narratives.To probe analogical reasoning at the document level, we adopt the Analogical Reasoning over Narrativ...

work page doi:10.48550/arxiv.2310.01714 2024
[6]

A pair of spans (s1, s2) drawn from a shared discourse context (coreference and quote attribution)

A single span (s1) (event detection, entity detection) 2. A pair of spans (s1, s2) drawn from a shared discourse context (coreference and quote attribution). The classifier then predicts a binary label y∈0, 1 indicating whether the span (or span pair) satisfies the task-specific criterion. Importantly, these tasks donotrequire ranking; they serve to verif...

work page 2019