pith. sign in

arxiv: 2510.19028 · v3 · submitted 2025-10-21 · 💻 cs.CL

Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues

Pith reviewed 2026-05-18 04:19 UTC · model grok-4.3

classification 💻 cs.CL
keywords social reasoningLLM evaluationdialogue understandingrelationship inferencemultilingualEnglishKoreanchain-of-thought
0
0 comments X

The pith

LLMs infer social relationships like friends or lovers at 75-80% accuracy in English but only 58-69% in Korean.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a dataset of 1,100 dialogues in English and Korean taken from movie scripts to test whether LLMs can correctly identify the social relationship between speakers in a conversation. Models reach reasonable accuracy on English examples yet perform noticeably worse on Korean ones, and techniques such as chain-of-thought prompting or thinking models add little benefit while sometimes increasing biased outputs. These findings matter because LLMs are now used in everyday interactions where correctly reading social signals affects how natural or appropriate the responses feel. The results indicate that current models still fall short on reliable social reasoning when language and cultural context change.

Core claim

We introduce the SCRIPTS dataset of 1.1k dialogues in English and Korean sourced from movie scripts and propose a task to evaluate LLMs on inferring social relationships between speakers. Nine models achieve 75-80% accuracy on the English portion and 58-69% on the Korean portion. Thinking models and chain-of-thought prompting provide minimal benefits for social reasoning and occasionally amplify social biases.

What carries the argument

The SCRIPTS dataset of movie-script dialogues paired with a task that requires models to classify the social relationship between the two speakers.

If this is right

  • LLMs still have clear limits when reasoning about social relationships in dialogue.
  • Performance drops when moving from English to Korean dialogues.
  • Chain-of-thought prompting and thinking models do not reliably raise accuracy on this task.
  • Certain prompting methods can increase the chance of biased relationship predictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real spoken exchanges outside scripted material may produce different error patterns if movie writing follows artificial social rules.
  • Closing the language gap may require training data that includes more varied everyday interactions rather than film dialogue alone.
  • Chat systems used across cultures could need targeted adjustments to avoid misreading social cues in non-English settings.

Load-bearing premise

Dialogues written for movies serve as a good stand-in for real interpersonal exchanges that works the same way in English and Korean.

What would settle it

Testing the same models on transcripts of actual spoken conversations in English and Korean and finding accuracy rates that differ substantially from the movie-script results.

read the original abstract

As LLMs are increasingly deployed in real-world interactions, their social reasoning in interpersonal communication becomes critical. To explore their capabilities, we introduce SCRIPTS, a 1.1k-dialogue dataset in English and Korean, sourced from movie scripts and propose a social reasoning task based on SCRIPTS that evaluates the capacity of LLMs to infer the social relationships (e.g., friends, lovers) between speakers in each dialogue. Evaluating nine models on our task, current LLMs achieve around 75--80% on the English dataset and 58--69% in Korean, and models predict an Unlikely relationship in 10--25% of responses in both languages. Furthermore, we find that thinking models and chain-of-thought prompting provide minimal benefits for social reasoning and occasionally amplify social biases. In sum, there are significant limitations in current LLMs' social reasoning capabilities, especially for Korean, highlighting the need for efforts to develop socially-aware LLMs across languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the SCRIPTS dataset consisting of 1.1k dialogues sourced from movie scripts in both English and Korean. It proposes a task to evaluate LLMs' ability to infer social relationships (such as friends or lovers) between dialogue speakers. The authors evaluate nine LLMs, reporting accuracies of 75-80% on English and 58-69% on Korean, observe that models often predict 'Unlikely' relationships, and find that chain-of-thought prompting and thinking models provide minimal benefits while sometimes amplifying social biases. The conclusion emphasizes limitations in current LLMs' social reasoning, especially across languages.

Significance. This work offers a new multilingual benchmark for assessing social reasoning in LLMs, which is increasingly important as these models are used in interactive settings. The cross-lingual comparison between English and Korean is valuable, and the negative results on CoT and thinking models challenge assumptions about prompting strategies. If the dataset proves to be a reliable proxy, the findings could guide future development of socially aware AI systems. The empirical nature with concrete numbers is a strength, though generalizability remains to be established.

major comments (3)
  1. [Dataset section (likely §3)] The central accuracy claims (75-80% English, 58-69% Korean) depend on the quality and validity of the SCRIPTS labels derived from movie scripts. However, movie scripts often follow narrative conventions and dramatic tropes that may not reflect natural interpersonal social reasoning, potentially making the task easier or biased toward explicit cues. This assumption is load-bearing for the generalization claims, particularly for the Korean data where cultural adaptation or translation could introduce additional distortions. The paper should include a discussion or validation study comparing script-based inferences to real-world dialogues.
  2. [Methods or Evaluation section (likely §4)] The reported accuracies lack error bars, confidence intervals, or statistical tests, and there are no details provided on label collection process, inter-annotator agreement, or baseline models (e.g., rule-based or majority-class predictors). Without these, it is difficult to assess the reliability and significance of the performance differences and the claim that CoT provides minimal benefits.
  3. [Results section (likely §5)] The finding that thinking models and CoT occasionally amplify social biases is interesting but requires more specific evidence, such as examples of biased predictions or quantitative measures of bias before and after CoT. Currently, it is stated at a high level without supporting data or analysis.
minor comments (2)
  1. [Abstract] The abstract mentions 'models predict an Unlikely relationship in 10--25% of responses' but does not specify what 'Unlikely' refers to in the task setup; clarifying the label set early would improve readability.
  2. [Throughout] Some figures or tables (if present) showing per-model breakdowns or error analyses would help readers interpret the aggregate accuracies.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which have helped us identify areas to strengthen the rigor and clarity of our manuscript. We address each major comment in turn below.

read point-by-point responses
  1. Referee: [Dataset section (likely §3)] The central accuracy claims (75-80% English, 58-69% Korean) depend on the quality and validity of the SCRIPTS labels derived from movie scripts. However, movie scripts often follow narrative conventions and dramatic tropes that may not reflect natural interpersonal social reasoning, potentially making the task easier or biased toward explicit cues. This assumption is load-bearing for the generalization claims, particularly for the Korean data where cultural adaptation or translation could introduce additional distortions. The paper should include a discussion or validation study comparing script-based inferences to real-world dialogues.

    Authors: We agree that movie scripts can incorporate narrative conventions and dramatic tropes that differ from everyday social interactions, and that this may affect the generalizability of results, particularly for the Korean portion where translation and cultural adaptation could add further variables. In the revised manuscript we will add a dedicated limitations subsection to §3 that explicitly discusses these issues, explains our choice of movie scripts (availability of verifiable ground-truth relationships from plot context that would be difficult to obtain ethically in real-world data), and notes the potential for more explicit cues. We will also flag a full-scale validation study against real-world dialogues as valuable future work. A new empirical validation study is not feasible within the current project scope due to resource and privacy constraints. revision: partial

  2. Referee: [Methods or Evaluation section (likely §4)] The reported accuracies lack error bars, confidence intervals, or statistical tests, and there are no details provided on label collection process, inter-annotator agreement, or baseline models (e.g., rule-based or majority-class predictors). Without these, it is difficult to assess the reliability and significance of the performance differences and the claim that CoT provides minimal benefits.

    Authors: We acknowledge that the current version omits statistical details and baseline comparisons. The ground-truth labels are extracted directly from the narrative context and character descriptions provided in the original movie scripts rather than through separate human annotation of the inference task; we will clarify this extraction process in the revision. In the updated manuscript we will add bootstrap-derived error bars and confidence intervals to all accuracy figures, include majority-class and simple rule-based baselines, and report statistical significance tests (e.g., McNemar’s test) for key comparisons between models, languages, and prompting conditions. These changes will appear in the Methods and Results sections. revision: yes

  3. Referee: [Results section (likely §5)] The finding that thinking models and CoT occasionally amplify social biases is interesting but requires more specific evidence, such as examples of biased predictions or quantitative measures of bias before and after CoT. Currently, it is stated at a high level without supporting data or analysis.

    Authors: We agree that concrete examples and quantitative support would make this observation more robust. In the revision we will expand the relevant Results subsection to include two to three illustrative examples of outputs where CoT or thinking models produced more stereotypical inferences (e.g., over-inferring romantic relationships from mixed-gender dialogues). We will also add a quantitative analysis based on manual review of a sampled subset of model responses, reporting the frequency of biased predictions across prompting conditions. These additions will be supported by the data we already collected during our experiments. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark evaluation

full rationale

The paper constructs the SCRIPTS dataset from movie scripts and reports direct accuracy measurements of LLMs on a held-out social relationship inference task. These results are computed as standard performance metrics without any fitted parameters, self-referential equations, or derivations that reduce the reported accuracies to quantities defined inside the paper. No load-bearing self-citations, ansatzes, or uniqueness claims are invoked to justify the central findings, making the work a self-contained empirical benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that movie scripts contain representative social cues and that human labels of relationship type are reliable ground truth. No free parameters are introduced; the only domain assumption is that script dialogue generalizes to real social reasoning.

axioms (1)
  • domain assumption Movie scripts provide ecologically valid examples of interpersonal dialogue for social-relationship inference.
    The dataset is sourced from movie scripts; this premise is required for the task to measure real-world social reasoning.

pith-pipeline@v0.9.0 · 5730 in / 1218 out tokens · 23838 ms · 2026-05-18T04:19:11.065111+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.

  2. LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control

    cs.CL 2026-05 unverdicted novelty 5.0

    LoCar is a localization-aware evaluation framework for in-vehicle assistants that identifies unstable Korean honorific control and weaker performance on strategic metrics like clarification and proactivity in current LLMs.