Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues
Pith reviewed 2026-05-18 04:19 UTC · model grok-4.3
The pith
LLMs infer social relationships like friends or lovers at 75-80% accuracy in English but only 58-69% in Korean.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce the SCRIPTS dataset of 1.1k dialogues in English and Korean sourced from movie scripts and propose a task to evaluate LLMs on inferring social relationships between speakers. Nine models achieve 75-80% accuracy on the English portion and 58-69% on the Korean portion. Thinking models and chain-of-thought prompting provide minimal benefits for social reasoning and occasionally amplify social biases.
What carries the argument
The SCRIPTS dataset of movie-script dialogues paired with a task that requires models to classify the social relationship between the two speakers.
If this is right
- LLMs still have clear limits when reasoning about social relationships in dialogue.
- Performance drops when moving from English to Korean dialogues.
- Chain-of-thought prompting and thinking models do not reliably raise accuracy on this task.
- Certain prompting methods can increase the chance of biased relationship predictions.
Where Pith is reading between the lines
- Real spoken exchanges outside scripted material may produce different error patterns if movie writing follows artificial social rules.
- Closing the language gap may require training data that includes more varied everyday interactions rather than film dialogue alone.
- Chat systems used across cultures could need targeted adjustments to avoid misreading social cues in non-English settings.
Load-bearing premise
Dialogues written for movies serve as a good stand-in for real interpersonal exchanges that works the same way in English and Korean.
What would settle it
Testing the same models on transcripts of actual spoken conversations in English and Korean and finding accuracy rates that differ substantially from the movie-script results.
read the original abstract
As LLMs are increasingly deployed in real-world interactions, their social reasoning in interpersonal communication becomes critical. To explore their capabilities, we introduce SCRIPTS, a 1.1k-dialogue dataset in English and Korean, sourced from movie scripts and propose a social reasoning task based on SCRIPTS that evaluates the capacity of LLMs to infer the social relationships (e.g., friends, lovers) between speakers in each dialogue. Evaluating nine models on our task, current LLMs achieve around 75--80% on the English dataset and 58--69% in Korean, and models predict an Unlikely relationship in 10--25% of responses in both languages. Furthermore, we find that thinking models and chain-of-thought prompting provide minimal benefits for social reasoning and occasionally amplify social biases. In sum, there are significant limitations in current LLMs' social reasoning capabilities, especially for Korean, highlighting the need for efforts to develop socially-aware LLMs across languages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the SCRIPTS dataset consisting of 1.1k dialogues sourced from movie scripts in both English and Korean. It proposes a task to evaluate LLMs' ability to infer social relationships (such as friends or lovers) between dialogue speakers. The authors evaluate nine LLMs, reporting accuracies of 75-80% on English and 58-69% on Korean, observe that models often predict 'Unlikely' relationships, and find that chain-of-thought prompting and thinking models provide minimal benefits while sometimes amplifying social biases. The conclusion emphasizes limitations in current LLMs' social reasoning, especially across languages.
Significance. This work offers a new multilingual benchmark for assessing social reasoning in LLMs, which is increasingly important as these models are used in interactive settings. The cross-lingual comparison between English and Korean is valuable, and the negative results on CoT and thinking models challenge assumptions about prompting strategies. If the dataset proves to be a reliable proxy, the findings could guide future development of socially aware AI systems. The empirical nature with concrete numbers is a strength, though generalizability remains to be established.
major comments (3)
- [Dataset section (likely §3)] The central accuracy claims (75-80% English, 58-69% Korean) depend on the quality and validity of the SCRIPTS labels derived from movie scripts. However, movie scripts often follow narrative conventions and dramatic tropes that may not reflect natural interpersonal social reasoning, potentially making the task easier or biased toward explicit cues. This assumption is load-bearing for the generalization claims, particularly for the Korean data where cultural adaptation or translation could introduce additional distortions. The paper should include a discussion or validation study comparing script-based inferences to real-world dialogues.
- [Methods or Evaluation section (likely §4)] The reported accuracies lack error bars, confidence intervals, or statistical tests, and there are no details provided on label collection process, inter-annotator agreement, or baseline models (e.g., rule-based or majority-class predictors). Without these, it is difficult to assess the reliability and significance of the performance differences and the claim that CoT provides minimal benefits.
- [Results section (likely §5)] The finding that thinking models and CoT occasionally amplify social biases is interesting but requires more specific evidence, such as examples of biased predictions or quantitative measures of bias before and after CoT. Currently, it is stated at a high level without supporting data or analysis.
minor comments (2)
- [Abstract] The abstract mentions 'models predict an Unlikely relationship in 10--25% of responses' but does not specify what 'Unlikely' refers to in the task setup; clarifying the label set early would improve readability.
- [Throughout] Some figures or tables (if present) showing per-model breakdowns or error analyses would help readers interpret the aggregate accuracies.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments, which have helped us identify areas to strengthen the rigor and clarity of our manuscript. We address each major comment in turn below.
read point-by-point responses
-
Referee: [Dataset section (likely §3)] The central accuracy claims (75-80% English, 58-69% Korean) depend on the quality and validity of the SCRIPTS labels derived from movie scripts. However, movie scripts often follow narrative conventions and dramatic tropes that may not reflect natural interpersonal social reasoning, potentially making the task easier or biased toward explicit cues. This assumption is load-bearing for the generalization claims, particularly for the Korean data where cultural adaptation or translation could introduce additional distortions. The paper should include a discussion or validation study comparing script-based inferences to real-world dialogues.
Authors: We agree that movie scripts can incorporate narrative conventions and dramatic tropes that differ from everyday social interactions, and that this may affect the generalizability of results, particularly for the Korean portion where translation and cultural adaptation could add further variables. In the revised manuscript we will add a dedicated limitations subsection to §3 that explicitly discusses these issues, explains our choice of movie scripts (availability of verifiable ground-truth relationships from plot context that would be difficult to obtain ethically in real-world data), and notes the potential for more explicit cues. We will also flag a full-scale validation study against real-world dialogues as valuable future work. A new empirical validation study is not feasible within the current project scope due to resource and privacy constraints. revision: partial
-
Referee: [Methods or Evaluation section (likely §4)] The reported accuracies lack error bars, confidence intervals, or statistical tests, and there are no details provided on label collection process, inter-annotator agreement, or baseline models (e.g., rule-based or majority-class predictors). Without these, it is difficult to assess the reliability and significance of the performance differences and the claim that CoT provides minimal benefits.
Authors: We acknowledge that the current version omits statistical details and baseline comparisons. The ground-truth labels are extracted directly from the narrative context and character descriptions provided in the original movie scripts rather than through separate human annotation of the inference task; we will clarify this extraction process in the revision. In the updated manuscript we will add bootstrap-derived error bars and confidence intervals to all accuracy figures, include majority-class and simple rule-based baselines, and report statistical significance tests (e.g., McNemar’s test) for key comparisons between models, languages, and prompting conditions. These changes will appear in the Methods and Results sections. revision: yes
-
Referee: [Results section (likely §5)] The finding that thinking models and CoT occasionally amplify social biases is interesting but requires more specific evidence, such as examples of biased predictions or quantitative measures of bias before and after CoT. Currently, it is stated at a high level without supporting data or analysis.
Authors: We agree that concrete examples and quantitative support would make this observation more robust. In the revision we will expand the relevant Results subsection to include two to three illustrative examples of outputs where CoT or thinking models produced more stereotypical inferences (e.g., over-inferring romantic relationships from mixed-gender dialogues). We will also add a quantitative analysis based on manual review of a sampled subset of model responses, reporting the frequency of biased predictions across prompting conditions. These additions will be supported by the data we already collected during our experiments. revision: yes
Circularity Check
No circularity in empirical benchmark evaluation
full rationale
The paper constructs the SCRIPTS dataset from movie scripts and reports direct accuracy measurements of LLMs on a held-out social relationship inference task. These results are computed as standard performance metrics without any fitted parameters, self-referential equations, or derivations that reduce the reported accuracies to quantities defined inside the paper. No load-bearing self-citations, ansatzes, or uniqueness claims are invoked to justify the central findings, making the work a self-contained empirical benchmark study.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Movie scripts provide ecologically valid examples of interpersonal dialogue for social-relationship inference.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce SCRIPTS, a 1k-dialogue dataset in English and Korean, sourced from movie scripts... evaluating LLMs’ social relationship reasoning abilities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Pseudo-Deliberation in Language Models: When Reasoning Fails to Align Values and Actions
LLMs exhibit pseudo-deliberation, with consistent value-action misalignment in generated dialogues despite reasoning, as measured by the new VALDI framework across 4941 scenarios.
-
LoCar: Localization-Aware Evaluation of In-Vehicle Assistants through Fine-Grained Sociolinguistic Control
LoCar is a localization-aware evaluation framework for in-vehicle assistants that identifies unstable Korean honorific control and weaker performance on strategic metrics like clarification and proactivity in current LLMs.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.