What sentiment analysis can't see: Measuring whether customers were helped, and what went wrong, across 70,000 support conversations

Jason Potteiger

arxiv: 2606.19698 · v1 · pith:2MTOL7BXnew · submitted 2026-06-18 · 💻 cs.CL

What sentiment analysis can't see: Measuring whether customers were helped, and what went wrong, across 70,000 support conversations

Jason Potteiger This is my paper

Pith reviewed 2026-06-26 18:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords sentiment analysiscustomer satisfactionLLM annotationsupport conversationsproblem detectioncustomer feedbacktone vs satisfaction

0 comments

The pith

LLM satisfaction estimates from support conversations correlate better with customer ratings than sentiment analysis does.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests a richer alternative to sentiment analysis for reading customer support data at scale. It applies GPT-5.4 to 70,450 conversations to estimate each customer's satisfaction level and to flag whether they reported a concrete problem, then validates the outputs against the 1-to-5 ratings customers left afterward. The satisfaction estimate tracks the ratings more closely than sentiment does and reveals that tone and satisfaction disagree in 44 percent of cases. It also identifies a large category of customers who are satisfied yet still report fixable problems, a pattern that tone-based measures cannot surface. The work shows that LLM annotation can extract customer state and problem causes directly from interaction text.

Core claim

On 70,450 support conversations, an LLM estimate of customer satisfaction correlates at 0.47 with the 1-to-5 ratings customers left, compared with 0.36 for sentiment analysis, while also identifying reported problems and showing that tone and satisfaction disagree in 44 percent of cases, with tolerated friction as the largest group.

What carries the argument

LLM-based separate estimation of customer satisfaction with the outcome and of concrete problems reported in the conversation, validated directly against post-interaction 1-to-5 ratings.

If this is right

Satisfaction estimates flag unhappy customers with fewer false alarms than sentiment scores.
Structured reads can surface tolerated friction cases that no sentiment dashboard detects.
Neutral tone labels often conceal both quietly satisfied and quietly dissatisfied customers.
Business metrics can be built on extracted customer state and problem causes instead of language tone alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Teams could prioritize fixing problems reported by customers who still give positive ratings.
The same annotation approach might apply to other text records such as email threads or review comments to uncover hidden issues.
Linking these estimates to longer-term customer behavior data could test whether tolerated friction predicts churn.

Load-bearing premise

The 1-to-5 ratings customers leave after conversations serve as an unbiased and complete record of whether they were satisfied and what problems occurred.

What would settle it

Repeating the validation on a fresh collection of rated conversations where the satisfaction estimate shows equal or weaker correlation with ratings than sentiment does, or where the ratings themselves prove unrelated to the actual help received.

read the original abstract

Most companies read their customer support data at scale using sentiment analysis, which measures how customers sound rather than whether they were satisfied with the result. We tested a richer alternative on 70,450 support conversations from a leading online fundraising platform: alongside tone, we used GPT-5.4 to estimate each customer's satisfaction and to flag whether they reported a concrete problem, then validated all three readings against the 1-to-5 ratings customers left on the conversations they rated. The satisfaction estimate tracked those ratings far better than sentiment did, correlating at 0.47 against 0.36 and flagging unhappy customers with far fewer false alarms. The structured read also sees what sentiment cannot: tone and satisfaction disagree in 44% of conversations, a single "Neutral" label hides everything from quietly satisfied customers to ones who quietly gave up, and the largest group of all is "tolerated friction," customers who are satisfied but still reporting a fixable problem, a standing issue that no sentiment-based dashboard can surface. The broader finding is that LLM-based annotation can capture far more than the tonality of a customer's language, offering strong potential for new business metrics grounded instead in the customer's state (whether they were satisfied) and the cause of their problem extracted directly from the raw textual data of interactions and feedback.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLM satisfaction estimates beat sentiment at matching ratings on this dataset but the validation is only on the self-selected rated subset.

read the letter

The one or two things to know: this work shows an LLM can estimate customer satisfaction from support chats in a way that lines up better with actual 1-5 ratings than sentiment analysis does, and it flags a category of "tolerated friction" that no tone-based system would catch. On their 70,450 conversations, satisfaction correlates at 0.47 with ratings while sentiment hits 0.36, and tone and satisfaction disagree in 44% of cases.

What the paper does well is take a large, real dataset and run a direct head-to-head on the rated conversations. The idea of pulling out whether a concrete problem was mentioned, separate from how the customer sounded, is a useful distinction. They also point out that a neutral tone can mean very different things, which is obvious once said but worth quantifying on this scale.

The main soft spot is the validation sample. All the correlations and false alarm comparisons are computed only on conversations that got a customer rating. The paper does not report what share of the 70k left ratings or whether rated chats differ from unrated ones in problem rate or satisfaction. If customers who had issues are more likely to rate, then the advantage of the satisfaction estimate could be inflated for the full population. The stress test note on this is on point, and the abstract does not address it with any robustness check like repeat contact rates.

The ground truth assumption that the 1-5 ratings are unbiased is also taken without much probing.

This paper is for people in industry NLP or support analytics who are already using or considering LLMs for annotation. A reader who wants concrete numbers on how much better one approach is over another on real data will find something here, even if the scope is one platform.

It deserves a serious referee. The central empirical claim is clear enough to review, and the selection issue is fixable with more analysis. I would recommend sending it out rather than desk rejecting.

Referee Report

2 major / 1 minor

Summary. The paper claims that LLM-based estimates (using GPT-5.4) of customer satisfaction and reported problems on 70,450 support conversations outperform sentiment analysis when validated against customer 1-5 ratings, with correlations of 0.47 vs. 0.36, lower false alarms for unhappy customers, 44% disagreement between tone and satisfaction, and the ability to surface categories like 'tolerated friction' that sentiment cannot detect.

Significance. If the results hold after addressing methodological transparency and selection bias, the work could meaningfully advance customer-support analytics by shifting from tone-based to outcome-based metrics derived directly from interaction text, with potential for new business dashboards; the large scale and direct comparison to independent ratings are strengths.

major comments (2)

[Abstract] Abstract: the headline correlations (0.47 vs. 0.36) and false-alarm comparison are reported without any description of prompting methods, model parameters, statistical tests, or preprocessing steps, leaving the central empirical claim without visible methodological support.
[Abstract] Abstract: validation occurs exclusively on the rated subset, yet the manuscript provides no rating rate, no comparison of rated vs. unrated conversations, and no robustness check (e.g., via repeat-contact proxy), so the reported advantage over sentiment may be conditional on self-selection if rating propensity correlates with dissatisfaction or problem presence.

minor comments (1)

The model is referred to as 'GPT-5.4'; if this is a hypothetical or internal version, a brief clarification on its relation to publicly available models would aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for these constructive comments highlighting the need for greater methodological transparency in the abstract and for flagging potential selection bias in the validation. We address each point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the headline correlations (0.47 vs. 0.36) and false-alarm comparison are reported without any description of prompting methods, model parameters, statistical tests, or preprocessing steps, leaving the central empirical claim without visible methodological support.

Authors: We agree the abstract would be strengthened by including key methodological details. In the revision we will add a concise clause noting the use of GPT-5.4 with structured prompting for satisfaction and problem estimation, the direct validation against customer-provided 1-5 ratings, and that full prompting templates, model parameters, preprocessing, and statistical tests (Pearson correlations with confidence intervals) appear in the Methods and Results sections. revision: yes
Referee: [Abstract] Abstract: validation occurs exclusively on the rated subset, yet the manuscript provides no rating rate, no comparison of rated vs. unrated conversations, and no robustness check (e.g., via repeat-contact proxy), so the reported advantage over sentiment may be conditional on self-selection if rating propensity correlates with dissatisfaction or problem presence.

Authors: This concern is valid. The current validation is restricted to rated conversations because they supply the independent ground-truth labels. In the revised manuscript we will report the overall rating rate, provide a table comparing rated versus unrated conversations on available metadata (length, topic distribution), and add an explicit limitations paragraph discussing possible self-selection. Our dataset does not contain repeat-contact identifiers, so a repeat-contact proxy robustness check cannot be performed; we will note this as a limitation and suggest it for future work. revision: partial

Circularity Check

0 steps flagged

No circularity; claims rest on direct external validation against customer ratings

full rationale

The paper estimates satisfaction and problem flags via GPT-5.4, then directly correlates those outputs (0.47) against independent 1-5 customer ratings left on conversations, outperforming sentiment (0.36). No equations, fitted parameters, or self-citations appear in the provided text; the central comparison uses an external ground truth that is not derived from or defined by the LLM outputs. Selection bias on the rated subset is a validity concern but does not constitute circularity under the enumerated patterns. The derivation chain is self-contained against an external benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; the central validation depends on treating customer ratings as ground truth with no other free parameters or invented entities described.

axioms (1)

domain assumption Customer-provided 1-to-5 ratings accurately reflect true satisfaction and can serve as ground truth.
Invoked when validating LLM estimates against the ratings.

pith-pipeline@v0.9.1-grok · 5767 in / 1072 out tokens · 33540 ms · 2026-06-26T18:05:10.737402+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references

[1]

too convoluted

Introduction A customer writes a few civil messages about a refund that will not go through, thanks the agent for the help, and then rates the interaction a 1, with the remark that it was "too convoluted." Read as tone, the conversation is unremarkable; read as a verdict (the self-reported rating), it failed. The two readings describe the same customer an...

2025
[2]

is this text positive or negative

Related research The conventional way to summarize unstructured customer feedback at scale is sentiment analysis, and the field that built it has spent two decades moving away from the version still running on most dashboards. Pang and Lee (2008) and Liu (2015) codified the task as classifying the polarity of a document or sentence as positive, negative, ...

2008
[3]

The customer's tone, the presence of a concrete problem, and the customer's overall satisfaction are not separate data streams collected by separate instruments

Data, annotation, and validation design Every measure in this study is recovered from a single artifact: the text of a support conversation. The customer's tone, the presence of a concrete problem, and the customer's overall satisfaction are not separate data streams collected by separate instruments. They are three readings of the same words, produced by...

2025
[4]

They do not, and the gap is the foundation everything else in this paper rests on

Predicted satisfaction tracks the customer's rating; sentiment trails it If predicted satisfaction and sentiment were interchangeable summaries of a support conversation, they would track the customer's own rating about equally well. They do not, and the gap is the foundation everything else in this paper rests on. Predicted satisfaction correlates with r...

2018
[5]

something went wrong

Where tone and state disagree A measure can beat its rival on a criterion and still agree with it most of the time; the winner is just right more often in the cases where they split. But that is not the case here. Across the analyzable corpus, the two readings of the same conversation disagree 44% of the time. That rate is too high to support the common w...
[6]

I need help setting up my fundraising account. I have 2 campaigns on my page that are the same, how do I delete one? How do I access the QR code? Where is the sharing tab?

What the structured representation recovers Consider the conversation that sounds fine and contains a problem. A customer writes in about an accidental tip on a donation, the agent explains the refund path, the customer says thanks, and the exchange closes politely. Sentiment files this as Positive or Neutral and moves on. But a problem was reported and r...

2004
[7]

ok, thanks,

Why tone and state diverge: salience versus verdict Tone and the rating diverge because they answer different questions: sentiment reads the language a customer chose, while the rating judges how the whole interaction went. A customer can be unfailingly polite about an experience they would not repeat: three civil messages about a refund that will not go ...

1999
[8]

Conclusions A sentiment label is the wrong instrument for customer support, and the cost is not subtle. A structured read of the customer's state and the operational cause recovers what a single tone label compresses away: across 70,450 conversations the two readings disagree 44% of the time, a single Neutral label stretches across average ratings from 4....
[9]

nothing to see here

Implications for data practitioners Monitor state and cause, not tone. A sentiment dashboard is not a satisfaction dashboard. The two disagree 44% of the time here, often enough that using one to stand in for the other will mislead you. Track two things instead: whether the customer was satisfied (predicted satisfaction) and whether they reported a proble...
[10]

This is too convoluted. Why can't I just make a refund from the transactions page?

Limitations The criterion is limited and skews positive. Customer ratings exist for 5.55% of conversations and are heavily concentrated at the top of the scale, which inflates raw agreement and makes estimates least stable exactly where they matter most, at the low end. We mitigated this by leading with skew-robust metrics and reporting collapsed agreemen...

arXiv 2025

[1] [1]

too convoluted

Introduction A customer writes a few civil messages about a refund that will not go through, thanks the agent for the help, and then rates the interaction a 1, with the remark that it was "too convoluted." Read as tone, the conversation is unremarkable; read as a verdict (the self-reported rating), it failed. The two readings describe the same customer an...

2025

[2] [2]

is this text positive or negative

Related research The conventional way to summarize unstructured customer feedback at scale is sentiment analysis, and the field that built it has spent two decades moving away from the version still running on most dashboards. Pang and Lee (2008) and Liu (2015) codified the task as classifying the polarity of a document or sentence as positive, negative, ...

2008

[3] [3]

The customer's tone, the presence of a concrete problem, and the customer's overall satisfaction are not separate data streams collected by separate instruments

Data, annotation, and validation design Every measure in this study is recovered from a single artifact: the text of a support conversation. The customer's tone, the presence of a concrete problem, and the customer's overall satisfaction are not separate data streams collected by separate instruments. They are three readings of the same words, produced by...

2025

[4] [4]

They do not, and the gap is the foundation everything else in this paper rests on

Predicted satisfaction tracks the customer's rating; sentiment trails it If predicted satisfaction and sentiment were interchangeable summaries of a support conversation, they would track the customer's own rating about equally well. They do not, and the gap is the foundation everything else in this paper rests on. Predicted satisfaction correlates with r...

2018

[5] [5]

something went wrong

Where tone and state disagree A measure can beat its rival on a criterion and still agree with it most of the time; the winner is just right more often in the cases where they split. But that is not the case here. Across the analyzable corpus, the two readings of the same conversation disagree 44% of the time. That rate is too high to support the common w...

[6] [6]

I need help setting up my fundraising account. I have 2 campaigns on my page that are the same, how do I delete one? How do I access the QR code? Where is the sharing tab?

What the structured representation recovers Consider the conversation that sounds fine and contains a problem. A customer writes in about an accidental tip on a donation, the agent explains the refund path, the customer says thanks, and the exchange closes politely. Sentiment files this as Positive or Neutral and moves on. But a problem was reported and r...

2004

[7] [7]

ok, thanks,

Why tone and state diverge: salience versus verdict Tone and the rating diverge because they answer different questions: sentiment reads the language a customer chose, while the rating judges how the whole interaction went. A customer can be unfailingly polite about an experience they would not repeat: three civil messages about a refund that will not go ...

1999

[8] [8]

Conclusions A sentiment label is the wrong instrument for customer support, and the cost is not subtle. A structured read of the customer's state and the operational cause recovers what a single tone label compresses away: across 70,450 conversations the two readings disagree 44% of the time, a single Neutral label stretches across average ratings from 4....

[9] [9]

nothing to see here

Implications for data practitioners Monitor state and cause, not tone. A sentiment dashboard is not a satisfaction dashboard. The two disagree 44% of the time here, often enough that using one to stand in for the other will mislead you. Track two things instead: whether the customer was satisfied (predicted satisfaction) and whether they reported a proble...

[10] [10]

This is too convoluted. Why can't I just make a refund from the transactions page?

Limitations The criterion is limited and skews positive. Customer ratings exist for 5.55% of conversations and are heavily concentrated at the top of the scale, which inflates raw agreement and makes estimates least stable exactly where they matter most, at the low end. We mitigated this by leading with skew-robust metrics and reporting collapsed agreemen...

arXiv 2025