OCC: A Smart Reply System for Efficient In-App Communications

Franziska Bell; Gokhan Tur; Huaixiu Zheng; Yue Weng

arxiv: 1907.08167 · v1 · pith:OCO3VQX6new · submitted 2019-07-18 · 💻 cs.CL · stat.ML

OCC: A Smart Reply System for Efficient In-App Communications

Yue Weng , Huaixiu Zheng , Franziska Bell , Gokhan Tur This is my paper

Pith reviewed 2026-05-24 19:39 UTC · model grok-4.3

classification 💻 cs.CL stat.ML

keywords smart replyintent detectionin-app chatride sharingnearest neighbordistributed embeddingproduction deployment

0 comments

The pith

OCC smart reply system detects intents at 76% accuracy using embeddings and nearest-neighbor classification for Uber rider-driver chats.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OCC, Uber's smart reply system for in-app messaging between riders and drivers. It uses a two-stage design that first detects message intent then retrieves replies, rather than predicting replies directly. Intent detection relies on unsupervised distributed embeddings and a nearest-neighbor classifier chosen for low labeled-data needs, simplicity, and fast inference. The reply stage pairs detected intents with historically popular responses. The system reaches 76% intent accuracy, matches CNN performance, and sees 71% adoption in production English-speaking deployments.

Core claim

The paper claims that separating intent detection via unsupervised distributed embeddings and nearest-neighbor classification from popularity-based reply retrieval produces effective smart replies for short, non-canonical mobile messages. This yields 76% accuracy on intent detection while requiring only small labeled training sets, enabling simple development and scalable serving. In production the system drives 71% adoption of suggested replies to speed rider-driver communications.

What carries the argument

Unsupervised distributed embedding combined with nearest-neighbor classifier for intent detection, paired with historical popularity data to select replies.

If this is right

Intent detection reaches 76% accuracy while matching word-level CNN performance.
Only small amounts of labeled data are needed for training.
Fast inference supports scalable production serving.
71% of in-app communications adopt the replies and complete faster.
The approach targets short non-canonical messages typical of mobile apps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-stage split lets reply options be refreshed without retraining the intent model.
The same architecture could apply to other customer-support or ride-hailing chat domains with limited labeled data.
If the nearest-neighbor method transfers across languages, deployment could expand beyond English markets without new labeled sets.
Simplicity of the deployed solution suggests viability on devices with tight compute budgets.

Load-bearing premise

Pairings of intents to replies drawn from historical chat popularity remain effective and appropriate for current conversations.

What would settle it

Compare adoption rates and user satisfaction before and after swapping the reply database for one built only from messages collected after the original training period.

Figures

Figures reproduced from arXiv: 1907.08167 by Franziska Bell, Gokhan Tur, Huaixiu Zheng, Yue Weng.

**Figure 3.** Figure 3: e architecture for Uber’s smart reply system, [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Chat message length frequency, on average [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 6.** Figure 6: Illustration of the process of nearest neighbor clas [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗

**Figure 7.** Figure 7: (a) Precision and (b) Recall of the models for top- [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗

**Figure 8.** Figure 8: Top-K accuracy of intent detection. 4.1 Results Model Accuracy: One of the most important metrics for evaluating our system is the overall accuracy for intent detection as we strictly control the number of replies for each intent in the product [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗

**Figure 9.** Figure 9: Model prediction comparison: (a) Word-CNN vs [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

**Figure 10.** Figure 10: In this two-dimensional t-SNE projection of sen [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗

**Figure 11.** Figure 11: Hyper-parameter search for Doc2vec model with [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

read the original abstract

Smart reply systems have been developed for various messaging platforms. In this paper, we introduce Uber's smart reply system: one-click-chat (OCC), which is a key enhanced feature on top of the Uber in-app chat system. It enables driver-partners to quickly respond to rider messages using smart replies. The smart replies are dynamically selected according to conversation content using machine learning algorithms. Our system consists of two major components: intent detection and reply retrieval, which are very different from standard smart reply systems where the task is to directly predict a reply. It is designed specifically for mobile applications with short and non-canonical messages. Reply retrieval utilizes pairings between intent and reply based on their popularity in chat messages as derived from historical data. For intent detection, a set of embedding and classification techniques are experimented with, and we choose to deploy a solution using unsupervised distributed embedding and nearest-neighbor classifier. It has the advantage of only requiring a small amount of labeled training data, simplicity in developing and deploying to production, and fast inference during serving and hence highly scalable. At the same time, it performs comparably with deep learning architectures such as word-level convolutional neural network. Overall, the system achieves a high accuracy of 76% on intent detection. Currently, the system is deployed in production for English-speaking countries and 71% of in-app communications between riders and driver-partners adopted the smart replies to speedup the communication process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward industry report on a deployed two-stage smart reply system at Uber, useful for practitioners but short on evaluation details.

read the letter

The main takeaway is that Uber built and shipped a smart reply feature for driver-rider chat by splitting the problem into intent detection and then popularity-based reply lookup from historical data. The system runs in production for English markets and claims 71% of in-app messages used the suggested replies. They picked unsupervised embeddings plus nearest-neighbor classification for the intent stage because it needs little labeled data, is simple to maintain, and runs fast at scale. It reaches 76% accuracy and matches a word-level CNN on their test, which is the practical win they highlight. The split architecture itself is a reasonable adaptation for short, informal mobile messages rather than trying to predict full replies directly. That choice and the deployment numbers are the concrete things the paper contributes. The evaluation stays thin. The abstract gives the 76% figure but supplies no information on how the test set was built, what the exact baselines were, or any error breakdown. The reply retrieval step rests entirely on historical popularity counts without any reported check that those pairings still produce appropriate suggestions or any end-to-end A/B result that isolates the effect of the suggestions. The 71% adoption number is presented as evidence of success, yet the measurement method is not described. These gaps make it hard to judge how robust the claims are. The paper is aimed at engineers who need to ship similar features in high-volume consumer apps and want to see one working example with real constraints. It does not introduce new algorithms or open fresh research questions. I would send it to peer review for a venue that accepts industry systems papers, with the expectation that referees will press for more on the metrics and validation of the retrieval stage.

Referee Report

3 major / 1 minor

Summary. The paper describes Uber's One-Click-Chat (OCC) smart reply system for in-app rider-driver communications. It separates intent detection (unsupervised distributed embeddings + nearest-neighbor classifier, reported 76% accuracy and comparable to word-level CNN) from reply retrieval (pairings derived from historical popularity counts). The system is deployed in production for English-speaking countries, with a claimed 71% adoption rate that speeds up communication.

Significance. If the performance and adoption claims are substantiated, the work shows a practical, low-labeled-data, scalable alternative to direct reply prediction for short non-canonical mobile messages, and illustrates that simple embedding methods can match deeper architectures in this constrained domain.

major comments (3)

Abstract: the 76% intent-detection accuracy is stated without any information on test-set construction, size, class balance, or the precise CNN baseline implementation and comparison protocol, which is load-bearing for both the accuracy claim and the comparability statement.
Abstract: reply retrieval is defined solely via historical popularity pairings, yet no validation, A/B test, expert review, or end-to-end usefulness metric is supplied to show that these pairings remain appropriate; this assumption directly supports the 71% adoption claim.
Abstract: the 71% adoption figure is reported without describing the measurement method (e.g., logged clicks, A/B experiment, or survey), leaving the central deployment-success claim unevaluable.

minor comments (1)

Abstract: the number of intents, replies, or labeled examples used for the nearest-neighbor model is not stated, which would help readers assess data efficiency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on the abstract. We address each major comment below. Where the comments identify gaps in substantiation, we agree that revisions are warranted and will update the abstract accordingly while preserving its conciseness.

read point-by-point responses

Referee: Abstract: the 76% intent-detection accuracy is stated without any information on test-set construction, size, class balance, or the precise CNN baseline implementation and comparison protocol, which is load-bearing for both the accuracy claim and the comparability statement.

Authors: We agree that the abstract would benefit from additional context on the evaluation protocol to make the 76% figure and CNN comparison more readily evaluable. The main text (Section 4) describes the test set as a held-out collection of labeled messages drawn from production logs, the class distribution, and the CNN baseline trained and evaluated under an identical protocol. In the revised manuscript we will add a concise clause to the abstract summarizing the test-set construction and noting that the CNN comparison follows the same data split and metrics. revision: yes
Referee: Abstract: reply retrieval is defined solely via historical popularity pairings, yet no validation, A/B test, expert review, or end-to-end usefulness metric is supplied to show that these pairings remain appropriate; this assumption directly supports the 71% adoption claim.

Authors: The referee correctly observes that the abstract presents the historical-popularity method without separate validation steps. The manuscript relies on the observed production adoption rate as the primary end-to-end indicator of appropriateness. We will revise the abstract to explicitly state that the pairings are derived from historical frequency counts and that their suitability is supported by the measured deployment usage, thereby linking the retrieval component more directly to the adoption result. revision: yes
Referee: Abstract: the 71% adoption figure is reported without describing the measurement method (e.g., logged clicks, A/B experiment, or survey), leaving the central deployment-success claim unevaluable.

Authors: We agree that the measurement method for the 71% figure should be stated. This percentage is computed from production logs as the fraction of in-app rider-driver communications in which a smart reply was selected and sent. We will revise the abstract to include this definition of the metric, making the deployment-success claim directly interpretable from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering system description with empirical metrics only

full rationale

The paper is a deployment report on OCC, a two-component smart-reply system (intent detection via unsupervised embeddings + nearest-neighbor, reply retrieval via historical popularity pairings). No equations, derivations, fitted parameters, or predictions appear anywhere in the provided text. The 76% intent accuracy and 71% adoption figures are stated as observed production outcomes, not outputs of any self-referential chain. Reply retrieval is explicitly described as a design choice based on historical counts, not as a derived or predicted result. No self-citations, uniqueness theorems, or ansatzes are invoked. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions about data and embeddings; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)

domain assumption Historical chat data provides reliable intent-reply pairings based on popularity
Directly invoked for the reply retrieval component in the abstract.
domain assumption Unsupervised distributed embeddings capture intents sufficiently for nearest-neighbor classification in short non-canonical messages
Basis for choosing the deployed intent detection method that requires only small labeled data.

pith-pipeline@v0.9.0 · 5789 in / 1367 out tokens · 26087 ms · 2026-05-24T19:39:45.212003+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 3 internal anchors

[1]

Brenes, Daniel Gayo-Avello, and Kilian P´erez-Gonz´alez

David J. Brenes, Daniel Gayo-Avello, and Kilian P´erez-Gonz´alez. 2009. Survey and evaluation of query intent detection methods. InProceedings of the 2009 workshop on Web Search Click Data, WSCD@WSDM 2009, Barcelona, Spain, February 9,

work page 2009
[2]

h/t_tps://doi.org/10.1145/1507509.1507510

1–7. h/t_tps://doi.org/10.1145/1507509.1507510

work page doi:10.1145/1507509.1507510
[3]

Ma/t_thew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, L ´aszl´o Luk´acs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Eﬃcient Natural Language Response Suggestion for Smart Reply. CoRR abs/1705.00652 (2017). arXiv:1705.00652 h/t_tp://arxiv.org/abs/1705.00652

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Jeremy Hermann and Mike Del Balso. 2018. Meet Michelangelo: Uber’s Machine Learning Platform. h/t_tp://eng.uber.com/michelangelo/. (2018)

work page 2018
[5]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation 9(8):1735-1780 (1997)

work page 1997
[6]

Nimesh Chakravarthi Jeﬀ Pasternack. 2017. Building Smart Replies for Member Messages. Press Release. h/t_tps://engineering.linkedin.com/blog/2017/10/building- smart-replies-for-member-messages. (2017)

work page 2017
[7]

Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, L´aszl´o Luk´acs, Marina Ganea, Peter Young, and Vivek Ramavajjala. 2016. Smart Reply: Automated Response Suggestion for Email. CoRR abs/1606.04870 (2016). arXiv:1606.04870 h/t_tp://arxiv.org/abs/1606.04870

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classi/f_ication. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL . ACL, 1746–1751

work page 2014
[9]

Distributed Representations of Sentences and Documents

/Q_uoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. CoRR abs/1405.4053 (2014). arXiv:1405.4053 h/t_tp://arxiv.org/ abs/1405.4053

work page internal anchor Pith review Pith/arXiv arXiv 2014
[10]

Piero Molino. [n. d.]. Ludwig. h/t_tps://github.com/uber/ludwig

work page
[11]

Uber Newsroom. 2017. Connect Ahead of the Pickup with In-App Chat. Press Release. h/t_tps://www.uber.com/newsroom/in-app-chat/. (2017)

work page 2017
[12]

Radim ˇReh˚uˇrek and Petr Sojka. 2010. So/f_tware Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valle/t_ta, Malta, 45–50. h/t_tp://is.muni.cz/publication/ 884893/en

work page 2010
[13]

Ilya Sutskever, Oriol Vinyals, and /Q_uoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, /Q_uebec, Canada. 3104–3112

work page 2014
[14]

/T_homas A

Suzie Lee et al. /T_homas A. Dingusa, Feng Guo. 2016. Driver crash risk factors and prevalence evaluation using naturalistic driving data. PNAS 13, 10 (2016)

work page 2016
[15]

Laurens van der Maaten and Geoﬀrey E. Hinton. 2008. Visualizing Data using t-SNE

work page 2008
[16]

Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akkiraju. 2017. A New Chatbot for Customer Service on Social Media. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, May 06-11,

work page 2017
[17]

h/t_tps://doi.org/10.1145/3025453.3025496 8

3506–3510. h/t_tps://doi.org/10.1145/3025453.3025496 8

work page doi:10.1145/3025453.3025496

[1] [1]

Brenes, Daniel Gayo-Avello, and Kilian P´erez-Gonz´alez

David J. Brenes, Daniel Gayo-Avello, and Kilian P´erez-Gonz´alez. 2009. Survey and evaluation of query intent detection methods. InProceedings of the 2009 workshop on Web Search Click Data, WSCD@WSDM 2009, Barcelona, Spain, February 9,

work page 2009

[2] [2]

h/t_tps://doi.org/10.1145/1507509.1507510

1–7. h/t_tps://doi.org/10.1145/1507509.1507510

work page doi:10.1145/1507509.1507510

[3] [3]

Ma/t_thew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, L ´aszl´o Luk´acs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Eﬃcient Natural Language Response Suggestion for Smart Reply. CoRR abs/1705.00652 (2017). arXiv:1705.00652 h/t_tp://arxiv.org/abs/1705.00652

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

Jeremy Hermann and Mike Del Balso. 2018. Meet Michelangelo: Uber’s Machine Learning Platform. h/t_tp://eng.uber.com/michelangelo/. (2018)

work page 2018

[5] [5]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation 9(8):1735-1780 (1997)

work page 1997

[6] [6]

Nimesh Chakravarthi Jeﬀ Pasternack. 2017. Building Smart Replies for Member Messages. Press Release. h/t_tps://engineering.linkedin.com/blog/2017/10/building- smart-replies-for-member-messages. (2017)

work page 2017

[7] [7]

Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, L´aszl´o Luk´acs, Marina Ganea, Peter Young, and Vivek Ramavajjala. 2016. Smart Reply: Automated Response Suggestion for Email. CoRR abs/1606.04870 (2016). arXiv:1606.04870 h/t_tp://arxiv.org/abs/1606.04870

work page internal anchor Pith review Pith/arXiv arXiv 2016

[8] [8]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classi/f_ication. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL . ACL, 1746–1751

work page 2014

[9] [9]

Distributed Representations of Sentences and Documents

/Q_uoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. CoRR abs/1405.4053 (2014). arXiv:1405.4053 h/t_tp://arxiv.org/ abs/1405.4053

work page internal anchor Pith review Pith/arXiv arXiv 2014

[10] [10]

Piero Molino. [n. d.]. Ludwig. h/t_tps://github.com/uber/ludwig

work page

[11] [11]

Uber Newsroom. 2017. Connect Ahead of the Pickup with In-App Chat. Press Release. h/t_tps://www.uber.com/newsroom/in-app-chat/. (2017)

work page 2017

[12] [12]

Radim ˇReh˚uˇrek and Petr Sojka. 2010. So/f_tware Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valle/t_ta, Malta, 45–50. h/t_tp://is.muni.cz/publication/ 884893/en

work page 2010

[13] [13]

Ilya Sutskever, Oriol Vinyals, and /Q_uoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, /Q_uebec, Canada. 3104–3112

work page 2014

[14] [14]

/T_homas A

Suzie Lee et al. /T_homas A. Dingusa, Feng Guo. 2016. Driver crash risk factors and prevalence evaluation using naturalistic driving data. PNAS 13, 10 (2016)

work page 2016

[15] [15]

Laurens van der Maaten and Geoﬀrey E. Hinton. 2008. Visualizing Data using t-SNE

work page 2008

[16] [16]

Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akkiraju. 2017. A New Chatbot for Customer Service on Social Media. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, May 06-11,

work page 2017

[17] [17]

h/t_tps://doi.org/10.1145/3025453.3025496 8

3506–3510. h/t_tps://doi.org/10.1145/3025453.3025496 8

work page doi:10.1145/3025453.3025496