pith. sign in

arxiv: 1907.08167 · v1 · pith:OCO3VQX6new · submitted 2019-07-18 · 💻 cs.CL · stat.ML

OCC: A Smart Reply System for Efficient In-App Communications

Pith reviewed 2026-05-24 19:39 UTC · model grok-4.3

classification 💻 cs.CL stat.ML
keywords smart replyintent detectionin-app chatride sharingnearest neighbordistributed embeddingproduction deployment
0
0 comments X

The pith

OCC smart reply system detects intents at 76% accuracy using embeddings and nearest-neighbor classification for Uber rider-driver chats.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents OCC, Uber's smart reply system for in-app messaging between riders and drivers. It uses a two-stage design that first detects message intent then retrieves replies, rather than predicting replies directly. Intent detection relies on unsupervised distributed embeddings and a nearest-neighbor classifier chosen for low labeled-data needs, simplicity, and fast inference. The reply stage pairs detected intents with historically popular responses. The system reaches 76% intent accuracy, matches CNN performance, and sees 71% adoption in production English-speaking deployments.

Core claim

The paper claims that separating intent detection via unsupervised distributed embeddings and nearest-neighbor classification from popularity-based reply retrieval produces effective smart replies for short, non-canonical mobile messages. This yields 76% accuracy on intent detection while requiring only small labeled training sets, enabling simple development and scalable serving. In production the system drives 71% adoption of suggested replies to speed rider-driver communications.

What carries the argument

Unsupervised distributed embedding combined with nearest-neighbor classifier for intent detection, paired with historical popularity data to select replies.

If this is right

  • Intent detection reaches 76% accuracy while matching word-level CNN performance.
  • Only small amounts of labeled data are needed for training.
  • Fast inference supports scalable production serving.
  • 71% of in-app communications adopt the replies and complete faster.
  • The approach targets short non-canonical messages typical of mobile apps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The two-stage split lets reply options be refreshed without retraining the intent model.
  • The same architecture could apply to other customer-support or ride-hailing chat domains with limited labeled data.
  • If the nearest-neighbor method transfers across languages, deployment could expand beyond English markets without new labeled sets.
  • Simplicity of the deployed solution suggests viability on devices with tight compute budgets.

Load-bearing premise

Pairings of intents to replies drawn from historical chat popularity remain effective and appropriate for current conversations.

What would settle it

Compare adoption rates and user satisfaction before and after swapping the reply database for one built only from messages collected after the original training period.

Figures

Figures reproduced from arXiv: 1907.08167 by Franziska Bell, Gokhan Tur, Huaixiu Zheng, Yue Weng.

Figure 2
Figure 2. Figure 2: ‡e machine learning algorithm empowers the [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ‡e architecture for Uber’s smart reply system, [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Chat message length frequency, on average [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of the process of nearest neighbor clas [PITH_FULL_IMAGE:figures/full_fig_p004_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (a) Precision and (b) Recall of the models for top- [PITH_FULL_IMAGE:figures/full_fig_p005_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Top-K accuracy of intent detection. 4.1 Results Model Accuracy: One of the most important metrics for evaluat￾ing our system is the overall accuracy for intent detection as we strictly control the number of replies for each intent in the product [PITH_FULL_IMAGE:figures/full_fig_p005_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Model prediction comparison: (a) Word-CNN vs [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: In this two-dimensional t-SNE projection of sen [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Hyper-parameter search for Doc2vec model with [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
read the original abstract

Smart reply systems have been developed for various messaging platforms. In this paper, we introduce Uber's smart reply system: one-click-chat (OCC), which is a key enhanced feature on top of the Uber in-app chat system. It enables driver-partners to quickly respond to rider messages using smart replies. The smart replies are dynamically selected according to conversation content using machine learning algorithms. Our system consists of two major components: intent detection and reply retrieval, which are very different from standard smart reply systems where the task is to directly predict a reply. It is designed specifically for mobile applications with short and non-canonical messages. Reply retrieval utilizes pairings between intent and reply based on their popularity in chat messages as derived from historical data. For intent detection, a set of embedding and classification techniques are experimented with, and we choose to deploy a solution using unsupervised distributed embedding and nearest-neighbor classifier. It has the advantage of only requiring a small amount of labeled training data, simplicity in developing and deploying to production, and fast inference during serving and hence highly scalable. At the same time, it performs comparably with deep learning architectures such as word-level convolutional neural network. Overall, the system achieves a high accuracy of 76% on intent detection. Currently, the system is deployed in production for English-speaking countries and 71% of in-app communications between riders and driver-partners adopted the smart replies to speedup the communication process.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper describes Uber's One-Click-Chat (OCC) smart reply system for in-app rider-driver communications. It separates intent detection (unsupervised distributed embeddings + nearest-neighbor classifier, reported 76% accuracy and comparable to word-level CNN) from reply retrieval (pairings derived from historical popularity counts). The system is deployed in production for English-speaking countries, with a claimed 71% adoption rate that speeds up communication.

Significance. If the performance and adoption claims are substantiated, the work shows a practical, low-labeled-data, scalable alternative to direct reply prediction for short non-canonical mobile messages, and illustrates that simple embedding methods can match deeper architectures in this constrained domain.

major comments (3)
  1. Abstract: the 76% intent-detection accuracy is stated without any information on test-set construction, size, class balance, or the precise CNN baseline implementation and comparison protocol, which is load-bearing for both the accuracy claim and the comparability statement.
  2. Abstract: reply retrieval is defined solely via historical popularity pairings, yet no validation, A/B test, expert review, or end-to-end usefulness metric is supplied to show that these pairings remain appropriate; this assumption directly supports the 71% adoption claim.
  3. Abstract: the 71% adoption figure is reported without describing the measurement method (e.g., logged clicks, A/B experiment, or survey), leaving the central deployment-success claim unevaluable.
minor comments (1)
  1. Abstract: the number of intents, replies, or labeled examples used for the nearest-neighbor model is not stated, which would help readers assess data efficiency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the careful review and constructive feedback on the abstract. We address each major comment below. Where the comments identify gaps in substantiation, we agree that revisions are warranted and will update the abstract accordingly while preserving its conciseness.

read point-by-point responses
  1. Referee: Abstract: the 76% intent-detection accuracy is stated without any information on test-set construction, size, class balance, or the precise CNN baseline implementation and comparison protocol, which is load-bearing for both the accuracy claim and the comparability statement.

    Authors: We agree that the abstract would benefit from additional context on the evaluation protocol to make the 76% figure and CNN comparison more readily evaluable. The main text (Section 4) describes the test set as a held-out collection of labeled messages drawn from production logs, the class distribution, and the CNN baseline trained and evaluated under an identical protocol. In the revised manuscript we will add a concise clause to the abstract summarizing the test-set construction and noting that the CNN comparison follows the same data split and metrics. revision: yes

  2. Referee: Abstract: reply retrieval is defined solely via historical popularity pairings, yet no validation, A/B test, expert review, or end-to-end usefulness metric is supplied to show that these pairings remain appropriate; this assumption directly supports the 71% adoption claim.

    Authors: The referee correctly observes that the abstract presents the historical-popularity method without separate validation steps. The manuscript relies on the observed production adoption rate as the primary end-to-end indicator of appropriateness. We will revise the abstract to explicitly state that the pairings are derived from historical frequency counts and that their suitability is supported by the measured deployment usage, thereby linking the retrieval component more directly to the adoption result. revision: yes

  3. Referee: Abstract: the 71% adoption figure is reported without describing the measurement method (e.g., logged clicks, A/B experiment, or survey), leaving the central deployment-success claim unevaluable.

    Authors: We agree that the measurement method for the 71% figure should be stated. This percentage is computed from production logs as the fraction of in-app rider-driver communications in which a smart reply was selected and sent. We will revise the abstract to include this definition of the metric, making the deployment-success claim directly interpretable from the abstract alone. revision: yes

Circularity Check

0 steps flagged

No circularity: engineering system description with empirical metrics only

full rationale

The paper is a deployment report on OCC, a two-component smart-reply system (intent detection via unsupervised embeddings + nearest-neighbor, reply retrieval via historical popularity pairings). No equations, derivations, fitted parameters, or predictions appear anywhere in the provided text. The 76% intent accuracy and 71% adoption figures are stated as observed production outcomes, not outputs of any self-referential chain. Reply retrieval is explicitly described as a design choice based on historical counts, not as a derived or predicted result. No self-citations, uniqueness theorems, or ansatzes are invoked. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on two domain assumptions about data and embeddings; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Historical chat data provides reliable intent-reply pairings based on popularity
    Directly invoked for the reply retrieval component in the abstract.
  • domain assumption Unsupervised distributed embeddings capture intents sufficiently for nearest-neighbor classification in short non-canonical messages
    Basis for choosing the deployed intent detection method that requires only small labeled data.

pith-pipeline@v0.9.0 · 5789 in / 1367 out tokens · 26087 ms · 2026-05-24T19:39:45.212003+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 3 internal anchors

  1. [1]

    Brenes, Daniel Gayo-Avello, and Kilian P´erez-Gonz´alez

    David J. Brenes, Daniel Gayo-Avello, and Kilian P´erez-Gonz´alez. 2009. Survey and evaluation of query intent detection methods. InProceedings of the 2009 workshop on Web Search Click Data, WSCD@WSDM 2009, Barcelona, Spain, February 9,

  2. [2]

    h/t_tps://doi.org/10.1145/1507509.1507510

    1–7. h/t_tps://doi.org/10.1145/1507509.1507510

  3. [3]

    Ma/t_thew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, L ´aszl´o Luk´acs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Efficient Natural Language Response Suggestion for Smart Reply. CoRR abs/1705.00652 (2017). arXiv:1705.00652 h/t_tp://arxiv.org/abs/1705.00652

  4. [4]

    Jeremy Hermann and Mike Del Balso. 2018. Meet Michelangelo: Uber’s Machine Learning Platform. h/t_tp://eng.uber.com/michelangelo/. (2018)

  5. [5]

    Hochreiter and J

    S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation 9(8):1735-1780 (1997)

  6. [6]

    Nimesh Chakravarthi Jeff Pasternack. 2017. Building Smart Replies for Member Messages. Press Release. h/t_tps://engineering.linkedin.com/blog/2017/10/building- smart-replies-for-member-messages. (2017)

  7. [7]

    Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, L´aszl´o Luk´acs, Marina Ganea, Peter Young, and Vivek Ramavajjala. 2016. Smart Reply: Automated Response Suggestion for Email. CoRR abs/1606.04870 (2016). arXiv:1606.04870 h/t_tp://arxiv.org/abs/1606.04870

  8. [8]

    Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classi/f_ication. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL . ACL, 1746–1751

  9. [9]

    Distributed Representations of Sentences and Documents

    /Q_uoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. CoRR abs/1405.4053 (2014). arXiv:1405.4053 h/t_tp://arxiv.org/ abs/1405.4053

  10. [10]

    Piero Molino. [n. d.]. Ludwig. h/t_tps://github.com/uber/ludwig

  11. [11]

    Uber Newsroom. 2017. Connect Ahead of the Pickup with In-App Chat. Press Release. h/t_tps://www.uber.com/newsroom/in-app-chat/. (2017)

  12. [12]

    Radim ˇReh˚uˇrek and Petr Sojka. 2010. So/f_tware Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valle/t_ta, Malta, 45–50. h/t_tp://is.muni.cz/publication/ 884893/en

  13. [13]

    Ilya Sutskever, Oriol Vinyals, and /Q_uoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, /Q_uebec, Canada. 3104–3112

  14. [14]

    /T_homas A

    Suzie Lee et al. /T_homas A. Dingusa, Feng Guo. 2016. Driver crash risk factors and prevalence evaluation using naturalistic driving data. PNAS 13, 10 (2016)

  15. [15]

    Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing Data using t-SNE

  16. [16]

    Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akkiraju. 2017. A New Chatbot for Customer Service on Social Media. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, May 06-11,

  17. [17]

    h/t_tps://doi.org/10.1145/3025453.3025496 8

    3506–3510. h/t_tps://doi.org/10.1145/3025453.3025496 8