OCC: A Smart Reply System for Efficient In-App Communications
Pith reviewed 2026-05-24 19:39 UTC · model grok-4.3
The pith
OCC smart reply system detects intents at 76% accuracy using embeddings and nearest-neighbor classification for Uber rider-driver chats.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that separating intent detection via unsupervised distributed embeddings and nearest-neighbor classification from popularity-based reply retrieval produces effective smart replies for short, non-canonical mobile messages. This yields 76% accuracy on intent detection while requiring only small labeled training sets, enabling simple development and scalable serving. In production the system drives 71% adoption of suggested replies to speed rider-driver communications.
What carries the argument
Unsupervised distributed embedding combined with nearest-neighbor classifier for intent detection, paired with historical popularity data to select replies.
If this is right
- Intent detection reaches 76% accuracy while matching word-level CNN performance.
- Only small amounts of labeled data are needed for training.
- Fast inference supports scalable production serving.
- 71% of in-app communications adopt the replies and complete faster.
- The approach targets short non-canonical messages typical of mobile apps.
Where Pith is reading between the lines
- The two-stage split lets reply options be refreshed without retraining the intent model.
- The same architecture could apply to other customer-support or ride-hailing chat domains with limited labeled data.
- If the nearest-neighbor method transfers across languages, deployment could expand beyond English markets without new labeled sets.
- Simplicity of the deployed solution suggests viability on devices with tight compute budgets.
Load-bearing premise
Pairings of intents to replies drawn from historical chat popularity remain effective and appropriate for current conversations.
What would settle it
Compare adoption rates and user satisfaction before and after swapping the reply database for one built only from messages collected after the original training period.
Figures
read the original abstract
Smart reply systems have been developed for various messaging platforms. In this paper, we introduce Uber's smart reply system: one-click-chat (OCC), which is a key enhanced feature on top of the Uber in-app chat system. It enables driver-partners to quickly respond to rider messages using smart replies. The smart replies are dynamically selected according to conversation content using machine learning algorithms. Our system consists of two major components: intent detection and reply retrieval, which are very different from standard smart reply systems where the task is to directly predict a reply. It is designed specifically for mobile applications with short and non-canonical messages. Reply retrieval utilizes pairings between intent and reply based on their popularity in chat messages as derived from historical data. For intent detection, a set of embedding and classification techniques are experimented with, and we choose to deploy a solution using unsupervised distributed embedding and nearest-neighbor classifier. It has the advantage of only requiring a small amount of labeled training data, simplicity in developing and deploying to production, and fast inference during serving and hence highly scalable. At the same time, it performs comparably with deep learning architectures such as word-level convolutional neural network. Overall, the system achieves a high accuracy of 76% on intent detection. Currently, the system is deployed in production for English-speaking countries and 71% of in-app communications between riders and driver-partners adopted the smart replies to speedup the communication process.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper describes Uber's One-Click-Chat (OCC) smart reply system for in-app rider-driver communications. It separates intent detection (unsupervised distributed embeddings + nearest-neighbor classifier, reported 76% accuracy and comparable to word-level CNN) from reply retrieval (pairings derived from historical popularity counts). The system is deployed in production for English-speaking countries, with a claimed 71% adoption rate that speeds up communication.
Significance. If the performance and adoption claims are substantiated, the work shows a practical, low-labeled-data, scalable alternative to direct reply prediction for short non-canonical mobile messages, and illustrates that simple embedding methods can match deeper architectures in this constrained domain.
major comments (3)
- Abstract: the 76% intent-detection accuracy is stated without any information on test-set construction, size, class balance, or the precise CNN baseline implementation and comparison protocol, which is load-bearing for both the accuracy claim and the comparability statement.
- Abstract: reply retrieval is defined solely via historical popularity pairings, yet no validation, A/B test, expert review, or end-to-end usefulness metric is supplied to show that these pairings remain appropriate; this assumption directly supports the 71% adoption claim.
- Abstract: the 71% adoption figure is reported without describing the measurement method (e.g., logged clicks, A/B experiment, or survey), leaving the central deployment-success claim unevaluable.
minor comments (1)
- Abstract: the number of intents, replies, or labeled examples used for the nearest-neighbor model is not stated, which would help readers assess data efficiency.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive feedback on the abstract. We address each major comment below. Where the comments identify gaps in substantiation, we agree that revisions are warranted and will update the abstract accordingly while preserving its conciseness.
read point-by-point responses
-
Referee: Abstract: the 76% intent-detection accuracy is stated without any information on test-set construction, size, class balance, or the precise CNN baseline implementation and comparison protocol, which is load-bearing for both the accuracy claim and the comparability statement.
Authors: We agree that the abstract would benefit from additional context on the evaluation protocol to make the 76% figure and CNN comparison more readily evaluable. The main text (Section 4) describes the test set as a held-out collection of labeled messages drawn from production logs, the class distribution, and the CNN baseline trained and evaluated under an identical protocol. In the revised manuscript we will add a concise clause to the abstract summarizing the test-set construction and noting that the CNN comparison follows the same data split and metrics. revision: yes
-
Referee: Abstract: reply retrieval is defined solely via historical popularity pairings, yet no validation, A/B test, expert review, or end-to-end usefulness metric is supplied to show that these pairings remain appropriate; this assumption directly supports the 71% adoption claim.
Authors: The referee correctly observes that the abstract presents the historical-popularity method without separate validation steps. The manuscript relies on the observed production adoption rate as the primary end-to-end indicator of appropriateness. We will revise the abstract to explicitly state that the pairings are derived from historical frequency counts and that their suitability is supported by the measured deployment usage, thereby linking the retrieval component more directly to the adoption result. revision: yes
-
Referee: Abstract: the 71% adoption figure is reported without describing the measurement method (e.g., logged clicks, A/B experiment, or survey), leaving the central deployment-success claim unevaluable.
Authors: We agree that the measurement method for the 71% figure should be stated. This percentage is computed from production logs as the fraction of in-app rider-driver communications in which a smart reply was selected and sent. We will revise the abstract to include this definition of the metric, making the deployment-success claim directly interpretable from the abstract alone. revision: yes
Circularity Check
No circularity: engineering system description with empirical metrics only
full rationale
The paper is a deployment report on OCC, a two-component smart-reply system (intent detection via unsupervised embeddings + nearest-neighbor, reply retrieval via historical popularity pairings). No equations, derivations, fitted parameters, or predictions appear anywhere in the provided text. The 76% intent accuracy and 71% adoption figures are stated as observed production outcomes, not outputs of any self-referential chain. Reply retrieval is explicitly described as a design choice based on historical counts, not as a derived or predicted result. No self-citations, uniqueness theorems, or ansatzes are invoked. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Historical chat data provides reliable intent-reply pairings based on popularity
- domain assumption Unsupervised distributed embeddings capture intents sufficiently for nearest-neighbor classification in short non-canonical messages
Reference graph
Works this paper leans on
-
[1]
Brenes, Daniel Gayo-Avello, and Kilian P´erez-Gonz´alez
David J. Brenes, Daniel Gayo-Avello, and Kilian P´erez-Gonz´alez. 2009. Survey and evaluation of query intent detection methods. InProceedings of the 2009 workshop on Web Search Click Data, WSCD@WSDM 2009, Barcelona, Spain, February 9,
work page 2009
-
[2]
h/t_tps://doi.org/10.1145/1507509.1507510
1–7. h/t_tps://doi.org/10.1145/1507509.1507510
-
[3]
Ma/t_thew Henderson, Rami Al-Rfou, Brian Strope, Yun-Hsuan Sung, L ´aszl´o Luk´acs, Ruiqi Guo, Sanjiv Kumar, Balint Miklos, and Ray Kurzweil. 2017. Efficient Natural Language Response Suggestion for Smart Reply. CoRR abs/1705.00652 (2017). arXiv:1705.00652 h/t_tp://arxiv.org/abs/1705.00652
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
Jeremy Hermann and Mike Del Balso. 2018. Meet Michelangelo: Uber’s Machine Learning Platform. h/t_tp://eng.uber.com/michelangelo/. (2018)
work page 2018
-
[5]
S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory. Neural Computation 9(8):1735-1780 (1997)
work page 1997
-
[6]
Nimesh Chakravarthi Jeff Pasternack. 2017. Building Smart Replies for Member Messages. Press Release. h/t_tps://engineering.linkedin.com/blog/2017/10/building- smart-replies-for-member-messages. (2017)
work page 2017
-
[7]
Anjuli Kannan, Karol Kurach, Sujith Ravi, Tobias Kaufmann, Andrew Tomkins, Balint Miklos, Greg Corrado, L´aszl´o Luk´acs, Marina Ganea, Peter Young, and Vivek Ramavajjala. 2016. Smart Reply: Automated Response Suggestion for Email. CoRR abs/1606.04870 (2016). arXiv:1606.04870 h/t_tp://arxiv.org/abs/1606.04870
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[8]
Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classi/f_ication. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL . ACL, 1746–1751
work page 2014
-
[9]
Distributed Representations of Sentences and Documents
/Q_uoc V. Le and Tomas Mikolov. 2014. Distributed Representations of Sentences and Documents. CoRR abs/1405.4053 (2014). arXiv:1405.4053 h/t_tp://arxiv.org/ abs/1405.4053
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[10]
Piero Molino. [n. d.]. Ludwig. h/t_tps://github.com/uber/ludwig
-
[11]
Uber Newsroom. 2017. Connect Ahead of the Pickup with In-App Chat. Press Release. h/t_tps://www.uber.com/newsroom/in-app-chat/. (2017)
work page 2017
-
[12]
Radim ˇReh˚uˇrek and Petr Sojka. 2010. So/f_tware Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valle/t_ta, Malta, 45–50. h/t_tp://is.muni.cz/publication/ 884893/en
work page 2010
-
[13]
Ilya Sutskever, Oriol Vinyals, and /Q_uoc V. Le. 2014. Sequence to Sequence Learning with Neural Networks. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, /Q_uebec, Canada. 3104–3112
work page 2014
-
[14]
Suzie Lee et al. /T_homas A. Dingusa, Feng Guo. 2016. Driver crash risk factors and prevalence evaluation using naturalistic driving data. PNAS 13, 10 (2016)
work page 2016
-
[15]
Laurens van der Maaten and Geoffrey E. Hinton. 2008. Visualizing Data using t-SNE
work page 2008
-
[16]
Anbang Xu, Zhe Liu, Yufan Guo, Vibha Sinha, and Rama Akkiraju. 2017. A New Chatbot for Customer Service on Social Media. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, Denver, CO, USA, May 06-11,
work page 2017
-
[17]
h/t_tps://doi.org/10.1145/3025453.3025496 8
3506–3510. h/t_tps://doi.org/10.1145/3025453.3025496 8
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.