pith. sign in

arxiv: 2606.01099 · v1 · pith:RU7SO3ROnew · submitted 2026-05-31 · 💻 cs.CL · cs.AI

MiCU: End-to-End Smart Home Command Understanding with Large Language Model

Pith reviewed 2026-06-28 17:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords smart homecommand understandinglarge language modelsdata synthesiscurriculum learningreinforcement learningtoken compressiondomain adaptation
0
0 comments X

The pith

MiCU adapts large language models to smart home commands by synthesizing data from user logs and using curriculum learning plus reinforcement learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to build a domain-specific model that turns vague user requests such as making a room cozy into correct device actions. It starts with an automated workflow that turns existing user logs into training examples with the help of large language models. Curriculum learning then embeds device knowledge, while cold-start training followed by reinforcement learning guided by domain rules improves step-by-step reasoning. A compression step replaces lengthy device descriptions with one special token, cutting inference cost. These steps together produce measurable gains in accuracy and lower correction rates once the model runs in a live app.

Core claim

MiCU is created by first generating training data automatically from user logs with the aid of large language models, then applying curriculum learning to inject domain knowledge, followed by cold-start training and reinforcement learning that follows explicit domain thinking rules, and finally replacing device descriptions with a single special token for efficiency. On this foundation the model reaches an average accuracy gain of 20.01 percent across device categories over prior baselines. When deployed in the Xiaomi Home app it lowers the user correction rate by 1.57 percent and raises human-audited accuracy by 32.05 percent.

What carries the argument

The automated data synthesis workflow from user logs combined with curriculum learning, reinforcement learning guided by domain rules, and single-token compression of device descriptions.

If this is right

  • The model delivers a 20.01 percent average accuracy increase across all device categories relative to baselines.
  • Production use cuts the rate at which users must correct the system by 1.57 percent.
  • Human-audited accuracy in the deployed app rises by 32.05 percent.
  • The token compression step creates a faster variant that handles long device lists without extra cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same log-driven synthesis and rule-guided reinforcement learning could be reused for command understanding in other structured environments such as vehicle controls or industrial interfaces.
  • Single-token compression of repeated structured text may lower serving costs in any large-language-model application that repeatedly references catalogs or inventories.
  • Daily traffic of 1.7 million page views supplies a continuous stream of new logs that can be fed back into the synthesis loop for ongoing adaptation.

Load-bearing premise

The examples created automatically from user logs and large language models match the distribution of real ambiguous commands well enough that gains observed on test sets carry over to live users.

What would settle it

Run the trained model on a fresh collection of ambiguous commands gathered directly from app users that were never seen during synthesis and check whether the 20 percent accuracy margin over baselines still holds.

Figures

Figures reproduced from arXiv: 2606.01099 by Bin Qin, Bo Du, Debiao Zhang, Haowei Han, Jiawei Jiang, Kexin Hu, Weiwei Cai, Xiao Yan, Yuxiang Wang.

Figure 1
Figure 1. Figure 1: Illustration of the LLM-based end-to-end command [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the key techniques of MiCU. We evaluate our MiCU on datasets with multiple device cate￾gories, and compare it with baselines including traditional rule￾based selection and popular generally pretrained LLMs with few￾shot contextual learning. Results show that MiCU outperforms all baselines across various device categories: it achieves accuracy gains of 28.29% and 20.01% over the rule-based metho… view at source ↗
Figure 4
Figure 4. Figure 4: Curriculum learning stage. It begins with contin [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: CoT enhancement stage. We first cold start model [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Token compression technique. The model under [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Failure analysis for the main category "Light". [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Scalability study. Figure (a) shows performance [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

Command understanding systems in smart home ecosystems can automate device control and substantially improve user experience. However, while they perform well on precise utterances (e.g., "turn on the bedroom light"), they struggle with ambiguous or misaligned commands (e.g., "make the bedroom cozy"). Large language models (LLMs) generalize well across various domains and can outperform traditional rule-based systems on such tasks, but their effectiveness is often constrained by scarce domain-specific data, insufficient task-specific adaptation, and high computational costs. In this paper, we propose an automated training data synthesis workflow using user logs and LLMs; then we build MiCU, a domain-specific LLM that excels at command understanding. Specifically, we employ curriculum learning to inject domain knowledge into the base LLM, then we enhance its reasoning ability via cold-start training combined with reinforcement learning (RL) guided by domain-specific thinking rules. Additionally, we introduce a token compression technique that condenses device description into a single special token, substantially reducing inference overhead and enabling \model-fast, an efficient variant optimized for long inputs. Extensive experiments show that MiCU significantly outperforms baselines, with an average accuracy gain of 20.01% across all device categories. We have deployed MiCU in the Xiaomi Home app, receiving approximately 1.7 million page views per day. Production evaluations show that MiCU reduces user correction rate by 1.57% and increases human audited accuracy by 32.05%. Our data and code are available at https://github.com/xiaomi-research/iot_spec_llm

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes MiCU, a domain-specific LLM for smart home command understanding. It introduces an automated data synthesis workflow from user logs using LLMs, followed by curriculum learning to inject domain knowledge, cold-start training combined with RL guided by domain-specific thinking rules, and a token compression technique to create an efficient variant (MiCU-fast). Experiments report an average 20.01% accuracy gain over baselines across device categories, with production deployment in the Xiaomi Home app yielding a 1.57% reduction in user correction rate and 32.05% increase in human audited accuracy; code and data are released.

Significance. If the results hold under proper controls, the work demonstrates a practical pipeline for adapting LLMs to ambiguous command understanding in IoT settings with scarce domain data, combining synthetic data generation, curriculum/RL stages, and inference optimization. The production deployment metrics provide evidence of real-world utility beyond academic benchmarks.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (data synthesis workflow): The headline 20.01% accuracy gain and production metrics rest on the claim that LLM-driven synthesis from user logs produces training examples whose ambiguity distribution matches real deployment commands, yet no quantitative validation (feature histograms, KL divergence, or human alignment scores on held-out real logs) is reported; without this, the gains risk being artifacts of the synthetic distribution rather than evidence of robust generalization.
  2. [Abstract, Experiments] Abstract and experimental section: The accuracy, correction-rate, and audited-accuracy figures are stated without dataset sizes, baseline model details, error bars, ablation results, or statistical significance tests, preventing assessment of whether the reported improvements are load-bearing or sensitive to experimental controls.
minor comments (2)
  1. [§4] The token-compression method is described as condensing device descriptions into a single special token, but the exact training objective and any impact on downstream accuracy are not quantified separately from the RL stage.
  2. [Production evaluation] Production metrics are given as point estimates (1.57%, 32.05%) without confidence intervals or description of the audit sample size and selection criteria.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which highlight important aspects for improving the clarity and rigor of our work. We address each major comment below and commit to revisions that incorporate the suggested enhancements.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (data synthesis workflow): The headline 20.01% accuracy gain and production metrics rest on the claim that LLM-driven synthesis from user logs produces training examples whose ambiguity distribution matches real deployment commands, yet no quantitative validation (feature histograms, KL divergence, or human alignment scores on held-out real logs) is reported; without this, the gains risk being artifacts of the synthetic distribution rather than evidence of robust generalization.

    Authors: We concur that providing quantitative evidence of distribution alignment would strengthen the claims regarding the synthetic data's fidelity. Although the workflow is constructed to mirror real user log patterns and the production results offer supporting evidence of practical effectiveness, we will revise the manuscript to include comparative feature histograms, KL divergence metrics on ambiguity-related attributes, and human-rated alignment scores between synthetic and held-out real logs. These additions will appear in an expanded §3. revision: yes

  2. Referee: [Abstract, Experiments] Abstract and experimental section: The accuracy, correction-rate, and audited-accuracy figures are stated without dataset sizes, baseline model details, error bars, ablation results, or statistical significance tests, preventing assessment of whether the reported improvements are load-bearing or sensitive to experimental controls.

    Authors: We agree that comprehensive reporting of experimental details is essential for reproducibility and assessment of robustness. In the revised version, we will expand the experimental section to report training and test dataset sizes, full descriptions of all baseline models, standard deviations or error bars from repeated experiments, complete ablation results, and appropriate statistical significance tests for the accuracy gains. Updated tables and text will reflect these additions. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and evaluation pipeline

full rationale

The paper describes an automated data synthesis workflow from user logs and LLMs, followed by curriculum learning, cold-start training, RL with domain rules, and token compression. All headline results (20.01% accuracy gain, production metrics) are presented as direct outcomes of training and deployment evaluation rather than any derivation, fitted parameter, or self-citation that reduces to the inputs by construction. No equations or mathematical claims appear; the work is self-contained as standard empirical ML.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim implicitly rests on the unstated assumption that synthetic data from logs faithfully represents real user ambiguity.

pith-pipeline@v0.9.1-grok · 5835 in / 1176 out tokens · 21701 ms · 2026-06-28T17:11:37.232335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 17 canonical work pages · 11 internal anchors

  1. [1]

    Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. 2024. LongAlign: A Recipe for Long Context Alignment of Large Language Models. InFindings of the Association for Computational Linguistics: EMNLP

  2. [2]

    Andrew Drozdov, Nathanael Schärli, Ekin Akyürek, Nathan Scales, Xinying Song, Xinyun Chen, Olivier Bousquet, and Denny Zhou. 2022. Compositional semantic parsing with large language models.arXiv preprint arXiv:2209.15003(2022)

  3. [3]

    Yi Gao, Kaijie Xiao, Fu Li, Weifeng Xu, Jiaming Huang, and Wei Dong. 2024. ChatIoT: Zero-code Generation of Trigger-action Based IoT Programs.Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT) 8, 3, Article 103 (2024)

  4. [4]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

  5. [5]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  6. [6]

    Chuang Hu, Wei Bao, Dan Wang, and Fengming Liu. 2019. Dynamic Adaptive DNN Surgery for Inference Acceleration on the Edge. InProceedings of the IEEE Conference on Computer Communications (INFOCOM)

  7. [7]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  8. [8]

    Yidi Jiang, Bidisha Sharma, Maulik Madhavi, and Haizhou Li. 2021. Knowledge distillation from bert transformer to speech transformer for intent classification. arXiv preprint arXiv:2108.02598(2021)

  9. [9]

    Ehsan Kamalloo, Nouha Dziri, Charles Clarke, and Davood Rafiei. 2023. Evalu- ating open-domain question answering in the era of large language models. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)

  10. [10]

    Minjeong Kim, Gyuwan Kim, Sang-Woo Lee, and Jung-Woo Ha. 2021. St-bert: Cross-modal language model pre-training for end-to-end spoken language un- derstanding. InIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7478–7482

  11. [11]

    Evan King, Haoxiang Yu, Sangsu Lee, and Christine Julien. 2024. Sasha: Creative Goal-Oriented Reasoning in Smart Homes with Large Language Models.Pro- ceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)8, 1, Article 12 (2024)

  12. [12]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al

  13. [13]

    In Advances in Neural Information Processing Systems (NeurIPS)

    Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems (NeurIPS)

  14. [14]

    Yuhao Lin, Zhipeng Tang, Jiayan Tong, Junqing Xiao, Bin Lu, Yuhang Li, Chao Li, Zhiguo Zhang, Junhua Wang, Hao Luo, et al. 2026. FuxiShuffle: An Adaptive and Resilient Shuffle Service for Distributed Data Processing on Alibaba Cloud. arXiv preprint arXiv:2602.22580(2026). KDD 2026, August 9–13, 2026, Jeju Island, Republic of Korea. Haowei Han et al

  15. [15]

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al . 2025. Deepseek- v3.2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556(2025)

  16. [16]

    Renze Lou, Kai Zhang, and Wenpeng Yin. 2024. Large language model instruction following: A survey of progresses and challenges.Computational Linguistics50, 3 (2024), 1053–1095

  17. [17]

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al

  18. [18]

    Self-refine: Iterative refinement with self-feedback.Advances in Neural Information Processing Systems (NeurIPS)(2023)

  19. [19]

    Mahda Noura, Sebastian Heil, and Martin Gaedke. 2020. VISH: Does Your Smart Home Dialogue System Also Need Training Data?. InWeb Engineering, Maria Bielikova, Tommi Mikkonen, and Cesare Pautasso (Eds.). 171–187

  20. [20]

    Dmitriy Rivkin, Francois Hogan, Amal Feriani, Abhisek Konar, Adam Sigal, Steve Liu, and Greg Dudek. 2023. Sage: Smart home agent with grounded execution. arXiv preprint arXiv:2311.00772(2023)

  21. [21]

    Subhro Roy, Samuel Thomson, Tongfei Chen, Richard Shin, Adam Pauls, Jason Eisner, and Benjamin Van Durme. 2023. Benchclamp: A Benchmark for Evaluating Language Models on Syntactic and Semantic Parsing. InAdvances in Neural Information Processing Systems (NeurIPS)

  22. [22]

    Pranab Sahoo, Ayush Kumar Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A systematic survey of prompt engineering in large language models: Techniques and applications.arXiv preprint arXiv:2402.07927 (2024)

  23. [23]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

  24. [24]

    Tom Silver, Varun Hariprasad, Reece S Shuttleworth, Nishanth Kumar, Tomás Lozano-Pérez, and Leslie Pack Kaelbling. 2022. PDDL planning with pretrained large language models. InProceedings of the NeurIPS 2022 Workshop on Foundation Models for Decision Making

  25. [25]

    Stefanie Tellex, Thomas Kollar, Steven Dickerson, Matthew Walter, Ashis Baner- jee, Seth Teller, and Nicholas Roy. 2011. Understanding natural language com- mands for robotic navigation and mobile manipulation. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI)

  26. [26]

    Jianing Wang, Wenkang Huang, Minghui Qiu, Qiuhui Shi, Hongbin Wang, Xiang Li, and Ming Gao. 2022. Knowledge Prompting in Pre-trained Language Model for Natural Language Understanding. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP)

  27. [27]

    Yuxiang Wang, Chi Ma, Xiao Yan, Mincong Huang, Xiaoguang Li, Ruidong Han, Bin Yin, Shangyu Chen, Xiang Li, Fei Jiang, Lei Yu, Chuan Liu, Wei Lin, Haowei Han, Xiaokai Zhou, Bo Du, and Jiawei Jiang. 2026. MTGenRec: An Efficient Distributed Training System for Generative Recommendation Models in Meituan. InProceedings of the ACM SIGKDD Conference on Knowledg...

  28. [28]

    Gang Wu, Liang Hu, Yuxiao Hu, Xingbo Xiong, and Feng Wang. 2025. LLM4TAP: LLM-Enhanced TAP Rule Recommendation.IEEE Internet of Things Journal12, 10 (2025), 13157–13169

  29. [29]

    Yaqi Xie, Chen Yu, Tongyao Zhu, Jinbin Bai, Ze Gong, and Harold Soh. 2023. Translating natural language to planning goals with large-language models.arXiv preprint arXiv:2302.05128(2023)

  30. [30]

    Benfeng Xu, Licheng Zhang, Zhendong Mao, Quan Wang, Hongtao Xie, and Yongdong Zhang. 2020. Curriculum learning for natural language understand- ing. InProceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)

  31. [31]

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. 2023. Wizardlm: Empowering large language models to follow complex instructions. InProceedings of International Conference on Learning Representations (ICLR)

  32. [32]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  33. [33]

    Fangkai Yang, Pu Zhao, Zezhong Wang, Lu Wang, Bo Qiao, Jue Zhang, Mohit Garg, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. 2023. Empower large language model to perform better on industrial domain-specific question answering. InProceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP))

  34. [34]

    Jianxin Yang. 2023. Firefly. https://github.com/yangjianxin1/Firefly

  35. [35]

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. 2025. Dapo: An open- source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476 (2025)

  36. [36]

    Zelle and Raymond J

    John M. Zelle and Raymond J. Mooney. 1993. Learning semantic grammars with constructive inductive logic programming. InProceedings of the AAAI Conference on Artificial Intelligence (AAAI)

  37. [37]

    Shengyu Zhang, Linfeng Dong, Xiaoya Li, Sen Zhang, Xiaofei Sun, Shuhe Wang, Jiwei Li, Runyi Hu, Tianwei Zhang, Guoyin Wang, and Fei Wu. 2026. Instruction Tuning for Large Language Models: A Survey.ACM Computing Survey(2026)

  38. [38]

    Zhuosheng Zhang, Yuwei Wu, Hai Zhao, Zuchao Li, Shuailiang Zhang, Xi Zhou, and Xiang Zhou. 2020. Semantics-aware BERT for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)

  39. [39]

    Xue Zhao, Hanyu Zhao, Sha Yuan, and Yequan Wang. 2022. WuDaoCorpora Text. doi:10.57760/sciencedb.o00126.00004

  40. [40]

    Haoyu Zheng, Fangcheng Fu, Jia Wu, Binhang Yuan, Yongqiang Zhang, Hao Wang, Yuanyuan Zhu, Xiao Yan, and Jiawei Jiang. 2026. Efficient Serving for Dynamic Agent Workflows with Prediction-based KV-Cache Management.arXiv preprint arXiv:2605.06472(2026)

  41. [41]

    Haoyu Zheng, Yongqiang Zhang, Fangcheng Fu, Xiaokai Zhou, Hao Luo, Hongchao Zhu, Yuanyuan Zhu, Hao Wang, Xiao Yan, and Jiawei Jiang. 2026. Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions. arXiv preprint arXiv:2604.00499(2026)

  42. [42]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. 2023. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911(2023). A Dataset Detail The DevCmd dataset comprises 50K samples across 28 distinct smart device categories, covering a comprehensive range of modern sma...