pith. sign in

arxiv: 2605.23958 · v1 · pith:T327JEZZnew · submitted 2026-05-11 · 💻 cs.CY · cs.AI· econ.GN· q-fin.EC

AI in the Enterprise: How People Use M365 Copilot Chat

Pith reviewed 2026-06-30 22:02 UTC · model grok-4.3

classification 💻 cs.CY cs.AIecon.GNq-fin.EC
keywords enterprise AIM365 Copilotuser intent classificationknowledge workO*NET work activitiesAI adoption patternschat session analysis
0
0 comments X

The pith

M365 Copilot Chat functions as an everyday assistant for knowledge work where writing leads and search use declines over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper classifies user intents in a sample of 5.5 million M365 Copilot Chat sessions using learned models and O*NET work activity categories. It establishes that writing is the dominant activity while information retrieval, analysis, decision making, and system evaluation also occur frequently. Time trends indicate a relative move away from search-style queries toward content and communication tasks. Usage spans many occupations yet remains uneven when compared to the broader labor market, pointing to areas of future expansion.

Core claim

Based on direct classification of approximately 5.5 million sessions, M365 Copilot Chat supports a range of knowledge work activities with writing as the primary use, substantial reliance on information retrieval and analysis, and a measurable shift over time from chat-as-search toward content creation and communication. Patterns differ across occupational groups and diverge from standard labor market distributions in some activities.

What carries the argument

Learned classification of user intent from chat sessions combined with O*NET work activity mapping.

If this is right

  • Writing tasks form the core use case for this class of enterprise AI.
  • Information-seeking interactions are common but lose relative share to content and communication work.
  • Usage cuts across some jobs while remaining tied to others, creating both broad and specialized patterns.
  • Work activities underrepresented in current usage indicate the next areas for enterprise AI expansion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Design of future assistants could prioritize features that support the observed shift to content creation.
  • Occupational differences may require profession-specific prompting or integration options.
  • The data could be used to model which work activities are most ready for AI augmentation.

Load-bearing premise

The automated intent classification from chat logs accurately reflects users' real goals and the sampled sessions represent typical usage across the companies involved.

What would settle it

A controlled comparison where users directly report their goal for each session and the reports are matched against the model's classifications, or a replication study on sessions from a new set of companies that shows substantially different activity shares or time trends.

Figures

Figures reproduced from arXiv: 2605.23958 by Alperen Kok, Andrey Zaikin, Gorkem Ozer Yilmaz, Himanshu Sharma, Jing Dong, Kiran Tomlinson, Rui Hu, Scott Counts, Siddharth Suri, Sonia Jaffe, Will Wang, Yan Chen.

Figure 1
Figure 1. Figure 1: shows the distribution of user prompts in each top-level intent class. Information Inquiry and Content Refinement account for almost 60% of user prompts. Other forms of content work including generation, as well as programming assistance and analytical reason￾ing account for about 20% of user prompts. Content summarization, ideation and planning, and meeting-related activity account for about 7% of usage, … view at source ↗
Figure 2
Figure 2. Figure 2: Second-level intents for each of the top 6 most frequent top-level intents. See Appendix Figure A1 for all second and first-level intent frequencies. 3.1.3 Intent Comparison to ChatGPT For comparison and generalization, we compare user intent in M365 Copilot to user intent in ChatGPT, as reported by Chatterji et al. [2025]. Because we do not use the same intent taxonomy, we manually mapped our second-level… view at source ↗
Figure 3
Figure 3. Figure 3: M365 Copilot and ChatGPT intents, mapped to the OpenAI topic taxonomy. 3.1.4 Intent Over Time Given the rapid pace of adoption of AI (e.g., Chatterji et al. [2025], Microsoft [2025], Misra et al. [2025]), how is usage changing as more people use, learn to use, and develop usage habits with AI? We look for changes in user intent over the 114 day period (June 1, 2025 - September 22, 2025) in our intent time … view at source ↗
Figure 4
Figure 4. Figure 4: Fractional share of top-level intents over time. Data were sampled daily from June 1 - September 22, 2025. Values are averaged over rolling past 28 days. 3.1.5 Intent by Industry and Occupation We subset our intent data based on the originating company of each M365 Copilot prompt and aggregate to the same three example industries as in our O*NET industry dataset: Banking, Manufacturing and Consulting [PIT… view at source ↗
Figure 5
Figure 5. Figure 5: Top-level intents for three example industries, with the average across all 37 industries shown for reference. Data were sampled daily from Feb 16-22, 2026. We turn now to our examination of top-level user intent across Occupational Groups ( [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Share of top-level user intent by Occupational Group. For the full intersection of all Occupational Groups by all top-level user intents, see Appendix Figure A2. We generalize the analysis of relationships in the share distributions of the intents over Occupational Groups by clustering the intents. The resulting 4-cluster solution is shown in [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Fraction of each generalized work activity (GWA) for User Goal and Copilot Action 15 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Top 25 most frequent IWAs for user goal (a) and Copilot action (b). Note different x-axis scales. The top 25 IWAs reflect “head of curve” information work tasks more common across occupations, but they don’t tell the whole story. In fact, these top 25 account for only two￾thirds of all user goal IWA instances and three-quarters of Copilot action IWA instances. In other words, about 30% of all work activiti… view at source ↗
Figure 9
Figure 9. Figure 9: highlights the IWAs that skew toward one or the other. User goals tend to emphasize data and information gathering, financial work, and administrative tasks, whereas Copilot actions predominantly focus on advising, assisting, and providing information. The shape of this asymmetry — users bringing tasks and information needs, and Copilot responding with advice, assistance, and explanation — reflects Copilot… view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Weighted activity share for GWAs. Bars show activity share as in [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: IWA comparison across three example industries. IWAs shown differ in M365 Copilot activity share by industry and/or from estimated share of work done in the industries based on breakdown of per-occupation workforce share within each industry taken from Occupational Em￾ployment and Wage Statistics. See Appendix Figure A5 for comparison across a larger set of IWAs [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: AI applicability score by O*NET job family. To understand the makeup of work done in each job family we decompose the AI Ap￾plicability score for each job family into constituent GWAs. This allows us to understand not just the overall degree of AI applicability to each job family, but a “profile” of the ways in which AI is applicable based on usage. Although the Administrative Support and 22 [PITH_FULL_I… view at source ↗
Figure 15
Figure 15. Figure 15: AI applicability score and breakdown by GWA for the Computer and Mathematical (a) and Educational Instruction and Library (b) job families. See Figure A6 for more job families. Job families with lower AI applicability scores can also be differentiated. Life & Social Sciences is spiky, with considerable applicability in Analyzing Data & Information and Documenting & Recording Information. Personal Services… view at source ↗
read the original abstract

M365 Copilot is used every week by millions of people across more than a million companies around the world as part of their workflows. Uniquely positioned in the AI landscape given its near-exclusive use for work purposes, M365 Copilot can offer a clear picture of how people use AI for work and where that usage may expand next. This paper characterizes that usage through direct classification of user interactions with M365 Copilot Chat. Based on an anonymized and privacy-preserving analysis of a sample of approximately 5.5 million sessions, we combine a learned classification of user intent with a classification of O*NET work activities done with M365 Copilot Chat. We find that M365 Copilot is emerging as an everyday assistant for knowledge work: writing dominates, but users also rely on it for information retrieval, analysis, decision making and strategizing, and evaluating and diagnosing programs and systems, among others. Information seeking tasks remain common, but time trends suggest a relative shift away from ``chat as search'' and toward content and communication-related work. Comparisons across occupational groupings and to work done in the labor market further show that usage is broad but uneven, where the relative share of work done with M365 Copilot Chat cuts across jobs in some cases and is occupation-specific in others. Areas of relative underrepresentation in the labor market suggest the next frontier for enterprise AI adoption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper analyzes a sample of approximately 5.5 million anonymized M365 Copilot Chat sessions using a learned intent classifier combined with O*NET work activity mappings. It claims that writing dominates usage for knowledge work, information-seeking tasks are common but declining relatively over time in favor of content and communication tasks, and usage is broad but uneven across occupational groups, with implications for future enterprise AI adoption.

Significance. If the intent classification is shown to be reliable, the study would provide one of the largest-scale empirical characterizations of real-world enterprise generative AI usage, enabling direct comparisons to labor-market task distributions via O*NET and highlighting adoption gaps.

major comments (2)
  1. [Methods] Methods section (classification procedure): the abstract states that results rest on a 'learned classification of user intent' applied to 5.5M sessions, yet no training data description, model details, human validation metrics (precision, recall, F1, inter-annotator agreement), or error analysis are supplied. Because every reported proportion, time trend, and occupational comparison derives from these labels, the absence of validation metrics renders the central empirical claims impossible to assess for bias or accuracy.
  2. [Data] Data and sample section: the paper asserts the 5.5M sessions are drawn from usage across more than a million companies and are representative, but provides no information on sampling frame, inclusion/exclusion criteria, stratification by company size or role, or checks for selection bias. This directly undermines the generalizability of the 'broad but uneven' usage findings and the labor-market comparisons.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'learned classification' is introduced without any accompanying performance qualifier, which could be clarified in one sentence to help readers immediately gauge the evidential basis.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which identify key areas where additional transparency is needed to support the paper's claims. We respond to each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Methods] Methods section (classification procedure): the abstract states that results rest on a 'learned classification of user intent' applied to 5.5M sessions, yet no training data description, model details, human validation metrics (precision, recall, F1, inter-annotator agreement), or error analysis are supplied. Because every reported proportion, time trend, and occupational comparison derives from these labels, the absence of validation metrics renders the central empirical claims impossible to assess for bias or accuracy.

    Authors: We agree that the absence of these details in the current manuscript is a significant limitation that prevents readers from evaluating the reliability of the intent labels. The referee's point is correct. In the revised version we will expand the methods section with a description of the training data, model architecture and training procedure, human validation metrics (precision, recall, F1), inter-annotator agreement, and an error analysis to the extent permitted by data-access constraints. revision: yes

  2. Referee: [Data] Data and sample section: the paper asserts the 5.5M sessions are drawn from usage across more than a million companies and are representative, but provides no information on sampling frame, inclusion/exclusion criteria, stratification by company size or role, or checks for selection bias. This directly undermines the generalizability of the 'broad but uneven' usage findings and the labor-market comparisons.

    Authors: The referee correctly identifies that the data section provides insufficient information on sampling and potential biases. We will revise the section to add available details on the sampling frame, inclusion/exclusion criteria, and any representativeness checks. Full stratification information by company size or role cannot be disclosed due to privacy and enterprise confidentiality agreements; we will explicitly note this limitation and its implications for generalizability. revision: partial

standing simulated objections not resolved
  • Complete disclosure of the full sampling frame, company-level stratification, and certain model-training details due to binding privacy and confidentiality constraints on enterprise usage data.

Circularity Check

0 steps flagged

No circularity: purely empirical analysis of usage data

full rationale

The paper performs direct classification of 5.5M anonymized sessions via a learned intent model mapped to external O*NET work activities, then reports observed distributions and trends. No equations, fitted parameters relabeled as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain. All claims rest on observable session data and an independent taxonomy rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on accuracy of intent classifier and representativeness of session sample; relies on domain assumption that O*NET taxonomy maps appropriately to AI-assisted tasks.

free parameters (1)
  • Intent classification model parameters
    Learned model for classifying user intent from sessions likely involves parameters fitted to training data.
axioms (1)
  • domain assumption O*NET taxonomy provides a valid categorization of work activities performed with AI tools
    Paper maps classified sessions to O*NET activities to characterize usage.

pith-pipeline@v0.9.1-grok · 5827 in / 1083 out tokens · 43730 ms · 2026-06-30T22:02:18.975710+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Accessed: 2025-10-05

    URLhttps://www.anthropic.com/research/ anthropic-economic-index-september-2025-report. Accessed: 2025-10-05. Ruth Appel, Maxim Massenkoff, Peter McCrory, Miles McCain, Ryan Heller, Tyler Neylon, and Alex Tamkin. Anthropic economic index report: eco- nomic primitives,

  2. [2]

    Alexander Bick, Adam Blandin, and David J Deming

    URLhttps://www.anthropic.com/research/ anthropic-economic-index-january-2026-report. Alexander Bick, Adam Blandin, and David J Deming. The rapid adoption of generative AI. Management Science,

  3. [3]

    engines of growth

    doi: 10.1287/mnsc.2025.02523. Timothy F Bresnahan and Manuel Trajtenberg. General purpose technologies: “engines of growth”?Journal of Econometrics, 65(1):83–108,

  4. [4]

    26 Accessed: 2025-10-12

    URLhttps://www.wsj.com/opinion/ ais-overlooked-97-billion-contribution-to-the-economy-users-service-da6e8f55. 26 Accessed: 2025-10-12. Kevin Zheyuan Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz. The effects of generative AI on high-skilled work: Evidence from three field experiments with software developers.Management Science,

  5. [5]

    Early impacts of M365 Copilot.arXiv preprint arXiv:2504.11443,

    Eleanor Wiske Dillon, Sonia Jaffe, Sida Peng, and Alexia Cambon. Early impacts of M365 Copilot.arXiv preprint arXiv:2504.11443,

  6. [6]

    Troy, Dario Amodei, Jared Kaplan, Jack Clark, and Deep Ganguli

    Kunal Handa, Alex Tamkin, Miles McCain, Saffron Huang, Esin Durmus, Sarah Heck, Jared Mueller, Jerry Hong, Stuart Ritchie, Tim Belonax, Kevin K. Troy, Dario Amodei, Jared Kaplan, Jack Clark, and Deep Ganguli. Which economic tasks are performed with AI? evidence from millions of Claude conversations. arXiv:2503.04761 [cs.CY],

  7. [7]

    Accessed: 2025-10-12

    URLhttps://www.microsoft.com/en-us/research/wp-content/uploads/ 2024/07/Generative-AI-in-Real-World-Workplaces.pdf. Accessed: 2025-10-12. Maxim Massenkoff, Eva Lyubich, Peter McCrory, Ruth Appel, and Ryan Heller. An- thropic economic index report: Learning curves,

  8. [8]

    com/research/economic-index-march-2026-report

    URLhttps://www.anthropic. com/research/economic-index-march-2026-report. Microsoft. AI diffusion report: Where AI is most used, developed, and built. Techni- cal report, Microsoft Corporation,

  9. [9]

    Ac- cessed: 2025-11-06

    URLhttps://www.microsoft.com/en-us/ research/wp-content/uploads/2025/10/Microsoft-AI-Diffusion-Report.pdf. Ac- cessed: 2025-11-06. Amit Misra, Jane Wang, Scott McCullers, Kevin White, and Juan Lavista Ferres. Measuring AI diffusion: A population-normalized metric for tracking global AI usage,

  10. [10]

    National Center for O*NET Development

    URL https://arxiv.org/abs/2511.02781. National Center for O*NET Development. O*NET Database Version 29.0,

  11. [11]

    Accessed: 2025-05-29

    URL https://www.onetcenter.org/db_releases.html. Accessed: 2025-05-29. Shakked Noy and Whitney Zhang. Experimental evidence on the productivity effects of generative artificial intelligence.Science, 381(6654):187–192,

  12. [12]

    The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

    URLhttps: //www.gallup.com/workplace/691643/work-nearly-doubled-two-years.aspx. Ac- cessed: 2025-10-19. Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. The impact of AI on developer productivity: Evidence from GitHub copilot.arXiv preprint arXiv:2302.06590,

  13. [13]

    Working with AI: Measuring the applicability of AI to occupations

    Kiran Tomlinson, Sonia Jaffe, Will Wang, Scott Counts, and Siddharth Suri. Working with AI: Measuring the applicability of AI to occupations. arXiv:2507.07935 [cs.AI],

  14. [14]

    Bureau of Labor Statistics

    U.S. Bureau of Labor Statistics. Occupational Employment and Wage Statistics (OEWS), May 2024,

  15. [15]

    Accessed: 2025-05-29

    URLhttps://www.bls.gov/oes/tables.htm. Accessed: 2025-05-29. Mengting Wan, Tara Safavi, Sujay Kumar Jauhar, Yujin Kim, Scott Counts, Jen- nifer Neville, Siddharth Suri, Chirag Shah, Ryen W. White, Longqi Yang, Reid Andersen, Georg Buscher, Dhruv Joshi, and Nagu Rangan. Tnt- llm: Text mining at scale with large language models. InKDD, March

  16. [16]

    Personal

    28 A Figures & Tables 0.0000.0250.0500.075 Chat Inquiry App Inquiry Personal Inquiry Assistant Inquiry Doc Inquiry Email Inquiry Profile Inquiry T argeted Retrieval Research Compare T ech Support Explain Concepts How-T o Guidance Enterprise Inquiry Info Query 0.0000.0500.1000.1500.200 Proofread Edit 0.0000.0050.0100.0150.020 Music Composition Promo Conten...