arxiv: 2604.28181 · v1 · submitted 2026-04-30 · 💻 cs.AI · cs.CL· cs.LG

Recognition: unknown

Synthetic Computers at Scale for Long-Horizon Productivity Simulation

Tao Ge , Baolin Peng , Hao Cheng , Jianfeng Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:59 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords synthetic computerslong-horizon simulationproductivity agentssynthetic dataagent self-improvementAI agentsreinforcement learning

0 comments

The pith

Synthetic computers at scale let agents simulate month-long productivity work to generate training signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a method to create thousands of synthetic computers with realistic folder structures and documents. Agents then generate and complete complex objectives that would take a human about a month, navigating files and producing artifacts over thousands of steps. The resulting interaction data improves agent performance on productivity tasks in both familiar and new settings. A reader would care because long-horizon work on personal computers is central to many real jobs, yet realistic training examples for such work are scarce.

Core claim

By constructing 1,000 synthetic computers with realistic folder hierarchies and content-rich artifacts, and conducting long-horizon simulations where one agent creates user-specific objectives requiring multiple professional deliverables over roughly a month of work while another agent fulfills them by navigating the system, coordinating with simulated collaborators, and producing outputs, the approach yields experiential learning signals. These signals produce significant improvements in agent performance on both in-domain and out-of-domain productivity evaluations. The method can in principle scale to millions or billions of such environments given sufficient compute.

What carries the argument

Synthetic Computers at Scale: the creation of synthetic user computers with realistic folder hierarchies and artifacts, on which long-horizon agent simulations are run to produce training data for productivity tasks.

If this is right

Significant improvements appear in agent performance on both in-domain and out-of-domain productivity evaluations.
The method can scale in principle to millions or billions of synthetic user worlds with enough compute.
It supplies a substrate for agent self-improvement and agentic reinforcement learning.
Broader coverage of diverse professions, roles, contexts, and productivity needs becomes possible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could reduce the need to collect real user data, lowering privacy risks in agent training.
The same environment-generation idea might extend to other long-horizon domains such as software development or research.
If the simulations prove effective, they could serve as a testbed for measuring how well agents sustain work across changing contexts.

Load-bearing premise

The synthetic computer environments and simulations must be realistic enough that the learning signals transfer effectively to real user computers and actual productivity work.

What would settle it

If agents that improve from these synthetic simulations show no performance gains when tested on real user computers performing actual productivity tasks, the transfer of learning signals would be falsified.

Figures

Figures reproduced from arXiv: 2604.28181 by Baolin Peng, Hao Cheng, Jianfeng Gao, Tao Ge.

**Figure 1.** Figure 1: Overview of our methodology: We create user-specific synthetic computers from personas view at source ↗

**Figure 2.** Figure 2: Overview of synthetic computer creation. A persona is first expanded into a detailed user view at source ↗

**Figure 3.** Figure 3: Screenshots of the artifacts created in the synthetic computer. view at source ↗

**Figure 4.** Figure 4: Occupation distribution of the 1,000 sampled personas 0% 5% 10% 15% 20% 25% 30% 35% Share of files 6.6% 1.7% 1.9% 2.3% 2.4% 2.9% 3.3% 5.9% 8.5% 13.9% 15.8% 34.8% others jsonl toml md json png pptx txt py pdf xlsx docx view at source ↗

**Figure 6.** Figure 6: Sore distribution against the rubric. Left: per-computer aggregate. Right: per-deliverable. 4.2.2 Full Trajectory Analysis Beyond judging the final deliverables, we analyze the full trajectory of each simulation on its corresponding synthetic computer. The analysis considers both the process and the outcome: how the work agent planned over time, navigated the filesystem, used existing artifacts, coordinate… view at source ↗

**Figure 7.** Figure 7: Per-computer win/lose share of skillaugmented variants against the baseline as the number of training computers N grows. Each row aggregates 100 paired comparisons view at source ↗

**Figure 8.** Figure 8: Out-of-domain evaluation on the GDPVal gold set (220 tasks) view at source ↗

**Figure 9.** Figure 9: A self-improving loop for productivity agents. Synthetic computers enable long-horizon view at source ↗

read the original abstract

Realistic long-horizon productivity work is strongly conditioned on user-specific computer environments, where much of the work context is stored and organized through directory structures and content-rich artifacts. To scale synthetic data creation for such productivity scenarios, we introduce Synthetic Computers at Scale, a scalable methodology for creating such environments with realistic folder hierarchies and content-rich artifacts (e.g., documents, spreadsheets, and presentations). Conditioned on each synthetic computer, we run long-horizon simulations: one agent creates productivity objectives that are specific to the computer's user and require multiple professional deliverables and about a month of human work; another agent then acts as that user and keeps working across the computer -- for example, navigating the filesystem for grounding, coordinating with simulated collaborators, and producing professional artifacts -- until these objectives are completed. In preliminary experiments, we create 1,000 synthetic computers and run long-horizon simulations on them; each run requires over 8 hours of agent runtime and spans more than 2,000 turns on average. These simulations produce rich experiential learning signals, whose effectiveness is validated by significant improvements in agent performance on both in-domain and out-of-domain productivity evaluations. Given that personas are abundant at billion scale, this methodology can in principle scale to millions or even billions of synthetic user worlds with sufficient compute, enabling broader coverage of diverse professions, roles, contexts, environments, and productivity needs. We argue that scalable synthetic computer creation, together with at-scale simulations, is highly promising as a foundational substrate for agent self-improvement and agentic reinforcement learning in long-horizon productivity scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete way to spin up thousands of synthetic computers with realistic files and folders, then run month-long agent simulations on them to create training data for productivity tasks.

read the letter

The core contribution is a pipeline that first builds user-specific synthetic computers complete with folder hierarchies and content like documents or spreadsheets, then runs two-agent loops where one sets realistic objectives and the other executes them over thousands of turns. They report doing this for 1,000 such environments, each averaging more than 2,000 steps and 8 hours of runtime, and claim the resulting trajectories produce measurable gains on both in-domain and out-of-domain productivity benchmarks. That scale and the emphasis on grounding in actual filesystem navigation and artifact creation is the part that feels new relative to earlier synthetic data efforts for agents. It directly targets the gap where most current training lacks the messy, long-horizon context of real desktop work. The preliminary results are presented as evidence that the generated signals are useful enough to improve agent behavior, which is a reasonable first check. The argument that this can scale to millions of worlds because personas are cheap is straightforward and worth testing. The main soft spot is that the abstract and available description give almost no detail on how the folder structures or artifact content are actually sampled, what the exact evaluation suites contain, or how the baselines and statistical tests were run. Without those pieces it is hard to judge whether the reported improvements are robust or whether the synthetic worlds are close enough to real user machines for the learning to transfer. The realism assumption is load-bearing but not yet stress-tested in public. This is for researchers working on long-horizon agent training and agentic RL who need diverse, grounded environments rather than short scripted tasks. Anyone trying to move beyond toy benchmarks would get value from the concrete scaling numbers even if they end up modifying the generation method. It deserves a serious referee because the idea is practical and the scale they already reached is non-trivial; a review would mainly push for the missing methodological specifics and transfer experiments. I would send it out rather than desk reject.

Referee Report

3 major / 3 minor

Summary. The paper introduces Synthetic Computers at Scale, a methodology for generating large numbers of realistic synthetic user computer environments complete with folder hierarchies and content-rich artifacts such as documents and spreadsheets. Conditioned on these environments, it runs long-horizon simulations in which one agent generates user-specific productivity objectives requiring multiple deliverables over roughly a month of work, while a second agent simulates the user completing those objectives through filesystem navigation, collaboration, and artifact production. Preliminary experiments create 1,000 such computers and execute simulations averaging over 2,000 turns each (more than 8 hours of runtime), producing experiential data that yields significant performance gains on both in-domain and out-of-domain productivity evaluations. The authors argue this approach can scale to millions or billions of synthetic worlds and serves as a foundational substrate for agent self-improvement and agentic RL in long-horizon productivity tasks.

Significance. If the generated trajectories prove sufficiently close in distribution to real user-computer interactions, the method could supply scalable, diverse training signals that are currently scarce for long-horizon agent training. The reported improvements on out-of-domain tasks would indicate that synthetic environments can support generalization beyond the training distribution, addressing a key bottleneck in agentic RL. The explicit scaling argument to billion-scale personas is a concrete strength, as is the focus on full computer states rather than isolated tasks.

major comments (3)

[Experiments / preliminary results] Experiments section (preliminary results paragraph): the abstract states that the 1,000 simulations produce 'significant improvements' on in-domain and out-of-domain evaluations, yet provides no quantitative metrics, baseline agents, statistical tests, or error analysis. Because the central claim rests on these performance gains demonstrating effective learning signals, the absence of these details makes it impossible to assess effect size or reliability.
[Methodology / Synthetic Computers at Scale] Synthetic computer generation subsection: the description of folder hierarchies and artifact distributions is high-level and does not specify the exact prompting strategies, base models, or any quantitative validation (e.g., distributional similarity to real user data or human realism ratings). This construction is load-bearing for the realism assumption required for transfer to genuine productivity tasks.
[Evaluation / out-of-domain] Out-of-domain evaluation paragraph: the manuscript must define the out-of-domain test suite explicitly (e.g., held-out professions, different artifact types, or real user traces) and report any measured overlap with the synthetic training distribution. Without this, the generalization claim cannot be evaluated.

minor comments (3)

[Abstract] Abstract: the phrase 'about a month of human work' is vague; replace with a concrete estimate of total turns or wall-clock time per objective.
[Introduction / Related Work] Related work: the manuscript should cite prior synthetic data generation efforts for agents (e.g., work on synthetic environments for RL or computer-use agents) to clarify novelty.
[Figures] Figure captions: any diagrams of folder hierarchies or simulation traces should include scale bars or example counts to convey the realism and diversity achieved.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We address each of the major comments below, indicating the revisions we plan to make to the manuscript.

read point-by-point responses

Referee: [Experiments / preliminary results] Experiments section (preliminary results paragraph): the abstract states that the 1,000 simulations produce 'significant improvements' on in-domain and out-of-domain evaluations, yet provides no quantitative metrics, baseline agents, statistical tests, or error analysis. Because the central claim rests on these performance gains demonstrating effective learning signals, the absence of these details makes it impossible to assess effect size or reliability.

Authors: We acknowledge that the preliminary nature of the experiments section does not include the detailed quantitative metrics, baselines, statistical tests, or error analysis that would be expected in a full paper. The current manuscript emphasizes the scale and methodology of the synthetic computer simulations, with performance claims stated at a high level. In the revised manuscript, we will expand this section to report specific metrics showing the performance gains, include comparisons against baseline agents without the synthetic data, perform statistical tests where appropriate, and provide error analysis. This will strengthen the evidence for the effectiveness of the learning signals. revision: yes
Referee: [Methodology / Synthetic Computers at Scale] Synthetic computer generation subsection: the description of folder hierarchies and artifact distributions is high-level and does not specify the exact prompting strategies, base models, or any quantitative validation (e.g., distributional similarity to real user data or human realism ratings). This construction is load-bearing for the realism assumption required for transfer to genuine productivity tasks.

Authors: The referee correctly identifies that the synthetic computer generation is described at a high level in the current version. We will revise the methodology subsection to include more specific details on the prompting strategies used to generate folder hierarchies and content-rich artifacts, as well as the base models employed for this generation process. Regarding quantitative validation, we will add any distributional similarity measures we have computed and discuss the limitations of the current validation approach. We note that human realism ratings were not conducted in this preliminary work but plan to include them in future extensions. revision: partial
Referee: [Evaluation / out-of-domain] Out-of-domain evaluation paragraph: the manuscript must define the out-of-domain test suite explicitly (e.g., held-out professions, different artifact types, or real user traces) and report any measured overlap with the synthetic training distribution. Without this, the generalization claim cannot be evaluated.

Authors: We agree that the out-of-domain evaluation requires clearer definition. In the revision, we will explicitly describe the out-of-domain test suite, specifying that it consists of held-out professions and different artifact types not used in training. We will also report any analysis of overlap with the synthetic training distribution using embedding-based similarity metrics. This will help substantiate the generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents a methodology for constructing synthetic computer environments with folder hierarchies and artifacts, then running long-horizon agent simulations (objective generation followed by user-agent execution) to produce experiential trajectories. These trajectories are used to improve agent policies, with effectiveness shown via reported performance gains on separate in-domain and out-of-domain productivity evaluations. No load-bearing step reduces by definition or construction to its own inputs: the synthetic creation process, simulation mechanics, and validation metrics are described as independent procedures whose outputs are empirically tested rather than presupposed. No self-citations, fitted parameters renamed as predictions, or uniqueness theorems appear in the provided text. The chain from environment generation to simulation data to measured improvements is self-contained and externally checkable through the stated experimental setup.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on a domain assumption about the transferability of synthetic simulation data to real tasks, without introducing new physical entities or numerous fitted parameters beyond standard simulation choices.

axioms (1)

domain assumption Synthetic computer environments with generated folder hierarchies and artifacts are sufficiently realistic to produce transferable learning signals for AI agents in productivity scenarios.
Invoked when the abstract claims validation through significant improvements on both in-domain and out-of-domain productivity evaluations.

pith-pipeline@v0.9.0 · 5586 in / 1602 out tokens · 64985 ms · 2026-05-07T06:59:30.451005+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 1 canonical work pages

[1]

Accessed: 2026-04-29. 22 Technical Report Ahmed Awadallah, Yash Lara, Raghav Magazine, Hussein Mozannar, Akshay Nambi, Yash Pandya, Aravind Rajeswaran, Corby Rosset, Alexey Taymanov, Vibhav Vineet, et al. Fara-7b: An efficient agentic model for computer use.arXiv preprint arXiv:2511.19663, 2025. Shihao Cai, Runnan Fang, Jialong Wu, Baixuan Li, Xinyu Wang,...

work page arXiv 2026
[2]

Executive Summary 24 Technical Report Margaret Forsythe is a Senior Financial Advisor at Meridian Wealth Partners (Denver office, $640M AUM) responsible for managing 80+ client households ($340M AUM). Over a 20-day simulation, she was tasked with five concurrent deliverables: (1) a VCMM 2026 model portfolio refresh for Investment Committee presentation, (...

2026
[3]

slide 1 = recommendation not background

Deliverable-by-Deliverable Analysis 2.1 DLV 001: VCMM 2026 Model Portfolio Refresh (Hartley) — 127/168 (75.6%) Key Strengths: • All 8 output files created and properly located on D: drive • VCMM 2026 data currency excellent: 2025-12-31 data-as-of cited throughout • Complete 12-asset-class delta table with signed changes • Monte Carlo: 10,000 paths, 10-yea...

2026
[4]

Do not submit the stress test workbook to IC with the current values

Simulated-Collaborator Communication Analysis 3.1 David Hartley (ext hartley) — Managing Director Interactions:10 outbound messages, 10 responses. 2 blank messages (Days 14, 15). Effectiveness:Strong in Weeks 1–2. The agent proactively secured VCMM data, confirmed IC agenda, obtained direction on 3 open decision points (REIT rotation, FI parallel scenario...

2022
[5]

Workflow & Efficiency 4.1 Error Patterns from Turn Log Thedaily sim turns.jsonlcontains213 error entriesacross 5,114 total turns (4.2% error rate). Key patterns: • dotnet-script path errors:multiple instances of cd .claude/skills/minimax-docx: No such file or directory — suggesting the agent repeatedly tried to use document-creation skills with incorrect ...
[6]

Alternatives vehicle selection should have been finalized before analysis began.The agent started quantitative analysis before definitively choosing between GSCI/PDBC for commodities and QSPIX/VMNVX for liquid alts, leading to the analysis being performed on different funds than those recommended
[7]

Supporting workbooks should have been updated before narrative documents.The agent wrote updated figures in narrative documents (v2.docx, CMO, FINAL PDF) but left the supporting Excel workbooks with old numbers, breaking the evidence chain
[8]

Cross-document reconciliation should have been a dedicated final-day activity.The agent spent Day 20 on miscellaneous tasks rather than systematic reconciliation. A single day dedicated to printing all weight tables side-by-side and verifying consistency would have caught the Conservative US LC (14%/20%/27.4%/19%) discrepancy and the Growth EM direction reversal
[9]

Domain-Specific Insights 5.1 Missing Professional Knowledge
[10]

check every number in every document

Cross-document reconciliation discipline:In wealth management, a single inconsistent number in a client-facing document can destroy trust. The agent did not demonstrate the “check every number in every document” discipline that a senior financial advisor would apply before an IC presentation or client delivery
[11]

Substituting different items than those confirmed by the committee chair is a serious governance failure

IC vote item fidelity:In an investment committee context, the voting resolution list is a quasi-legal document. Substituting different items than those confirmed by the committee chair is a serious governance failure. An expert would have treated Hartley’s enumerated list as immutable
[12]

The agent delivered the Castellano package with compliance still pending — exactly what the client said he didn’t want

Compliance as gating function:Whitfield explicitly stated that compliance sign-off must precede client delivery. The agent delivered the Castellano package with compliance still pending — exactly what the client said he didn’t want
[13]

Fund product knowledge:The GSCI/PDBC confusion reveals incomplete knowledge of the commodity ETF universe. A senior advisor would know that iShares GSCI Commodity Indexed Trust (ticker: GSG) and Invesco Optimum Yield Diversified Commodity Strategy (ticker: PDBC) are completely different products with different expense ratios and index methodologies. 5.2 W...
[14]

single source of truth

Created a “single source of truth” spreadsheetmapping every key figure (weights, returns, correlations, expense ratios, vehicle tickers) with cross-references to every document that cites it. Any change in the source would propagate to all documents
[15]

Treated simulated-collaborator corrections as blocking items.Sandra’s Day 17 corrections and Whitfield’s verbatim language insertions should have been completed within hours of receipt, not deferred to future days
[16]

red team

Scheduled a final “red team” review day(Day 19 or 20) devoted entirely to printing all deliverable tables side-by-side and checking every number. This is standard practice before IC presentations
[17]

open items

Maintained a running “open items” trackerwith simulated collaborator, date, item, status, and deadline. This would have prevented the Sandra Day 17 items and Whitfield verbatim insertions from falling through the cracks
[18]

[Whitfield verbatim, Jan 20]

Actionable Recommendations Recommendation 1: Implement Cross-Document Reconciliation Tables What happened:Conservative US Large Cap weight appeared as 14%, 20%, 27.4%, and 19% across four documents in dlv 001. Growth EM appeared with opposite directional signs in different documents. What should have happened:Before packaging, create a reconciliation matr...