pith. sign in

arxiv: 2502.02904 · v5 · pith:3LZCSWS4new · submitted 2025-02-05 · 💻 cs.HC · cs.CL· q-bio.NC

ScholaWrite: A Dataset of End-to-End Scholarly Writing Process

classification 💻 cs.HC cs.CLq-bio.NC
keywords writingscholarlydatasetend-to-endscholawriteassistantscapturingcognitive
0
0 comments X
read the original abstract

Writing is a cognitively demanding activity that requires constant decision-making, heavy reliance on working memory, and frequent shifts between tasks of different goals. To build writing assistants that truly align with writers' cognition, we must capture and decode the complete thought process behind how writers transform ideas into final texts. We present ScholaWrite, the first dataset of end-to-end scholarly writing, tracing the multi-month journey from initial drafts to final manuscripts. We contribute three key advances: (1) a Chrome extension that unobtrusively records keystrokes on Overleaf, enabling the collection of realistic, in-situ writing data; (2) a novel corpus of full scholarly manuscripts, enriched with fine-grained annotations of cognitive writing intentions. The dataset includes \LaTeX-based edits from five computer science preprints, capturing nearly 62K text changes over four months; and (3) analyses and insights into the micro-dynamics of scholarly writing, highlighting gaps between human writing processes and the current capabilities of large language models (LLMs) in providing meaningful assistance. ScholaWrite underscores the value of capturing end-to-end writing data to develop future writing assistants that support, not replace, the cognitive work of scientists.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Privacy-Preserving Proof of Human Authorship via Zero-Knowledge Process Attestation

    cs.CR 2026-02 unverdicted novelty 6.0

    ZK-PoP uses Groth16 proofs, Pedersen commitments, and Bulletproof range proofs to attest that behavioral feature vectors and content evolution match human patterns without exposing the raw data.

  2. Detecting Cognitive Signatures in Typing Behavior for Non-Intrusive Authorship Verification

    cs.CR 2026-02 unverdicted novelty 6.0

    Cognitive Load Correlation from keystroke timings distinguishes genuine human composition from mechanical transcription with estimated 85-95% accuracy in a non-intrusive framework.