Expos\'ia: Teaching and Assessment of Academic Writing Skills for Research Project Proposals and Peer Feedback
Pith reviewed 2026-05-16 15:21 UTC · model grok-4.3
The pith
Expos'ia supplies the first public dataset of student research proposals paired with peer feedback and human assessment scores for testing AI writing tools.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Expos'ia is the first public dataset connecting writing and feedback in higher education, including student research project proposals, peer and instructor comments and free-text reviews, and human assessment scores from a fine-grained schema. Benchmarking state-of-the-art LLMs shows that different models perform best on scoring proposals versus reviews, closed-source models consistently outperform open-weight models, and a prompting strategy that scores multiple aspects together is most effective.
What carries the argument
The Expos'ia dataset, which captures the multi-stage academic writing process of drafting, receiving feedback, and revising, along with the fine-grained pedagogically-grounded assessment schema.
If this is right
- Researchers can now develop and test educationally grounded computational approaches using real student data.
- Automated scoring of proposals and reviews becomes feasible with current LLMs, though model selection matters.
- Joint multi-aspect prompting improves reliability for classroom use.
- Open-weight models need improvement to match closed-source performance in educational settings.
Where Pith is reading between the lines
- AI systems trained on this data could provide instant feedback to help students improve their proposals before submission.
- The dataset's structure might generalize to other writing tasks like lab reports or literature reviews.
- Future work could explore using the feedback comments to train models that generate helpful peer reviews.
Load-bearing premise
The data collected from a single computer science course and the human assessments based on the developed schema are representative enough to support general claims about LLM performance for academic writing assessment.
What would settle it
A follow-up study that applies the same LLMs and prompting strategies to a comparable dataset from a different field or university and finds significantly lower agreement with human scores would challenge the generalizability.
Figures
read the original abstract
We present Expos\'ia, the first public dataset that connects writing and feedback in higher education, enabling research on educationally grounded computational approaches to teaching and evaluating academic writing. Expos\'ia includes student research project proposals and peer and instructor feedback consisting of comments and free-text reviews. The dataset was collected in the "Introduction to Scientific Work" course of the Computer Science. Expos\'ia reflects the multi-stage nature of the academic writing process that includes drafting, receiving feedback, and revising the writing based on the feedback received. Both the project proposals and peer feedback are accompanied by human assessment scores based on a fine-grained, pedagogically-grounded schema for writing and feedback assessment that we develop. We use Expos\'ia to benchmark state-of-the-art large language models (LLMs) on two tasks: automated scoring of (1) the proposals and (2) the student reviews. We find that the two tasks benefit from different LLMs. Furthermore, closed-source models consistently outperform open-weight models, motivating further research on improving the performance of open-weight models preferred in classroom settings. Finally, we establish that a prompting strategy that scores multiple aspects of the writing together is the most effective, an important finding for classroom deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Expos'ia, the first public dataset linking student research project proposals with peer and instructor feedback (comments and free-text reviews) collected from a single 'Introduction to Scientific Work' course in Computer Science. Proposals and reviews are accompanied by human assessment scores using a new fine-grained, pedagogically-grounded schema. The authors benchmark state-of-the-art LLMs on two tasks—automated scoring of proposals and of student reviews—reporting that the tasks benefit from different models, that closed-source models consistently outperform open-weight models, and that a multi-aspect prompting strategy is most effective.
Significance. If the dataset construction and benchmarking results prove reliable, the work supplies a novel, educationally grounded resource for research on computational support for academic writing instruction and assessment. The empirical findings on model selection and prompting could inform practical LLM deployment in classrooms, particularly by highlighting trade-offs between closed- and open-weight models.
major comments (2)
- [Dataset collection and benchmarking sections] The dataset and all reported LLM performance differentials derive exclusively from one CS course at a single institution. This narrow sampling frame makes the central claims—that different LLMs excel at proposal versus review scoring, that closed-source models are superior, and that multi-aspect prompting is optimal—vulnerable to course-specific confounds (topic distribution, student demographics, feedback norms). The benchmarking conclusions therefore require either explicit limitation statements or additional cross-cohort validation to support general statements about LLM utility for academic writing assessment.
- [Abstract and evaluation description] No quantitative results, error analysis, inter-rater agreement statistics for the human schema, or evaluation-setup details (metrics, sample sizes, statistical tests) appear in the provided abstract or description. Without these, the reliability of the reported model rankings and the superiority of multi-aspect prompting cannot be verified, undermining the empirical contribution.
minor comments (1)
- [Title and abstract] The rendering of the dataset name 'Expos'ia' (with accent) should be checked for consistency across title, abstract, and body text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, agreeing where revisions are needed to improve clarity and rigor while defending the core contributions of the Expos'ia dataset as the first public resource of its kind.
read point-by-point responses
-
Referee: [Dataset collection and benchmarking sections] The dataset and all reported LLM performance differentials derive exclusively from one CS course at a single institution. This narrow sampling frame makes the central claims—that different LLMs excel at proposal versus review scoring, that closed-source models are superior, and that multi-aspect prompting is optimal—vulnerable to course-specific confounds (topic distribution, student demographics, feedback norms). The benchmarking conclusions therefore require either explicit limitation statements or additional cross-cohort validation to support general statements about LLM utility for academic writing assessment.
Authors: We agree that the single-institution, single-course origin of Expos'ia represents a genuine limitation on generalizability. The reported model rankings and prompting findings are observations within this specific educational context rather than universal claims. In the revised manuscript we will add a dedicated Limitations section that explicitly discusses potential confounds including topic distribution, student demographics, and institutional feedback norms. We will also qualify all benchmarking statements to emphasize the need for future cross-cohort validation. As the first public dataset linking proposals, peer feedback, and pedagogically grounded assessments, we maintain that the resource still provides a valuable foundation for the community, but we accept that broader sampling is required before stronger generalization statements can be made. revision: yes
-
Referee: [Abstract and evaluation description] No quantitative results, error analysis, inter-rater agreement statistics for the human schema, or evaluation-setup details (metrics, sample sizes, statistical tests) appear in the provided abstract or description. Without these, the reliability of the reported model rankings and the superiority of multi-aspect prompting cannot be verified, undermining the empirical contribution.
Authors: We acknowledge that the current abstract omits key quantitative details. The full manuscript already reports evaluation metrics (Pearson and Spearman correlations, accuracy), sample sizes, and statistical comparisons in the benchmarking sections. We will revise the abstract to include the main performance numbers (e.g., best-model scores for each task and the advantage of multi-aspect prompting) and will ensure that inter-rater agreement statistics for the assessment schema are reported prominently in the revised version. If an error analysis is not yet present, we will add a concise summary of common failure modes. These changes will allow readers to verify the model rankings without needing to consult the full text. revision: yes
Circularity Check
No significant circularity: empirical dataset release and external LLM benchmarking
full rationale
The paper collects a new dataset from one CS course, develops a scoring schema, and evaluates off-the-shelf LLMs on proposal and review scoring tasks. No equations, parameter fitting, or derivations exist. Claims rest on direct comparison to external models and human annotations rather than any self-referential reduction. Self-citations, if present, are not load-bearing for the central results. The work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
A qualitative study on challenges that post- graduate students face in research proposal writing at university level.International Journal of So- cial Sciences: Current and Future Research Trends, 5(01):1–6. S. Narciss. 2008. Feedback strategies for interactive learning tasks. In J. M. Spector, M. D. Merrill, J. van Merriënboer, and D. M. Driscoll, editor...
work page internal anchor Pith review Pith/arXiv arXiv 2008
-
[2]
What is science? Course overview and intro- duction
-
[3]
Empirical methods and research design
-
[4]
Literature review and academic search strate- gies
-
[5]
Project proposals (exposés);start writing ex- posé draft
-
[6]
LATEX for academic writing
-
[7]
Research Questions and practical exercise on academic literature search
-
[8]
Introduction to peer review
-
[9]
Feedback;start writing peer feedback
-
[10]
Generative AI in scientific work
-
[11]
Workshop: Software development
-
[12]
Workshop: Research and project management B Exposé Template and Writing Guidelines PanelExposé templateshows the content of the LATEX exposé template that students received as part of the course and used to create their exposés. While the complete instructions were provided in an extensive slide-based lecture (Appendix A), the excerpt reproduced here is t...
-
[13]
Helmut Balzert, Christian Schäfer, Marion Schröder, Uwe Kern, Roman Bendisch, and Klaus Zeppenfeld.Wis- senschaftliches Arbeiten: Ethik, Inhalt und Form wissenschaftlicher Arbeiten, Handwerkszeug, Quellen, Projektmanage- ment, Präsentation. W3L-Verlag, Herdecke/Witten, 2011. ISBN: 978-3-937137-59-9. 14 For all reviews for which the author provided ad- dit...
work page 2011
-
[14]
Evaluate the expose **criterion by criterion**
-
[15]
For **every** listed criterion, assign exactly **one** score using **only** the scoring options and point values provided
-
[16]
Justify each score with a clear, concise and **short** explanation. Your response MUST be a list of dictionaries in **valid JSON** schema: [ { "criterion_name": "<the EXACT criterion name from the criteria list, without any other words>", "assigned_score": "<an INTEGER, one of the allowed point values for this criterion>", "justification": "<SHORT justifi...
-
[17]
Evaluate the peer review **criterion by criterion**
-
[18]
For **every** listed criterion, assign exactly **one** score using **only** the scoring options and point values provided for that criterion
-
[19]
Base your evaluation strictly on the quality of the full review text
-
[20]
Justify each score with a clear, concise and **short** explanation. Output format: 33 Return a list of objects following this **valid JSON** schema: [ { "criterion_name": "<EXACT criterion name from the criteria list>", "assigned_score": <INTEGER, one of the allowed point values for this criterion>, "justification": "<SHORT justification of why you decide...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.