pith. sign in

arxiv: 2601.06536 · v2 · submitted 2026-01-10 · 💻 cs.CL

Expos\'ia: Teaching and Assessment of Academic Writing Skills for Research Project Proposals and Peer Feedback

Pith reviewed 2026-05-16 15:21 UTC · model grok-4.3

classification 💻 cs.CL
keywords academic writingdatasetpeer feedbackLLM evaluationautomated scoringresearch proposalshigher education
0
0 comments X

The pith

Expos'ia supplies the first public dataset of student research proposals paired with peer feedback and human assessment scores for testing AI writing tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Expos'ia, the first public dataset that links student research project proposals with the peer and instructor feedback they receive. It includes detailed human assessment scores based on a pedagogically grounded schema for both the writing and the reviews. Researchers use the dataset to benchmark large language models on automatically scoring proposals and feedback. The study finds that the two scoring tasks favor different models, closed-source models outperform open ones, and scoring multiple aspects together works best. This resource supports the development of computational tools for teaching and assessing academic writing in higher education.

Core claim

Expos'ia is the first public dataset connecting writing and feedback in higher education, including student research project proposals, peer and instructor comments and free-text reviews, and human assessment scores from a fine-grained schema. Benchmarking state-of-the-art LLMs shows that different models perform best on scoring proposals versus reviews, closed-source models consistently outperform open-weight models, and a prompting strategy that scores multiple aspects together is most effective.

What carries the argument

The Expos'ia dataset, which captures the multi-stage academic writing process of drafting, receiving feedback, and revising, along with the fine-grained pedagogically-grounded assessment schema.

If this is right

  • Researchers can now develop and test educationally grounded computational approaches using real student data.
  • Automated scoring of proposals and reviews becomes feasible with current LLMs, though model selection matters.
  • Joint multi-aspect prompting improves reliability for classroom use.
  • Open-weight models need improvement to match closed-source performance in educational settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • AI systems trained on this data could provide instant feedback to help students improve their proposals before submission.
  • The dataset's structure might generalize to other writing tasks like lab reports or literature reviews.
  • Future work could explore using the feedback comments to train models that generate helpful peer reviews.

Load-bearing premise

The data collected from a single computer science course and the human assessments based on the developed schema are representative enough to support general claims about LLM performance for academic writing assessment.

What would settle it

A follow-up study that applies the same LLMs and prompting strategies to a comparable dataset from a different field or university and finds significantly lower agreement with human scores would challenge the generalizability.

Figures

Figures reproduced from arXiv: 2601.06536 by Alla Rozovskaya, Dennis Zyska, Ilia Kuznetsov, Iryna Gurevych.

Figure 1
Figure 1. Figure 1: Overview of Exposía. In the university course [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example instance from Exposía. Top: A student [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Human–LLM agreement (QWA) for ex￾posé scoring by expertise level. The dotted line shows human-human QWA. −0.1 0.0 0.1 0.2 0.3 Δ(QWA) = (G1 vs LLM) − (G2 vs LLM) Methodology Motivation Bibliography Approach Metadata Language/Q Schedule Model Qwen3 (80B) Llama 3.3 (70B) GPT OSS (120B) GPT OSS (20B) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-rubric score improvements from draft [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Daily number of actions across the semester for instructors (n=9) and students (n=45). Bars show per-day [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Human–LLM agreement (QWA) for re￾view scoring by expertise level. The dotted line shows human-human QWA. −0.08 −0.06 −0.04 −0.02 0.00 0.02 0.04 0.06 0.08 Δ(QWA) = (G1 vs LLM) − (G2 vs LLM) Feed Forward Feed Up Structure and Clarity Content Quality Feed Back Errors Language Model Qwen3 (80B) Llama 3.3 (70B) GPT OSS (120B) GPT OSS (20B) [PITH_FULL_IMAGE:figures/full_fig_p034_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Group-asymmetry in human–LLM agree￾ment by review rubric. Each point shows the differ￾ence in agreement between an LLM and human raters (Group 1 (G1) vs. Group 2 (G2)). The vertical line at ∆ = 0 indicates equal agreement of the LLM with human G1 and G2 raters. Points to the right indicate the model agrees more with G1 raters than with G2 raters. the aggregate are explained by the small size of this bin an… view at source ↗
Figure 9
Figure 9. Figure 9: Runtime per submission for criterion-based [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗
read the original abstract

We present Expos\'ia, the first public dataset that connects writing and feedback in higher education, enabling research on educationally grounded computational approaches to teaching and evaluating academic writing. Expos\'ia includes student research project proposals and peer and instructor feedback consisting of comments and free-text reviews. The dataset was collected in the "Introduction to Scientific Work" course of the Computer Science. Expos\'ia reflects the multi-stage nature of the academic writing process that includes drafting, receiving feedback, and revising the writing based on the feedback received. Both the project proposals and peer feedback are accompanied by human assessment scores based on a fine-grained, pedagogically-grounded schema for writing and feedback assessment that we develop. We use Expos\'ia to benchmark state-of-the-art large language models (LLMs) on two tasks: automated scoring of (1) the proposals and (2) the student reviews. We find that the two tasks benefit from different LLMs. Furthermore, closed-source models consistently outperform open-weight models, motivating further research on improving the performance of open-weight models preferred in classroom settings. Finally, we establish that a prompting strategy that scores multiple aspects of the writing together is the most effective, an important finding for classroom deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Expos'ia, the first public dataset linking student research project proposals with peer and instructor feedback (comments and free-text reviews) collected from a single 'Introduction to Scientific Work' course in Computer Science. Proposals and reviews are accompanied by human assessment scores using a new fine-grained, pedagogically-grounded schema. The authors benchmark state-of-the-art LLMs on two tasks—automated scoring of proposals and of student reviews—reporting that the tasks benefit from different models, that closed-source models consistently outperform open-weight models, and that a multi-aspect prompting strategy is most effective.

Significance. If the dataset construction and benchmarking results prove reliable, the work supplies a novel, educationally grounded resource for research on computational support for academic writing instruction and assessment. The empirical findings on model selection and prompting could inform practical LLM deployment in classrooms, particularly by highlighting trade-offs between closed- and open-weight models.

major comments (2)
  1. [Dataset collection and benchmarking sections] The dataset and all reported LLM performance differentials derive exclusively from one CS course at a single institution. This narrow sampling frame makes the central claims—that different LLMs excel at proposal versus review scoring, that closed-source models are superior, and that multi-aspect prompting is optimal—vulnerable to course-specific confounds (topic distribution, student demographics, feedback norms). The benchmarking conclusions therefore require either explicit limitation statements or additional cross-cohort validation to support general statements about LLM utility for academic writing assessment.
  2. [Abstract and evaluation description] No quantitative results, error analysis, inter-rater agreement statistics for the human schema, or evaluation-setup details (metrics, sample sizes, statistical tests) appear in the provided abstract or description. Without these, the reliability of the reported model rankings and the superiority of multi-aspect prompting cannot be verified, undermining the empirical contribution.
minor comments (1)
  1. [Title and abstract] The rendering of the dataset name 'Expos'ia' (with accent) should be checked for consistency across title, abstract, and body text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below, agreeing where revisions are needed to improve clarity and rigor while defending the core contributions of the Expos'ia dataset as the first public resource of its kind.

read point-by-point responses
  1. Referee: [Dataset collection and benchmarking sections] The dataset and all reported LLM performance differentials derive exclusively from one CS course at a single institution. This narrow sampling frame makes the central claims—that different LLMs excel at proposal versus review scoring, that closed-source models are superior, and that multi-aspect prompting is optimal—vulnerable to course-specific confounds (topic distribution, student demographics, feedback norms). The benchmarking conclusions therefore require either explicit limitation statements or additional cross-cohort validation to support general statements about LLM utility for academic writing assessment.

    Authors: We agree that the single-institution, single-course origin of Expos'ia represents a genuine limitation on generalizability. The reported model rankings and prompting findings are observations within this specific educational context rather than universal claims. In the revised manuscript we will add a dedicated Limitations section that explicitly discusses potential confounds including topic distribution, student demographics, and institutional feedback norms. We will also qualify all benchmarking statements to emphasize the need for future cross-cohort validation. As the first public dataset linking proposals, peer feedback, and pedagogically grounded assessments, we maintain that the resource still provides a valuable foundation for the community, but we accept that broader sampling is required before stronger generalization statements can be made. revision: yes

  2. Referee: [Abstract and evaluation description] No quantitative results, error analysis, inter-rater agreement statistics for the human schema, or evaluation-setup details (metrics, sample sizes, statistical tests) appear in the provided abstract or description. Without these, the reliability of the reported model rankings and the superiority of multi-aspect prompting cannot be verified, undermining the empirical contribution.

    Authors: We acknowledge that the current abstract omits key quantitative details. The full manuscript already reports evaluation metrics (Pearson and Spearman correlations, accuracy), sample sizes, and statistical comparisons in the benchmarking sections. We will revise the abstract to include the main performance numbers (e.g., best-model scores for each task and the advantage of multi-aspect prompting) and will ensure that inter-rater agreement statistics for the assessment schema are reported prominently in the revised version. If an error analysis is not yet present, we will add a concise summary of common failure modes. These changes will allow readers to verify the model rankings without needing to consult the full text. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical dataset release and external LLM benchmarking

full rationale

The paper collects a new dataset from one CS course, develops a scoring schema, and evaluates off-the-shelf LLMs on proposal and review scoring tasks. No equations, parameter fitting, or derivations exist. Claims rest on direct comparison to external models and human annotations rather than any self-referential reduction. Self-citations, if present, are not load-bearing for the central results. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work relies on standard practices for dataset curation and LLM evaluation.

pith-pipeline@v0.9.0 · 5530 in / 1039 out tokens · 41662 ms · 2026-05-16T15:21:56.316618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 1 internal anchor

  1. [1]

    A qualitative study on challenges that post- graduate students face in research proposal writing at university level.International Journal of So- cial Sciences: Current and Future Research Trends, 5(01):1–6. S. Narciss. 2008. Feedback strategies for interactive learning tasks. In J. M. Spector, M. D. Merrill, J. van Merriënboer, and D. M. Driscoll, editor...

  2. [2]

    What is science? Course overview and intro- duction

  3. [3]

    Empirical methods and research design

  4. [4]

    Literature review and academic search strate- gies

  5. [5]

    Project proposals (exposés);start writing ex- posé draft

  6. [6]

    LATEX for academic writing

  7. [7]

    Research Questions and practical exercise on academic literature search

  8. [8]

    Introduction to peer review

  9. [9]

    Feedback;start writing peer feedback

  10. [10]

    Generative AI in scientific work

  11. [11]

    Workshop: Software development

  12. [12]

    Workshop: Research and project management B Exposé Template and Writing Guidelines PanelExposé templateshows the content of the LATEX exposé template that students received as part of the course and used to create their exposés. While the complete instructions were provided in an extensive slide-based lecture (Appendix A), the excerpt reproduced here is t...

  13. [13]

    The Holographic AI Assistant

    Helmut Balzert, Christian Schäfer, Marion Schröder, Uwe Kern, Roman Bendisch, and Klaus Zeppenfeld.Wis- senschaftliches Arbeiten: Ethik, Inhalt und Form wissenschaftlicher Arbeiten, Handwerkszeug, Quellen, Projektmanage- ment, Präsentation. W3L-Verlag, Herdecke/Witten, 2011. ISBN: 978-3-937137-59-9. 14 For all reviews for which the author provided ad- dit...

  14. [14]

    Evaluate the expose **criterion by criterion**

  15. [15]

    For **every** listed criterion, assign exactly **one** score using **only** the scoring options and point values provided

  16. [16]

    criterion_name

    Justify each score with a clear, concise and **short** explanation. Your response MUST be a list of dictionaries in **valid JSON** schema: [ { "criterion_name": "<the EXACT criterion name from the criteria list, without any other words>", "assigned_score": "<an INTEGER, one of the allowed point values for this criterion>", "justification": "<SHORT justifi...

  17. [17]

    Evaluate the peer review **criterion by criterion**

  18. [18]

    For **every** listed criterion, assign exactly **one** score using **only** the scoring options and point values provided for that criterion

  19. [19]

    Base your evaluation strictly on the quality of the full review text

  20. [20]

    criterion_name

    Justify each score with a clear, concise and **short** explanation. Output format: 33 Return a list of objects following this **valid JSON** schema: [ { "criterion_name": "<EXACT criterion name from the criteria list>", "assigned_score": <INTEGER, one of the allowed point values for this criterion>, "justification": "<SHORT justification of why you decide...