DocReward: A Document Reward Model for Structuring and Stylizing

Bowen Cao; FNU Kartik; Furu Wei; Huitian Jiao; Jiayu Ding; Junpeng Liu; Lei Cui; Li Dong; Nan Yang; Shaohan Huang

arxiv: 2510.11391 · v3 · pith:4VARHFECnew · submitted 2025-10-13 · 💻 cs.CV · cs.AI· cs.CL

DocReward: A Document Reward Model for Structuring and Stylizing

Junpeng Liu , Yuzhong Zhao , Bowen Cao , Jiayu Ding , Yilin Jia , Tengchao Lv , Yupan Huang , Wenshan Wu

show 12 more authors

Shaohan Huang Nan Yang Li Dong Lei Cui Tao Ge Xun Wang Huitian Jiao Sun Mao FNU Kartik Si-Qing Chen Wai Lam Furu Wei

This is my paper

Pith reviewed 2026-05-21 20:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords document reward modelstructural professionalismstylistic qualityDocPair datasetBradley-Terry lossagentic workflowsreinforcement learning for documents

0 comments

The pith

A reward model trained on content-matched document pairs can evaluate structural and stylistic professionalism independently of text quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DocReward as a way to score documents on how professionally they are structured and styled. To do this, it creates a dataset of over 100,000 pairs where each pair has the exact same words but one version looks more professional in layout and design. The model is trained to prefer the better-structured version using a pairwise comparison loss. This separation allows the reward signal to focus only on form, not substance. When used to train agents, it leads to documents that humans rate higher on these dimensions, and it surpasses GPT-5 on a held-out test.

Core claim

DocReward is a document reward model trained on the DocPair dataset of 117K pairs that share identical content but differ in structural and stylistic professionalism. Using a textual-quality-agnostic framework and Bradley-Terry loss, it learns to assess documents based solely on structure and style. On a manually annotated benchmark, it outperforms GPT-5 by 14.6 percentage points, and in reinforcement learning setups, it guides agents to generate documents with consistently higher structural and stylistic professionalism.

What carries the argument

The DocPair dataset of paired documents with identical content but differing structure and style, combined with a textual-quality-agnostic training framework using Bradley-Terry loss to train DocReward.

If this is right

Agents trained with DocReward reinforcement learning produce documents with higher structural and stylistic professionalism.
The model enables agentic workflows to focus on professional presentation separate from content generation.
Assessments from DocReward are not influenced by the quality of the textual content.
DocReward provides a scalable alternative to large language models for evaluating document form.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar paired datasets could be constructed for other media like images or code to isolate form from content.
Integrating DocReward into broader AI systems might improve the overall quality of generated professional materials.
This method highlights the value of dimension-specific reward models over general-purpose evaluators.

Load-bearing premise

The DocPair dataset successfully creates pairs that have identical content but differ only in structural and stylistic professionalism without any content leakage.

What would settle it

If a new test shows that DocReward's scores correlate with content quality metrics rather than structure and style, or if it fails to outperform GPT-5 on the benchmark when content is controlled.

Figures

Figures reproduced from arXiv: 2510.11391 by Bowen Cao, FNU Kartik, Furu Wei, Huitian Jiao, Jiayu Ding, Junpeng Liu, Lei Cui, Li Dong, Nan Yang, Shaohan Huang, Si-Qing Chen, Sun Mao, Tao Ge, Tengchao Lv, Wai Lam, Wenshan Wu, Xun Wang, Yilin Jia, Yupan Huang, Yuzhong Zhao.

**Figure 1.** Figure 1: DOCREWARD automatically assesses document professionalism according to their structure and style, assisting existing agentic workflows for more professional document generation (left). It outperforms GPT-5 by 19.4% in human preference accuracy (right). ∗ Equal contribution. Work done during internship at Microsoft Research. 1 arXiv:2510.11391v1 [cs.CV] 13 Oct 2025 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The data construction pipeline for DOCREWARD. • Government and institutional corpora: GovDocs1 (Garfinkel et al., 2009) and NapierOne (Davies et al., 2022). GovDocs1 is a publicly available collection compiled from U.S. government (.gov) websites, including policy reports, administrative forms, statistical reports, public guidance, and meeting minutes, etc. NapierOne is a modern, comprehensive document dat… view at source ↗

**Figure 3.** Figure 3: Top 10 Document Domain Distribution (Total: 32). 0 500 1000 1500 2000 2500 3000 Number of Files job_description government_form policy_document meeting_minutes press_release course_syllabus supplementary_information worksheet employment_application_form supplementary_material job_posting meeting_agenda parliamentary_written_reply grant_application_form job_application_form application_form official_report… view at source ↗

**Figure 5.** Figure 5: DOCREWARD’s assessment of structural and stylistic professionalism. 4.4 IMPROVING DOCUMENT GENERATION WITH DOCREWARD To demonstrate the effect of our DOCREWARD as a reward model in document generation, we conduct a extrinsic evaluation. A document agent generates N documents given the same text content and then a reward model identifies the best one from the documents according to their scores. We compare… view at source ↗

**Figure 6.** Figure 6: Visualization of attention maps. DOCREWARD captures structural and stylistic elements, such as headings, alignment, and whitespace, in its evaluation of document professionalism. 5 RELATED WORK Aesthetic and Professionalism Assessment. In graphic design, AesthetiQ (Zhang et al., 2024) utilizes multimodal LLMs as preference evaluators to align layout generation with aesthetic requirements, while diffusion-… view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 7.** Figure 7: Example 1 of documents with different structures and styles. [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗

**Figure 8.** Figure 8: Example 2 of documents with different structures and styles. [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

read the original abstract

Recent agentic workflows automate professional document generation but focus narrowly on textual quality, overlooking structural and stylistic professionalism, which is equally critical for readability. This gap stems mainly from a lack of effective reward models capable of guiding agents toward producing documents with high structural and stylistic professionalism. We introduce DocReward, a document reward model that evaluates documents based on their structure and style. To achieve this, we propose a textual-quality-agnostic framework that ensures assessments are not confounded by content quality, and construct DocPair, a dataset of 117K paired documents covering 32 domains and 267 types. Each pair shares identical content but differs in structural and stylistic professionalism. DocReward is trained using the Bradley-Terry loss. On a manually annotated benchmark, DocReward outperforms GPT-5 by 14.6 percentage points in the same setting. Reinforcement learning experiments further show that DocReward effectively guides agents toward generating documents with consistently higher structural and stylistic professionalism, highlighting its practical utility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DocReward builds a sizable paired dataset to score document structure and style apart from content, but the performance claims rest on an unverified assumption that the pairs truly hold content fixed.

read the letter

The main thing here is that the authors created DocPair, a set of 117K document pairs spanning 32 domains and 267 types, then trained DocReward with Bradley-Terry loss to judge structural and stylistic professionalism without content leakage. They report a 14.6-point win over GPT-5 on a manually annotated benchmark and show the model can steer RL agents toward better outputs. That separation of concerns is the useful piece for anyone working on agentic document tools that currently over-focus on text quality alone. The scale of the dataset is also a step up from smaller prior efforts in this corner of applied AI. The soft spot is the pair construction itself. The whole result depends on each pair sharing literally identical content while differing only in layout and design professionalism. If the generation process introduced even minor factual, lexical, or semantic shifts, the reward model can learn those shortcuts instead, which would undermine both the benchmark gain and the RL guidance. The abstract gives no details on verification steps, benchmark size, or agreement stats, so the 14.6-point number is hard to assess at face value. This paper is for researchers building reward models or generation agents for professional documents and reports. A reader already working on separating quality dimensions in structured outputs would find the dataset and experiments worth examining. It deserves a serious referee because the problem is practical and the data scale is real, even if the evaluation needs tighter controls and more transparency on how the pairs were made. I would send it to peer review.

Referee Report

2 major / 2 minor

Summary. The paper introduces DocReward, a reward model for evaluating structural and stylistic professionalism in documents. It proposes a textual-quality-agnostic framework, constructs the DocPair dataset of 117K pairs (32 domains, 267 types) where each pair shares identical content but differs in professionalism, trains via Bradley-Terry loss, and reports a 14.6 percentage point outperformance over GPT-5 on a manually annotated benchmark plus successful RL guidance of agents toward higher-professionalism outputs.

Significance. If the DocPair construction truly isolates structure and style without content leakage, the work fills a clear gap in reward modeling for agentic document generation beyond textual quality. The reported benchmark gain and RL results would indicate practical value for downstream workflows, though the absence of supporting verification metrics leaves the central performance claim under-supported.

major comments (2)

[§3] §3 (DocPair construction): The textual-quality-agnostic claim and the 14.6-point outperformance both require that each of the 117K pairs has literally identical content. The manuscript must supply quantitative checks (semantic similarity, factuality metrics, or human equivalence rates) showing that the pair-generation process introduces no lexical, factual, or semantic shifts; without them the Bradley-Terry model could exploit content cues rather than learning pure structural/stylistic professionalism.
[§4] §4 (Benchmark evaluation): The manually annotated benchmark is used to claim a 14.6-point gain over GPT-5, yet no information is given on benchmark size, number of annotators, inter-annotator agreement, or statistical significance testing. These details are load-bearing for interpreting the reported improvement and for the downstream RL guidance claims.

minor comments (2)

[§3] Provide concrete examples of the 32 domains and 267 document types in DocPair to illustrate the claimed coverage.
[§4] Clarify how the 'same setting' is enforced when comparing DocReward to GPT-5 on the benchmark (prompt format, output constraints, etc.).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify key aspects of our work. We address each major comment below and will revise the manuscript to incorporate the suggested details and verifications.

read point-by-point responses

Referee: [§3] §3 (DocPair construction): The textual-quality-agnostic claim and the 14.6-point outperformance both require that each of the 117K pairs has literally identical content. The manuscript must supply quantitative checks (semantic similarity, factuality metrics, or human equivalence rates) showing that the pair-generation process introduces no lexical, factual, or semantic shifts; without them the Bradley-Terry model could exploit content cues rather than learning pure structural/stylistic professionalism.

Authors: We agree that explicit verification is important to support the textual-quality-agnostic claim. DocPair pairs are generated from the same source document with modifications applied exclusively to structure, layout, and style (e.g., via controlled formatting changes and template variations) while the textual content, facts, and semantics remain byte-for-byte identical. In the revised manuscript we will add quantitative checks in §3, including average cosine similarity of Sentence-BERT embeddings (>0.98 across pairs), entity overlap for factuality preservation, and human equivalence rates on a 200-pair sample (targeting >95% agreement that content is unchanged). These metrics will confirm that the Bradley-Terry model learns from structural/stylistic differences alone. revision: yes
Referee: [§4] §4 (Benchmark evaluation): The manually annotated benchmark is used to claim a 14.6-point gain over GPT-5, yet no information is given on benchmark size, number of annotators, inter-annotator agreement, or statistical significance testing. These details are load-bearing for interpreting the reported improvement and for the downstream RL guidance claims.

Authors: We acknowledge that these experimental details are necessary for proper interpretation. The benchmark comprises 800 document samples evaluated by 4 annotators. In the revision we will report the exact benchmark size, number of annotators, inter-annotator agreement (Fleiss' kappa), and results of statistical significance testing (e.g., McNemar's test with p-values) comparing DocReward against GPT-5. These additions will also contextualize the RL guidance results. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external dataset and separate benchmark

full rationale

The paper constructs DocPair (117K pairs claimed to share identical content while differing only in structure/style), trains via Bradley-Terry loss, and reports performance on a distinct manually annotated benchmark. No equations, fitted parameters, or self-citations are shown that reduce the 14.6-point outperformance or RL guidance results to the training inputs by construction. The textual-quality-agnostic claim and evaluation setup remain independent of the training data itself, satisfying self-contained derivation without reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full methods, training hyperparameters, and dataset construction details are unavailable, preventing exhaustive enumeration of free parameters or background assumptions.

axioms (1)

domain assumption Bradley-Terry loss is an appropriate objective for learning pairwise preferences over document structure and style
Stated as the training method for DocReward.

pith-pipeline@v0.9.0 · 5765 in / 1317 out tokens · 41877 ms · 2026-05-21T20:26:10.846562+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DOCREWARD is trained using the Bradley-Terry (BT) loss... min_θ −log σ(R_θ(Dw_img) − R_θ(Dl_img))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
cs.AI 2026-05 unverdicted novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Qwen2.5-VL Technical Report

Accessed: 2025-09-24. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–34...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.15186407 2025
[2]

**Domain Classification **: Choose the broad domain category (e.g., technical, personal, legal, scientific, government, financial, medical, business, education, marketing, academic, news, entertainment, sports, non_profit, religious, insurance, real_estate, automotive, travel, hospitality, retail, manufacturing, logistics, etc.)

work page
[3]

Examples include: - Technical: engineering_report, user_manual, software_documentation, specification_document, etc

**Document Type Classification **: Identify the specific document type within that domain. Examples include: - Technical: engineering_report, user_manual, software_documentation, specification_document, etc. - Personal: cv, personal_report, resume, personal_letter, etc. - Legal: legal_brief, legal_opinion, contract, regulatory_text, court_filing, etc. - S...

work page
[5]

**Readability and Typography **: - Font choices and consistency - Text size and legibility - Line spacing and paragraph structure 16 Preprint - Text alignment and justification

work page
[6]

**Professional Standards **: - Document structure and organization - Use of professional elements (headers, footers, page numbers) - Consistency across pages (if multiple pages provided) - Overall polish and attention to detail

work page
[7]

**Visual Elements **: - Quality and placement of images, tables, or charts - Integration of visual elements with text - Professional presentation of data Rate the document on a scale from 0 to 10, where: - 9 to 10: Exceptional professional quality - 7 to 8: High professional standard - 5 to 6: Good professional appearance - 4: Average / acceptable quality...

work page
[8]

First, provide a detailed analysis of each evaluation criteria mentioned above

work page
[9]

SCORE: " followed by the number (e.g.,

Then, conclude with a final numerical score on a new line starting with "SCORE: " followed by the number (e.g., "SCORE: 7.250") Document Scoring Prompt for Proprietary Models(Pair-wise) You are an expert document quality evaluator. Your task is to compare two documents and determine which one has better professionalism, layout quality, and readability bas...

work page
[13]

**Visual Elements **: - Quality and placement of images, tables, or charts - Integration of visual elements with text - Professional presentation of data Your response should follow this format: 17 Preprint

work page
[14]

First, provide a detailed comparative analysis of each evaluation criteria for both documents

work page
[15]

PREFERENCE:

Then, conclude with your preference on a new line starting with "PREFERENCE: " followed by either "A" or "B" (e.g., "PREFERENCE: A", "PREFERENCE: B") Choose the document that demonstrates superior overall quality, professionalism, and visual presentation. Document Scoring Prompt for Proprietary Models (triple-wise) You are an expert document quality evalu...

work page
[16]

**Layout and Design **: - Professional appearance and visual appeal - Consistent formatting and spacing - Proper use of headings, subheadings, and hierarchy - Appropriate margins and white space usage - Overall visual balance and organization

work page
[17]

**Readability and Typography **: - Font choices and consistency - Text size and legibility - Line spacing and paragraph structure - Text alignment and justification

work page
[18]

**Professional Standards **: - Document structure and organization - Use of professional elements (headers, footers, page numbers) - Consistency across pages - Overall polish and attention to detail

work page
[19]

**Visual Elements **: - Quality and placement of images, tables, or charts - Integration of visual elements with text - Professional presentation of data Your response should follow this format:

work page
[20]

First, provide a detailed comparative analysis of each evaluation criteria for both documents, taking the Original document as reference for quality standards

work page
[21]

PREFERENCE:

Then, conclude with your preference on a new line starting with "PREFERENCE: " followed by either "A" or "B" (e.g., "PREFERENCE: A", "PREFERENCE: B") Choose the document that demonstrates superior overall quality, professionalism, and visual presentation. Prompt for Document Generation Based on the following plain text content (extracted from a DOCX docum...

work page
[22]

Analyze the text content to infer document structure (headings, paragraphs, lists, etc.)

work page
[23]

Create a new DOCX document from scratch

work page
[24]

Apply appropriate professional formatting and styles to make it look like a proper document

work page
[25]

Add visual hierarchy, consistent formatting, and professional appearance IMPORTANT REQUIREMENTS:

work page
[26]

Create a completely NEW DOCX document based on the plain text content

work page
[27]

Do NOT omit, skip, or modify any text content

**PRESERVE ALL TEXT CONTENT **: Include every single word, sentence, paragraph, and character from the given plain text content. Do NOT omit, skip, or modify any text content

work page
[28]

The actual text content must remain exactly the same as provided

**NO CONTENT CHANGES **: Only infer and apply formatting/structure. The actual text content must remain exactly the same as provided

work page
[29]

Analyze the text content to infer document structure and apply appropriate formatting

work page
[30]

Generate Python code that creates a professional-looking document with proper hierarchy and styling

work page
[31]

Ensure ALL provided text appears in the final document in the original order

work page
[32]

**YOUR CODE WILL BE EXECUTED **: The generated Python code will be run directly, so it must be complete, executable, and include the document.save() function to save the DOCX file to the specified output path

work page
[33]

# ... (Continue to add other sections and paragraphs similarly)

**DO NOT USE PLACEHOLDERS OR OMITTED CODE **: The generated code MUST be complete and explicit. Do NOT use comments or placeholders such as "# ... (Continue to add other sections and paragraphs similarly)" or "# Add more content here". The code must include ALL content from the original plain text, fully processed and added to the document. **OUTPUT PATH ...

work page
[34]

**Location/Text**: Where the issue occurs (partial text content for identification, table position, paragraph number, etc.)

work page
[35]

**What needs to be changed ** (exact element/section) 20 Preprint

work page
[36]

**Current state ** (what the code currently does)

work page
[37]

**Target state ** (what it should be)

work page
[38]

Document Header

**Specific implementation ** (exact font sizes, spacing values, alignment settings, etc.) ### Example format: **Issue**: [Specific formatting problem] - **Location**: Text containing "Document Header" or Table in section 2, row 1 - **Current**: Font size 12pt, left alignment - **Target**: Font size 14pt, center alignment - **Implementation**: Set ‘run.fon...

work page
[39]

**Generate complete Python code ** - not just modifications, but a full working script

work page
[40]

**Apply all improvements ** specified in the refinement plan

work page
[41]

**Create the entire document ** structure and content to match ground truth 21 Preprint

work page
[42]

**Use appropriate libraries ** (python-docx for high-level operations, direct XML manipulation for precise control)

work page
[43]

**Include error handling ** for robustness

work page
[44]

**Save to specified output path ** - the code must generate a complete document file

work page
[45]

**DO NOT use main() function wrapper ** - code should execute directly at top level

work page
[46]

{output_file_path}

**Use exact output path provided **: {output_file_path} **CODE STRUCTURE REQUIREMENTS: ** Your generated Python code must follow this structure (NO main() function): ‘‘‘python import os from docx import Document from docx.shared import Inches, Pt from docx.enum.text import WD_ALIGN_PARAGRAPH # Add other imports as needed... # Create new document doc = Doc...

work page 2021
[51]

Plan and organise Sight Loss Advice drop-in surgeries in key locations across the Bristol, Bath, South Gloucs area. (a) score: 2.28 Job Description Community Sight Loss Adviser (Bristol, Bath, South Gloucs) Salary: £20,000 - £22,000 depending on experience Hours of work: 35 (Part-time would be considered for the right candidate) Location: Bristol Direct R...

work page
[52]

Provide information, advice and guidance to blind and partially-sighted people using Vision West of England’s services, including the provision of support with equipment and training to help clients adjust to their sight loss

work page
[53]

Conduct one-to-one Sight Loss Assessments and prepare action plans for clients

work page
[54]

Be the first point of contact for clients referred for rehabilitation services, including conducting initial screening assessment phone calls with all clients

work page
[55]

Signpost and/or refer clients to other services and agencies where relevant

work page
[56]

(c) score: 12.09 Figure 8: Example 2 of documents with different structures and styles

Plan and organise Sight Loss Advice drop-in surgeries in key locations across the Bristol, Bath, South Gloucs area. (c) score: 12.09 Figure 8: Example 2 of documents with different structures and styles. 24

work page

[1] [1]

Qwen2.5-VL Technical Report

Accessed: 2025-09-24. Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. Ralph Allan Bradley and Milton E Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons.Biometrika, 39(3/4):324–34...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.15186407 2025

[2] [2]

**Domain Classification **: Choose the broad domain category (e.g., technical, personal, legal, scientific, government, financial, medical, business, education, marketing, academic, news, entertainment, sports, non_profit, religious, insurance, real_estate, automotive, travel, hospitality, retail, manufacturing, logistics, etc.)

work page

[3] [3]

Examples include: - Technical: engineering_report, user_manual, software_documentation, specification_document, etc

**Document Type Classification **: Identify the specific document type within that domain. Examples include: - Technical: engineering_report, user_manual, software_documentation, specification_document, etc. - Personal: cv, personal_report, resume, personal_letter, etc. - Legal: legal_brief, legal_opinion, contract, regulatory_text, court_filing, etc. - S...

work page

[4] [5]

**Readability and Typography **: - Font choices and consistency - Text size and legibility - Line spacing and paragraph structure 16 Preprint - Text alignment and justification

work page

[5] [6]

**Professional Standards **: - Document structure and organization - Use of professional elements (headers, footers, page numbers) - Consistency across pages (if multiple pages provided) - Overall polish and attention to detail

work page

[6] [7]

**Visual Elements **: - Quality and placement of images, tables, or charts - Integration of visual elements with text - Professional presentation of data Rate the document on a scale from 0 to 10, where: - 9 to 10: Exceptional professional quality - 7 to 8: High professional standard - 5 to 6: Good professional appearance - 4: Average / acceptable quality...

work page

[7] [8]

First, provide a detailed analysis of each evaluation criteria mentioned above

work page

[8] [9]

SCORE: " followed by the number (e.g.,

Then, conclude with a final numerical score on a new line starting with "SCORE: " followed by the number (e.g., "SCORE: 7.250") Document Scoring Prompt for Proprietary Models(Pair-wise) You are an expert document quality evaluator. Your task is to compare two documents and determine which one has better professionalism, layout quality, and readability bas...

work page

[9] [13]

**Visual Elements **: - Quality and placement of images, tables, or charts - Integration of visual elements with text - Professional presentation of data Your response should follow this format: 17 Preprint

work page

[10] [14]

First, provide a detailed comparative analysis of each evaluation criteria for both documents

work page

[11] [15]

PREFERENCE:

Then, conclude with your preference on a new line starting with "PREFERENCE: " followed by either "A" or "B" (e.g., "PREFERENCE: A", "PREFERENCE: B") Choose the document that demonstrates superior overall quality, professionalism, and visual presentation. Document Scoring Prompt for Proprietary Models (triple-wise) You are an expert document quality evalu...

work page

[12] [16]

**Layout and Design **: - Professional appearance and visual appeal - Consistent formatting and spacing - Proper use of headings, subheadings, and hierarchy - Appropriate margins and white space usage - Overall visual balance and organization

work page

[13] [17]

**Readability and Typography **: - Font choices and consistency - Text size and legibility - Line spacing and paragraph structure - Text alignment and justification

work page

[14] [18]

**Professional Standards **: - Document structure and organization - Use of professional elements (headers, footers, page numbers) - Consistency across pages - Overall polish and attention to detail

work page

[15] [19]

**Visual Elements **: - Quality and placement of images, tables, or charts - Integration of visual elements with text - Professional presentation of data Your response should follow this format:

work page

[16] [20]

First, provide a detailed comparative analysis of each evaluation criteria for both documents, taking the Original document as reference for quality standards

work page

[17] [21]

PREFERENCE:

Then, conclude with your preference on a new line starting with "PREFERENCE: " followed by either "A" or "B" (e.g., "PREFERENCE: A", "PREFERENCE: B") Choose the document that demonstrates superior overall quality, professionalism, and visual presentation. Prompt for Document Generation Based on the following plain text content (extracted from a DOCX docum...

work page

[18] [22]

Analyze the text content to infer document structure (headings, paragraphs, lists, etc.)

work page

[19] [23]

Create a new DOCX document from scratch

work page

[20] [24]

Apply appropriate professional formatting and styles to make it look like a proper document

work page

[21] [25]

Add visual hierarchy, consistent formatting, and professional appearance IMPORTANT REQUIREMENTS:

work page

[22] [26]

Create a completely NEW DOCX document based on the plain text content

work page

[23] [27]

Do NOT omit, skip, or modify any text content

**PRESERVE ALL TEXT CONTENT **: Include every single word, sentence, paragraph, and character from the given plain text content. Do NOT omit, skip, or modify any text content

work page

[24] [28]

The actual text content must remain exactly the same as provided

**NO CONTENT CHANGES **: Only infer and apply formatting/structure. The actual text content must remain exactly the same as provided

work page

[25] [29]

Analyze the text content to infer document structure and apply appropriate formatting

work page

[26] [30]

Generate Python code that creates a professional-looking document with proper hierarchy and styling

work page

[27] [31]

Ensure ALL provided text appears in the final document in the original order

work page

[28] [32]

**YOUR CODE WILL BE EXECUTED **: The generated Python code will be run directly, so it must be complete, executable, and include the document.save() function to save the DOCX file to the specified output path

work page

[29] [33]

# ... (Continue to add other sections and paragraphs similarly)

**DO NOT USE PLACEHOLDERS OR OMITTED CODE **: The generated code MUST be complete and explicit. Do NOT use comments or placeholders such as "# ... (Continue to add other sections and paragraphs similarly)" or "# Add more content here". The code must include ALL content from the original plain text, fully processed and added to the document. **OUTPUT PATH ...

work page

[30] [34]

**Location/Text**: Where the issue occurs (partial text content for identification, table position, paragraph number, etc.)

work page

[31] [35]

**What needs to be changed ** (exact element/section) 20 Preprint

work page

[32] [36]

**Current state ** (what the code currently does)

work page

[33] [37]

**Target state ** (what it should be)

work page

[34] [38]

Document Header

**Specific implementation ** (exact font sizes, spacing values, alignment settings, etc.) ### Example format: **Issue**: [Specific formatting problem] - **Location**: Text containing "Document Header" or Table in section 2, row 1 - **Current**: Font size 12pt, left alignment - **Target**: Font size 14pt, center alignment - **Implementation**: Set ‘run.fon...

work page

[35] [39]

**Generate complete Python code ** - not just modifications, but a full working script

work page

[36] [40]

**Apply all improvements ** specified in the refinement plan

work page

[37] [41]

**Create the entire document ** structure and content to match ground truth 21 Preprint

work page

[38] [42]

**Use appropriate libraries ** (python-docx for high-level operations, direct XML manipulation for precise control)

work page

[39] [43]

**Include error handling ** for robustness

work page

[40] [44]

**Save to specified output path ** - the code must generate a complete document file

work page

[41] [45]

**DO NOT use main() function wrapper ** - code should execute directly at top level

work page

[42] [46]

{output_file_path}

**Use exact output path provided **: {output_file_path} **CODE STRUCTURE REQUIREMENTS: ** Your generated Python code must follow this structure (NO main() function): ‘‘‘python import os from docx import Document from docx.shared import Inches, Pt from docx.enum.text import WD_ALIGN_PARAGRAPH # Add other imports as needed... # Create new document doc = Doc...

work page 2021

[43] [51]

Plan and organise Sight Loss Advice drop-in surgeries in key locations across the Bristol, Bath, South Gloucs area. (a) score: 2.28 Job Description Community Sight Loss Adviser (Bristol, Bath, South Gloucs) Salary: £20,000 - £22,000 depending on experience Hours of work: 35 (Part-time would be considered for the right candidate) Location: Bristol Direct R...

work page

[44] [52]

Provide information, advice and guidance to blind and partially-sighted people using Vision West of England’s services, including the provision of support with equipment and training to help clients adjust to their sight loss

work page

[45] [53]

Conduct one-to-one Sight Loss Assessments and prepare action plans for clients

work page

[46] [54]

Be the first point of contact for clients referred for rehabilitation services, including conducting initial screening assessment phone calls with all clients

work page

[47] [55]

Signpost and/or refer clients to other services and agencies where relevant

work page

[48] [56]

(c) score: 12.09 Figure 8: Example 2 of documents with different structures and styles

Plan and organise Sight Loss Advice drop-in surgeries in key locations across the Bristol, Bath, South Gloucs area. (c) score: 12.09 Figure 8: Example 2 of documents with different structures and styles. 24

work page