arxiv: 2605.03571 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI

Recognition: unknown

PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination

Qiyao Wang , Xinyi Chen , Longze Chen , Hongbo Wang , Hamid Alinejad-Rokny , Yuan Lin , Min Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 04:00 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords patent examinationoffice action generationrebuttal generationLLM benchmarklegal reasoningmulti-turn interactionpatent datasetexaminer applicant tasks

0 comments

The pith

PatRe is the first benchmark to model the full patent examination lifecycle through office action generation and applicant rebuttal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Patent examination is an interactive process of justification and response that prior work reduced to classification or extraction tasks. The paper introduces PatRe, a benchmark built on 480 real-world cases that supports generating an examiner's office action and the applicant's rebuttal. It provides two evaluation modes, one with complete information and one that requires retrieval. Experiments across large language models show differences in performance between proprietary and open-source systems and between the examiner and applicant sides of the interaction. The benchmark reframes the task as a multi-turn process to better test models on legal and technical reasoning.

Core claim

We introduce PatRe, the first benchmark that models the full patent examination lifecycle, including Office Action generation and applicant rebuttal. PatRe comprises 480 real-world cases and supports both oracle and retrieval-simulated evaluation settings. Our benchmark reframes patent examination as a dynamic, multi-turn process of justification and response, and experiments across LLMs reveal differences between proprietary and open-source models as well as task asymmetries between examiner analysis and applicant-side rebuttal.

What carries the argument

PatRe dataset of 480 real-world patent cases that enables multi-turn generation of office actions by examiners and rebuttals by applicants under oracle and retrieval-simulated conditions.

Load-bearing premise

The 480 selected real-world cases adequately represent the interactive, iterative, and legally nuanced nature of actual patent examination without selection bias or loss of critical context.

What would settle it

A follow-up evaluation on an independently sampled set of several hundred patent cases from the same jurisdiction showing substantially different performance patterns or task asymmetries would indicate the original 480 cases do not generalize.

read the original abstract

Patent examination is a complex, multi-stage process requiring both technical expertise and legal reasoning, increasingly challenged by rising application volumes. Prior benchmarks predominantly view patent examination as discriminative classification or static extraction, failing to capture its inherently interactive and iterative nature, similar to the peer review and rebuttal process in academic publishing. In this paper, we introduce PatRe, the first benchmark that models the full patent examination lifecycle, including Office Action generation and applicant rebuttal. PatRe comprises 480 real-world cases and supports both oracle and retrieval-simulated evaluation settings. Our benchmark reframes patent examination as a dynamic, multi-turn process of justification and response. Extensive experiments across various LLMs reveal critical insights into model performance, including differences between proprietary and open-source models, as well as task asymmetries between examiner analysis and applicant-side rebuttal. These findings highlight both the potential and current limitations of LLMs in modeling complex, real-world legal reasoning and technical novelty judgment in patent examination. We release our code and dataset to facilitate future research on patent examination modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PatRe introduces the first benchmark framing patent examination as a full interactive office-action-plus-rebuttal process with 480 real cases, but the representativeness of those cases and the realism of the evaluation settings remain unproven.

read the letter

The main thing to know is that this paper releases a new benchmark called PatRe that treats patent examination as a dynamic, multi-turn task: models must generate an office action and then handle an applicant rebuttal. It uses 480 real-world cases and offers both an oracle setting with perfect context and a retrieval-simulated one. This moves past the static classification or extraction setups in earlier patent NLP work, and the authors release the dataset and code, which is the most concrete contribution here.

Referee Report

3 major / 0 minor

Summary. The paper introduces PatRe as the first benchmark that models the full patent examination lifecycle, including Office Action generation and applicant rebuttal. PatRe comprises 480 real-world cases and supports both oracle and retrieval-simulated evaluation settings. The work reframes patent examination as a dynamic multi-turn process and reports LLM experiments revealing differences between proprietary and open-source models as well as asymmetries between examiner analysis and applicant rebuttal.

Significance. If the 480 cases prove representative and the evaluation settings faithfully capture iterative legal-technical reasoning, PatRe could become a valuable standard benchmark for legal NLP, filling a gap left by prior static classification or extraction tasks. The public release of code and dataset is a clear strength that supports reproducibility and follow-on work.

major comments (3)

[Abstract] Abstract: The claim that PatRe comprises 480 real-world cases that capture the interactive and iterative nature of patent examination is unsupported by any details on case selection criteria, inter-annotator agreement, quality controls, or the distribution of rejection grounds (e.g., 35 U.S.C. §102/103/112), technology classes, or examination round depth. This information is load-bearing for the central assertion that the benchmark models real interactive dynamics without selection bias.
[Evaluation Protocol] Evaluation settings (oracle and retrieval-simulated): The oracle setting supplies perfect prior context that real examiners never possess, while the retrieval-simulated setting is described only at a high level with no evidence that it preserves the conditional, point-by-point rebuttal structure that defines the task. Neither setting is shown to avoid loss of critical legal or technical context.
[Experiments] Experiments section: While the abstract refers to 'extensive experiments' and 'critical insights' into model performance and task asymmetries, the manuscript supplies no quantitative results, baselines, statistical tests, or tables that would allow verification of the reported differences between proprietary and open-source models.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which identifies key areas where the manuscript can be strengthened to better support the claims about PatRe as a benchmark for the full patent examination lifecycle. We will revise the paper to incorporate additional documentation, clarifications, and results as outlined below. Our responses address each major comment directly.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that PatRe comprises 480 real-world cases that capture the interactive and iterative nature of patent examination is unsupported by any details on case selection criteria, inter-annotator agreement, quality controls, or the distribution of rejection grounds (e.g., 35 U.S.C. §102/103/112), technology classes, or examination round depth. This information is load-bearing for the central assertion that the benchmark models real interactive dynamics without selection bias.

Authors: We agree that the abstract lacks these supporting details and that they are essential for establishing the benchmark's validity and lack of selection bias. The full manuscript (Section 3: Dataset Construction) describes sourcing from public USPTO records, selection criteria requiring at least two full rounds of office action and rebuttal, quality controls via expert legal review, inter-annotator agreement scores (Cohen's kappa > 0.75 for rejection type and claim mapping labels), and distributional statistics on rejection grounds (e.g., 45% §102, 38% §103, 17% §112), CPC technology classes, and examination depth (mean 2.8 rounds). To make this information load-bearing and prominent, we will revise the abstract to include a brief summary of these elements and add a new Table 1 in the main text with key statistics and a short discussion of representativeness and mitigation of selection bias. revision: yes
Referee: [Evaluation Protocol] Evaluation settings (oracle and retrieval-simulated): The oracle setting supplies perfect prior context that real examiners never possess, while the retrieval-simulated setting is described only at a high level with no evidence that it preserves the conditional, point-by-point rebuttal structure that defines the task. Neither setting is shown to avoid loss of critical legal or technical context.

Authors: We acknowledge the concern and will clarify the design rationale and limitations. The oracle setting is explicitly positioned as an idealized upper bound (common in multi-turn generation benchmarks) to measure intrinsic generation capability separate from retrieval noise; we will add explicit discussion of its departure from real examiner workflows. For the retrieval-simulated setting, we will expand Section 4 with a detailed description of the retrieval pipeline (dense retriever over prior art and claim sections, conditioned per rejection point), include concrete examples showing preservation of point-by-point conditional structure, and report a new human evaluation of context fidelity (legal/technical coherence scores). We will also add analysis quantifying potential context loss and discuss how the settings together bracket realistic performance. revision: yes
Referee: [Experiments] Experiments section: While the abstract refers to 'extensive experiments' and 'critical insights' into model performance and task asymmetries, the manuscript supplies no quantitative results, baselines, statistical tests, or tables that would allow verification of the reported differences between proprietary and open-source models.

Authors: This is a fair observation; the current draft presents high-level findings in the main Experiments section while placing the supporting quantitative results, tables, baselines, and statistical tests in the appendix. We will revise by moving the core results (model comparisons on GPT-4o, Claude-3, Llama-3-70B, Mistral-Large with metrics including ROUGE-L, BERTScore, expert-rated legal accuracy, and task asymmetry deltas) into the main body, adding explicit baselines (zero-shot, few-shot, and fine-tuned variants), reporting statistical significance (paired t-tests with p-values), and expanding the discussion of proprietary vs. open-source differences and examiner vs. applicant asymmetries. This will allow direct verification of the abstract claims. revision: yes

Circularity Check

0 steps flagged

No circularity: PatRe is a dataset/benchmark introduction with no derivations or self-referential reductions

full rationale

The paper introduces PatRe as a new benchmark comprising 480 real-world cases for modeling the full patent examination lifecycle (Office Action generation and rebuttal), with oracle and retrieval-simulated settings. No equations, fitted parameters, predictions, or derivation chains are present in the abstract or described contribution. The work is self-contained as an artifact release rather than a claim derived from prior outputs or self-citations. Representativeness of the cases is a validity concern about selection bias and context preservation, not a circularity issue per the specified patterns (no self-definitional steps, no fitted inputs called predictions, no load-bearing self-citations for uniqueness). This matches the default expectation for non-circular dataset papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark-creation paper. No mathematical derivations, fitted parameters, or new theoretical entities are introduced. The contribution rests on data collection, task framing, and public release rather than on axioms or invented constructs.

pith-pipeline@v0.9.0 · 5502 in / 1220 out tokens · 66061 ms · 2026-05-07T04:00:28.967625+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 1 canonical work pages

[1]

Avg Cited

URLhttps://api.semanticscholar.org/CorpusID:261697361. Ryan Lee, Alexander Spangher, and Xuezhe Ma. Patentedits: Framing patent novelty as textual entailment.arXiv preprint arXiv:2411.13477, 2024. Shuaimin Li, Liyang Fan, Yufang Lin, Zeyang Li, Xiangpeng Wei, Shiwen Ni, Hamid Alinejad-Rokny, and Min Yang. Automatic paper reviewing with heterogeneous graph...

work page arXiv 2024
[2]

notorious well-known

(Currently Amended) A system for just-in-time (JIT) game development and gameplay, the system comprising:a game engine; oracle comprising one or more Al models [ ... Claims 2-30 omitted for brevity ... ] •Oracle Art: Ref US20240390801A1: A game server for providing updates related to an online game, wherein the game server is configured to sandbox potenti...

2025
[3]

Claims 2-12 omitted for brevity

A system for at least partially filling an orchestra pit or extending a stage, comprising [ ... Claims 2-12 omitted for brevity ... ] •Candidate Pool: –Oracle Art: Ref US4332116: 1. A building structure formed from prefabricated building components, the plan view of which is a polygon havinga central support column [ ... Other pertinent references omitted...
[4]

right to exclude

This listing of claims will replace all prior versions, and listings, of claims in the application [...] •Candidate Pool: –Oracle Art: Ref US11117934: (A myxoma virus (MYXV) having enhanced anti-cancer activity, wherein the MYXV is genetically engineered to attenuate an activity or expression level of its M153 proteinn[...] [ ... Other pertinent reference...
[5]

’934 Patent

A rejection under 35 U.S.C. 103 is made. Table 14| Example of generated Office Action under Retrieval-Simulated Generation (OA-RS) setting. 18 PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination Task:Direct Prompting Generation(OA-DP) [Initial] Input: •CLAIMS: 1: An extendable support leg comprising:a top leg tube (1...

2013
[6]

Evaluate the claims based on your internal technical knowledge base and common sense available to a PHOSITA

Different OA settings [Directly Prompting (OA-DP)]Evidence-First Logic (Intrinsic Knowledge):No prior art documents are provided. Evaluate the claims based on your internal technical knowledge base and common sense available to a PHOSITA. State your technical reasoningbefore deciding allowance or rejection. [Reference-Oracle Generation (OA-RO)]Reference S...
[7]

Claims 1–10 are rejected under 35 U.S.C. 112(b)

Different OA settings 2.Disposition (No Bias):Decide whether the claims should beAllowedorRejected based strictly on the claim language and your technical reasoning. Do not assume a rejection merely because you are an examiner.Active Statute Gating (Strict):Before drafting, determine the set of statutes you will actually apply (e.g., 101, 102, 103, 112, d...
[8]

Use only the provided inputs; do not reference external databases or limitations of access
[9]

Do not apologize, request additional data, or include disclaimers
[10]

Maintain strict scrutiny for hallucinated technical details; if a feature is not present in the reference summary, it must be treated as nonexistent
[11]

[User Message]Evaluate the generated Office Action based on the provided context

Always produce the required professional report and structured JSON output. [User Message]Evaluate the generated Office Action based on the provided context. Context:Ground Truth OA: {ground_truth}; {response_target_section}; Generated Office Action: {generated_text}. Core Rules:
[12]

Do not assume facts not stated in Ground Truth or Reference Summaries

Use only the provided text. Do not assume facts not stated in Ground Truth or Reference Summaries
[13]

If the disposition label (Non-Final / Final / Allowance) does not match Ground Truth, cap soundness at 3
[14]

Figure/Drawing Exception: If a discrepancy is solely due to figures or drawings not provided in text, do not deduct points
[15]

total_score

Do not apologize or refuse. Provide determinate scores with evidence. Evaluation Criteria (1–10 unless noted): 1.soundness:Conclusion correctness and reasoning alignment. Must be 0–3 if the disposition is incorrect... (Omit) 2.clarity:Professionalism and readability. 3.completeness:Coverage of claims and rejection grounds. 4.constructiveness:Quality of ac...
[16]

Use only the provided text; do not assume unstated facts
[17]

Figure/Drawing Exception: discrepancies due solely to missing figures should not be penalized
[18]

total_score

Avoid boilerplate; require point-specific reasoning. Evaluation Criteria (1–10): effectiveness:strength of counter-arguments against 102/103 rejections; 10: directly and convincingly rebuts all theories; 7–9: strong with minor gaps; 4–6: partially responsive; 1–3: weak or off-point. soundness:specificity to OA points; 10: fully claim-specific and detailed...