Recognition: unknown
PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination
Pith reviewed 2026-05-07 04:00 UTC · model grok-4.3
The pith
PatRe is the first benchmark to model the full patent examination lifecycle through office action generation and applicant rebuttal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce PatRe, the first benchmark that models the full patent examination lifecycle, including Office Action generation and applicant rebuttal. PatRe comprises 480 real-world cases and supports both oracle and retrieval-simulated evaluation settings. Our benchmark reframes patent examination as a dynamic, multi-turn process of justification and response, and experiments across LLMs reveal differences between proprietary and open-source models as well as task asymmetries between examiner analysis and applicant-side rebuttal.
What carries the argument
PatRe dataset of 480 real-world patent cases that enables multi-turn generation of office actions by examiners and rebuttals by applicants under oracle and retrieval-simulated conditions.
Load-bearing premise
The 480 selected real-world cases adequately represent the interactive, iterative, and legally nuanced nature of actual patent examination without selection bias or loss of critical context.
What would settle it
A follow-up evaluation on an independently sampled set of several hundred patent cases from the same jurisdiction showing substantially different performance patterns or task asymmetries would indicate the original 480 cases do not generalize.
read the original abstract
Patent examination is a complex, multi-stage process requiring both technical expertise and legal reasoning, increasingly challenged by rising application volumes. Prior benchmarks predominantly view patent examination as discriminative classification or static extraction, failing to capture its inherently interactive and iterative nature, similar to the peer review and rebuttal process in academic publishing. In this paper, we introduce PatRe, the first benchmark that models the full patent examination lifecycle, including Office Action generation and applicant rebuttal. PatRe comprises 480 real-world cases and supports both oracle and retrieval-simulated evaluation settings. Our benchmark reframes patent examination as a dynamic, multi-turn process of justification and response. Extensive experiments across various LLMs reveal critical insights into model performance, including differences between proprietary and open-source models, as well as task asymmetries between examiner analysis and applicant-side rebuttal. These findings highlight both the potential and current limitations of LLMs in modeling complex, real-world legal reasoning and technical novelty judgment in patent examination. We release our code and dataset to facilitate future research on patent examination modeling.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PatRe as the first benchmark that models the full patent examination lifecycle, including Office Action generation and applicant rebuttal. PatRe comprises 480 real-world cases and supports both oracle and retrieval-simulated evaluation settings. The work reframes patent examination as a dynamic multi-turn process and reports LLM experiments revealing differences between proprietary and open-source models as well as asymmetries between examiner analysis and applicant rebuttal.
Significance. If the 480 cases prove representative and the evaluation settings faithfully capture iterative legal-technical reasoning, PatRe could become a valuable standard benchmark for legal NLP, filling a gap left by prior static classification or extraction tasks. The public release of code and dataset is a clear strength that supports reproducibility and follow-on work.
major comments (3)
- [Abstract] Abstract: The claim that PatRe comprises 480 real-world cases that capture the interactive and iterative nature of patent examination is unsupported by any details on case selection criteria, inter-annotator agreement, quality controls, or the distribution of rejection grounds (e.g., 35 U.S.C. §102/103/112), technology classes, or examination round depth. This information is load-bearing for the central assertion that the benchmark models real interactive dynamics without selection bias.
- [Evaluation Protocol] Evaluation settings (oracle and retrieval-simulated): The oracle setting supplies perfect prior context that real examiners never possess, while the retrieval-simulated setting is described only at a high level with no evidence that it preserves the conditional, point-by-point rebuttal structure that defines the task. Neither setting is shown to avoid loss of critical legal or technical context.
- [Experiments] Experiments section: While the abstract refers to 'extensive experiments' and 'critical insights' into model performance and task asymmetries, the manuscript supplies no quantitative results, baselines, statistical tests, or tables that would allow verification of the reported differences between proprietary and open-source models.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which identifies key areas where the manuscript can be strengthened to better support the claims about PatRe as a benchmark for the full patent examination lifecycle. We will revise the paper to incorporate additional documentation, clarifications, and results as outlined below. Our responses address each major comment directly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that PatRe comprises 480 real-world cases that capture the interactive and iterative nature of patent examination is unsupported by any details on case selection criteria, inter-annotator agreement, quality controls, or the distribution of rejection grounds (e.g., 35 U.S.C. §102/103/112), technology classes, or examination round depth. This information is load-bearing for the central assertion that the benchmark models real interactive dynamics without selection bias.
Authors: We agree that the abstract lacks these supporting details and that they are essential for establishing the benchmark's validity and lack of selection bias. The full manuscript (Section 3: Dataset Construction) describes sourcing from public USPTO records, selection criteria requiring at least two full rounds of office action and rebuttal, quality controls via expert legal review, inter-annotator agreement scores (Cohen's kappa > 0.75 for rejection type and claim mapping labels), and distributional statistics on rejection grounds (e.g., 45% §102, 38% §103, 17% §112), CPC technology classes, and examination depth (mean 2.8 rounds). To make this information load-bearing and prominent, we will revise the abstract to include a brief summary of these elements and add a new Table 1 in the main text with key statistics and a short discussion of representativeness and mitigation of selection bias. revision: yes
-
Referee: [Evaluation Protocol] Evaluation settings (oracle and retrieval-simulated): The oracle setting supplies perfect prior context that real examiners never possess, while the retrieval-simulated setting is described only at a high level with no evidence that it preserves the conditional, point-by-point rebuttal structure that defines the task. Neither setting is shown to avoid loss of critical legal or technical context.
Authors: We acknowledge the concern and will clarify the design rationale and limitations. The oracle setting is explicitly positioned as an idealized upper bound (common in multi-turn generation benchmarks) to measure intrinsic generation capability separate from retrieval noise; we will add explicit discussion of its departure from real examiner workflows. For the retrieval-simulated setting, we will expand Section 4 with a detailed description of the retrieval pipeline (dense retriever over prior art and claim sections, conditioned per rejection point), include concrete examples showing preservation of point-by-point conditional structure, and report a new human evaluation of context fidelity (legal/technical coherence scores). We will also add analysis quantifying potential context loss and discuss how the settings together bracket realistic performance. revision: yes
-
Referee: [Experiments] Experiments section: While the abstract refers to 'extensive experiments' and 'critical insights' into model performance and task asymmetries, the manuscript supplies no quantitative results, baselines, statistical tests, or tables that would allow verification of the reported differences between proprietary and open-source models.
Authors: This is a fair observation; the current draft presents high-level findings in the main Experiments section while placing the supporting quantitative results, tables, baselines, and statistical tests in the appendix. We will revise by moving the core results (model comparisons on GPT-4o, Claude-3, Llama-3-70B, Mistral-Large with metrics including ROUGE-L, BERTScore, expert-rated legal accuracy, and task asymmetry deltas) into the main body, adding explicit baselines (zero-shot, few-shot, and fine-tuned variants), reporting statistical significance (paired t-tests with p-values), and expanding the discussion of proprietary vs. open-source differences and examiner vs. applicant asymmetries. This will allow direct verification of the abstract claims. revision: yes
Circularity Check
No circularity: PatRe is a dataset/benchmark introduction with no derivations or self-referential reductions
full rationale
The paper introduces PatRe as a new benchmark comprising 480 real-world cases for modeling the full patent examination lifecycle (Office Action generation and rebuttal), with oracle and retrieval-simulated settings. No equations, fitted parameters, predictions, or derivation chains are present in the abstract or described contribution. The work is self-contained as an artifact release rather than a claim derived from prior outputs or self-citations. Representativeness of the cases is a validity concern about selection bias and context preservation, not a circularity issue per the specified patterns (no self-definitional steps, no fitted inputs called predictions, no load-bearing self-citations for uniqueness). This matches the default expectation for non-circular dataset papers.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttps://api.semanticscholar.org/CorpusID:261697361. Ryan Lee, Alexander Spangher, and Xuezhe Ma. Patentedits: Framing patent novelty as textual entailment.arXiv preprint arXiv:2411.13477, 2024. Shuaimin Li, Liyang Fan, Yufang Lin, Zeyang Li, Xiangpeng Wei, Shiwen Ni, Hamid Alinejad-Rokny, and Min Yang. Automatic paper reviewing with heterogeneous graph...
-
[2]
notorious well-known
(Currently Amended) A system for just-in-time (JIT) game development and gameplay, the system comprising:a game engine; oracle comprising one or more Al models [ ... Claims 2-30 omitted for brevity ... ] •Oracle Art: Ref US20240390801A1: A game server for providing updates related to an online game, wherein the game server is configured to sandbox potenti...
2025
-
[3]
Claims 2-12 omitted for brevity
A system for at least partially filling an orchestra pit or extending a stage, comprising [ ... Claims 2-12 omitted for brevity ... ] •Candidate Pool: –Oracle Art: Ref US4332116: 1. A building structure formed from prefabricated building components, the plan view of which is a polygon havinga central support column [ ... Other pertinent references omitted...
-
[4]
right to exclude
This listing of claims will replace all prior versions, and listings, of claims in the application [...] •Candidate Pool: –Oracle Art: Ref US11117934: (A myxoma virus (MYXV) having enhanced anti-cancer activity, wherein the MYXV is genetically engineered to attenuate an activity or expression level of its M153 proteinn[...] [ ... Other pertinent reference...
-
[5]
’934 Patent
A rejection under 35 U.S.C. 103 is made. Table 14| Example of generated Office Action under Retrieval-Simulated Generation (OA-RS) setting. 18 PatRe: A Full-Stage Office Action and Rebuttal Generation Benchmark for Patent Examination Task:Direct Prompting Generation(OA-DP) [Initial] Input: •CLAIMS: 1: An extendable support leg comprising:a top leg tube (1...
2013
-
[6]
Evaluate the claims based on your internal technical knowledge base and common sense available to a PHOSITA
Different OA settings [Directly Prompting (OA-DP)]Evidence-First Logic (Intrinsic Knowledge):No prior art documents are provided. Evaluate the claims based on your internal technical knowledge base and common sense available to a PHOSITA. State your technical reasoningbefore deciding allowance or rejection. [Reference-Oracle Generation (OA-RO)]Reference S...
-
[7]
Claims 1–10 are rejected under 35 U.S.C. 112(b)
Different OA settings 2.Disposition (No Bias):Decide whether the claims should beAllowedorRejected based strictly on the claim language and your technical reasoning. Do not assume a rejection merely because you are an examiner.Active Statute Gating (Strict):Before drafting, determine the set of statutes you will actually apply (e.g., 101, 102, 103, 112, d...
-
[8]
Use only the provided inputs; do not reference external databases or limitations of access
-
[9]
Do not apologize, request additional data, or include disclaimers
-
[10]
Maintain strict scrutiny for hallucinated technical details; if a feature is not present in the reference summary, it must be treated as nonexistent
-
[11]
[User Message]Evaluate the generated Office Action based on the provided context
Always produce the required professional report and structured JSON output. [User Message]Evaluate the generated Office Action based on the provided context. Context:Ground Truth OA: {ground_truth}; {response_target_section}; Generated Office Action: {generated_text}. Core Rules:
-
[12]
Do not assume facts not stated in Ground Truth or Reference Summaries
Use only the provided text. Do not assume facts not stated in Ground Truth or Reference Summaries
-
[13]
If the disposition label (Non-Final / Final / Allowance) does not match Ground Truth, cap soundness at 3
-
[14]
Figure/Drawing Exception: If a discrepancy is solely due to figures or drawings not provided in text, do not deduct points
-
[15]
total_score
Do not apologize or refuse. Provide determinate scores with evidence. Evaluation Criteria (1–10 unless noted): 1.soundness:Conclusion correctness and reasoning alignment. Must be 0–3 if the disposition is incorrect... (Omit) 2.clarity:Professionalism and readability. 3.completeness:Coverage of claims and rejection grounds. 4.constructiveness:Quality of ac...
-
[16]
Use only the provided text; do not assume unstated facts
-
[17]
Figure/Drawing Exception: discrepancies due solely to missing figures should not be penalized
-
[18]
total_score
Avoid boilerplate; require point-specific reasoning. Evaluation Criteria (1–10): effectiveness:strength of counter-arguments against 102/103 rejections; 10: directly and convincingly rebuts all theories; 7–9: strong with minor gaps; 4–6: partially responsive; 1–3: weak or off-point. soundness:specificity to OA points; 10: fully claim-specific and detailed...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.