An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

Jingjing Huo; Jun Li; Pan Liu; Qian Zhu; Wanqing Xu; Wenyan Yang; Xinnan Guo; Xuan Lin

arxiv: 2603.14463 · v1 · submitted 2026-03-15 · 💻 cs.CL

An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

Qian Zhu , Xinnan Guo , Jingjing Huo , Jun Li , Pan Liu , Wenyan Yang , Wanqing Xu , Xuan Lin This is my paper

Pith reviewed 2026-05-15 11:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords insurance LLMdomain specializationhallucination controlSFT-RL curriculumverifiable data synthesisINSEva benchmarkRLVRRLAIF

0 comments

The pith

INS-S1 reaches state-of-the-art insurance domain performance while preserving general capabilities and holding hallucinations to 0.6%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents INS-S1, a family of insurance-specific large language models trained through a new end-to-end alignment process. Two core innovations drive the work: a Verifiable Data Synthesis System that builds hierarchical datasets for actuarial reasoning and compliance checks, and a Progressive SFT-RL Curriculum Framework that combines dynamic data annealing with verified reasoning rewards and AI feedback. Experiments show the resulting models lead on insurance tasks, beat general models such as DeepSeek-R1 and Gemini-2.5-Pro, keep top-tier scores on broad benchmarks, and record a 0.6% hallucination rate on HHEM. The authors also release INSEva, a 39k-sample insurance benchmark, to support further testing. A sympathetic reader would care because the results suggest domain mastery in regulated fields need not come at the cost of broad intelligence or require ongoing retrieval systems.

Core claim

Through a Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance together with a Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing and a mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF), domain specialization for insurance can be achieved without competence trade-offs, yielding SOTA performance on domain tasks, maintained top-tier general capabilities, and a 0.6% hallucination rate on HHEM.

What carries the argument

The Progressive SFT-RL Curriculum Framework that optimizes data ratios and reward signals via dynamic data annealing combined with RLVR and RLAIF to enforce domain constraints while avoiding catastrophic forgetting.

If this is right

Insurance tasks can be performed with high accuracy and near-zero tolerance for regulatory errors using a single model.
General capabilities on non-insurance benchmarks remain competitive with leading frontier models.
Hallucination rates on domain-specific factual checks can be driven below 1% without external retrieval.
A 39k-sample insurance benchmark (INSEva) is now available for standardized evaluation of future models.
The same data-synthesis and curriculum approach can be applied to other regulated verticals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may reduce dependence on retrieval-augmented generation for high-stakes factual accuracy.
Similar curriculum designs could be tested in other regulated fields such as legal or medical reasoning.
If the annealing schedule generalizes, larger models could be specialized with less risk of forgetting base capabilities.

Load-bearing premise

The curriculum framework with dynamic data annealing and the RLVR-RLAIF mix can enforce domain constraints and prevent catastrophic forgetting without any hidden trade-offs or later degradation.

What would settle it

A post-training evaluation on standard general benchmarks that shows scores falling below the top tier of general models, or an HHEM hallucination rate above 0.6%, would falsify the no-trade-off claim.

read the original abstract

Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces INS-S1 and the INSEva benchmark with a curriculum meant to deliver domain mastery without general capability loss, but supplies no ablations or before-after checks to confirm the curriculum actually prevents forgetting.

read the letter

The main new pieces are the INS-S1 model family and the INSEva benchmark of 39k+ insurance samples. The method combines a verifiable hierarchical data synthesis system with a progressive SFT-RL curriculum that mixes RLVR and RLAIF plus dynamic data annealing. The authors report SOTA domain results that beat DeepSeek-R1 and Gemini-2.5-Pro, plus top-tier general performance and a 0.6% hallucination rate on HHEM. Releasing the benchmark is a practical step that gives the field something concrete to test against. The curriculum idea itself is a reasonable attempt to address the common trade-off problem through controlled data ratios and reward signals. The soft spot is exactly what the stress-test flagged: no ablation that removes the annealing schedule or the RL mix, no checkpointed scores on standard general benchmarks comparing the base model to INS-S1, and no quantitative forgetting measure on non-insurance tasks. Without those controls it is difficult to separate the claimed training dynamics from simple data selection or base-model strength. The hallucination number is striking if the measurement protocol holds up, but the abstract gives little detail on how it was obtained. This work is aimed at people building LLMs for regulated domains where reliability matters more than raw scale. It deserves a serious referee because the problem is important and the framework is described in enough detail to review, even though the current evidence would need the missing controls before the central claim can be taken as settled.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces INS-S1, a family of insurance-specific large language models trained through a Verifiable Data Synthesis System for hierarchical datasets on actuarial reasoning and compliance, and a Progressive SFT-RL Curriculum Framework that uses dynamic data annealing combined with Verified Reasoning (RLVR) and AI Feedback (RLAIF). The authors release INSEva, a comprehensive insurance benchmark comprising over 39,000 samples, and report that INS-S1 achieves state-of-the-art performance on domain-specific tasks, outperforming models such as DeepSeek-R1 and Gemini-2.5-Pro, while preserving top-tier general capabilities and attaining a 0.6% hallucination rate as measured by HHEM.

Significance. If the central claims hold after additional controls, this would represent a meaningful advance in high-stakes domain adaptation of LLMs, showing that specialization in regulated fields like insurance can be achieved with low hallucination rates and without the usual loss of general capabilities. The public release of the INSEva benchmark is a concrete positive contribution that could support future work in insurance NLP and related verticals.

major comments (2)

[Methods (Progressive SFT-RL Curriculum Framework) and Results] The central claim that the Progressive SFT-RL Curriculum prevents catastrophic forgetting while enforcing domain constraints lacks direct supporting evidence. No ablation removing the dynamic data annealing schedule is reported, and no checkpointed evaluations on general benchmarks (MMLU, GSM8K, etc.) comparing the base model to INS-S1 are provided in the results or methods sections. Without these, it remains possible that preserved general performance is an artifact of data selection rather than the claimed training dynamics.
[Results and Experiments] The SOTA performance claims on domain tasks and the 0.6% HHEM hallucination rate are presented without baseline tables that include error bars, statistical tests, or a detailed measurement protocol for HHEM. This information is required to assess whether the reported outperformance over DeepSeek-R1 and Gemini-2.5-Pro is robust.

minor comments (2)

[Abstract] The abstract states that data ratios and reward signals are optimized but does not specify the concrete values, search procedure, or sensitivity analysis used.
[Benchmark Description] Additional details on the construction, scenario diversity, and inter-annotator agreement for the INSEva benchmark would help readers evaluate its coverage and reliability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and commit to revisions that directly strengthen the supporting evidence for our claims.

read point-by-point responses

Referee: [Methods (Progressive SFT-RL Curriculum Framework) and Results] The central claim that the Progressive SFT-RL Curriculum prevents catastrophic forgetting while enforcing domain constraints lacks direct supporting evidence. No ablation removing the dynamic data annealing schedule is reported, and no checkpointed evaluations on general benchmarks (MMLU, GSM8K, etc.) comparing the base model to INS-S1 are provided in the results or methods sections. Without these, it remains possible that preserved general performance is an artifact of data selection rather than the claimed training dynamics.

Authors: We agree that explicit ablations and checkpointed evaluations would provide stronger direct support for the role of the curriculum dynamics. In the revised manuscript we will add an ablation that removes the dynamic data annealing schedule and report checkpointed performance on MMLU and GSM8K at multiple training stages, comparing the base model to INS-S1. These additions will isolate the contribution of the Progressive SFT-RL Curriculum from data selection effects. revision: yes
Referee: [Results and Experiments] The SOTA performance claims on domain tasks and the 0.6% HHEM hallucination rate are presented without baseline tables that include error bars, statistical tests, or a detailed measurement protocol for HHEM. This information is required to assess whether the reported outperformance over DeepSeek-R1 and Gemini-2.5-Pro is robust.

Authors: We will revise the results section to include error bars from multiple runs, statistical significance tests (e.g., paired t-tests), and a detailed HHEM measurement protocol specifying the exact evaluation setup, verification steps, and scoring criteria. These changes will allow readers to assess the robustness of the reported outperformance and the 0.6% hallucination rate. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on independent benchmark

full rationale

The paper describes a Progressive SFT-RL Curriculum Framework using dynamic data annealing, RLVR, and RLAIF, then reports final performance numbers on the newly released INSEva benchmark (39k+ samples) plus general-capability suites. No equations, derivations, or self-citations are shown that reduce any reported metric to a fitted data ratio or reward weight by construction. The central claim of maintained general intelligence is presented as an empirical outcome rather than a tautological identity. Absence of ablations is a limitation of evidence, not a circular reduction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The central claim depends on the unverified effectiveness of the Verifiable Data Synthesis System and the Progressive SFT-RL Curriculum Framework, plus several unstated training hyperparameters such as data ratios and reward weighting.

free parameters (2)

data ratios
Optimized dynamically in the curriculum to balance domain constraints against general capability retention
reward signals
Synergistic mix of RLVR and RLAIF weights chosen to enforce compliance

axioms (2)

domain assumption Verifiable Data Synthesis System constructs hierarchical datasets that accurately capture actuarial reasoning and regulatory compliance
Invoked as the foundation for all subsequent training
domain assumption Progressive SFT-RL Curriculum with dynamic data annealing prevents catastrophic forgetting while locking in domain constraints
Central premise of the end-to-end alignment paradigm

invented entities (2)

INS-S1 no independent evidence
purpose: Insurance-specific LLM family
Produced by the new training paradigm
INSEva no independent evidence
purpose: Comprehensive insurance benchmark
39k+ samples used to demonstrate SOTA performance

pith-pipeline@v0.9.0 · 5559 in / 1616 out tokens · 71571 ms · 2026-05-15T11:14:51.326694+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery theorem unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.