arxiv: 2602.04113 · v3 · submitted 2026-02-04 · 💻 cs.CR · cs.LG

Recognition: 3 theorem links

· Lean Theorem

ZKBoost: Zero-Knowledge Verifiable Training for XGBoost

Nikolas Melissaris , Antigoni Polychroniadou , Akira Takahashi , Chenkai Weng , Jiayi Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:12 UTC · model grok-4.3

classification 💻 cs.CR cs.LG

keywords zero-knowledge proofsXGBoostproof of trainingverifiable machine learninggradient boosted treescryptographic protocols

0 comments

The pith

ZKBoost enables the first zero-knowledge proof that an XGBoost model was trained correctly on a committed dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ZKBoost as a protocol that lets a model owner prove they ran the full XGBoost training procedure on a specific private dataset without disclosing either the data or the learned model parameters. This matters for tabular-data applications where downstream users or regulators need cryptographic assurance that the model was not manipulated after training. The protocol avoids the prohibitive cost of executing the entire gradient-boosting process inside a general zero-knowledge circuit by supplying a specialized template that decomposes the training steps into cheaper sub-proofs. A second contribution is a VOLE-based concrete instantiation that closes previous soundness and topology-leakage gaps while preserving efficiency. The authors also replace floating-point arithmetic with a fixed-point version whose accuracy loss is bounded by 1 percent on standard benchmarks.

Core claim

ZKBoost is the first zkPoT protocol for XGBoost; it supplies a generic template that works with any ZKP backend and reduces prover cost relative to naive re-execution, together with a VOLE instantiation that eliminates topology leakage and soundness gaps, all resting on a fixed-point XGBoost variant whose accuracy remains within 1 percent of the original floating-point implementation.

What carries the argument

A generic zkPoT template that decomposes XGBoost training into oblivious partitioning and split-finding sub-proofs, instantiated via VOLE for security.

If this is right

Model owners can publish a commitment to both data and model while allowing anyone to verify training integrity.
The same template can be reused with future general-purpose ZKP systems to further lower prover time.
Fixed-point training opens the door to similar verifiable pipelines for other gradient-boosted tree methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decomposition may extend to proving inference on the resulting trees without additional topology leakage.
If the 1 percent accuracy gap holds across more domains, regulators could accept fixed-point models as drop-in substitutes for floating-point ones in audited settings.

Load-bearing premise

That the fixed-point arithmetic version of XGBoost reproduces standard accuracy to within 1 percent and that the VOLE instantiation introduces no new soundness or leakage vulnerabilities.

What would settle it

An experiment in which a malicious prover submits a proof for training on a different committed dataset or with altered splits; the verifier must reject the proof with overwhelming probability.

Figures

Figures reproduced from arXiv: 2602.04113 by Akira Takahashi, Antigoni Polychroniadou, Chenkai Weng, Jiayi Xu, Nikolas Melissaris.

read the original abstract

Gradient boosted decision trees, particularly XGBoost, are among the most effective methods for tabular data. As deployment in sensitive settings increases, cryptographic guarantees of model integrity become essential. We present ZKBoost, the first zero-knowledge proof of training (zkPoT) protocol for XGBoost, enabling model owners to prove correct training on a committed dataset without revealing data or model parameters. Naively re-executing XGBoost training in ZK would incur prohibitive costs, primarily due to the oblivious partitioning of training samples and unknown tree splits. Moreover, previous work on ZKP of training and inference had subtle security issues, such as leakage of tree topology and soundness gaps allowing cheating model providers to deviate from the correct execution of training and inference. We make two key contributions to address these challenges: (1) a generic zkPoT template for XGBoost that can be instantiated with any general-purpose ZKP backend, significantly improving prover costs compared to naive re-execution of the training process; and (2) a VOLE-based instantiation that overcomes the security issues of previous ZK proofs of training at minimal costs. To maximize efficiency, we develop a fixed-point version of XGBoost, which is particularly well suited for efficient instantiation of ZKP, and show it matches standard XGBoost accuracy to within 1\% on real-world datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents ZKBoost as the first zero-knowledge proof of training (zkPoT) protocol for XGBoost. It enables model owners to prove correct training on a committed dataset without revealing the data or model parameters. The two main contributions are (1) a generic zkPoT template for XGBoost that can be instantiated with any general-purpose ZKP backend and significantly reduces prover costs relative to naive re-execution of oblivious partitioning and split selection, and (2) a VOLE-based instantiation that closes prior soundness gaps and topology-leakage vulnerabilities. The paper also introduces a fixed-point arithmetic version of XGBoost claimed to retain accuracy within 1% of floating-point XGBoost on real-world datasets.

Significance. If the security proofs and efficiency claims hold, the work would constitute a meaningful advance in verifiable machine learning for gradient-boosted trees. XGBoost is widely deployed on tabular data in regulated domains; a sound zkPoT protocol would allow cryptographic attestation of training integrity without data or model disclosure. The generic template plus VOLE instantiation, together with the fixed-point approximation, directly targets the dominant cost and security obstacles identified in prior ZKP-of-training literature.

major comments (2)

[§4] §4 (Fixed-point XGBoost): the claim that the fixed-point implementation 'matches standard XGBoost accuracy to within 1%' must be accompanied by per-dataset tables reporting exact test accuracy, AUC, or log-loss for both versions, together with the precise fixed-point bit widths and rounding modes used. Without these numbers it is impossible to judge whether the 1% bound is uniform or dataset-dependent.
[§5.2] §5.2 (VOLE instantiation): the security argument that the new construction eliminates the soundness gaps and topology leakage of prior ZK proofs of training is load-bearing for the central claim. The manuscript must supply a formal security theorem (in the random-oracle or ideal-cipher model) together with a proof sketch showing that any deviation from correct training is detected with overwhelming probability; a high-level description alone is insufficient.

minor comments (2)

Define the acronym VOLE at first use and state the precise security assumptions (e.g., correlation robustness) under which the instantiation is proven secure.
Figure 2 (protocol diagram) should include explicit labels for the committed dataset, the tree topology commitment, and the final model commitment so that the flow from template to VOLE instantiation is immediately readable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate the requested details and formal analysis.

read point-by-point responses

Referee: [§4] §4 (Fixed-point XGBoost): the claim that the fixed-point implementation 'matches standard XGBoost accuracy to within 1%' must be accompanied by per-dataset tables reporting exact test accuracy, AUC, or log-loss for both versions, together with the precise fixed-point bit widths and rounding modes used. Without these numbers it is impossible to judge whether the 1% bound is uniform or dataset-dependent.

Authors: We agree that detailed per-dataset results are necessary for a rigorous evaluation. In the revised manuscript we will add a new table in Section 4 that reports exact test accuracy, AUC, and log-loss for both the fixed-point and floating-point XGBoost versions on every evaluated dataset. We will also specify the exact fixed-point parameters used, including total bit width, number of fractional bits, and the rounding mode (e.g., round-to-nearest-even). This will make the 1% bound fully transparent and allow readers to assess its uniformity. revision: yes
Referee: [§5.2] §5.2 (VOLE instantiation): the security argument that the new construction eliminates the soundness gaps and topology leakage of prior ZK proofs of training is load-bearing for the central claim. The manuscript must supply a formal security theorem (in the random-oracle or ideal-cipher model) together with a proof sketch showing that any deviation from correct training is detected with overwhelming probability; a high-level description alone is insufficient.

Authors: We acknowledge that a formal theorem and proof sketch are required. In the revised version we will insert a formal security theorem in Section 5.2 (stated in the random-oracle model) asserting that the VOLE-based instantiation detects any deviation from correct training with overwhelming probability. A corresponding proof sketch will be added to the main text or appendix, explicitly showing how the construction closes the soundness gaps and eliminates topology leakage present in prior work. The proof will rely on the binding and hiding properties of the underlying VOLE protocol together with the generic template. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ZKBoost derivation

full rationale

The paper presents a new generic zkPoT template for XGBoost plus a VOLE instantiation, relying on standard ZKP backends and addressing prior security gaps via explicit protocol design rather than any self-referential reduction. The fixed-point XGBoost variant is introduced for ZKP compatibility and validated by direct empirical comparison to floating-point accuracy on external datasets; no equations or claims reduce by construction to fitted inputs, self-citations, or renamed known results. The central claims remain independent of the paper's own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work relies on standard cryptographic assumptions for ZK proofs and empirical validation of the fixed-point XGBoost variant.

free parameters (1)

fixed point precision parameters
The fixed-point approximation requires choosing bit widths or scaling factors to achieve the claimed accuracy match.

axioms (1)

domain assumption Cryptographic security of the ZKP backend and VOLE protocol
The protocol's security depends on the underlying zero-knowledge proof system's assumptions.

pith-pipeline@v0.9.0 · 5549 in / 1186 out tokens · 32479 ms · 2026-05-16T08:12:11.239488+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniquely satisfies the reciprocal cost functional equation) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a fixed-point version of XGBoost... SigmoidWideFP(z) = 0/1 or (z+2)/4; LogitFromProbFP via truncated atanh Taylor series; gain = ½(GL²/(HL+λ) + ... ) − γ with fixed-point division.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CertXGB validates each tree independently by leaf-to-root histogram reconstruction and gain-maximality checks; uses dummy splits for pruning to hide topology.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 from linking) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

VOLE-based gadgets for comparison via MSB(bit-decomposition) with extra nonzero check (1−w)·s=0, division via residue r with range checks 0≤r<y and z<⌊p/y⌋.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

[1]

A Additional Preliminaries A.1 Further Details of XGBoost Theoretical Background XGBoost is a scalable implementation of gradient boosted decision trees

ACM, 2020. A Additional Preliminaries A.1 Further Details of XGBoost Theoretical Background XGBoost is a scalable implementation of gradient boosted decision trees. Given a datasetDand a differentiable convex loss functionℓ:R×Y→R, the algorithm trains an ensemble ofmregression treesT={Tk}m k=1 such that the final prediction is p=σ(z)z=z 0− m∑ k=1 η·wk,i w...

work page 2020
[2]

zero-knowledge proofs of training

For each boosting roundk= 1,...,m, it builds a new treeTk with the following key steps. (a)Prediction and derivatives.For every data pointi∈[n], the algorithm computes the current probabilitypi =σ(zk−1,i), the gradientgi =p i−yi, and the Hessianhi =p i(1−pi). (b)Split search.For each node, containing samples indexed byI⊆[n], the algorithm scans candidate ...

work page
[3]

We show thatCertXGB(x,y,T,z 0) = 0

= TrainXGB(x,y). We show thatCertXGB(x,y,T,z 0) = 0. Ifz 0̸=z′ 0: SinceValidateLogitcomputes the logit fromyas inTrainXGB,CertXGBderives the samez′ 0 fromy. Thus, it holds thatz 0̸=z′ 0, and thusCertXGBreturns 0. IfT 1̸=T′ 1 and(f 1,t 1)̸= (f′ 1,t′ 1)(i.e., the first tree contains an incorrect split):SinceCertXGBruns InitHists, which computespi,g i, andh ...

work page
[4]

Otherwise, it commits tou=x −1

Ifx= 0,Pcommits toJuKsuch thatu= 0. Otherwise, it commits tou=x −1

work page
[5]

Comparison.On inputJxK,JyK,JsK, prove thats=1{JxK<JyK}

Prove thatJyK−JuK·JxK= 0and(1−JyK)·JxK= 0. Comparison.On inputJxK,JyK,JsK, prove thats=1{JxK<JyK}. Define parameters(d,t)such thatn−1 =d·t. Construct a public lookup tableT= (0,...,2d−1)

work page
[6]

Decomposezinto(J˜z 0K,...,J˜zt−1K)such that ˜zi = ∑t j=0 2j·zid+j

Commits toJzK=JxK−JyKandJsK=MSB(JzK). Decomposezinto(J˜z 0K,...,J˜zt−1K)such that ˜zi = ∑t j=0 2j·zid+j

work page
[7]

Prove that fori∈[0,t),˜zi =T[˜zi]

work page
[8]

Prove thatJzK= 2 n−1·JsK+ ∑t−1 i=0 2id·J˜ziK

work page
[9]

Prove that(1−JwK)·JsK= 0

Commit toJwK=1{JzK̸= 0}and prove its correctness using the above nonzero check. Prove that(1−JwK)·JsK= 0. Division.On inputJxK,JyK,JzK, prove thatz=⌊x/y⌋. Definemq andm d as the upper bound of the abstract value of the quotient and denominator. 1.Pcommits to abstract valuesJ¯xK,J¯yK,J¯zKand prove that¯i= (1−2·MSB(i))·ifori∈{x,y,z}. 2.Pcommits to the resid...

work page
[10]

Prove that0≤J¯rK<J¯yK,J¯yK≤md andJ¯zK<m q

work page
[11]

24 T runcation.On inputJxK,JzK,f, prove thatz=⌊x/2 f⌋

Prove thatMSB(z) =MSB(x)⊕MSB(y)using the fact that for any binary valuesα,β,α⊕β= α+β−2α·β. 24 T runcation.On inputJxK,JzK,f, prove thatz=⌊x/2 f⌋. The proof is the same as the division except thaty= 2 f is public so that its range check is avoided. Additionally,zis bounded by mq =⌊p/2f+1⌋. Ifpis a Mersenne prime, the check of0≤JrK<2f andz <mq can both be d...

work page
[12]

Upon receivinginitfrom both parties, and additional inputxand decommitment message decomx fromP, check whethercom x correctly decommits tox

work page
[13]

Send(m) toPand(k,∆)toV

Upon receivingcvolefrom both parties, sample a VOLE correlationm=k+x·∆. Send(m) toPand(k,∆)toV. IfPis corrupted, receive(m)fromSand recomputek=m−x·∆. IfVis corrupted, receive(k,∆)fromVand recomputem=k+x·∆. Send these values to corresponding parties. Functionality 3:F RAMZK The functionality interacts with three parties: a proverP, a verifierV, and an idea...

work page
[14]

If any relation is not satisfied by x,w, it aborts

For arithmetic components, it verifies arithmetic relations. If any relation is not satisfied by x,w, it aborts

work page
[15]

It checks the data output byreadaccesses, and modifies the table according towriteaccesses

For RAM components, it initializes the memory using data fromxorw. It checks the data output byreadaccesses, and modifies the table according towriteaccesses. If any access operation is inconsistent, it aborts. Theorem 1.Define the relation by the proof of fixed-point XGBoost Training. Protocol 1 securely realizesF CP in the(F CVOLE,F RAMZK)-hybrid model....

work page