Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture

George Koomullil

arxiv: 2605.16407 · v1 · pith:HVHB2WFJnew · submitted 2026-05-13 · 💻 cs.LO · cs.CL· cs.CR· cs.PL

Proof-Carrying Certificates for LLM Pipelines: A Trust-Boundary Architecture

George Koomullil This is my paper

Pith reviewed 2026-05-20 21:26 UTC · model grok-4.3

classification 💻 cs.LO cs.CLcs.CRcs.PL

keywords proof-carrying certificatesLLM pipelinestrust boundariesLean 4 verificationbilattice groundingHoare-style actionscompositional stabilityassurance cards

0 comments

The pith

LLM pipelines can be made auditable for high-stakes use by certifying only their deterministic structured parts with Lean 4 proofs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes a trust-boundary architecture that verifies the deterministic computations around large language models rather than the models themselves. Certificates are valid when they pass a Lean 4 kernel type check and a sorry-free audit against the small trusted axiom set consisting of propext, Classical.choice, and Quot.sound, while all other assumptions are explicitly declared and tiered. The framework supplies three certificate families together with two operators that together produce a per-call Universal Assurance Card. If the approach holds, regulated applications such as clinical decision support and agentic systems with irreversible actions could proceed with explicit, auditable guarantees instead of opaque model trust.

Core claim

The central claim is that isolating the deterministic structured computations surrounding an LLM allows three local certificate families—conflict-aware bilattice grounding with an emission-gate soundness lemma, embedding sensitivity and paraphrase stability, and Hoare-style agent action—plus the operators Maximal Certifiable Residue and Compositional Stability theorem to deliver pipeline-wide assurance. Validity reduces to a Lean 4 kernel type-check together with a transitive axiom audit against the fixed trusted set, all other assumptions being partitioned by tier and declared. A compiled reference artifact covers all 22 certificate types with 17 axiom-free and the remainder depending only,

What carries the argument

The trust-boundary architecture that partitions assumptions by tier and reduces certificate validity to Lean 4 kernel type-checking plus a sorry-free transitive audit against the trusted axiom set.

If this is right

High-stakes deployments such as clinical decision support and regulated finance can use per-call Universal Assurance Cards.
The Compositional Stability theorem supplies a closed-form pipeline-wide perturbation budget from per-layer gains and margins.
Abstentions convert to a maximum-weight certifiable residue with an audit log of dropped claims.
The three families were tested on adversarially perturbed HotpotQA and prompt-injection scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar tiered-assumption partitions could be applied in other proof assistants to handle mixed human and machine oracles.
If the isolation succeeds, the same certificate families might extend to non-LLM agent pipelines that contain deterministic control flow.
The approach suggests a route to regulatory acceptance that treats the LLM as an untrusted oracle while still producing auditable outputs.

Load-bearing premise

The deterministic structured computations surrounding an LLM can be cleanly isolated from the model and verified independently using the three proposed certificate families without the isolation itself introducing new failure modes in the overall pipeline.

What would settle it

An adversarial prompt injection that succeeds against the Hoare-style agent action certificate inside the filesystem sandbox while still passing the kernel audit and axiom check would show the isolation does not hold.

Figures

Figures reproduced from arXiv: 2605.16407 by George Koomullil.

read the original abstract

We present a framework for verifying the deterministic structured computations surrounding a large language model rather than the model itself, extending a Lean 4 trust-boundary architecture to the generic interfaces of modern LLM pipelines. Certificate validity is a Lean 4 kernel type-check plus a sorry-free transitive axiom audit against the trusted set {propext, Classical.choice, Quot.sound}; other assumptions are declared and partitioned by tier (mathematical placeholders, cryptographic assumptions, ML/human oracles). The technical contribution comprises three local certificate families and two operators. The families are conflict-aware bilattice grounding (with an emission-gate soundness lemma), embedding sensitivity and paraphrase stability, and Hoare-style agent action. The operators are a Maximal Certifiable Residue, which turns abstention into the maximum-weight certifiable residue with audit-logged dropped claims, and a Compositional Stability theorem, which yields a closed-form pipeline-wide perturbation budget from per-layer gains and margins. The three families plus a Universal Assurance Card consolidator form the per-call deliverable for high-stakes deployments: patent and legal retrieval, regulated finance, clinical decision support, and agentic systems with irreversible side effects. A compiled Lean 4 reference artifact (Lean v4.30.0-rc2, Mathlib) covers all 22 certificate types, with 17 of 46 kernel-audited declarations axiom-free, the rest depending only on the trusted set and declared assumptions, and zero uses of sorryAx or Lean.ofReduceBool. The three families are empirically tested through four registered pilots: bilattice grounding on adversarially perturbed HotpotQA, embedding sensitivity in short- and long-form settings, and Hoare-style agent action on a filesystem sandbox with adversarial prompt injection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a Lean 4 artifact for certifying deterministic layers around LLMs with three certificate families and two operators, but the boundary isolation claim needs closer checks.

read the letter

The main takeaway is that this work ships a sorry-free Lean 4 reference artifact covering 22 certificate types, with 17 declarations axiom-free and the rest tied only to a small trusted set. That is concrete and worth looking at if you care about auditable pipelines in regulated settings. The new pieces are the conflict-aware bilattice grounding with its emission-gate lemma, the embedding sensitivity and paraphrase stability family, the Hoare-style agent actions, plus the Maximal Certifiable Residue operator and the Compositional Stability theorem that gives a pipeline-wide perturbation budget. The four pilots on perturbed HotpotQA, short- and long-form embeddings, and a filesystem sandbox with prompt injection show some empirical grounding. The trusted axiom set is kept minimal and external, which avoids circularity. The soft spot is the assumption that deterministic structured computations can be cleanly cut out from the LLM without new failure modes at the interfaces. Parsing, state updates, or prompt handling could introduce dependencies that sit outside the three certificate families and the emission-gate soundness lemma. The stress-test note on boundary soundness is reasonable to probe; the artifact certifies the isolated fragments, but end-to-end coverage depends on whether those interfaces are fully captured by the declared oracles. This is aimed at people working on trustworthy AI and formal methods who need practical certificates for high-stakes uses like clinical support or agentic systems. It is not a full end-to-end verification of the model itself, but the formal part is handled properly. I would send it to peer review because the artifact and the registered pilots give referees something solid to evaluate.

Referee Report

2 major / 2 minor

Summary. The paper proposes a trust-boundary architecture for verifying deterministic structured computations surrounding LLMs in pipelines, rather than the models themselves. It introduces three certificate families (conflict-aware bilattice grounding with emission-gate soundness lemma, embedding sensitivity and paraphrase stability, Hoare-style agent actions) and two operators (Maximal Certifiable Residue and Compositional Stability theorem). Certificate validity reduces to Lean 4 kernel type-checking plus a sorry-free transitive audit against the minimal trusted axiom set {propext, Classical.choice, Quot.sound}, with other assumptions partitioned by tier. A Lean 4 artifact covers 22 certificate types (17 of 46 declarations axiom-free), and the approach is tested in four registered pilots including adversarially perturbed HotpotQA and filesystem sandbox agent actions.

Significance. If the isolation and boundary claims hold, the work offers a concrete route to adding machine-checked assurance to high-stakes LLM deployments (patent retrieval, regulated finance, clinical support, irreversible agentic actions) without attempting to verify the LLM core. Strengths include the sorry-free Lean 4 artifact with explicit trusted-set audit, the closed-form perturbation budget derived from per-layer gains, and the tiered assumption partitioning that keeps cryptographic and oracle assumptions separate from the kernel.

major comments (2)

Abstract and section on Hoare-style agent action family: the claim that deterministic computations can be extracted and certified independently rests on the assumption that LLM outputs enter the deterministic layer only through explicitly declared oracles whose effects are fully captured by the emission-gate soundness lemma and perturbation budget. The skeptic concern that parsing, state-update, or prompt-injection interfaces may introduce implicit dependencies not covered by the three families is load-bearing; without an explicit argument or lemma showing that the transitive audit extends across these boundaries, the end-to-end pipeline guarantee does not follow from the isolated fragments.
Section describing the four registered pilots: the empirical tests (bilattice grounding on HotpotQA, embedding sensitivity, Hoare-style actions in sandbox) must demonstrate that no new failure modes arise at the LLM-deterministic interface. The current description leaves open whether the pilots exercise the boundary conditions presupposed by the Compositional Stability theorem; if they do not, the pilots do not yet support the central claim that the certificate families suffice for the full pipeline.

minor comments (2)

The abstract states both '22 certificate types' and '17 of 46 kernel-audited declarations'; a brief clarifying sentence or table reconciling these two counts would prevent reader confusion about the artifact's scope.
A summary table listing the assumption tiers, their dependencies on the trusted set, and which certificate families rely on each tier would improve readability of the assurance architecture.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on the trust-boundary architecture. The comments correctly identify that the end-to-end guarantee depends on explicit interface handling and that the pilot descriptions should more clearly demonstrate coverage of boundary conditions. We address both points below and will incorporate clarifications and an additional bridging argument in the revised manuscript.

read point-by-point responses

Referee: Abstract and section on Hoare-style agent action family: the claim that deterministic computations can be extracted and certified independently rests on the assumption that LLM outputs enter the deterministic layer only through explicitly declared oracles whose effects are fully captured by the emission-gate soundness lemma and perturbation budget. The skeptic concern that parsing, state-update, or prompt-injection interfaces may introduce implicit dependencies not covered by the three families is load-bearing; without an explicit argument or lemma showing that the transitive audit extends across these boundaries, the end-to-end pipeline guarantee does not follow from the isolated fragments.

Authors: We agree that the end-to-end claim requires an explicit bridging argument to rule out implicit dependencies at parsing, state-update, and prompt-injection points. The emission-gate soundness lemma and oracle declarations are intended to capture these interfaces, with assumptions partitioned by tier. In the revised manuscript we will insert a new lemma (placed after the Hoare-style family) that composes the transitive audit across the declared boundaries by invoking the Compositional Stability theorem and the Maximal Certifiable Residue operator. This addition will make the skeptic concern directly addressable without altering the core architecture. revision: yes
Referee: Section describing the four registered pilots: the empirical tests (bilattice grounding on HotpotQA, embedding sensitivity, Hoare-style actions in sandbox) must demonstrate that no new failure modes arise at the LLM-deterministic interface. The current description leaves open whether the pilots exercise the boundary conditions presupposed by the Compositional Stability theorem; if they do not, the pilots do not yet support the central claim that the certificate families suffice for the full pipeline.

Authors: The pilots were explicitly constructed to probe the interfaces: the adversarially perturbed HotpotQA case exercises embedding sensitivity and bilattice grounding under perturbation budgets, while the filesystem sandbox pilot includes adversarial prompt injection to test the Hoare-style action boundary. No new failure modes were observed at these points. To remove any ambiguity, the revised pilot section will add a short mapping table that links each pilot to the specific boundary conditions and the assumptions of the Compositional Stability theorem, together with a statement confirming that the observed certificates remained valid under the declared oracles. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation anchored in external Lean kernel and fixed trusted axioms

full rationale

The paper defines certificate validity explicitly as a Lean 4 kernel type-check plus a sorry-free transitive axiom audit against the fixed external trusted set {propext, Classical.choice, Quot.sound}, with 17 of 46 declarations axiom-free and the remainder depending only on declared assumptions partitioned by tier. The three certificate families (bilattice grounding with emission-gate soundness lemma, embedding sensitivity, Hoare-style actions) and the Compositional Stability theorem are implemented as Lean artifacts checked against Mathlib and the kernel rather than derived from fitted parameters, self-referential definitions, or load-bearing self-citations. No equations or theorems reduce by construction to their own inputs; the isolation of deterministic computations is presented as an engineering claim verified through the formalization and pilots, not as a mathematical identity that presupposes its conclusion. This is the normal case of a self-contained formal development against an external proof assistant.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the Lean 4 kernel plus a small trusted axiom set and tiered declared assumptions; no numerical free parameters or newly postulated physical entities are introduced.

axioms (1)

standard math propext, Classical.choice, Quot.sound form the trusted set for axiom audit
Explicitly listed as the base against which all other assumptions are audited.

pith-pipeline@v0.9.0 · 5848 in / 1344 out tokens · 67222 ms · 2026-05-20T21:26:42.046354+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absoluteFloorClosureCert echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Certificate validity is a Lean 4 kernel type-check plus a sorry-free transitive axiom audit against the trusted set {propext, Classical.choice, Quot.sound}
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

17 of 46 kernel-audited declarations axiom-free, the rest depending only on the trusted set and declared assumptions, and zero uses of sorryAx

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 8 internal anchors

[1]

Safereinforcementlearningviashielding

Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, andUfukTopcu. Safereinforcementlearningviashielding. InProceedingsof the AAAI Conference on Artificial Intelligence, pages 2669–2678. AAAI Press, 2018

work page 2018
[2]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction.Foundations and Trends in Machine Learning, 16(4):494–591, 2023

work page 2023
[3]

Formal certification ofcode-basedcryptographicproofs

Gilles Barthe, Benjamin Grégoire, and Santiago Zanella Béguelin. Formal certification ofcode-basedcryptographicproofs. InProceedingsofthe36thAnnualACMSIGPLAN- SIGACT Symposium on Principles of Programming Languages (POPL), pages 90–101. ACM, 2009

work page 2009
[4]

Nuel D. Belnap. A useful four-valued logic. In J. Michael Dunn and George Epstein, editors,Modern Uses of Multiple-Valued Logic, pages 5–37. Reidel, Dordrecht, 1977

work page 1977
[5]

Bohnet, V

Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Massimiliano Ciaramita, Jacob Eisenstein, et al. Attributed question answer- ing: Evaluation and modeling for attributed large language models.arXiv preprint arXiv:2212.08037, 2022

work page arXiv 2022
[6]

OxfordUniversityPress,Oxford,1stedition, 2013

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities: ANonasymptoticTheoryofIndependence. OxfordUniversityPress,Oxford,1stedition, 2013

work page 2013
[7]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 632–642. Association for Computational Linguistics, 2015

work page 2015
[8]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems 33 (NeurIPS 2020), volume 33, pages 1877–1901. Curran Associates, Inc., 2020

work page 2020
[9]

Lean4Lean: VerifyingatypecheckerforLean, inLean.arXivpreprint arXiv:2403.14064, 2024

MarioCarneiro. Lean4Lean: VerifyingatypecheckerforLean, inLean.arXivpreprint arXiv:2403.14064, 2024

work page arXiv 2024
[10]

Daggitt, Omri Isac, Guy Katz, Verena Rieser, and Oliver Lemon

MarcoCasadio,TanviDinkar,EkaterinaKomendantskaya,LucaArnaboldi,MatthewL. Daggitt, Omri Isac, Guy Katz, Verena Rieser, and Oliver Lemon. NLP verification: To- wardsageneralmethodologyforcertifyingrobustness.arXivpreprintarXiv:2403.10144, 2024. 70

work page arXiv 2024
[11]

Cohen, Elan Rosenfeld, and J

Jeremy M. Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. InProceedings of the 36th International Conference on Machine Learning (ICML), pages 1310–1320. PMLR, 2019

work page 2019
[12]

Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023). Curran Associates, Inc., 2023

work page 2023
[13]

Abstract interpretation: A unified lattice model for staticanalysisofprogramsbyconstructionorapproximationoffixpoints

Patrick Cousot and Radhia Cousot. Abstract interpretation: A unified lattice model for staticanalysisofprogramsbyconstructionorapproximationoffixpoints. InConference Record of the Fourth ACM Symposium on Principles of Programming Languages (POPL), pages 238–252. ACM, 1977

work page 1977
[14]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Cutler, Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks,KeshaHietala,EleftheriosIoannidis,etal

Joseph W. Cutler, Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks,KeshaHietala,EleftheriosIoannidis,etal. Cedar: Anewlanguageforexpressive, fast, safe, and analyzable authorization.Proceedings of the ACM on Programming Languages, 8(OOPSLA1):670–697, 2024

work page 2024
[16]

AsurveyoftheprojectAUTOMATH

NicolaasGovertdeBruijn. AsurveyoftheprojectAUTOMATH. InJonathanP.Seldin and J. Roger Hindley, editors,To H. B. Curry: Essays on Combinatory Logic, Lambda Calculus and Formalism, pages 579–606. Academic Press, New York, 1980

work page 1980
[17]

The lean 4 theorem prover and programming language

LeonardodeMouraandSebastianUllrich. TheLean4theoremproverandprogramming language. In André Platzer and Geoff Sutcliffe, editors,Automated Deduction – CADE 28, volume 12699 ofLecture Notes in Computer Science, pages 625–635, Cham, 2021. Springer. doi: 10.1007/978-3-030-79876-5_37

work page doi:10.1007/978-3-030-79876-5_37 2021
[18]

Dijkstra.A Discipline of Programming

Edsger W. Dijkstra.A Discipline of Programming. Prentice-Hall, Englewood Cliffs, NJ, 1976

work page 1976
[19]

A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

work page 2021
[20]

What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models.Transactions of the Association for Computational Linguistics, 8:34–48, 2020

Allyson Ettinger. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models.Transactions of the Association for Computational Linguistics, 8:34–48, 2020

work page 2020
[21]

A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics.Advances in Mathematics, 370:107239, 2020

Tobias Fritz. A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics.Advances in Mathematics, 370:107239, 2020. 71

work page 2020
[22]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

AI2: Safety and robustness certification of neural networks with abstract interpretation

Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaud- huri, and Martin Vechev. AI2: Safety and robustness certification of neural networks with abstract interpretation. In2018 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE Computer Society, 2018

work page 2018
[24]

Multivaluedlogics: Auniformapproachtoreasoninginartificial intelligence.Computational Intelligence, 4(3):265–316, 1988

MatthewL.Ginsberg. Multivaluedlogics: Auniformapproachtoreasoninginartificial intelligence.Computational Intelligence, 4(3):265–316, 1988

work page 1988
[25]

Notwhatyou’vesignedupfor: Compromisingreal-worldLLM-integrated applications with indirect prompt injection

KaiGreshake,SaharAbdelnabi,ShaileshMishra,ChristophEndres,ThorstenHolz,and MarioFritz. Notwhatyou’vesignedupfor: Compromisingreal-worldLLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ’23), pages 79–90. ACM, 2023

work page 2023
[26]

Cautious reinforcementlearningwithlogicalconstraints

Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. Cautious reinforcementlearningwithlogicalconstraints. InProceedingsofthe19thInternational Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 483–491. IFAAMAS, 2020

work page 2020
[27]

C. A. R. Hoare. An axiomatic basis for computer programming.Communications of the ACM, 12(10):576–580, 1969. doi: 10.1145/363235.363259

work page doi:10.1145/363235.363259 1969
[28]

Probability inequalities for sums of bounded random variables

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963

work page 1963
[29]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, QianglongChen,WeihuaPeng,etal.Asurveyonhallucinationinlargelanguagemodels: Principles,taxonomy,challenges,andopenquestions.arXivpreprintarXiv:2311.05232, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38,

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Delong Chen, Wenliang Dai, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38,

work page
[31]

doi: 10.1145/3571730

work page doi:10.1145/3571730
[32]

Certified robustness to adversarial word substitutions

Robin Jia, Aditi Raghunathan, Kerem Göksel, and Percy Liang. Certified robustness to adversarial word substitutions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4129–4142. Association for Computational Linguistics, 2019

work page 2019
[33]

Jiang, Sean Welleck, Jin Peng Zhou, Timothée Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu

Albert Q. Jiang, Sean Welleck, Jin Peng Zhou, Timothée Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. Draft, sketch, and prove: 72 Guiding formal theorem provers with informal proofs. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023
[34]

Cliff B. Jones. Specification and design of (parallel) programs. InInformation Processing 83, Proceedings of the IFIP 9th World Computer Congress, pages 321–332. North-Holland, 1983

work page 1983
[35]

Weakest precondition reasoning for expected runtimes of randomized algo- rithms.Journal of the ACM, 65(5):30:1–30:68, 2018

Benjamin Lucien Kaminski, Joost-Pieter Katoen, Christoph Matheja, and Federico Olmedo. Weakest precondition reasoning for expected runtimes of randomized algo- rithms.Journal of the ACM, 65(5):30:1–30:68, 2018

work page 2018
[36]

Dense Passage Retrieval for Open-Domain Question Answering

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain ques- tionanswering. InProceedingsofthe2020ConferenceonEmpiricalMethodsinNatural Language Processing (EMNLP), pages 6769–6781. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp...

work page doi:10.18653/v1/2020.emnlp-main.550 2020
[37]

Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly

Nora Kassner and Hinrich Schütze. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. InProceedings of the 58th Annual MeetingoftheAssociationforComputationalLinguistics,pages7811–7818.Association for Computational Linguistics, 2020

work page 2020
[38]

Formally Verified Patent Analysis via Dependent Type Theory: Machine-Checkable Certificates from a Hybrid AI + Lean 4 Pipeline

George Koomullil. Formally verified patent analysis via dependent type theory: Machine-checkable certificates from a hybrid AI + Lean 4 pipeline.arXiv preprint arXiv:2604.18882, April 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Semantics of probabilistic programs.Journal of Computer and System Sciences, 22(3):328–350, 1981

Dexter Kozen. Semantics of probabilistic programs.Journal of Computer and System Sciences, 22(3):328–350, 1981. doi: 10.1016/0022-0000(81)90036-2

work page doi:10.1016/0022-0000(81)90036-2 1981
[40]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, et al. Measuring faithfulness in chain-of- thought reasoning.arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems 33 (NeurIPS 2020), volume 33, pages 9459–9474. Cu...

work page 2020
[42]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, et al. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing 73 (EMNLP), pages 9004–9017. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.557

work page doi:10.18653/v1/2023.emnlp-main.557 2023
[44]

Proceedings of the 2023

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 12076–12100. Association f...

work page doi:10.18653/v1/2023.emnlp-main.741 2023
[45]

George C. Necula. Proof-carrying code. InProceedings of the 24th ACM SIGPLAN- SIGACTSymposiumonPrinciplesofProgrammingLanguages(POPL),pages106–119. ACM, 1997

work page 1997
[46]

Peter W. O’Hearn. Resources, concurrency, and local reasoning.Theoretical Computer Science, 375(1–3):271–307, 2007. doi: 10.1016/j.tcs.2006.12.035

work page doi:10.1016/j.tcs.2006.12.035 2007
[47]

Zoom in: An introduction to circuits.Distill, 5(3), 2020

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 5(3), 2020

work page 2020
[48]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Conformal language modeling.arXiv preprint arXiv:2306.10193, 2023

Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling.arXiv preprint arXiv:2306.10193, 2023

work page arXiv 2023
[50]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992. Association for Computational Linguistics, 2019

work page 2019
[51]

Rondon, Ming Kawaguchi, and Ranjit Jhala

Patrick M. Rondon, Ming Kawaguchi, and Ranjit Jhala. Liquid types. InProceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 159–169. ACM, 2008

work page 2008
[52]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), volume 36. Curran Associates, Inc., 2023

work page 2023
[53]

Dependent types and multi-monadic effects in F*

Nikhil Swamy, Cătălin Hriţcu, Chantal Keller, Aseem Rastogi, Antoine Delignat- Lavaud, Simon Forest, Karthikeyan Bhargavan, Cédric Fournet, et al. Dependent types and multi-monadic effects in F*. InProceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 256–270. ACM, 2016. 74

work page 2016
[54]

TheMathlib4mathematicslibraryforLean4,2025

TheMathlibCommunity. TheMathlib4mathematicslibraryforLean4,2025. Software repository, leanprover-community/mathlib4

work page 2025
[55]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’talwayssaywhattheythink: Unfaithfulexplanationsinchain-of-thoughtprompting. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), volume 36. Curran Associates, Inc., 2023

work page 2023
[56]

Seidel, Ranjit Jhala, Dimitrios Vytiniotis, and Simon Peyton-Jones

Niki Vazou, Eric L. Seidel, Ranjit Jhala, Dimitrios Vytiniotis, and Simon Peyton-Jones. Refinement types for Haskell. InProceedings of the 19th ACM SIGPLAN International Conference on Functional Programming (ICFP), pages 269–282. ACM, 2014

work page 2014
[57]

Lipschitz regularity of deep neural networks: Analysis and efficient estimation

Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: Analysis and efficient estimation. InAdvances in Neural Information Processing Systems 31 (NeurIPS 2018). Curran Associates, Inc., 2018

work page 2018
[58]

Springer, New York, 1st edition, 2005

Vladimir Vovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, New York, 1st edition, 2005

work page 2005
[59]

Frontiers Comput

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, et al. A survey on large language model based autonomous agents. FrontiersofComputerScience,18(6):186345,2024. doi: 10.1007/s11704-024-40231-1

work page doi:10.1007/s11704-024-40231-1 2024
[60]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

work page 2023
[61]

Chain-of-thoughtpromptingelicitsreasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, EdH.Chi,QuocV.Le,andDennyZhou. Chain-of-thoughtpromptingelicitsreasoning in large language models. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), volume 35, pages 24824–24837. Curran Associates, Inc., 2022

work page 2022
[62]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2369–2380. Association for Computational Linguistics, 2018

work page 2018
[63]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In Proceedings of the 11th International Conference on Learning Representations (ICLR), 2023

work page 2023
[64]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 75 A Notation Symbols used in the paper, in roughly the order they appear. Symbol Meaning Trust boundary and certificate validity ΩTrusted axiom set{prope...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Safereinforcementlearningviashielding

Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, andUfukTopcu. Safereinforcementlearningviashielding. InProceedingsof the AAAI Conference on Artificial Intelligence, pages 2669–2678. AAAI Press, 2018

work page 2018

[2] [2]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle introduction.Foundations and Trends in Machine Learning, 16(4):494–591, 2023

work page 2023

[3] [3]

Formal certification ofcode-basedcryptographicproofs

Gilles Barthe, Benjamin Grégoire, and Santiago Zanella Béguelin. Formal certification ofcode-basedcryptographicproofs. InProceedingsofthe36thAnnualACMSIGPLAN- SIGACT Symposium on Principles of Programming Languages (POPL), pages 90–101. ACM, 2009

work page 2009

[4] [4]

Nuel D. Belnap. A useful four-valued logic. In J. Michael Dunn and George Epstein, editors,Modern Uses of Multiple-Valued Logic, pages 5–37. Reidel, Dordrecht, 1977

work page 1977

[5] [5]

Bohnet, V

Bernd Bohnet, Vinh Q. Tran, Pat Verga, Roee Aharoni, Daniel Andor, Livio Baldini Soares, Massimiliano Ciaramita, Jacob Eisenstein, et al. Attributed question answer- ing: Evaluation and modeling for attributed large language models.arXiv preprint arXiv:2212.08037, 2022

work page arXiv 2022

[6] [6]

OxfordUniversityPress,Oxford,1stedition, 2013

Stéphane Boucheron, Gábor Lugosi, and Pascal Massart.Concentration Inequalities: ANonasymptoticTheoryofIndependence. OxfordUniversityPress,Oxford,1stedition, 2013

work page 2013

[7] [7]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 632–642. Association for Computational Linguistics, 2015

work page 2015

[8] [8]

Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, et al. Language models are few-shot learners. InAdvances in Neural Information Processing Systems 33 (NeurIPS 2020), volume 33, pages 1877–1901. Curran Associates, Inc., 2020

work page 2020

[9] [9]

Lean4Lean: VerifyingatypecheckerforLean, inLean.arXivpreprint arXiv:2403.14064, 2024

MarioCarneiro. Lean4Lean: VerifyingatypecheckerforLean, inLean.arXivpreprint arXiv:2403.14064, 2024

work page arXiv 2024

[10] [10]

Daggitt, Omri Isac, Guy Katz, Verena Rieser, and Oliver Lemon

MarcoCasadio,TanviDinkar,EkaterinaKomendantskaya,LucaArnaboldi,MatthewL. Daggitt, Omri Isac, Guy Katz, Verena Rieser, and Oliver Lemon. NLP verification: To- wardsageneralmethodologyforcertifyingrobustness.arXivpreprintarXiv:2403.10144, 2024. 70

work page arXiv 2024

[11] [11]

Cohen, Elan Rosenfeld, and J

Jeremy M. Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. InProceedings of the 36th International Conference on Machine Learning (ICML), pages 1310–1320. PMLR, 2019

work page 2019

[12] [12]

Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso

Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adrià Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023). Curran Associates, Inc., 2023

work page 2023

[13] [13]

Abstract interpretation: A unified lattice model for staticanalysisofprogramsbyconstructionorapproximationoffixpoints

Patrick Cousot and Radhia Cousot. Abstract interpretation: A unified lattice model for staticanalysisofprogramsbyconstructionorapproximationoffixpoints. InConference Record of the Fourth ACM Symposium on Principles of Programming Languages (POPL), pages 238–252. ACM, 1977

work page 1977

[14] [14]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Cutler, Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks,KeshaHietala,EleftheriosIoannidis,etal

Joseph W. Cutler, Craig Disselkoen, Aaron Eline, Shaobo He, Kyle Headley, Michael Hicks,KeshaHietala,EleftheriosIoannidis,etal. Cedar: Anewlanguageforexpressive, fast, safe, and analyzable authorization.Proceedings of the ACM on Programming Languages, 8(OOPSLA1):670–697, 2024

work page 2024

[16] [16]

AsurveyoftheprojectAUTOMATH

NicolaasGovertdeBruijn. AsurveyoftheprojectAUTOMATH. InJonathanP.Seldin and J. Roger Hindley, editors,To H. B. Curry: Essays on Combinatory Logic, Lambda Calculus and Formalism, pages 579–606. Academic Press, New York, 1980

work page 1980

[17] [17]

The lean 4 theorem prover and programming language

LeonardodeMouraandSebastianUllrich. TheLean4theoremproverandprogramming language. In André Platzer and Geoff Sutcliffe, editors,Automated Deduction – CADE 28, volume 12699 ofLecture Notes in Computer Science, pages 625–635, Cham, 2021. Springer. doi: 10.1007/978-3-030-79876-5_37

work page doi:10.1007/978-3-030-79876-5_37 2021

[18] [18]

Dijkstra.A Discipline of Programming

Edsger W. Dijkstra.A Discipline of Programming. Prentice-Hall, Englewood Cliffs, NJ, 1976

work page 1976

[19] [19]

A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, et al. A mathematical framework for transformer circuits.Transformer Circuits Thread, 2021

work page 2021

[20] [20]

What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models.Transactions of the Association for Computational Linguistics, 8:34–48, 2020

Allyson Ettinger. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models.Transactions of the Association for Computational Linguistics, 8:34–48, 2020

work page 2020

[21] [21]

A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics.Advances in Mathematics, 370:107239, 2020

Tobias Fritz. A synthetic approach to Markov kernels, conditional independence and theorems on sufficient statistics.Advances in Mathematics, 370:107239, 2020. 71

work page 2020

[22] [22]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

AI2: Safety and robustness certification of neural networks with abstract interpretation

Timon Gehr, Matthew Mirman, Dana Drachsler-Cohen, Petar Tsankov, Swarat Chaud- huri, and Martin Vechev. AI2: Safety and robustness certification of neural networks with abstract interpretation. In2018 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE Computer Society, 2018

work page 2018

[24] [24]

Multivaluedlogics: Auniformapproachtoreasoninginartificial intelligence.Computational Intelligence, 4(3):265–316, 1988

MatthewL.Ginsberg. Multivaluedlogics: Auniformapproachtoreasoninginartificial intelligence.Computational Intelligence, 4(3):265–316, 1988

work page 1988

[25] [25]

Notwhatyou’vesignedupfor: Compromisingreal-worldLLM-integrated applications with indirect prompt injection

KaiGreshake,SaharAbdelnabi,ShaileshMishra,ChristophEndres,ThorstenHolz,and MarioFritz. Notwhatyou’vesignedupfor: Compromisingreal-worldLLM-integrated applications with indirect prompt injection. InProceedings of the 16th ACM Workshop on Artificial Intelligence and Security (AISec ’23), pages 79–90. ACM, 2023

work page 2023

[26] [26]

Cautious reinforcementlearningwithlogicalconstraints

Mohammadhosein Hasanbeig, Alessandro Abate, and Daniel Kroening. Cautious reinforcementlearningwithlogicalconstraints. InProceedingsofthe19thInternational Conference on Autonomous Agents and Multiagent Systems (AAMAS), pages 483–491. IFAAMAS, 2020

work page 2020

[27] [27]

C. A. R. Hoare. An axiomatic basis for computer programming.Communications of the ACM, 12(10):576–580, 1969. doi: 10.1145/363235.363259

work page doi:10.1145/363235.363259 1969

[28] [28]

Probability inequalities for sums of bounded random variables

Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963

work page 1963

[29] [29]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, QianglongChen,WeihuaPeng,etal.Asurveyonhallucinationinlargelanguagemodels: Principles,taxonomy,challenges,andopenquestions.arXivpreprintarXiv:2311.05232, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [30]

Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38,

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Delong Chen, Wenliang Dai, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38,

work page

[31] [31]

doi: 10.1145/3571730

work page doi:10.1145/3571730

[32] [32]

Certified robustness to adversarial word substitutions

Robin Jia, Aditi Raghunathan, Kerem Göksel, and Percy Liang. Certified robustness to adversarial word substitutions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4129–4142. Association for Computational Linguistics, 2019

work page 2019

[33] [33]

Jiang, Sean Welleck, Jin Peng Zhou, Timothée Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu

Albert Q. Jiang, Sean Welleck, Jin Peng Zhou, Timothée Lacroix, Jiacheng Liu, Wenda Li, Mateja Jamnik, Guillaume Lample, and Yuhuai Wu. Draft, sketch, and prove: 72 Guiding formal theorem provers with informal proofs. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

work page 2023

[34] [34]

Cliff B. Jones. Specification and design of (parallel) programs. InInformation Processing 83, Proceedings of the IFIP 9th World Computer Congress, pages 321–332. North-Holland, 1983

work page 1983

[35] [35]

Weakest precondition reasoning for expected runtimes of randomized algo- rithms.Journal of the ACM, 65(5):30:1–30:68, 2018

Benjamin Lucien Kaminski, Joost-Pieter Katoen, Christoph Matheja, and Federico Olmedo. Weakest precondition reasoning for expected runtimes of randomized algo- rithms.Journal of the ACM, 65(5):30:1–30:68, 2018

work page 2018

[36] [36]

Dense Passage Retrieval for Open-Domain Question Answering

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain ques- tionanswering. InProceedingsofthe2020ConferenceonEmpiricalMethodsinNatural Language Processing (EMNLP), pages 6769–6781. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.emnlp...

work page doi:10.18653/v1/2020.emnlp-main.550 2020

[37] [37]

Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly

Nora Kassner and Hinrich Schütze. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. InProceedings of the 58th Annual MeetingoftheAssociationforComputationalLinguistics,pages7811–7818.Association for Computational Linguistics, 2020

work page 2020

[38] [38]

Formally Verified Patent Analysis via Dependent Type Theory: Machine-Checkable Certificates from a Hybrid AI + Lean 4 Pipeline

George Koomullil. Formally verified patent analysis via dependent type theory: Machine-checkable certificates from a hybrid AI + Lean 4 pipeline.arXiv preprint arXiv:2604.18882, April 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Semantics of probabilistic programs.Journal of Computer and System Sciences, 22(3):328–350, 1981

Dexter Kozen. Semantics of probabilistic programs.Journal of Computer and System Sciences, 22(3):328–350, 1981. doi: 10.1016/0022-0000(81)90036-2

work page doi:10.1016/0022-0000(81)90036-2 1981

[40] [40]

Measuring Faithfulness in Chain-of-Thought Reasoning

Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, et al. Measuring faithfulness in chain-of- thought reasoning.arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

Retrieval-augmented generation for knowledge-intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. InAdvances in Neural Information Processing Systems 33 (NeurIPS 2020), volume 33, pages 9459–9474. Cu...

work page 2020

[42] [42]

Let's Verify Step by Step

Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, et al. Let’s verify step by step.arXiv preprint arXiv:2305.20050, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing 73 (EMNLP), pages 9004–9017. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.emnlp-main.557

work page doi:10.18653/v1/2023.emnlp-main.557 2023

[44] [44]

Proceedings of the 2023

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 12076–12100. Association f...

work page doi:10.18653/v1/2023.emnlp-main.741 2023

[45] [45]

George C. Necula. Proof-carrying code. InProceedings of the 24th ACM SIGPLAN- SIGACTSymposiumonPrinciplesofProgrammingLanguages(POPL),pages106–119. ACM, 1997

work page 1997

[46] [46]

Peter W. O’Hearn. Resources, concurrency, and local reasoning.Theoretical Computer Science, 375(1–3):271–307, 2007. doi: 10.1016/j.tcs.2006.12.035

work page doi:10.1016/j.tcs.2006.12.035 2007

[47] [47]

Zoom in: An introduction to circuits.Distill, 5(3), 2020

Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 5(3), 2020

work page 2020

[48] [48]

Ignore Previous Prompt: Attack Techniques For Language Models

Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models.arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[49] [49]

Conformal language modeling.arXiv preprint arXiv:2306.10193, 2023

Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling.arXiv preprint arXiv:2306.10193, 2023

work page arXiv 2023

[50] [50]

Sentence-BERT: Sentence embeddings using Siamese BERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992. Association for Computational Linguistics, 2019

work page 2019

[51] [51]

Rondon, Ming Kawaguchi, and Ranjit Jhala

Patrick M. Rondon, Ming Kawaguchi, and Ranjit Jhala. Liquid types. InProceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 159–169. ACM, 2008

work page 2008

[52] [52]

Toolformer: Language models can teach themselves to use tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), volume 36. Curran Associates, Inc., 2023

work page 2023

[53] [53]

Dependent types and multi-monadic effects in F*

Nikhil Swamy, Cătălin Hriţcu, Chantal Keller, Aseem Rastogi, Antoine Delignat- Lavaud, Simon Forest, Karthikeyan Bhargavan, Cédric Fournet, et al. Dependent types and multi-monadic effects in F*. InProceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL), pages 256–270. ACM, 2016. 74

work page 2016

[54] [54]

TheMathlib4mathematicslibraryforLean4,2025

TheMathlibCommunity. TheMathlib4mathematicslibraryforLean4,2025. Software repository, leanprover-community/mathlib4

work page 2025

[55] [55]

Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. Language models don’talwayssaywhattheythink: Unfaithfulexplanationsinchain-of-thoughtprompting. InAdvances in Neural Information Processing Systems 36 (NeurIPS 2023), volume 36. Curran Associates, Inc., 2023

work page 2023

[56] [56]

Seidel, Ranjit Jhala, Dimitrios Vytiniotis, and Simon Peyton-Jones

Niki Vazou, Eric L. Seidel, Ranjit Jhala, Dimitrios Vytiniotis, and Simon Peyton-Jones. Refinement types for Haskell. InProceedings of the 19th ACM SIGPLAN International Conference on Functional Programming (ICFP), pages 269–282. ACM, 2014

work page 2014

[57] [57]

Lipschitz regularity of deep neural networks: Analysis and efficient estimation

Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: Analysis and efficient estimation. InAdvances in Neural Information Processing Systems 31 (NeurIPS 2018). Curran Associates, Inc., 2018

work page 2018

[58] [58]

Springer, New York, 1st edition, 2005

Vladimir Vovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, New York, 1st edition, 2005

work page 2005

[59] [59]

Frontiers Comput

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, et al. A survey on large language model based autonomous agents. FrontiersofComputerScience,18(6):186345,2024. doi: 10.1007/s11704-024-40231-1

work page doi:10.1007/s11704-024-40231-1 2024

[60] [60]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InProceedings of the 11th International Conference on Learning Representations (ICLR), 2023

work page 2023

[61] [61]

Chain-of-thoughtpromptingelicitsreasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, EdH.Chi,QuocV.Le,andDennyZhou. Chain-of-thoughtpromptingelicitsreasoning in large language models. InAdvances in Neural Information Processing Systems 35 (NeurIPS 2022), volume 35, pages 24824–24837. Curran Associates, Inc., 2022

work page 2022

[62] [62]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2369–2380. Association for Computational Linguistics, 2018

work page 2018

[63] [63]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. In Proceedings of the 11th International Conference on Learning Representations (ICLR), 2023

work page 2023

[64] [64]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023. 75 A Notation Symbols used in the paper, in roughly the order they appear. Symbol Meaning Trust boundary and certificate validity ΩTrusted axiom set{prope...

work page internal anchor Pith review Pith/arXiv arXiv 2023