arxiv: 2605.12694 · v1 · submitted 2026-05-12 · 💻 cs.SE · cs.AI· cs.PL

Recognition: 2 theorem links

· Lean Theorem

Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis

Jacqueline L. Mitchell , Chao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:07 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.PL

keywords agentic interpretationlattice-based analysisLLM program reasoningworklist algorithmstatic analysisthird-party componentsprogram verification

0 comments

The pith

LLM program analysis gains reliability by tracking localized claims as evolving elements in a finite-height lattice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes agentic interpretation as a way to make LLM-driven program reasoning more disciplined by borrowing techniques from lattice-based static analysis. It breaks high-level analysis goals into localized claims, records the LLM's judgment on each claim as a lattice element, and uses a worklist algorithm to refine those judgments over time. This replaces brittle one-shot whole-program queries with an explicit record of which conclusions are supported or disputed. The approach incorporates external information such as documentation and security advisories that fixed analyzers miss, and is demonstrated with a formal model plus an example involving opaque third-party components.

Core claim

Agentic interpretation decomposes a high-level analysis goal into localized claims and tracks the LLM's judgment about each claim in a finite-height lattice; a worklist algorithm governs how claims and their judgments evolve during the analysis.

What carries the argument

Agentic interpretation: the framework that decomposes analysis goals into localized claims whose LLM judgments are recorded and refined as elements of a finite-height lattice under a worklist schedule.

If this is right

Whole-program LLM queries can be replaced by incremental refinement of smaller, checkable claims.
Analyses gain the ability to incorporate documentation, advisories, and informal contracts without losing traceability.
Different lattice heights and worklist policies become design choices that trade precision against query cost.
The same machinery can be applied to code whose behavior depends on opaque third-party libraries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The lattice discipline may also serve as a verification layer that flags when an LLM's answers become inconsistent across related claims.
Hybrid tools could run conventional static analyzers on the parts of the lattice they can resolve and delegate only the remaining claims to the LLM.
Empirical tests could measure how often the worklist terminates on real-world codebases and whether final lattice values match manual review.

Load-bearing premise

LLM judgments on localized claims can be structured and evolved meaningfully inside a finite-height lattice.

What would settle it

A concrete run in which repeated LLM queries on related claims produce judgments that never converge to a single lattice element or that violate the lattice ordering under the worklist steps.

Figures

Figures reproduced from arXiv: 2605.12694 by Chao Wang, Jacqueline L. Mitchell.

**Figure 1.** Figure 1: Five Target Categories for Agentic Interpretation respects an undocumented API convention. These questions concern program structure, possible flows, and component interactions, but the evidence needed to answer them is not fully contained in the source code and is often not available as a formal specification. Instead, it may come from documentation, package metadata, current security advisories, informa… view at source ↗

**Figure 2.** Figure 2: A handler with two opaque third-party components. The source of parseJSON and verifySignature is unavailable to the analyzer; the source of handleRequest and processPayload is available. a structured domain over evidence states: a lattice element records the current evidential status of a claim (e.g., whether the claim has weak or strong support). The goal of agentic interpretation is to achieve a separati… view at source ↗

**Figure 3.** Figure 3: Evaluation graph and claims for the motivating example. Context edges ( ) flow from component claims to the cross-component claim (cC ) and from cC and cR to cG. Feedback edges ( ) let a refuted cross-component claim focus later searches about opaque components. The table summarizes the claims associated with each node. Hashed nodes indicate that the nodes represent opaque components. Goal: handleRequest s… view at source ↗

**Figure 4.** Figure 4: The AGraded lattice. Each component takes values from {⊥, w, s}, ordered ⊥ ⊑ w ⊑ s. Joins are pointwise maxima. A state such as ⟨w, s⟩ records weak support and strong refutation; it does not mean the claim is both true and false. the opaque components only, as the source code of other nodes can be directly inspected. Both auxiliary and program-derived nodes participate in the worklist computation, while o… view at source ↗

**Figure 5.** Figure 5: Workflow of agentic interpretation. Definition 1 (Program-derived graph). A program-derived graph is a finite directed graph Gprog = (Vprog, Eprog) where Eprog ⊆ Vprog ×Vprog. Nodes in Vprog represent program points or program regions, and edges represent a chosen structural relation, such as control flow, call flow, or data flow. The evaluation graph augments the program-derived graph with auxiliary nodes… view at source ↗

**Figure 6.** Figure 6: Alternative assessment domains. 4.1 Choosing the Assessment Lattice The assessment lattice determines how many distinct judgments the framework can express about a claim. A coarser lattice is cheaper to operate and easier to interpret, but a finer lattice preserves more distinctions and requires the LLM agent to make granular judgments reliably. A simple bilateral option is a two-bit evidence-presence latt… view at source ↗

read the original abstract

Large language models can consult information that fixed static analyzers cannot, such as documentation, current security advisories, version-specific metadata, and informal API contracts. This makes LLMs a compelling option for program analyses that depend on information beyond the source program, or that are otherwise not amenable to conventional static analyzers. However, directly asking an LLM for a one-shot whole-program analysis is brittle because it compresses many evidence-dependent judgments into a single opaque answer, rather than exposing which conclusions are supported or disputed and using intermediate findings to guide later, more focused searches. In this paper, we propose agentic interpretation, a framework that brings the discipline of lattice-based static analysis to LLM-driven program reasoning. At a high level, agentic interpretation decomposes a high-level analysis goal into localized claims, and tracks the LLM's judgment about each claim in a finite-height lattice. A worklist algorithm governs how claims and their judgments evolve during the analysis. We introduce a formal model of agentic interpretation, explore the design space it opens, and illustrate the approach with a worked example analyzing code that depends on opaque third-party components.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches a lattice framework for structuring LLM judgments in program analysis, which is a clean idea but stays mostly conceptual with no empirical tests.

read the letter

The main takeaway is that this paper proposes agentic interpretation as a way to bring lattice-based static analysis discipline to LLM-driven program reasoning. It decomposes high-level goals into localized claims, tracks the LLM's judgments about them in a finite-height lattice, and uses a worklist algorithm to let those judgments evolve as the agent gathers more evidence from docs, advisories, or other sources. This avoids the brittleness of one-shot whole-program queries. What is actually new is the specific tie-in of agentic workflows to monotonic lattice updates, which is not in the prior work they cite. The formal model is a straightforward sketch that defines how claims start at the bottom and move upward with supporting evidence, and the worked example on code with opaque third-party components shows the flow in practice. That part is handled cleanly and highlights a real pain point with current LLM use in analysis. The paper also maps out some of the design space without overclaiming. The soft spots are straightforward and expected for a framework paper. There is no implementation, no runs on actual codebases, and no comparison to plain LLM calls or traditional analyzers, so we have no data on whether the lattice structure improves reliability or just adds steps. The assumption that localized LLM judgments can be made to evolve monotonically is treated as part of the model definition rather than something tested, and real model inconsistency could break that in practice. Minor point: the manuscript could spell out more on how an actual agent would implement the worklist without extra overhead. This is aimed at researchers in software engineering and AI who are already thinking about hybrid formal and LLM tools for program analysis. It shows clear thinking in adapting standard lattice ideas without circularity. It deserves peer review so the authors can get feedback on tightening the model and adding evaluation.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes agentic interpretation, a framework that applies lattice-based static analysis discipline to LLM-driven program reasoning. High-level analysis goals are decomposed into localized claims; LLM judgments on each claim are tracked as elements of a finite-height lattice; and a worklist algorithm governs how claims and judgments evolve. The paper supplies a formal model of the framework, explores the resulting design space, and illustrates the approach via a worked example that analyzes code depending on opaque third-party components.

Significance. If the framework can be implemented and validated, it would offer a principled way to make LLM-based program analysis more structured, transparent, and monotonic than direct one-shot queries. The explicit formal model together with the concrete worked example provides a reusable foundation that future empirical studies could build upon, particularly for analyses that must incorporate external information such as documentation or security advisories.

major comments (1)

[§3] §3 (Formal Model): the projection of raw LLM responses onto lattice elements is described only schematically; without an explicit mapping function or soundness argument, it is unclear whether the claimed monotonicity of the worklist algorithm is preserved under realistic LLM output variability.

minor comments (2)

The manuscript would benefit from citing the foundational Cousot & Cousot papers on abstract interpretation to clarify how the proposed lattice discipline relates to existing static-analysis theory.
[worked example] In the worked example, the sequence of LLM queries and the exact lattice updates they induce should be tabulated so that readers can reproduce the evolution steps.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Formal Model): the projection of raw LLM responses onto lattice elements is described only schematically; without an explicit mapping function or soundness argument, it is unclear whether the claimed monotonicity of the worklist algorithm is preserved under realistic LLM output variability.

Authors: We agree that the projection step in §3 is presented schematically. The formal model defines the lattice and worklist algorithm under the assumption that LLM outputs are mapped to lattice elements in an order-preserving manner, but does not supply an explicit mapping or a dedicated soundness argument. In the revised version we will add a concrete mapping function (parameterized by output parsing rules such as keyword extraction or confidence thresholding) together with a short lemma establishing that the worklist algorithm preserves monotonicity whenever the chosen mapping is monotonic with respect to the lattice order. This will make the dependence on LLM output variability explicit while preserving the framework's generality. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is a definitional construction

full rationale

The manuscript introduces agentic interpretation as an explicit formal model that decomposes analysis goals into localized claims tracked in a finite-height lattice, governed by a worklist algorithm. This structure is presented as a modeling choice that imports standard lattice theory from static analysis rather than deriving new results from fitted data or self-citations. No equations or central claims reduce to their own inputs by construction, and the load-bearing assumption (that judgments can be lattice-structured) is treated as definitional within the proposed framework. The paper supplies an independent formalization plus a concrete worked example, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Relies on domain assumptions about LLM reliability and lattice applicability.

axioms (1)

domain assumption LLM can provide judgments on localized claims that fit a lattice structure
Core to tracking evidence evolution.

invented entities (1)

Agentic interpretation no independent evidence
purpose: Structured LLM program reasoning
Newly introduced framework.

pith-pipeline@v0.9.0 · 19697 in / 798 out tokens · 203844 ms · 2026-05-14T20:07:33.308099+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

each claim is assessed in the product lattice A_Graded = {⊥, w, s} × {⊥, w, s} … ordered pointwise … join-based update ensures that assessments can only increase in the lattice order
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A has finite height H … at most H·K_cl strict assessment increases … Algorithm 1 terminates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Alchourrón, C.E., Gärdenfors, P., Makinson, D.: On the logic of theory change: Partial meet contraction and revision functions. J. Symb. Log.50(2), 510–530 (1985). https://doi.org/10.2307/2274239, https://doi.org/10.2307/2274239

work page doi:10.2307/2274239 1985
[2]

In: Gilpin, L.H., Giunchiglia, E., Hitzler, P., van Krieken, E

Allen, B.P., Chhikara, P., Ferguson, T.M., Ilievski, F., Groth, P.: Sound and complete neurosymbolic reasoning with llm-grounded interpretations. In: Gilpin, L.H., Giunchiglia, E., Hitzler, P., van Krieken, E. (eds.) Pro- ceedings of The 19th International Conference on Neurosymbolic Learning and Reasoning (NeSy 2025), 8-10 September 2025, Santa Cruz, CA,...

2025
[3]

CoRR abs/2507.05886(2025)

Bembenek, A.: Current practices for building llmpowered rea- soning tools are ad hoc-and we can do better. CoRR abs/2507.05886(2025). https://doi.org/10.48550/ARXIV.2507.05886, https://doi.org/10.48550/arXiv.2507.05886

work page doi:10.48550/arxiv.2507.05886 2025
[4]

In: Wooldridge, M.J., Dy, J.G., Natarajan, S

Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., Hoefler, T.: Graph of thoughts: Solving elaborate problems with large language models. In: Wooldridge, M.J., Dy, J.G., Natarajan, S. (eds.) Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-S...

work page doi:10.1609/aaai.v38i16.29720 2024
[5]

In: Bjørner, D., Broy, M., Pottosin, I.V

Bourdoncle, F.: Efficient chaotic iteration strategies with widenings. In: Bjørner, D., Broy, M., Pottosin, I.V. (eds.) Formal Methods in Programming and Their Applications, International Conference, Akademgorodok, Novosibirsk, Russia, June 28 - July 2, 1993, Proceedings. pp. 128–141. Lecture Notes in Computer Science, Springer (1993). https://doi.org/10....

work page doi:10.1007/bfb0039704 1993
[6]

Cai, Y., Hou, Z., Sanán, D., Luan, X., Lin, Y., Sun, J., Dong, J.S.: Auto- mated program refinement: Guide and verify code large language model with refinement calculus. Proc. ACM Program. Lang.9(POPL), 2057–2089 (2025). https://doi.org/10.1145/3704905, https://doi.org/10.1145/3704905

work page doi:10.1145/3704905 2057
[7]

Chapman, P.J., Rubio-González, C., Thakur, A.V.: Interleaving static analysis and LLM prompting with applications to error specification inference. Int. J. Softw. ToolsTechnol.Transf.27(2),239–254(2025).https://doi.org/10.1007/S10009-025- 00780-7, https://doi.org/10.1007/s10009-025-00780-7

work page doi:10.1007/s10009-025- 2025
[8]

Formal Methods Syst

Chechik, M., Gurfinkel, A., Devereux, B., Lai, A.Y.C., Easterbrook, S.M.: Data structures for symbolic multi-valued model-checking. Formal Methods Syst. Des.29(3), 295–344 (2006). https://doi.org/10.1007/S10703-006-0016-Z, https://doi.org/10.1007/s10703-006-0016-z

work page doi:10.1007/s10703-006-0016-z 2006
[9]

In: The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025

Chen, T., Lu, S., Lu, S., Gong, Y., Yang, C., Li, X., Misu, M.R.H., Yu, H., Duan, N., Cheng, P., Yang, F., Lahiri, S.K., Xie, T., Zhou, L.: Automated proof genera- tion for rust code via self-evolution. In: The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net (2025), https://openrev...

2025
[10]

In: Graham, 24 Mitchell and Wang R.M., Harrison, M.A., Sethi, R

Cousot, P., Cousot, R.: Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Graham, 24 Mitchell and Wang R.M., Harrison, M.A., Sethi, R. (eds.) Conference Record of the Fourth ACM Sym- posium on Principles of Programming Languages, Los Angeles, California, USA, January 1977. p...

work page doi:10.1145/512950.512973 1977
[11]

In: Aho, A.V., Zilles, S.N., Rosen, B.K

Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: Aho, A.V., Zilles, S.N., Rosen, B.K. (eds.) Conference Record of the Sixth Annual ACM Symposium on Principles of Programming Languages, San Antonio, Texas, USA, January 1979. pp. 269–282. ACM Press (1979). https://doi.org/10.1145/567752.567778, https://doi.org/10.1145/567752.567778

work page doi:10.1145/567752.567778 1979
[12]

Doyle, J.: A truth maintenance system. Artif. Intell.12(3), 231–272 (1979). https://doi.org/10.1016/0004-3702(79)90008-0, https://doi.org/10.1016/0004- 3702(79)90008-0

work page doi:10.1016/0004-3702(79)90008-0 1979
[13]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Du, Y., Tian, M., Ronanki, S., Rongali, S., Bodapati, S.B., Galstyan, A., Wells, A., Schwartz,R.,Huerta,E.A.,Peng,H.:ContextlengthalonehurtsLLMperformance despite perfect retrieval. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-9, 2025...

2025
[14]

ACM Softw

Endres, M., Fakhoury, S., Chakraborty, S., Lahiri, S.K.: Can large language mod- els transform natural language intent into formal method postconditions? Proc. ACM Softw. Eng.1(FSE), 1889–1912 (2024). https://doi.org/10.1145/3660791, https://doi.org/10.1145/3660791

work page doi:10.1145/3660791 1912
[15]

In: Chandra, S., Blincoe, K., Tonella, P

First, E., Rabe, M.N., Ringer, T., Brun, Y.: Baldur: Whole-proof gener- ation and repair with large language models. In: Chandra, S., Blincoe, K., Tonella, P. (eds.) Proceedings of the 31st ACM Joint European Soft- ware Engineering Conference and Symposium on the Foundations of Soft- ware Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3- 9, ...

work page doi:10.1145/3611643.3616243 2023
[16]

Fitting, M.: Bilattices and the semantics of logic programming. J. Log. Pro- gram.11(1&2), 91–116 (1991). https://doi.org/10.1016/0743-1066(91)90014-G, https://doi.org/10.1016/0743-1066(91)90014-G

work page doi:10.1016/0743-1066(91)90014-g 1991
[17]

CoRRabs/2312.08477(2023)

Hao, Y., Chen, W., Zhou, Z., Cui, W.: E&v: Prompting large language models to perform static analysis by pseudo-code execution and verification. CoRRabs/2312.08477(2023). https://doi.org/10.48550/ARXIV.2312.08477, https://doi.org/10.48550/arXiv.2312.08477

work page doi:10.48550/arxiv.2312.08477 2023
[18]

CoRRabs/2510.02534(2025)

Iranmanesh, M., Sabet, S.M., Marefat, S., Ghasr, A.J., Wilson, A., Sharafaldin, I., Tayebi, M.A.: Zerofalse: Improving precision in static analysis with llms. CoRRabs/2510.02534(2025). https://doi.org/10.48550/ARXIV.2510.02534, https://doi.org/10.48550/arXiv.2510.02534

work page doi:10.48550/arxiv.2510.02534 2025
[19]

Jenkins, B.: AGM-bench: Do large language models revise beliefs rationally? In: ICLR 2026 Workshop on Logical Reasoning of Large Language Models (2026), https://openreview.net/forum?id=2s1BujG84C

2026
[20]

In: Fis- cher, P.C., Ullman, J.D

Kildall, G.A.: A unified approach to global program optimization. In: Fis- cher, P.C., Ullman, J.D. (eds.) Conference Record of the ACM Symposium on Principles of Programming Languages, Boston, Massachusetts, USA, October
[21]

pp. 194–206. ACM Press (1973). https://doi.org/10.1145/512927.512945, https://doi.org/10.1145/512927.512945

work page doi:10.1145/512927.512945 1973
[22]

de Kleer, J.: An assumption-based TMS. Artif. Intell.28(2), 127–162 (1986). https://doi.org/10.1016/0004-3702(86)90080-9, https://doi.org/10.1016/0004- 3702(86)90080-9 Agentic Interpretation for Program Analysis 25

work page doi:10.1016/0004-3702(86)90080-9 1986
[23]

Li,H.,Hao,Y.,Zhai,Y.,Qian,Z.:Enhancingstaticanalysisforpracticalbugdetec- tion: An llm-integrated approach. Proc. ACM Program. Lang.8(OOPSLA1), 474– 499 (2024). https://doi.org/10.1145/3649828, https://doi.org/10.1145/3649828

work page doi:10.1145/3649828 2024
[24]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Li, J., Cao, P., Chen, Y., Xu, J., Li, H., Jiang, X., Liu, K., Zhao, J.: Towards better chain-of-thought: A reflection on effectiveness and faithfulness. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1,

2025
[25]

10747–10765

pp. 10747–10765. Findings of ACL, Association for Computational Linguis- tics (2025), https://aclanthology.org/2025.findings-acl.560/

2025
[26]

Li, J., Yan, H., He, Y.: Drift: Enhancing LLM faithfulness in rationale generation via dual-reward probabilistic inference. In: Che, W., Nabende, J., Shutova, E., Pilehvar,M.T.(eds.)Proceedingsofthe63rdAnnualMeetingoftheAssociationfor Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025. pp. 6850–6866. Ass...

2025
[27]

In: The Thirteenth International Conference on Learning Rep- resentations, ICLR 2025, Singapore, April 24-28, 2025

Li, Z., Dutta, S., Naik, M.: IRIS: llm-assisted static analysis for detecting secu- rity vulnerabilities. In: The Thirteenth International Conference on Learning Rep- resentations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net (2025), https://openreview.net/forum?id=9LdJDU7E91

2025
[28]

Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long contexts. Trans. Assoc. Com- put. Linguistics12, 157–173 (2024). https://doi.org/10.1162/TACL\_A\_00638, https://doi.org/10.1162/tacl_a_00638

work page internal anchor Pith review doi:10.1162/tacl 2024
[29]

CoRRabs/2310.08275(2023)

Liu, P., Sun, C., Zheng, Y., Feng, X., Qin, C., Wang, Y., Li, Z., Sun, L.: Harnessing the power of LLM to support binary taint analysis. CoRRabs/2310.08275(2023). https://doi.org/10.48550/ARXIV.2310.08275, https://doi.org/10.48550/arXiv.2310.08275

work page doi:10.48550/arxiv.2310.08275 2023
[30]

J.Comput.Soc.Sci.9(1), 11(2026).https://doi.org/10.1007/S42001-025-00435-2, https://doi.org/10.1007/s42001-025-00435-2

Lou, J., Sun, Y.: Anchoring bias in large language models: an experimental study. J.Comput.Soc.Sci.9(1), 11(2026).https://doi.org/10.1007/S42001-025-00435-2, https://doi.org/10.1007/s42001-025-00435-2

work page doi:10.1007/s42001-025-00435-2 2026
[31]

IEEE In- tell

O’Leary, D.E.: An anchoring effect in large language models. IEEE In- tell. Syst.40(2), 23–26 (2025). https://doi.org/10.1109/MIS.2025.3544939, https://doi.org/10.1109/MIS.2025.3544939

work page doi:10.1109/mis.2025.3544939 2025
[32]

(eds.) International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA

Pei, K., Bieber, D., Shi, K., Sutton, C., Yin, P.: Can large language mod- els reason about program invariants? In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. pp. 27496–27520. Proceedings of Machine Learning Research, P...

2023
[33]

arXiv preprint arXiv:2510.12702 , year=

Richter, C., Wehrheim, H.: Beyond postconditions: Can large language models infer formal contracts for automatic software verification? CoRR abs/2510.12702(2025). https://doi.org/10.48550/ARXIV.2510.12702, https://doi.org/10.48550/arXiv.2510.12702

work page doi:10.48550/arxiv.2510.12702 2025
[34]

ACM Trans

Sagiv, S., Reps, T.W., Wilhelm, R.: Parametric shape analysis via 3- valued logic. ACM Trans. Program. Lang. Syst.24(3), 217–298 (2002). https://doi.org/10.1145/514188.514190, https://doi.org/10.1145/514188.514190

work page doi:10.1145/514188.514190 2002
[35]

In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=4O0v4s3IzY 26 Mitchell and Wang

Stechly, K., Valmeekam, K., Kambhampati, S.: On the self-verification lim- itations of large language models on reasoning and planning tasks. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=4O0v4s3IzY 26 Mitchell and Wang

2025
[36]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Tian, R., Li, Y., Fu, Y., Deng, S., Luo, Q., Qian, C., Wang, S., Cong, X., Zhang, Z., Wu, Y., Lin, Y., Wang, H., Liu, X.: Distance between relevant information pieces causes bias in long-context llms. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 -...

2025
[37]

In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

Turpin, M., Michael, J., Perez, E., Bowman, S.R.: Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Ne...

2023
[38]

Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps

Tutek, M., Chaleshtori, F.H., Marasovic, A., Belinkov, Y.: Measuring chain of thought faithfulness by unlearning reasoning steps. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025. pp. 9935–9960. Asso...

work page doi:10.18653/v1/2025.emnlp-main.504 2025
[39]

CoRR abs/2603.01896(2026)

Ugare, S., Chandra, S.: Agentic code reasoning. CoRR abs/2603.01896(2026). https://doi.org/10.48550/ARXIV.2603.01896, https://doi.org/10.48550/arXiv.2603.01896

work page doi:10.48550/arxiv.2603.01896 2026
[40]

CoRR abs/2511.05766(2025)

Valencia-Clavijo, F.: Anchors in the machine: Behavioral and attributional evidence of anchoring bias in llms. CoRR abs/2511.05766(2025). https://doi.org/10.48550/ARXIV.2511.05766, https://doi.org/10.48550/arXiv.2511.05766

work page doi:10.48550/arxiv.2511.05766 2025
[42]

Wang, C., Gao, Y., Zhang, W., Liu, X., Shi, Q., Zhang, X.: LLMSA: A composi- tional neuro-symbolic approach to compilation-free and customizable static analy- sis.CoRRabs/2412.14399(2024).https://doi.org/10.48550/ARXIV.2412.14399, https://doi.org/10.48550/arXiv.2412.14399

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.14399 2024
[43]

In: ICLR 2025 Workshop: VerifAI: AI Verification in the Wild (2025), https://openreview.net/forum?id=3RP6YmKo59

Wang, M., Pei, K., Solar-Lezama, A.: ABSINT-AI: Language models for abstract interpretation. In: ICLR 2025 Workshop: VerifAI: AI Verification in the Wild (2025), https://openreview.net/forum?id=3RP6YmKo59

2025
[44]

CoRR abs/2508.14532(2025)

Wang, Z., Lin, T., Chen, M., Yang, M., Li, H., Yi, X., Qin, S., Yin, J.: Preguss: It analyzes, it specifies, it verifies. CoRR abs/2508.14532(2025). https://doi.org/10.48550/ARXIV.2508.14532, https://doi.org/10.48550/arXiv.2508.14532

work page doi:10.48550/arxiv.2508.14532 2025
[45]

In: ICLR 2026 Work- shop: VerifAI-2: The Second Workshop on AI Verification in the Wild (2026), https://openreview.net/forum?id=nq03VvBki2

Wei, A., Suresh, T., Sun, T., Wu, H., Wang, K., Aiken, A.: Quokka: Accelerat- ing program verification with LLMs via invariant synthesis. In: ICLR 2026 Work- shop: VerifAI-2: The Second Workshop on AI Verification in the Wild (2026), https://openreview.net/forum?id=nq03VvBki2

2026
[46]

In: The Twelfth International Conference on Agentic Interpretation for Program Analysis 27 Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024

Wu, H., Barrett, C.W., Narodytska, N.: Lemur: Integrating large language models in automated program verification. In: The Twelfth International Conference on Agentic Interpretation for Program Analysis 27 Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net (2024), https://openreview.net/forum?id=Q3YaCghZNt

2024
[47]

In: The Eleventh International Conference on Learning Representa- tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizing reasoning and acting in language mod- els. In: The Eleventh International Conference on Learning Representa- tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net (2023), https://openreview.net/forum?id=WE_vluYUL-X

2023
[48]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Yu, Y., Jiang, H., Luo, X., Wu, Q., Lin, C., Li, D., Yang, Y., Huang, Y., Qiu, L.: Mitigate position bias in llms via scaling a single hidden states channel. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1,

2025
[49]

6092–6111

pp. 6092–6111. Findings of ACL, Association for Computational Linguistics (2025), https://aclanthology.org/2025.findings-acl.316/

2025