pith. machine review for the scientific record. sign in

arxiv: 2605.12694 · v1 · submitted 2026-05-12 · 💻 cs.SE · cs.AI· cs.PL

Recognition: 2 theorem links

· Lean Theorem

Agentic Interpretation: Lattice-Structured Evidence for LLM-Based Program Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:07 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.PL
keywords agentic interpretationlattice-based analysisLLM program reasoningworklist algorithmstatic analysisthird-party componentsprogram verification
0
0 comments X

The pith

LLM program analysis gains reliability by tracking localized claims as evolving elements in a finite-height lattice.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes agentic interpretation as a way to make LLM-driven program reasoning more disciplined by borrowing techniques from lattice-based static analysis. It breaks high-level analysis goals into localized claims, records the LLM's judgment on each claim as a lattice element, and uses a worklist algorithm to refine those judgments over time. This replaces brittle one-shot whole-program queries with an explicit record of which conclusions are supported or disputed. The approach incorporates external information such as documentation and security advisories that fixed analyzers miss, and is demonstrated with a formal model plus an example involving opaque third-party components.

Core claim

Agentic interpretation decomposes a high-level analysis goal into localized claims and tracks the LLM's judgment about each claim in a finite-height lattice; a worklist algorithm governs how claims and their judgments evolve during the analysis.

What carries the argument

Agentic interpretation: the framework that decomposes analysis goals into localized claims whose LLM judgments are recorded and refined as elements of a finite-height lattice under a worklist schedule.

If this is right

  • Whole-program LLM queries can be replaced by incremental refinement of smaller, checkable claims.
  • Analyses gain the ability to incorporate documentation, advisories, and informal contracts without losing traceability.
  • Different lattice heights and worklist policies become design choices that trade precision against query cost.
  • The same machinery can be applied to code whose behavior depends on opaque third-party libraries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The lattice discipline may also serve as a verification layer that flags when an LLM's answers become inconsistent across related claims.
  • Hybrid tools could run conventional static analyzers on the parts of the lattice they can resolve and delegate only the remaining claims to the LLM.
  • Empirical tests could measure how often the worklist terminates on real-world codebases and whether final lattice values match manual review.

Load-bearing premise

LLM judgments on localized claims can be structured and evolved meaningfully inside a finite-height lattice.

What would settle it

A concrete run in which repeated LLM queries on related claims produce judgments that never converge to a single lattice element or that violate the lattice ordering under the worklist steps.

Figures

Figures reproduced from arXiv: 2605.12694 by Chao Wang, Jacqueline L. Mitchell.

Figure 1
Figure 1. Figure 1: Five Target Categories for Agentic Interpretation respects an undocumented API convention. These questions concern program structure, possible flows, and component interactions, but the evidence needed to answer them is not fully contained in the source code and is often not avail￾able as a formal specification. Instead, it may come from documentation, package metadata, current security advisories, informa… view at source ↗
Figure 2
Figure 2. Figure 2: A handler with two opaque third-party components. The source of parseJSON and verifySignature is unavailable to the analyzer; the source of handleRequest and processPayload is available. a structured domain over evidence states: a lattice element records the current evidential status of a claim (e.g., whether the claim has weak or strong support). The goal of agentic interpretation is to achieve a separati… view at source ↗
Figure 3
Figure 3. Figure 3: Evaluation graph and claims for the motivating example. Context edges ( ) flow from component claims to the cross-component claim (cC ) and from cC and cR to cG. Feedback edges ( ) let a refuted cross-component claim focus later searches about opaque components. The table summarizes the claims associated with each node. Hashed nodes indicate that the nodes represent opaque components. Goal: handleRequest s… view at source ↗
Figure 4
Figure 4. Figure 4: The AGraded lattice. Each component takes values from {⊥, w, s}, ordered ⊥ ⊑ w ⊑ s. Joins are pointwise maxima. A state such as ⟨w, s⟩ records weak support and strong refutation; it does not mean the claim is both true and false. the opaque components only, as the source code of other nodes can be directly inspected. Both auxiliary and program-derived nodes participate in the worklist compu￾tation, while o… view at source ↗
Figure 5
Figure 5. Figure 5: Workflow of agentic interpretation. Definition 1 (Program-derived graph). A program-derived graph is a finite directed graph Gprog = (Vprog, Eprog) where Eprog ⊆ Vprog ×Vprog. Nodes in Vprog represent program points or program regions, and edges represent a chosen structural relation, such as control flow, call flow, or data flow. The evaluation graph augments the program-derived graph with auxiliary nodes… view at source ↗
Figure 6
Figure 6. Figure 6: Alternative assessment domains. 4.1 Choosing the Assessment Lattice The assessment lattice determines how many distinct judgments the framework can express about a claim. A coarser lattice is cheaper to operate and easier to interpret, but a finer lattice preserves more distinctions and requires the LLM agent to make granular judgments reliably. A simple bilateral option is a two-bit evidence-presence latt… view at source ↗
read the original abstract

Large language models can consult information that fixed static analyzers cannot, such as documentation, current security advisories, version-specific metadata, and informal API contracts. This makes LLMs a compelling option for program analyses that depend on information beyond the source program, or that are otherwise not amenable to conventional static analyzers. However, directly asking an LLM for a one-shot whole-program analysis is brittle because it compresses many evidence-dependent judgments into a single opaque answer, rather than exposing which conclusions are supported or disputed and using intermediate findings to guide later, more focused searches. In this paper, we propose agentic interpretation, a framework that brings the discipline of lattice-based static analysis to LLM-driven program reasoning. At a high level, agentic interpretation decomposes a high-level analysis goal into localized claims, and tracks the LLM's judgment about each claim in a finite-height lattice. A worklist algorithm governs how claims and their judgments evolve during the analysis. We introduce a formal model of agentic interpretation, explore the design space it opens, and illustrate the approach with a worked example analyzing code that depends on opaque third-party components.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes agentic interpretation, a framework that applies lattice-based static analysis discipline to LLM-driven program reasoning. High-level analysis goals are decomposed into localized claims; LLM judgments on each claim are tracked as elements of a finite-height lattice; and a worklist algorithm governs how claims and judgments evolve. The paper supplies a formal model of the framework, explores the resulting design space, and illustrates the approach via a worked example that analyzes code depending on opaque third-party components.

Significance. If the framework can be implemented and validated, it would offer a principled way to make LLM-based program analysis more structured, transparent, and monotonic than direct one-shot queries. The explicit formal model together with the concrete worked example provides a reusable foundation that future empirical studies could build upon, particularly for analyses that must incorporate external information such as documentation or security advisories.

major comments (1)
  1. [§3] §3 (Formal Model): the projection of raw LLM responses onto lattice elements is described only schematically; without an explicit mapping function or soundness argument, it is unclear whether the claimed monotonicity of the worklist algorithm is preserved under realistic LLM output variability.
minor comments (2)
  1. The manuscript would benefit from citing the foundational Cousot & Cousot papers on abstract interpretation to clarify how the proposed lattice discipline relates to existing static-analysis theory.
  2. [worked example] In the worked example, the sequence of LLM queries and the exact lattice updates they induce should be tabulated so that readers can reproduce the evolution steps.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive recommendation. We address the single major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Formal Model): the projection of raw LLM responses onto lattice elements is described only schematically; without an explicit mapping function or soundness argument, it is unclear whether the claimed monotonicity of the worklist algorithm is preserved under realistic LLM output variability.

    Authors: We agree that the projection step in §3 is presented schematically. The formal model defines the lattice and worklist algorithm under the assumption that LLM outputs are mapped to lattice elements in an order-preserving manner, but does not supply an explicit mapping or a dedicated soundness argument. In the revised version we will add a concrete mapping function (parameterized by output parsing rules such as keyword extraction or confidence thresholding) together with a short lemma establishing that the worklist algorithm preserves monotonicity whenever the chosen mapping is monotonic with respect to the lattice order. This will make the dependence on LLM output variability explicit while preserving the framework's generality. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework is a definitional construction

full rationale

The manuscript introduces agentic interpretation as an explicit formal model that decomposes analysis goals into localized claims tracked in a finite-height lattice, governed by a worklist algorithm. This structure is presented as a modeling choice that imports standard lattice theory from static analysis rather than deriving new results from fitted data or self-citations. No equations or central claims reduce to their own inputs by construction, and the load-bearing assumption (that judgments can be lattice-structured) is treated as definitional within the proposed framework. The paper supplies an independent formalization plus a concrete worked example, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Relies on domain assumptions about LLM reliability and lattice applicability.

axioms (1)
  • domain assumption LLM can provide judgments on localized claims that fit a lattice structure
    Core to tracking evidence evolution.
invented entities (1)
  • Agentic interpretation no independent evidence
    purpose: Structured LLM program reasoning
    Newly introduced framework.

pith-pipeline@v0.9.0 · 19697 in / 798 out tokens · 203844 ms · 2026-05-14T20:07:33.308099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 29 canonical work pages · 2 internal anchors

  1. [1]

    Alchourrón, C.E., Gärdenfors, P., Makinson, D.: On the logic of theory change: Partial meet contraction and revision functions. J. Symb. Log.50(2), 510–530 (1985). https://doi.org/10.2307/2274239, https://doi.org/10.2307/2274239

  2. [2]

    In: Gilpin, L.H., Giunchiglia, E., Hitzler, P., van Krieken, E

    Allen, B.P., Chhikara, P., Ferguson, T.M., Ilievski, F., Groth, P.: Sound and complete neurosymbolic reasoning with llm-grounded interpretations. In: Gilpin, L.H., Giunchiglia, E., Hitzler, P., van Krieken, E. (eds.) Pro- ceedings of The 19th International Conference on Neurosymbolic Learning and Reasoning (NeSy 2025), 8-10 September 2025, Santa Cruz, CA,...

  3. [3]

    CoRR abs/2507.05886(2025)

    Bembenek, A.: Current practices for building llmpowered rea- soning tools are ad hoc-and we can do better. CoRR abs/2507.05886(2025). https://doi.org/10.48550/ARXIV.2507.05886, https://doi.org/10.48550/arXiv.2507.05886

  4. [4]

    In: Wooldridge, M.J., Dy, J.G., Natarajan, S

    Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Podstawski, M., Gianinazzi, L., Gajda, J., Lehmann, T., Niewiadomski, H., Nyczyk, P., Hoefler, T.: Graph of thoughts: Solving elaborate problems with large language models. In: Wooldridge, M.J., Dy, J.G., Natarajan, S. (eds.) Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-S...

  5. [5]

    In: Bjørner, D., Broy, M., Pottosin, I.V

    Bourdoncle, F.: Efficient chaotic iteration strategies with widenings. In: Bjørner, D., Broy, M., Pottosin, I.V. (eds.) Formal Methods in Programming and Their Applications, International Conference, Akademgorodok, Novosibirsk, Russia, June 28 - July 2, 1993, Proceedings. pp. 128–141. Lecture Notes in Computer Science, Springer (1993). https://doi.org/10....

  6. [6]

    Cai, Y., Hou, Z., Sanán, D., Luan, X., Lin, Y., Sun, J., Dong, J.S.: Auto- mated program refinement: Guide and verify code large language model with refinement calculus. Proc. ACM Program. Lang.9(POPL), 2057–2089 (2025). https://doi.org/10.1145/3704905, https://doi.org/10.1145/3704905

  7. [7]

    Chapman, P.J., Rubio-González, C., Thakur, A.V.: Interleaving static analysis and LLM prompting with applications to error specification inference. Int. J. Softw. ToolsTechnol.Transf.27(2),239–254(2025).https://doi.org/10.1007/S10009-025- 00780-7, https://doi.org/10.1007/s10009-025-00780-7

  8. [8]

    Formal Methods Syst

    Chechik, M., Gurfinkel, A., Devereux, B., Lai, A.Y.C., Easterbrook, S.M.: Data structures for symbolic multi-valued model-checking. Formal Methods Syst. Des.29(3), 295–344 (2006). https://doi.org/10.1007/S10703-006-0016-Z, https://doi.org/10.1007/s10703-006-0016-z

  9. [9]

    In: The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025

    Chen, T., Lu, S., Lu, S., Gong, Y., Yang, C., Li, X., Misu, M.R.H., Yu, H., Duan, N., Cheng, P., Yang, F., Lahiri, S.K., Xie, T., Zhou, L.: Automated proof genera- tion for rust code via self-evolution. In: The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net (2025), https://openrev...

  10. [10]

    In: Graham, 24 Mitchell and Wang R.M., Harrison, M.A., Sethi, R

    Cousot, P., Cousot, R.: Abstract interpretation: A unified lattice model for static analysis of programs by construction or approximation of fixpoints. In: Graham, 24 Mitchell and Wang R.M., Harrison, M.A., Sethi, R. (eds.) Conference Record of the Fourth ACM Sym- posium on Principles of Programming Languages, Los Angeles, California, USA, January 1977. p...

  11. [11]

    In: Aho, A.V., Zilles, S.N., Rosen, B.K

    Cousot, P., Cousot, R.: Systematic design of program analysis frameworks. In: Aho, A.V., Zilles, S.N., Rosen, B.K. (eds.) Conference Record of the Sixth Annual ACM Symposium on Principles of Programming Languages, San Antonio, Texas, USA, January 1979. pp. 269–282. ACM Press (1979). https://doi.org/10.1145/567752.567778, https://doi.org/10.1145/567752.567778

  12. [12]

    Doyle, J.: A truth maintenance system. Artif. Intell.12(3), 231–272 (1979). https://doi.org/10.1016/0004-3702(79)90008-0, https://doi.org/10.1016/0004- 3702(79)90008-0

  13. [13]

    In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

    Du, Y., Tian, M., Ronanki, S., Rongali, S., Bodapati, S.B., Galstyan, A., Wells, A., Schwartz,R.,Huerta,E.A.,Peng,H.:ContextlengthalonehurtsLLMperformance despite perfect retrieval. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China, November 4-9, 2025...

  14. [14]

    ACM Softw

    Endres, M., Fakhoury, S., Chakraborty, S., Lahiri, S.K.: Can large language mod- els transform natural language intent into formal method postconditions? Proc. ACM Softw. Eng.1(FSE), 1889–1912 (2024). https://doi.org/10.1145/3660791, https://doi.org/10.1145/3660791

  15. [15]

    In: Chandra, S., Blincoe, K., Tonella, P

    First, E., Rabe, M.N., Ringer, T., Brun, Y.: Baldur: Whole-proof gener- ation and repair with large language models. In: Chandra, S., Blincoe, K., Tonella, P. (eds.) Proceedings of the 31st ACM Joint European Soft- ware Engineering Conference and Symposium on the Foundations of Soft- ware Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3- 9, ...

  16. [16]

    Fitting, M.: Bilattices and the semantics of logic programming. J. Log. Pro- gram.11(1&2), 91–116 (1991). https://doi.org/10.1016/0743-1066(91)90014-G, https://doi.org/10.1016/0743-1066(91)90014-G

  17. [17]

    CoRRabs/2312.08477(2023)

    Hao, Y., Chen, W., Zhou, Z., Cui, W.: E&v: Prompting large language models to perform static analysis by pseudo-code execution and verification. CoRRabs/2312.08477(2023). https://doi.org/10.48550/ARXIV.2312.08477, https://doi.org/10.48550/arXiv.2312.08477

  18. [18]

    CoRRabs/2510.02534(2025)

    Iranmanesh, M., Sabet, S.M., Marefat, S., Ghasr, A.J., Wilson, A., Sharafaldin, I., Tayebi, M.A.: Zerofalse: Improving precision in static analysis with llms. CoRRabs/2510.02534(2025). https://doi.org/10.48550/ARXIV.2510.02534, https://doi.org/10.48550/arXiv.2510.02534

  19. [19]

    Jenkins, B.: AGM-bench: Do large language models revise beliefs rationally? In: ICLR 2026 Workshop on Logical Reasoning of Large Language Models (2026), https://openreview.net/forum?id=2s1BujG84C

  20. [20]

    In: Fis- cher, P.C., Ullman, J.D

    Kildall, G.A.: A unified approach to global program optimization. In: Fis- cher, P.C., Ullman, J.D. (eds.) Conference Record of the ACM Symposium on Principles of Programming Languages, Boston, Massachusetts, USA, October

  21. [21]

    pp. 194–206. ACM Press (1973). https://doi.org/10.1145/512927.512945, https://doi.org/10.1145/512927.512945

  22. [22]

    de Kleer, J.: An assumption-based TMS. Artif. Intell.28(2), 127–162 (1986). https://doi.org/10.1016/0004-3702(86)90080-9, https://doi.org/10.1016/0004- 3702(86)90080-9 Agentic Interpretation for Program Analysis 25

  23. [23]

    Li,H.,Hao,Y.,Zhai,Y.,Qian,Z.:Enhancingstaticanalysisforpracticalbugdetec- tion: An llm-integrated approach. Proc. ACM Program. Lang.8(OOPSLA1), 474– 499 (2024). https://doi.org/10.1145/3649828, https://doi.org/10.1145/3649828

  24. [24]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Li, J., Cao, P., Chen, Y., Xu, J., Li, H., Jiang, X., Liu, K., Zhao, J.: Towards better chain-of-thought: A reflection on effectiveness and faithfulness. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1,

  25. [25]

    10747–10765

    pp. 10747–10765. Findings of ACL, Association for Computational Linguis- tics (2025), https://aclanthology.org/2025.findings-acl.560/

  26. [26]

    Li, J., Yan, H., He, Y.: Drift: Enhancing LLM faithfulness in rationale generation via dual-reward probabilistic inference. In: Che, W., Nabende, J., Shutova, E., Pilehvar,M.T.(eds.)Proceedingsofthe63rdAnnualMeetingoftheAssociationfor Computational Linguistics (Volume 1: Long Papers), ACL 2025, Vienna, Austria, July 27 - August 1, 2025. pp. 6850–6866. Ass...

  27. [27]

    In: The Thirteenth International Conference on Learning Rep- resentations, ICLR 2025, Singapore, April 24-28, 2025

    Li, Z., Dutta, S., Naik, M.: IRIS: llm-assisted static analysis for detecting secu- rity vulnerabilities. In: The Thirteenth International Conference on Learning Rep- resentations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net (2025), https://openreview.net/forum?id=9LdJDU7E91

  28. [28]

    Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., Liang, P.: Lost in the middle: How language models use long contexts. Trans. Assoc. Com- put. Linguistics12, 157–173 (2024). https://doi.org/10.1162/TACL\_A\_00638, https://doi.org/10.1162/tacl_a_00638

  29. [29]

    CoRRabs/2310.08275(2023)

    Liu, P., Sun, C., Zheng, Y., Feng, X., Qin, C., Wang, Y., Li, Z., Sun, L.: Harnessing the power of LLM to support binary taint analysis. CoRRabs/2310.08275(2023). https://doi.org/10.48550/ARXIV.2310.08275, https://doi.org/10.48550/arXiv.2310.08275

  30. [30]

    J.Comput.Soc.Sci.9(1), 11(2026).https://doi.org/10.1007/S42001-025-00435-2, https://doi.org/10.1007/s42001-025-00435-2

    Lou, J., Sun, Y.: Anchoring bias in large language models: an experimental study. J.Comput.Soc.Sci.9(1), 11(2026).https://doi.org/10.1007/S42001-025-00435-2, https://doi.org/10.1007/s42001-025-00435-2

  31. [31]

    IEEE In- tell

    O’Leary, D.E.: An anchoring effect in large language models. IEEE In- tell. Syst.40(2), 23–26 (2025). https://doi.org/10.1109/MIS.2025.3544939, https://doi.org/10.1109/MIS.2025.3544939

  32. [32]

    (eds.) International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA

    Pei, K., Bieber, D., Shi, K., Sutton, C., Yin, P.: Can large language mod- els reason about program invariants? In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. pp. 27496–27520. Proceedings of Machine Learning Research, P...

  33. [33]

    arXiv preprint arXiv:2510.12702 , year=

    Richter, C., Wehrheim, H.: Beyond postconditions: Can large language models infer formal contracts for automatic software verification? CoRR abs/2510.12702(2025). https://doi.org/10.48550/ARXIV.2510.12702, https://doi.org/10.48550/arXiv.2510.12702

  34. [34]

    ACM Trans

    Sagiv, S., Reps, T.W., Wilhelm, R.: Parametric shape analysis via 3- valued logic. ACM Trans. Program. Lang. Syst.24(3), 217–298 (2002). https://doi.org/10.1145/514188.514190, https://doi.org/10.1145/514188.514190

  35. [35]

    In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=4O0v4s3IzY 26 Mitchell and Wang

    Stechly, K., Valmeekam, K., Kambhampati, S.: On the self-verification lim- itations of large language models on reasoning and planning tasks. In: The Thirteenth International Conference on Learning Representations (2025), https://openreview.net/forum?id=4O0v4s3IzY 26 Mitchell and Wang

  36. [36]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Tian, R., Li, Y., Fu, Y., Deng, S., Luo, Q., Qian, C., Wang, S., Cong, X., Zhang, Z., Wu, Y., Lin, Y., Wang, H., Liu, X.: Distance between relevant information pieces causes bias in long-context llms. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 -...

  37. [37]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Turpin, M., Michael, J., Perez, E., Bowman, S.R.: Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Ne...

  38. [38]

    Measuring Chain of Thought Faithfulness by Unlearning Reasoning Steps

    Tutek, M., Chaleshtori, F.H., Marasovic, A., Belinkov, Y.: Measuring chain of thought faithfulness by unlearning reasoning steps. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Confer- ence on Empirical Methods in Natural Language Processing, EMNLP 2025, Suzhou, China, November 4-9, 2025. pp. 9935–9960. Asso...

  39. [39]

    CoRR abs/2603.01896(2026)

    Ugare, S., Chandra, S.: Agentic code reasoning. CoRR abs/2603.01896(2026). https://doi.org/10.48550/ARXIV.2603.01896, https://doi.org/10.48550/arXiv.2603.01896

  40. [40]

    CoRR abs/2511.05766(2025)

    Valencia-Clavijo, F.: Anchors in the machine: Behavioral and attributional evidence of anchoring bias in llms. CoRR abs/2511.05766(2025). https://doi.org/10.48550/ARXIV.2511.05766, https://doi.org/10.48550/arXiv.2511.05766

  41. [42]

    Wang, C., Gao, Y., Zhang, W., Liu, X., Shi, Q., Zhang, X.: LLMSA: A composi- tional neuro-symbolic approach to compilation-free and customizable static analy- sis.CoRRabs/2412.14399(2024).https://doi.org/10.48550/ARXIV.2412.14399, https://doi.org/10.48550/arXiv.2412.14399

  42. [43]

    In: ICLR 2025 Workshop: VerifAI: AI Verification in the Wild (2025), https://openreview.net/forum?id=3RP6YmKo59

    Wang, M., Pei, K., Solar-Lezama, A.: ABSINT-AI: Language models for abstract interpretation. In: ICLR 2025 Workshop: VerifAI: AI Verification in the Wild (2025), https://openreview.net/forum?id=3RP6YmKo59

  43. [44]

    CoRR abs/2508.14532(2025)

    Wang, Z., Lin, T., Chen, M., Yang, M., Li, H., Yi, X., Qin, S., Yin, J.: Preguss: It analyzes, it specifies, it verifies. CoRR abs/2508.14532(2025). https://doi.org/10.48550/ARXIV.2508.14532, https://doi.org/10.48550/arXiv.2508.14532

  44. [45]

    In: ICLR 2026 Work- shop: VerifAI-2: The Second Workshop on AI Verification in the Wild (2026), https://openreview.net/forum?id=nq03VvBki2

    Wei, A., Suresh, T., Sun, T., Wu, H., Wang, K., Aiken, A.: Quokka: Accelerat- ing program verification with LLMs via invariant synthesis. In: ICLR 2026 Work- shop: VerifAI-2: The Second Workshop on AI Verification in the Wild (2026), https://openreview.net/forum?id=nq03VvBki2

  45. [46]

    In: The Twelfth International Conference on Agentic Interpretation for Program Analysis 27 Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024

    Wu, H., Barrett, C.W., Narodytska, N.: Lemur: Integrating large language models in automated program verification. In: The Twelfth International Conference on Agentic Interpretation for Program Analysis 27 Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenRe- view.net (2024), https://openreview.net/forum?id=Q3YaCghZNt

  46. [47]

    In: The Eleventh International Conference on Learning Representa- tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizing reasoning and acting in language mod- els. In: The Eleventh International Conference on Learning Representa- tions, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net (2023), https://openreview.net/forum?id=WE_vluYUL-X

  47. [48]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Yu, Y., Jiang, H., Luo, X., Wu, Q., Lin, C., Li, D., Yang, Y., Huang, Y., Qiu, L.: Mitigate position bias in llms via scaling a single hidden states channel. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds.) Findings of the Association for Computational Linguistics, ACL 2025, Vienna, Austria, July 27 - August 1,

  48. [49]

    6092–6111

    pp. 6092–6111. Findings of ACL, Association for Computational Linguistics (2025), https://aclanthology.org/2025.findings-acl.316/