arxiv: 2604.27132 · v1 · submitted 2026-04-29 · 💻 cs.AI

Recognition: unknown

TRUST: A Framework for Decentralized AI Service v.0.1

Yu-Chao Huang , Zhen Tan , Mohan Zhang , Pingzhi Li , Zhuo Zhang , Tianlong Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:15 UTC · model grok-4.3

classification 💻 cs.AI

keywords decentralized AI auditingtrustworthy AImulti-agent systemschain of thought verificationconsensus mechanismscausal interaction graphsroot cause attributionsafety profitability theorem

0 comments

The pith

TRUST decentralizes auditing of AI reasoning chains by breaking them into hierarchical graphs and aligning incentives so honest participants profit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TRUST as a framework for verifying reasoning in large models and multi-agent systems without relying on a central authority. It decomposes chain-of-thought steps into five levels of hierarchical directed acyclic graphs for parallel distributed checks, projects agent interactions onto causal graphs via the DAAN protocol for fault tracing, and runs a multi-tier voting system with stake weighting among automated checkers, language-model evaluators, and humans. A Safety-Profitability Theorem is established to show that honest auditors gain while malicious ones lose under up to 30 percent adversarial participation. Experiments report 72.4 percent accuracy across benchmarks, resilience to 20 percent corruption, and 70 percent root-cause attribution with token savings, plus human validation of the design. The approach targets high-stakes AI use cases by removing single points of failure and protecting proprietary reasoning traces through on-chain records and privacy segmentation.

Core claim

TRUST decomposes Chain-of-Thought reasoning into five abstraction levels via Hierarchical Directed Acyclic Graphs for distributed auditing, projects multi-agent interactions into Causal Interaction Graphs through the DAAN protocol for deterministic root-cause attribution, and employs a multi-tier consensus with stake-weighted voting among computational checkers, LLM evaluators, and human experts to guarantee correctness under 30 percent adversarial participation. The Safety-Profitability Theorem ensures honest auditors profit while malicious actors incur losses, with all decisions recorded on-chain and privacy preserved by segmentation. Empirical tests show 72.4 percent accuracy, resilience,

What carries the argument

Hierarchical Directed Acyclic Graphs (HDAGs) for decomposing reasoning, the DAAN protocol for mapping to Causal Interaction Graphs, and the multi-tier stake-weighted consensus among checkers, evaluators, and experts.

Load-bearing premise

The multi-tier consensus with stake-weighted voting guarantees correctness under 30 percent adversarial participation and the DAAN protocol projects interactions to causal graphs without loss of information for root-cause attribution.

What would settle it

An experiment in which 35 percent of participants act adversarially and the system either drops below 60 percent accuracy or allows a malicious actor to profit, violating the Safety-Profitability Theorem.

Figures

Figures reproduced from arXiv: 2604.27132 by Mohan Zhang, Pingzhi Li, Tianlong Chen, Yu-Chao Huang, Zhen Tan, Zhuo Zhang.

**Figure 1.** Figure 1: Comparison of sound clinical reasoning versus flawed reasoning that produces the correct view at source ↗

**Figure 2.** Figure 2: The dual decomposition engine of Trust. Linear Chain-of-Thought reasoning is mapped to Hierarchical Directed Acyclic Graphs (a), while Multi-Agent interactions are projected into Causal Interaction Graphs (b). Both representations enable parallel, privacy-preserving verification by distributed auditor networks. 2.1 Hierarchical Directed Acyclic Graphs (HDAGs) For singular Large Reasoning Models (LRMs) prod… view at source ↗

**Figure 3.** Figure 3: Example HDAG for a mathematical integration problem. Node color indicates the assigned view at source ↗

read the original abstract

Large Reasoning Models (LRMs) and Multi-Agent Systems (MAS) in high-stakes domains demand reliable verification, yet centralized approaches suffer four limitations: (1) Robustness, with single points of failure vulnerable to attacks and bias; (2) Scalability, as reasoning complexity creates bottlenecks; (3) Opacity, as hidden auditing erodes trust; and (4) Privacy, as exposed reasoning traces risk model theft. We introduce TRUST (Transparent, Robust, and Unified Services for Trustworthy AI), a decentralized framework with three innovations: (i) Hierarchical Directed Acyclic Graphs (HDAGs) that decompose Chain-of-Thought reasoning into five abstraction levels for parallel distributed auditing; (ii) the DAAN protocol, which projects multi-agent interactions into Causal Interaction Graphs (CIGs) for deterministic root-cause attribution; and (iii) a multi-tier consensus mechanism among computational checkers, LLM evaluators, and human experts with stake-weighted voting that guarantees correctness under 30% adversarial participation. We prove a Safety-Profitability Theorem ensuring honest auditors profit while malicious actors incur losses. All decisions are recorded on-chain, while privacy-by-design segmentation prevents reconstruction of proprietary logic. Across multiple LLMs and benchmarks, TRUST attains 72.4% accuracy (4-18% above baselines) and remains resilient against 20% corruption. DAAN reaches 70% root-cause attribution (vs. 54-63% for standard methods) with 60% token savings. Human studies validate the design (F1 = 0.89, Brier = 0.074). The framework supports (A1) decentralized auditing, (A2) tamper-proof leaderboards, (A3) trustless data annotation, and (A4) governed autonomous agents, pioneering decentralized AI auditing for safe, accountable deployment of reasoning-capable systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper sketches an ambitious decentralized framework for auditing AI reasoning but fails to provide evidence for its central theorem or performance claims.

read the letter

The main takeaway is that TRUST combines hierarchical DAGs for breaking down reasoning, a DAAN protocol for causal attribution through interaction graphs, and a stake-weighted multi-tier consensus with an economic incentive theorem. This specific integration for decentralized verification of large reasoning models and multi-agent systems is the novel element, even if it draws from prior work on graphs and decentralized systems. The paper does a clear job spelling out the four problems with centralized auditing—robustness, scalability, opacity, and privacy—and maps its components to practical uses like tamper-proof leaderboards and governed agents. It also mentions human validation results and token savings, which at least point to intended evaluation directions. The soft spots are substantial and central. The Safety-Profitability Theorem is asserted without any derivation, utility functions, or analysis of how the 30% adversarial threshold survives collusion or sybil attacks. The accuracy numbers (72.4%), attribution rates (70%), and resilience claims appear without datasets, baselines, error bars, or experimental setup, so they cannot be checked. The assumption that the CIG projection loses no causal information and that the voting rule reliably separates honest from malicious actors is stated but not demonstrated, which matches the stress-test concern. This leaves the guarantees feeling circular, defined in terms of the framework's own rules rather than independent bounds. The work is for researchers in AI safety and decentralized systems who want architectural sketches rather than finished results. A reader looking for reproducible findings or formal verification will come away empty. It could go to peer review if the full version supplies the missing proofs, code, and data, since the problem area matters and the direction is worth exploring, but on the current text it is not ready.

Referee Report

4 major / 2 minor

Summary. The paper introduces TRUST, a decentralized framework for auditing Large Reasoning Models and Multi-Agent Systems. It proposes three main innovations: Hierarchical Directed Acyclic Graphs (HDAGs) to decompose Chain-of-Thought reasoning into five abstraction levels for distributed auditing, the DAAN protocol that projects multi-agent interactions into Causal Interaction Graphs (CIGs) for root-cause attribution, and a multi-tier consensus mechanism involving computational checkers, LLM evaluators, and human experts using stake-weighted voting. The authors assert a proved Safety-Profitability Theorem that ensures honest auditors profit while malicious actors lose, and report empirical results of 72.4% accuracy (4-18% above baselines), 70% root-cause attribution (vs. 54-63% for baselines), 60% token savings, and resilience to 20% corruption, with human validation metrics (F1=0.89, Brier=0.074). The framework is positioned to support decentralized auditing, tamper-proof leaderboards, trustless annotation, and governed agents.

Significance. If the Safety-Profitability Theorem were rigorously derived and the performance claims validated with reproducible experiments, the work would offer a novel approach to addressing robustness, scalability, opacity, and privacy issues in AI auditing through decentralization and on-chain recording. The combination of HDAG decomposition with CIG-based attribution and hybrid consensus could advance trustworthy AI deployment in high-stakes domains. However, the manuscript provides no formal derivations, experimental protocols, or data, so the potential impact cannot be assessed at present.

major comments (4)

[Abstract] Abstract: The Safety-Profitability Theorem is asserted as proved, ensuring honest auditors profit under 30% adversarial participation, but the manuscript contains no proof sketch, lemmas, utility functions for participants, or analysis of false-positive/negative rates under collusion or sybil attacks. This is load-bearing for the central claim.
[Abstract] Abstract: Performance claims (72.4% accuracy, 70% root-cause attribution, resilience against 20% corruption, 60% token savings) are stated without any experimental setup, datasets, baselines, number of runs, error bars, or statistical tests. These numbers cannot be evaluated or reproduced from the given text.
[Framework Description] Framework / DAAN protocol: The claim that projection of multi-agent traces to Causal Interaction Graphs enables deterministic root-cause attribution without loss of information is presented as an axiom, but no formal argument, information-theoretic bound, or ablation study shows that the five-level HDAG decomposition preserves causal information.
[Consensus Mechanism] Consensus mechanism: The multi-tier stake-weighted voting is said to guarantee correctness under 30% adversarial participation, yet there is no explicit adversary model, game-theoretic analysis, or bounds on detection rates when adversaries control up to 30% of stake.

minor comments (2)

[Abstract] Abstract: The human studies are summarized only by aggregate metrics (F1 = 0.89, Brier = 0.074) with no details on study design, participant expertise, or task description.
[Notation and Definitions] Notation: The definitions of HDAGs, the five abstraction levels, and CIGs would benefit from explicit mathematical notation or pseudocode to support reproducibility.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript on the TRUST framework. We appreciate the recognition of its potential to address key challenges in decentralized AI auditing. We agree that the current version requires additional formal details and experimental documentation to substantiate the central claims. Below we respond point-by-point to the major comments and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The Safety-Profitability Theorem is asserted as proved, ensuring honest auditors profit under 30% adversarial participation, but the manuscript contains no proof sketch, lemmas, utility functions for participants, or analysis of false-positive/negative rates under collusion or sybil attacks. This is load-bearing for the central claim.

Authors: We acknowledge that the Safety-Profitability Theorem is central to the framework and that the current manuscript states the theorem and its high-level implications without including the full derivation. This omission stems from length constraints in the initial v0.1 submission. In the revised manuscript we will add a dedicated Theoretical Analysis section containing a complete proof sketch, the utility functions for honest and malicious auditors, supporting lemmas on incentive compatibility, and an explicit analysis of false-positive and false-negative rates under collusion and Sybil attacks, all within the 30% adversarial participation bound. revision: yes
Referee: [Abstract] Abstract: Performance claims (72.4% accuracy, 70% root-cause attribution, resilience against 20% corruption, 60% token savings) are stated without any experimental setup, datasets, baselines, number of runs, error bars, or statistical tests. These numbers cannot be evaluated or reproduced from the given text.

Authors: We agree that the reported performance figures require full experimental documentation for reproducibility and evaluation. The numbers derive from our internal experiments across multiple LLMs and standard benchmarks, but the manuscript omits the protocol details. In the revision we will insert a comprehensive Experiments section that specifies the datasets, baselines, number of independent runs, error bars, and statistical tests. We will also release the associated code and data artifacts to enable independent verification. revision: yes
Referee: [Framework Description] Framework / DAAN protocol: The claim that projection of multi-agent traces to Causal Interaction Graphs enables deterministic root-cause attribution without loss of information is presented as an axiom, but no formal argument, information-theoretic bound, or ablation study shows that the five-level HDAG decomposition preserves causal information.

Authors: The DAAN protocol is designed so that the projection onto Causal Interaction Graphs preserves the necessary causal structure via the five-level HDAG decomposition. While the current text presents this property concisely, we accept that a formal justification is needed. We will expand the Framework Description section with a formal argument, including an information-theoretic bound demonstrating preservation of causal information, together with ablation studies that quantify the contribution of each HDAG level to root-cause attribution accuracy. revision: yes
Referee: [Consensus Mechanism] Consensus mechanism: The multi-tier stake-weighted voting is said to guarantee correctness under 30% adversarial participation, yet there is no explicit adversary model, game-theoretic analysis, or bounds on detection rates when adversaries control up to 30% of stake.

Authors: We recognize that the current description of the multi-tier consensus mechanism is high-level and lacks an explicit adversary model. In the revised manuscript we will augment the Consensus Mechanism section with a formal adversary model, a game-theoretic analysis of the stake-weighted voting incentives, and derived bounds on detection rates and overall correctness under adversarial control of up to 30% of the stake, building on Byzantine fault tolerance principles adapted to the hybrid checker-LLM-human setting. revision: yes

Circularity Check

1 steps flagged

Safety-Profitability Theorem reduces to the framework's own 30% adversarial consensus assumptions by construction

specific steps

self definitional [Abstract]
"We prove a Safety-Profitability Theorem ensuring honest auditors profit while malicious actors incur losses. ... a multi-tier consensus mechanism among computational checkers, LLM evaluators, and human experts with stake-weighted voting that guarantees correctness under 30% adversarial participation."

The theorem's conclusion (honest profit, malicious loss) is identical to the built-in 'guarantees correctness' property of the stake-weighted voting rule at the 30% threshold. The claimed proof therefore reduces to the framework's own design assumptions without additional mathematical content or external adversary analysis.

full rationale

The paper's load-bearing theoretical result is the Safety-Profitability Theorem, which is asserted to follow from the multi-tier consensus mechanism. However, the mechanism is defined to 'guarantee correctness under 30% adversarial participation' via stake-weighted voting, and the theorem simply restates that honest participants profit while malicious ones lose under that same guarantee. No independent game-theoretic model, utility functions, or bounding lemmas are supplied; the profitability outcome is therefore entailed directly by the design choice rather than derived. Experimental accuracy and attribution figures are internal evaluations of the same unverified mechanism and do not break the circularity. This matches the self-definitional pattern exactly.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 3 invented entities

The central claims rest on several newly introduced structures and assumptions that lack independent evidence or prior validation in the abstract. The framework adds multiple free parameters and ad-hoc entities whose correctness is asserted rather than derived from external benchmarks.

free parameters (2)

30% adversarial participation threshold
Chosen value that the consensus mechanism and Safety-Profitability Theorem are stated to tolerate; directly affects all correctness and resilience claims.
stake-weighted voting parameters
Weights assigned to computational checkers, LLM evaluators, and human experts; required for the multi-tier consensus to function as described.

axioms (2)

domain assumption Multi-tier consensus with stake-weighted voting guarantees correctness under up to 30% adversarial participation
Invoked to support the Safety-Profitability Theorem and the 20% corruption resilience claim.
ad hoc to paper Projection of multi-agent interactions into Causal Interaction Graphs enables deterministic root-cause attribution
Central assumption for the DAAN protocol's claimed 70% attribution accuracy and 60% token savings.

invented entities (3)

Hierarchical Directed Acyclic Graphs (HDAGs) no independent evidence
purpose: Decompose Chain-of-Thought reasoning into five abstraction levels for parallel distributed auditing
New data structure introduced by the paper with no prior independent evidence cited.
DAAN protocol no independent evidence
purpose: Project multi-agent interactions into Causal Interaction Graphs for deterministic root-cause attribution
Novel protocol defined in the paper without external validation.
Causal Interaction Graphs (CIGs) no independent evidence
purpose: Represent multi-agent interactions to enable root-cause analysis
New graph representation introduced to support DAAN.

pith-pipeline@v0.9.0 · 5654 in / 2157 out tokens · 106755 ms · 2026-05-07T08:15:49.625027+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 24 canonical work pages · 12 internal anchors

[1]

Artificial intelligence risk management framework (ai rmf 1.0)

AI, N. Artificial intelligence risk management framework (ai rmf 1.0). URL: https://nvlpubs. nist. gov/nistpubs/ai/nist. ai, pp.\ 100--1, 2023

2023
[2]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review arXiv 2022
[3]

M., Gebru, T., McMillan-Major, A., and Shmitchell, S

Bender, E. M., Gebru, T., McMillan-Major, A., and Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.\ 610--623, 2021

2021
[4]

A., MacKnight, R., Kline, B., and Gomes, G

Boiko, D. A., MacKnight, R., Kline, B., and Gomes, G. Autonomous chemical research with large language models. Nature, 624 0 (7992): 0 570--578, 2023

2023
[5]

The foundation model transparency index

Bommasani, R., Klyman, K., Longpre, S., Kapoor, S., Maslej, N., Xiong, B., Zhang, D., and Liang, P. The foundation model transparency index. arXiv preprint arXiv:2310.12941, 2023

work page arXiv 2023
[6]

Extracting training data from large language models

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., et al. Extracting training data from large language models. In 30th USENIX security symposium (USENIX Security 21), pp.\ 2633--2650, 2021

2021
[7]

Practical byzantine fault tolerance

Castro, M., Liskov, B., et al. Practical byzantine fault tolerance. In OsDI, volume 99, pp.\ 173--186, 1999

1999
[8]

Why Do Multi-Agent LLM Systems Fail?

Cemri, M., Pan, M. Z., Yang, S., Agrawal, L. A., Chopra, B., Tiwari, R., Keutzer, K., Parameswaran, A., Klein, D., Ramchandran, K., Zaharia, M., Gonzalez, J. E., and Stoica, I. Why do multi-agent llm systems fail? arXiv preprint arXiv:2503.13657, 2025. URL https://arxiv.org/abs/2503.13657

work page internal anchor Pith review arXiv 2025
[9]

M., and Aletras, N

Chalkidis, I., Jana, A., Hartung, D., Bommarito, M., Androutsopoulos, I., Katz, D. M., and Aletras, N. Lexglue: A benchmark dataset for legal language understanding in english. arXiv preprint arXiv:2110.00976, 2021

work page arXiv 2021
[10]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chan, C.-M., Chen, W., Su, Y., Yu, J., Xue, W., Zhang, S., Fu, J., and Liu, Z. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023

work page internal anchor Pith review arXiv 2023
[11]

H., Chen, S., Liu, Z., Jiang, F., and Wang, B

Chen, G. H., Chen, S., Liu, Z., Jiang, F., and Wang, B. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669, 2024

work page arXiv 2024
[12]

Laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts

COM, E. Laying down harmonised rules on artificial intelligence (artificial intelligence act) and amending certain union legislative acts. Proposal for a regulation of the European parliament and of the council, 2021

2021
[13]

B., and Mordatch, I

Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. Improving factuality and reasoning in language models through multiagent debate. In Forty-first International Conference on Machine Learning, 2023

2023
[14]

A Survey on LLM-as-a-Judge

Gu, J., Jiang, X., Shi, Z., Tan, H., Zhai, X., Xu, C., Li, W., Shen, Y., Ma, S., Liu, H., Wang, S., Zhang, K., Wang, Y., Gao, W., Ni, L., and Guo, J. A survey on LLM -as-a-judge. arXiv preprint arXiv:2411.15594, 2024. URL https://arxiv.org/abs/2411.15594

work page internal anchor Pith review arXiv 2024
[15]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature, 645 0 (8081): 0 633--638, 2025

2025
[16]

D3: Dissecting multi-agent debate for LLM evaluation, 2024

Harrasse, A., Roch, R., Valentini, E., Bontempi, G., Schmidhuber, J., et al. D3: Dissecting multi-agent debate for LLM evaluation, 2024. URL https://arxiv.org/abs/2410.04663

work page arXiv 2024
[17]

Hong, S., Zhuge, M., Chen, J., Zheng, X., Cheng, Y., Wang, J., Zhang, C., Wang, Z., Yau, S. K. S., Lin, Z., et al. Metagpt: Meta programming for a multi-agent collaborative framework. In The twelfth international conference on learning representations, 2023

2023
[18]

OpenAI o1 System Card

Jaech, A., Kalai, A., Lerer, A., Richardson, A., El-Kishky, A., Low, A., Helyar, A., Madry, A., Beutel, A., Carney, A., et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Fault localization using interventional causal learning for cloud-native applications

Jha, S., Rios, J., Abe, N., Bagehorn, F., and Shwartz, L. Fault localization using interventional causal learning for cloud-native applications. In 2024 54th Annual IEEE/IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S), pp.\ 141--147. IEEE, 2024

2024
[20]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., Narasimhan, K., Vaswani, A., Jiang, C., R e gis, Y., and Hsu, D. SWE -bench: Can language models resolve real-world GitHub issues?, 2024. URL https://arxiv.org/abs/2310.06770

work page internal anchor Pith review arXiv 2024
[21]

Cot-icl lab: A synthetic framework for studying chain-of-thought learning from in-context demonstrations

Kothapalli, V., Firooz, H., and Sanjabi, M. Cot-icl lab: A synthetic framework for studying chain-of-thought learning from in-context demonstrations. arXiv preprint arXiv:2502.15132, 2025

work page arXiv 2025
[22]

The byzantine generals problem

Lamport, L., Shostak, R., and Pease, M. The byzantine generals problem. In Concurrency: the works of leslie lamport, pp.\ 203--226. 2019

2019
[23]

Measuring Faithfulness in Chain-of-Thought Reasoning

Lanham, T., Chen, A., Radhakrishnan, A., Steiner, B., Denison, C., Hernandez, D., Li, D., Durmus, E., Hubinger, E., Kernion, J., et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023

work page internal anchor Pith review arXiv 2023
[24]

Holistic Evaluation of Language Models

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review arXiv 2022
[25]

Let's verify step by step

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let's verify step by step. In The Twelfth International Conference on Learning Representations, 2023

2023
[26]

Deconstructing long chain-of-thought: A structured reasoning optimization framework for long cot distillation

Luo, Y., Song, Y., Zhang, X., Liu, J., Wang, W., Chen, G., Su, W., and Zheng, B. Deconstructing long chain-of-thought: A structured reasoning optimization framework for long cot distillation. arXiv preprint arXiv:2503.16385, 2025

work page arXiv 2025
[27]

D., and Gebru, T

Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I. D., and Gebru, T. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pp.\ 220--229, 2019

2019
[28]

Feder Cooper, Daphne Ippolito, Christopher A

Nasr, M., Carlini, N., Hayase, J., Jagielski, M., Cooper, A. F., Ippolito, D., Choquette-Choo, C. A., Wallace, E., Tram \`e r, F., and Lee, K. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023

work page arXiv 2023
[29]

GPT-4 Technical Report

OpenAI . GPT -4 technical report, 2023. URL https://arxiv.org/abs/2303.08774

work page internal anchor Pith review arXiv 2023
[30]

Llm evaluators recognize and favor their own generations

Panickssery, A., Bowman, S., and Feng, S. Llm evaluators recognize and favor their own generations. Advances in Neural Information Processing Systems, 37: 0 68772--68802, 2024

2024
[31]

Ignore Previous Prompt: Attack Techniques For Language Models

Perez, F. and Ribeiro, I. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022

work page internal anchor Pith review arXiv 2022
[32]

https://doi.org/10.48550/ arXiv.2508.05687

Reid, A., O'Callaghan, S., Carroll, L., and Caetano, T. Risk analysis techniques for governed llm-based multi-agent systems. arXiv preprint arXiv:2508.05687, 2025

work page arXiv 2025
[33]

and Jazayeri, M

Sarafyazd, M. and Jazayeri, M. Hierarchical reasoning by neural circuits in the frontal cortex. Science, 364 0 (6441): 0 eaav8911, 2019. doi:10.1126/science.aav8911. URL https://www.science.org/doi/abs/10.1126/science.aav8911

work page doi:10.1126/science.aav8911 2019
[34]

Toolformer: Language Models Can Teach Themselves to Use Tools

Schick, T., Dwivedi-Yu, J., Dessi, R., Raileanu, R., Lombrozo, T., Zettlemoyer, L., Liang, P., Hwang, J., Lai, C., Tsvetkov, Y., Ranzato, M., and Kim, Y. Toolformer: Language models can teach themselves to use tools, 2023. URL https://arxiv.org/abs/2302.04761

work page internal anchor Pith review arXiv 2023
[35]

Judging the judges: A systematic study of position bias in LLM -as-a-judge

Shi, L., Ma, C., Liang, W., Diao, X., Ma, W., and Vosoughi, S. Judging the judges: A systematic study of position bias in LLM -as-a-judge. In Inui, K., Sakti, S., Wang, H., Wong, D. F., Bhattacharyya, P., Banerjee, B., Ekbal, A., Chakraborty, T., and Singh, D. P. (eds.), Proceedings of the 14th International Joint Conference on Natural Language Processing...

2025
[36]

S., Wei, J., Chung, H

Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., et al. Large language models encode clinical knowledge. Nature, 620 0 (7972): 0 172--180, 2023

2023
[37]

Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting

Turpin, M., Michael, J., Perez, E., and Bowman, S. Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36: 0 74952--74965, 2023

2023
[38]

Wang, N., Yang, H., and Wang, C. D. Fingpt: Instruction tuning benchmark for open-source large language models in financial datasets. arXiv preprint arXiv:2310.04793, 2023

work page arXiv 2023
[39]

V., Zhou, D., et al

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 0 24824--24837, 2022

2022
[40]

Cloud atlas: Efficient fault localization for cloud systems using language models and causal insight, 2024

Xie, Z., Zheng, Y., Ottens, L., Zhang, K., Kozyrakis, C., and Mace, J. Cloud atlas: Efficient fault localization for cloud systems using language models and causal insight, 2024. URL https://arxiv.org/abs/2407.08694

work page arXiv 2024
[41]

Tree of thoughts: Deliberate problem solving with large language models

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36: 0 11809--11822, 2023 a

2023
[42]

React: Synergizing reasoning and acting in language models

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023 b

2023
[43]

Justice or prejudice? quantifying biases in llm-as-a-judge

Ye, J., Wang, Y., Huang, Y., Chen, D., Zhang, Q., Moniz, N., Gao, T., Geyer, W., Huang, C., Chen, P.-Y., et al. Justice or prejudice? quantifying biases in llm-as-a-judge. arXiv preprint arXiv:2410.02736, 2024

work page arXiv 2024
[44]

Judging llm-as-a-judge with mt-bench and chatbot arena

Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 0 46595--46623, 2023

2023
[45]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023

work page internal anchor Pith review arXiv 2023
[46]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...