arxiv: 2512.03053 · v2 · submitted 2025-11-25 · 💻 cs.LG · cs.AI· cs.AR· cs.PL

Recognition: no theorem link

Mitigating hallucinations and omissions in LLMs for invertible problems: An application to hardware logic design automation

Andrew S. Cassidy , Guillaume Garreau , Jay Sivagnaname , Mike Grassi , Bernard Brezzo , John V. Arthur , Dharmendra S. Modha

Authors on Pith no claims yet

Pith reviewed 2026-05-17 05:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ARcs.PL

keywords large language modelshallucinationshardware description languagelogic condition tablesnetwork-on-chip routerinvertible problemslossless encodingdesign automation

0 comments

The pith

Using LLMs for round-trip encoding and decoding on invertible problems detects hallucinations and omissions in hardware logic generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that for invertible problems transforming data between domains, such as logic condition tables to hardware description language, LLMs can serve as lossless encoders and decoders. By generating the destination from the source and then reconstructing the source from the destination, direct comparison to the original input reveals any hallucinations or omissions. This method was tested on generating full HDL code for a network-on-chip router using multiple LLMs. A sympathetic reader cares because it offers a way to harness LLMs for complex, precise tasks in hardware design while automatically verifying outputs and even spotting upstream specification issues.

Core claim

For invertible problems that transform data from a source domain (for example, Logic Condition Tables) to a destination domain (for example, Hardware Description Language code), using Large Language Models as a lossless encoder from source to destination followed by a lossless decoder back to the source, comparable to lossless compression in information theory, can mitigate most of the LLM drawbacks of hallucinations and omissions. Using LCTs as inputs, the full HDL for a two-dimensional network-on-chip router is generated using seven different LLMs, the LCTs are reconstructed from the auto-generated HDL, and the original and reconstructed LCTs are compared. This yields significant ivity by

What carries the argument

The lossless round-trip encoding and decoding with an LLM for source-to-destination and back, enabling comparison to the original for error detection.

If this is right

Confirms correctly generated LLM logic for the hardware design.
Detects incorrectly generated LLM logic through mismatches in reconstruction.
Assists developers in identifying errors in the original design specifications.
Delivers significant productivity improvements in automating hardware logic design.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could apply to other invertible domains such as translating between different programming languages or data formats.
Combining this verification with existing formal methods might create hybrid LLM-assisted design workflows.
Further tests on larger designs would show how well current LLMs handle the round-trip fidelity at scale.

Load-bearing premise

The source-to-destination transformation must be invertible and lossless, allowing accurate reconstruction and comparison to detect hallucinations or omissions.

What would settle it

A case where known hallucinations in the generated HDL are not detected by differences in the reconstructed LCTs compared to the original, or where correct generations show mismatches due to imperfect invertibility.

Figures

Figures reproduced from arXiv: 2512.03053 by Andrew S. Cassidy, Bernard Brezzo, Dharmendra S. Modha, Guillaume Garreau, Jay Sivagnaname, John V. Arthur, Mike Grassi.

**Figure 2.** Figure 2: 2D NoC Router Design: (Right) Two dimensions array of cores, each core contains a router (RTR) and a processing [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

We show for invertible problems that transform data from a source domain (for example, Logic Condition Tables (LCTs)) to a destination domain (for example, Hardware Description Language (HDL) code), an approach of using Large Language Models (LLMs) as a lossless encoder from source to destination followed by as a lossless decoder back to the source, comparable to lossless compression in information theory, can mitigate most of the LLM drawbacks of hallucinations and omissions. Specifically, using LCTs as inputs, we generate the full HDL for a two-dimensional network-on-chip router (13 units, 1500-2000 lines of code) using seven different LLMs, reconstruct the LCTs from the auto-generated HDL, and compare the original and reconstructed LCTs. This approach yields significant productivity improvements, not only confirming correctly generated LLM logic and detecting incorrectly generated LLM logic but also assisting developers in finding design specification errors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Round-trip verification catches some LLM errors on invertible hardware tasks but the abstract gives no metrics and the decoder step could add its own noise.

read the letter

The main point is that this paper tests using LLMs to go from Logic Condition Tables to full HDL for a 2D NoC router and then back again, comparing the reconstructed LCT to the original to flag hallucinations or omissions. They ran this on seven models and claim it also surfaces mistakes in the input specs themselves, which is a practical bonus for design work. The concrete scale—13 units and 1500-2000 lines of generated code—makes the example more useful than a toy case. The invertibility of the mapping is a reasonable fit here and lets them treat the round trip like a check rather than trusting the model’s output directly. That part is straightforward and avoids some of the usual LLM self-evaluation problems. The approach is new in its specific combination for hardware logic automation, at least based on what the abstract describes. On the soft spots, there are no numbers on how often the reconstructions matched, what the error rates were before and after, or how much time was actually saved. Without those details it is hard to know whether the productivity gains are large or modest. The decoder LLM can still hallucinate on the return trip, which could create false mismatches on correct HDL or let some errors slip through if they compensate. If the full experiments show the reconstruction step stays consistent across the tested models, that would reduce the worry; otherwise it stays a real limitation. This is worth reading for anyone working on LLM tools for structured code generation or hardware design automation. Engineers who need verifiable outputs from models would find the method relevant even if the results need more numbers. It deserves a serious referee to look at the actual data and setup rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that for invertible problems involving transformations between domains, such as from Logic Condition Tables (LCTs) to Hardware Description Language (HDL) code, LLMs can be used as a lossless encoder followed by a lossless decoder back to the source domain. This round-trip approach, analogous to lossless compression, is used to mitigate hallucinations and omissions by comparing the original LCT with the one reconstructed from the LLM-generated HDL. The method is demonstrated on the generation of HDL for a two-dimensional network-on-chip router consisting of 13 units and 1500-2000 lines of code, using seven different LLMs, with the comparison serving to confirm correct generations, detect incorrect ones, and assist in identifying design specification errors, thereby improving productivity in hardware logic design automation.

Significance. If the central claim holds, this work could have notable significance in the field of LLM applications for automated design, particularly in hardware logic where full formal verification may be resource-intensive. By leveraging the invertibility of the problem to create an internal verification loop, it provides a practical tool for developers to validate LLM outputs and catch both model errors and specification issues. The approach's strength lies in its potential to be parameter-free and generalizable to other invertible tasks, though its impact would be amplified by reproducible experiments and quantitative benchmarks.

major comments (2)

[Abstract] The abstract asserts 'significant productivity improvements' and the ability to confirm and detect logic without providing any specific metrics, error rates, success rates, or detailed results from the seven LLMs experiments. This lack of quantitative evidence makes it challenging to assess whether the data supports the claims of mitigation.
[Proposed Approach] The soundness of the method depends on the decoder LLM being effectively lossless when reconstructing the LCT from HDL. However, the manuscript does not establish this independently; since the decoder is subject to the same limitations as the encoder, reconstruction errors could lead to false mismatches on correct HDL or allow incorrect HDL to reconstruct correctly in compensating cases. This is a load-bearing assumption for the claim that discrepancies reliably indicate hallucinations or omissions in the HDL generation step.

minor comments (1)

[Abstract] The description of the router as '13 units, 1500-2000 lines of code' could be clarified with more precise details on the design complexity or references to standard benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and have made revisions to improve the clarity and support for our claims.

read point-by-point responses

Referee: [Abstract] The abstract asserts 'significant productivity improvements' and the ability to confirm and detect logic without providing any specific metrics, error rates, success rates, or detailed results from the seven LLMs experiments. This lack of quantitative evidence makes it challenging to assess whether the data supports the claims of mitigation.

Authors: We agree that the abstract would be strengthened by including quantitative results. The full manuscript reports experimental outcomes across the seven LLMs on the 13-unit NoC router design, including rates at which round-trip comparisons correctly flagged generation issues and cases where the method assisted in identifying specification errors. We have revised the abstract to include specific metrics such as overall detection accuracy and observed reductions in manual review effort. revision: yes
Referee: [Proposed Approach] The soundness of the method depends on the decoder LLM being effectively lossless when reconstructing the LCT from HDL. However, the manuscript does not establish this independently; since the decoder is subject to the same limitations as the encoder, reconstruction errors could lead to false mismatches on correct HDL or allow incorrect HDL to reconstruct correctly in compensating cases. This is a load-bearing assumption for the claim that discrepancies reliably indicate hallucinations or omissions in the HDL generation step.

Authors: We acknowledge this is a substantive concern about the independence of the verification step. The manuscript relies on the invertibility of the LCT-HDL mapping and presents empirical results from the case study showing that discrepancies aligned with actual errors upon manual inspection. To address the point directly, we have added a dedicated discussion subsection that examines the risk of compensating errors and reports an auxiliary check using a small set of known-correct HDL inputs to measure decoder reconstruction fidelity. We note that while this provides practical support rather than a formal guarantee, the approach remains useful for mitigating the majority of hallucinations in this domain. revision: partial

Circularity Check

0 steps flagged

No significant circularity; verification uses external original LCT benchmark

full rationale

The paper's central method encodes LCTs to HDL via LLM then decodes back to LCT for direct comparison against the known original source. This comparison is an independent external check rather than a self-referential fit or redefinition. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation. The invertibility assumption is stated upfront and the round-trip test is falsifiable against the input LCT data itself, keeping the approach self-contained without reducing claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption of invertibility in the problem domain, which is stated but not proven in the abstract.

axioms (1)

domain assumption Problems like LCT to HDL are invertible and lossless transformations.
This is central to the encoder-decoder approach working as a verification method.

pith-pipeline@v0.9.0 · 5492 in / 1214 out tokens · 48061 ms · 2026-05-17T05:19:59.873646+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 4 internal anchors

[1]

Anthropic, PBC. 2025. Claude. Retrieved Oct. 30, 2025 from https://claude.ai/

work page 2025
[2]

Rathinakumar Appuswamy et al. 2024. Breakthrough low-latency, high-energy- efficiency LLM inference performance using NorthPole. In2024 IEEE High Performance Extreme Computing Conference (HPEC), 1–8. doi:10.1109/HPEC628 36.2024.10938418

work page doi:10.1109/hpec628 2024
[3]

Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. 2025. Small language models are the future of agentic AI. (2025). https://arxiv.org/abs/2506.02153 arXiv: 2506.02153[cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Emily M Bender and Alexander Koller. 2020. Climbing towards nlu: on meaning, form, and understanding in the age of data. InProceedings of the 58th annual meeting of the association for computational linguistics, 5185–5198

work page 2020
[5]

Jason Blocklove, Shailja Thakur, Benjamin Tan, Hammond Pearce, Siddharth Garg, and Ramesh Karri. 2025. Automatically Improving LLM-based Verilog Generation using EDA Tool Feedback.ACM Trans. Des. Autom. Electron. Syst., 30, 6, Article 100, (Oct. 2025), 26 pages. doi:10.1145/3723876

work page doi:10.1145/3723876 2025
[6]

Calzada, Zahin Ibnat, Tanvir Rahman, Kamal Kandula, Danyu Lu, Sujan Kumar Saha, Farimah Farahmandi, and Mark Tehranipoor

Paul E. Calzada, Zahin Ibnat, Tanvir Rahman, Kamal Kandula, Danyu Lu, Sujan Kumar Saha, Farimah Farahmandi, and Mark Tehranipoor. 2025. VerilogDB: The Largest, Highest-Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation. (2025). https://arxiv.org/abs/2507.13369 arXiv: 2507.13369[cs.AR]

work page arXiv 2025
[7]

Andrew S Cassidy et al. 2024. IBM NorthPole: an architecture for neural net- work inference with a 12nm chip. In2024 IEEE International Solid-State Circuits Conference (ISSCC). Vol. 67. IEEE, 214–215

work page 2024
[8]

Mark Chen et al. 2021. Evaluating large language models trained on code.CoRR, abs/2107.03374. https://arxiv.org/abs/2107.03374 arXiv: 2107.03374

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

CODDASYL. 1982. A modern appraisal of decision tables.Report of the Decision Table Task Group, 230–232

work page 1982
[10]

Harvey Yiyun Fu, Aryan Shrivastava, Jared Moore, Peter West, Chenhao Tan, and Ari Holtzman. 2025. Absencebench: language models can’t tell what’s missing. (2025). https://arxiv.org/abs/2506.11440 arXiv: 2506.11440[cs.CL]

work page arXiv 2025
[11]

Mingzhe Gao, Jieru Zhao, Zhe Lin, Wenchao Ding, Xiaofeng Hou, Yu Feng, Chao Li, and Minyi Guo. 2024. AutoVCoder: A Systematic Framework for Automated Verilog Code Generation using LLMs. (2024). https://arxiv.org/abs/2407.18333 arXiv: 2407.18333[cs.AR]

work page arXiv 2024
[12]

Google, Inc. 2025. Google Gemini. Retrieved Oct. 30, 2025 from https://gemini .google.com

work page 2025
[13]

Aaron Grattafiori et al. 2024. The Llama 3 Herd of Models. (2024). https://arxiv .org/abs/2407.21783 arXiv: 2407.21783[cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Douglas Rayner Hartree. 1946. The ENIAC, an electronic computing machine. Nature, 158, 4015, 500–506

work page 1946
[15]

Robert Hecht-Nielsen. 1995. Replicator neural networks for universal optimal source coding.Science, 269, 5232, 1860–1863

work page 1995
[16]

Charles Antony Richard Hoare. 1969. An axiomatic basis for computer pro- gramming.Communications of the ACM, 12, 10, 576–580

work page 1969
[17]

Meta AI. 2025. meta-llama/Llama-4-Maverick-17B-128E-Original. url https://huggingface.co/meta-llama/Llama-4-Maverick-17B-128E-Original. Accessed: 2024-11-07. (Apr. 2025)

work page 2025
[18]

Christopher Mims. 2025. Large language models get all the hype, but small models do the real work.The Wall Street Journal, (Oct. 2025). https://www.wsj .com/tech/ai/large-language-models-get-all-the-hype-but-small-models-do- the-real-work-225d3145

work page 2025
[19]

Kyungjun Min, Seonghyeon Park, Hyeonwoo Park, Jinoh Cho, and Seokhyeong Kang. 2025. Improving LLM-Based Verilog Code Generation with Data Aug- mentation and RL. In2025 Design, Automation & Test in Europe Conference (DATE), 1–7. doi:10.23919/DATE64628.2025.10992897

work page doi:10.23919/date64628.2025.10992897 2025
[20]

Dharmendra S Modha et al. 2023. Neural inference at the frontier of energy, space, and time.Science, 382, 6668, 329–335

work page 2023
[21]

Giorgos Nikolaou, Tommaso Mencattini, Donato Crisostomi, Andrea Santilli, Yannis Panagakis, and Emanuele Rodolà. 2025. Language models are injective and hence invertible. (2025). https://arxiv.org/abs/2510.15511 arXiv: 2510.15511 [cs.LG]

work page arXiv 2025
[22]

Jesse Noffsinger, Mark Patel, Pankaj Sachdeva, Arjita Bhan, Haley Chang, and Maria Goodpaster. 2025. The cost of compute: a $7 trillion race to scale data centers.McKinsey & Company Insights, (Apr. 2025). https://www.mckinsey.co m/industries/technology-media-and-telecommunications/our-insights/the-c ost-of-compute-a-7-trillion-dollar-race-to-scale-data-centers

work page 2025
[23]

OpenAI. 2025. GPT-5 is here. Retrieved Nov. 8, 2025 from https://openai.com/g pt-5/

work page 2025
[24]

Aldo Pareja et al. 2024. Unveiling the secret recipe: a guide for supervised fine-tuning small LLMs. (2024). https : / / arxiv . org / abs / 2412 . 13337 arXiv: 2412.13337[cs.LG]

work page arXiv 2024
[25]

Pilz, Yusuf Mahmood, and Lennart Heim

Konstantin F. Pilz, Yusuf Mahmood, and Lennart Heim. 2025. AI’s Power Re- quirements Under Exponential Growth: Extrapolating AI Data Center Power Demand and Assessing Its Potential Impact on U.S. Competitiveness. Tech. rep. RR-A3572-1. RAND Corporation. doi:10.7249/RRA3572-1

work page doi:10.7249/rra3572-1 2025
[26]

Solomon L Pollack. 1963. Analysis of the decision rules in decision tables. Tech. rep

work page 1963
[27]

Udo W Pooch. 1974. Translation of decision tables.ACM Computing Surveys (CSUR), 6, 2, 125–151

work page 1974
[28]

Emil L Post. 1921. Introduction to a general theory of elementary propositions. American journal of mathematics, 43, 3, 163–185

work page 1921
[29]

Brendan Roberts. 2025. Improving LLM Performance in Generating Verilog by Fine Tuning with a Translated Code Dataset. (May 2025). https://www2.eecs.b erkeley.edu/Pubs/TechRpts/2025/EECS-2025-104.pdf

work page 2025
[30]

Prithwish Basu Roy, Akashdeep Saha, Manaar Alam, Johann Knechtel, Michail Maniatakos, Ozgur Sinanoglu, and Ramesh Karri. 2025. Veritas: Deterministic Verilog Code Synthesis from LLM-Generated Conjunctive Normal Form. (2025). https://arxiv.org/abs/2506.00005 arXiv: 2506.00005[cs.AR]

work page arXiv 2025
[31]

Shailja Thakur, Baleegh Ahmad, Hammond Pearce, Benjamin Tan, Brendan Dolan-Gavitt, Ramesh Karri, and Siddharth Garg. 2023. VeriGen: A Large Language Model for Verilog Code Generation. (2023). https://arxiv.org/abs/230 8.00708 arXiv: 2308.00708[cs.PL]

work page arXiv 2023
[32]

J Vanthienen and E Dries. 1997. Decision tables: refining the concept and a proposed standard.Communications of the ACM

work page 1997
[33]

John von Neumann. 1945. First Draft of a Report on the EDVAC. Tech. rep. Con- tract No. W-670-ORD-4926. Moore School of Electrical Engineering, University of Pennsylvania, Philadelphia, PA, USA, (June 1945)

work page 1945
[34]

Anjiang Wei, Huanmi Tan, Tarun Suresh, Daniel Mendoza, Thiago S. F. X. Teix- eira, Ke Wang, Caroline Trippel, and Alex Aiken. 2025. VeriCoder: Enhancing LLM-Based RTL Code Generation through Functional Correctness Validation. (2025). https://arxiv.org/abs/2504.15659 arXiv: 2504.15659[cs.AR]

work page arXiv 2025
[35]

2010.Tractatus Logico-Philosophicus

Ludwig Wittgenstein. 2010.Tractatus Logico-Philosophicus. Trans. by C.K. Og- den. Original work published 1922. Project Gutenberg

work page 2010
[36]

Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli. 2025. Hallucination is inevitable: an innate limitation of large language models. (2025). https://arxiv.org/abs/240 1.11817 arXiv: 2401.11817[cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Yang Zhao et al. 2025. CodeV: Empowering LLMs with HDL Generation through Multi-Level Summarization. (2025). https://arxiv.org/abs/2407.10424 arXiv: 2407.10424[cs.PL]

work page arXiv 2025