pith. machine review for the scientific record. sign in

arxiv: 2604.11141 · v1 · submitted 2026-04-13 · 💻 cs.LG · cs.CR

Recognition: unknown

Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:29 UTC · model grok-4.3

classification 💻 cs.LG cs.CR
keywords hallucination reductionminimum Bayes riskLLM workflowsenterprise AIself-consistencylegal benchmarkssemantic similaritylexical precision
0
0 comments X

The pith

A hybrid semantic-lexical minimum Bayes risk method reduces LLM hallucinations in high-stakes enterprise workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-stakes enterprise AI workflows using LLMs face material risks from hallucinations in legal, risk, and compliance tasks. The paper casts hallucination reduction as a minimum Bayes risk problem and proposes the Hybrid Utility MBR framework to select consensus outputs by combining embedding-based semantic similarity with lexical precision, all without ground-truth references. It provides theoretical error bounds and demonstrates through benchmarks and real Meta production data that this approach outperforms standard self-consistency, with 81% of outputs preferred over human ground truth and near elimination of critical recall errors.

Core claim

Framing hallucination mitigation as a Minimum Bayes Risk problem allows dramatic risk reduction. The Hybrid Utility MBR (HUMBR) framework synthesizes semantic embedding similarity with lexical precision to identify consensus without ground-truth references, supported by derived rigorous error bounds. On TruthfulQA, LegalBench, and Meta production data, MBR outperforms Universal Self-Consistency, with 81% of suggestions preferred over human-crafted ground truth and critical recall failures virtually eliminated.

What carries the argument

Hybrid Utility MBR (HUMBR), a framework that combines semantic embedding similarity and lexical precision to identify low-hallucination consensus outputs without references.

If this is right

  • MBR significantly outperforms standard Universal Self-Consistency.
  • 81% of the pipeline's suggestions were preferred over human-crafted ground truth.
  • Critical recall failures were virtually eliminated.
  • The approach works on public benchmarks and real-world production data from enterprise settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adapting the utility function could extend benefits to additional regulated sectors like healthcare or finance.
  • Integration into existing LLM pipelines could enable safer automation without added human oversight.
  • Advances in embedding technology would likely improve the semantic component and overall performance.

Load-bearing premise

The hybrid semantic-lexical utility function can reliably select the output with lowest hallucination risk across enterprise domains without ground truth, with error bounds that hold for real LLM distributions.

What would settle it

Running HUMBR and Universal Self-Consistency on a new set of enterprise legal queries and having experts count hallucinations in each; if HUMBR does not show lower hallucination rates or higher preference, the central claim fails.

Figures

Figures reproduced from arXiv: 2604.11141 by Abhishek Gulati, Arya Pudota, Chenhao Fang, Herv\'e Robert, Jason Nawrocki, Jay Minesh Shah, Jordi Mola, Katayoun Zand, Mansi Tripathi, Mark Harman, Matthew Becker, Vaibhav Shrivastava, Yue Cheng.

Figure 1
Figure 1. Figure 1: Workflow of the proposed HUMBR ensemble system. The system calculates consensus using a Hybrid Utility Function [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cost-𝑃fail Pareto Frontier: An idealized depiction of the operational trade-off landscape. To intuitively understand the operational landscape defined above, we visualize the idealized Cost-𝑃fail Pareto Frontier in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Human-in-the-Loop Workflow. The AI generates a [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Outcome Distribution Analysis. The stacked bars [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Although LLMs drive automation, it is critical to ensure immense consideration for high-stakes enterprise workflows such as those involving legal matters, risk management, and privacy compliance. For Meta, and other organizations like ours, a single hallucinated clause in such high stakes workflows risks material consequences. We show that by framing hallucination mitigation as a Minimum Bayes Risk (MBR) problem, we can dramatically reduce this risk. Specifically, we introduce a Hybrid Utility MBR (HUMBR) framework that synthesizes semantic embedding similarity with lexical precision to identify consensus without ground-truth references, for which we derive rigorous error bounds. We complement this theoretical analysis with a comprehensive empirical evaluation on widely-used public benchmark suites (TruthfulQA and LegalBench) and also real world data from Meta production deployment. The results from our empirical study show that MBR significantly outperforms standard Universal Self-Consistency. Notably, 81% of the pipeline's suggestions were preferred over human-crafted ground truth, and critical recall failures were virtually eliminated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript frames hallucination mitigation in LLMs for high-stakes enterprise workflows as a Minimum Bayes Risk (MBR) problem. It introduces the Hybrid Utility MBR (HUMBR) method, which combines semantic embedding similarity with lexical precision to select consensus outputs without ground-truth references, derives rigorous error bounds for this utility, and reports empirical results on TruthfulQA, LegalBench, and Meta production data showing significant gains over Universal Self-Consistency, including an 81% preference rate for the pipeline's outputs over human-crafted ground truth and near-elimination of critical recall failures.

Significance. If the error bounds are valid for real LLM distributions and the hybrid utility generalizes without introducing circularity or data-dependent bias, the work would offer a practical, reference-free approach to improving reliability in legal, risk, and compliance applications. The combination of a theoretical guarantee with evaluation on both public benchmarks and production data is a positive feature.

major comments (2)
  1. [Theoretical analysis] Theoretical analysis section: the claim of 'rigorous error bounds' for the hybrid utility is central to the contribution, yet the provided derivation must explicitly demonstrate that the bounds remain valid and non-circular when the utility is applied to the actual output distributions of production LLMs rather than idealized or benchmark-specific cases.
  2. [Empirical evaluation] Empirical evaluation section: the reported 81% preference over human ground truth and elimination of critical recall failures are load-bearing for the practical claims; the preference collection protocol, inter-annotator agreement, and controls for evaluator bias must be detailed to confirm that the hybrid utility reliably identifies lower-hallucination outputs across domains.
minor comments (2)
  1. [Abstract] The abstract and introduction could more precisely define the hybrid utility function (e.g., the weighting between semantic and lexical components) to aid reproducibility.
  2. [Results] Figure or table presenting the benchmark results should include exact metrics (e.g., accuracy, F1) alongside the preference rate for direct comparison with Universal Self-Consistency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.

read point-by-point responses
  1. Referee: Theoretical analysis section: the claim of 'rigorous error bounds' for the hybrid utility is central to the contribution, yet the provided derivation must explicitly demonstrate that the bounds remain valid and non-circular when the utility is applied to the actual output distributions of production LLMs rather than idealized or benchmark-specific cases.

    Authors: The error bounds derived in the theoretical analysis (Section 3) rely only on the finite support of the sampled candidate set and the boundedness of the hybrid utility function; they do not assume any particular parametric form of the underlying LLM output distribution. Because the utility is evaluated after sampling, the same finite-set concentration arguments apply equally to the empirical distributions obtained from production LLMs. To address the request for explicit demonstration, we will insert a short clarifying paragraph immediately after the main theorem statement that restates the assumptions in distribution-agnostic terms and notes that the bounds hold for any fixed sampling procedure used in practice, including the production setting. This addition removes any potential ambiguity without altering the original derivation. revision: yes

  2. Referee: Empirical evaluation section: the reported 81% preference over human ground truth and elimination of critical recall failures are load-bearing for the practical claims; the preference collection protocol, inter-annotator agreement, and controls for evaluator bias must be detailed to confirm that the hybrid utility reliably identifies lower-hallucination outputs across domains.

    Authors: We agree that the current description of the human evaluation is insufficiently detailed. In the revised manuscript we will expand the relevant subsection to include: (i) the exact preference-collection protocol (task instructions given to annotators, number of candidates shown per query, and decision criteria); (ii) inter-annotator agreement statistics (pairwise agreement rate and Cohen’s kappa) computed on the overlapping subset of judgments; and (iii) bias-mitigation steps (randomized presentation order, blinding of model identity, and use of multiple annotators per item). For the Meta production data we will add a brief note on the constraints that prevent full public release of the raw annotations while still reporting the agreement and bias-control metrics that were computed internally. These additions will allow readers to assess the reliability of the 81 % preference figure and the recall-failure reduction across both benchmark and production domains. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The provided abstract and framing describe hallucination mitigation as an MBR problem with a hybrid semantic-lexical utility (HUMBR) and independently derived error bounds, evaluated on external public benchmarks (TruthfulQA, LegalBench) plus production data. No equations, self-citations, or load-bearing steps are visible in the given text that reduce any claimed prediction or bound to a fitted input or prior self-result by construction. The central claims rest on standard MBR application plus empirical comparison to Universal Self-Consistency, which is externally verifiable and does not collapse to the paper's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities. The hybrid utility is described only at a high level as a synthesis of semantic and lexical signals; whether this synthesis introduces tunable weights or other fitted constants cannot be determined.

pith-pipeline@v0.9.0 · 5527 in / 1384 out tokens · 63235 ms · 2026-05-10T16:29:55.916730+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 26 canonical work pages · 12 internal anchors

  1. [1]

    Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. 2023. Universal Self-Consistency for Large Language Model Generation. arXiv:2311.17311 [cs.CL] https://arxiv.org/abs/2311.17311

  2. [2]

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. 2024. DoLa: Decoding by Contrasting Layers Improves Fac- tuality in Large Language Models. arXiv:2309.03883 [cs.CL] https://arxiv.org/ abs/2309.03883

  3. [3]

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG] https://arxiv.org/abs/2110.14168

  4. [4]

    Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. 2023. Chain-of-Verification Reduces Hallucina- tion in Large Language Models. arXiv:2309.11495 [cs.CL] https://arxiv.org/abs/ 2309.11495

  5. [5]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mor- datch. 2023. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325 [cs.CL] https://arxiv.org/abs/2305.14325

  6. [6]

    Bryan Eikema and Wilker Aziz. 2020. Is MAP decoding all you need? the inade- quacy of the mode in neural machine translation.arXiv preprint arXiv:2005.10283 (2020)

  7. [7]

    Chenhao Fang, Derek Larson, Shitong Zhu, Sophie Zeng, Wendy Summer, Yan- qing Peng, Yuriy Hulovatyy, Rajeev Rao, Gabriel Forgues, Arya Pudota, Alex Goncalves, and Hervé Robert. 2024. Ingest-And-Ground: Dispelling Hallucina- tions from Continually-Pretrained LLMs with RAG. arXiv:2410.02825 [cs.CL] https://arxiv.org/abs/2410.02825

  8. [8]

    Chenhao Fang, Xiaohan Li, Zezhong Fan, Jianpeng Xu, Kaushiki Nag, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. 2024. LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction. arXiv:2403.00863 [cs.IR] https://arxiv.org/abs/2403.00863

  9. [9]

    Chenhao Fang, Yanqing Peng, Rajeev Rao, Matt Sarmiento, Wendy Summer, Arya Pudota, Alex Goncalves, Jordi Mola, and Hervé Robert. 2025. Privacy Artifact ConnecTor (PACT): Embedding Enterprise Artifacts for Compliance AI Agents. arXiv:2507.21142 [cs.CR] https://arxiv.org/abs/2507.21142

  10. [10]

    Markus Freitag, Behrooz Ghorbani, and Patrick Fernandes. 2023. Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine Translation. InFindings of the Association for Com- putational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, ...

  11. [11]

    Vaibhava Goel. 2002. Minimum Bayes-risk automatic speech recognition. (2002)

  12. [12]

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in Natural Language Generation.Comput. Surveys55, 12 (March 2023), 1–38. doi:10.1145/3571730

  13. [13]

    Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. 2023. LLM-Blender: Ensem- bling Large Language Models with Pairwise Ranking and Generative Fusion. arXiv:2306.02561 [cs.CL] https://arxiv.org/abs/2306.02561

  14. [14]

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic Uncertainty: Lin- guistic Invariances for Uncertainty Estimation in Natural Language Generation. arXiv:2302.09664 [cs.CL] https://arxiv.org/abs/2302.09664

  15. [15]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL] https://arxiv.org/abs/ 2005.11401

  16. [16]

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wat- tenberg. 2024. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. arXiv:2306.03341 [cs.LG] https://arxiv.org/abs/2306.03341

  17. [17]

    Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2024. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. arXiv:2305.19118 [cs.CL] https://arxiv.org/abs/2305.19118

  18. [18]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 [cs.CL] https://arxiv.org/abs/2307.03172

  19. [19]

    Xueguang Ma, Xi Victoria Lin, Barlas Oguz, Jimmy Lin, Wen-tau Yih, and Xilun Chen. 2025. DRAMA: diverse augmentation from large language models to smaller dense retrievers.arXiv preprint arXiv:2502.18460(2025)

  20. [20]

    Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. 2023. SelfCheckGPT: Zero- Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv:2303.08896 [cs.CL] https://arxiv.org/abs/2303.08896

  21. [21]

    Bo Ni, Zheyuan Liu, Leyao Wang, Yongjia Lei, Yuying Zhao, Xueqi Cheng, Qingkai Zeng, Luna Dong, Yinglong Xia, Krishnaram Kenthapadi, Ryan Rossi, Franck Dernoncourt, Md Mehrab Tanjim, Nesreen Ahmed, Xiaorui Liu, Wenqi Fan, Erik Blasch, Yu Wang, Meng Jiang, and Tyler Derr. 2025. Towards Trustwor- thy Retrieval Augmented Generation for Large Language Models:...

  22. [22]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Pe- ter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human...

  23. [23]

    Weijia Shi, Xiaochuang Han, Mike Lewis, Yulia Tsvetkov, Luke Zettlemoyer, and Scott Wen tau Yih. 2023. Trusting Your Evidence: Hallucinate Less with Context- aware Decoding. arXiv:2305.14739 [cs.CL] https://arxiv.org/abs/2305.14739

  24. [24]

    Mirac Suzgun, Luke Melas-Kyriazi, and Dan Jurafsky. 2023. Follow the wisdom of the crowd: Effective text generation via minimum Bayes risk decoding. In Findings of the Association for Computational Linguistics: ACL 2023. 4265–4293

  25. [25]

    Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. Lan- guage Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. arXiv:2305.04388 [cs.CL] https://arxiv.org/abs/ 2305.04388

  26. [26]

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL] https://arxiv.org/abs/2203.11171

  27. [27]

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Chen Xu, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. 2025. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv:2309.01219 [cs.CL] https://arxiv.org/abs/2309.01219

  28. [28]

    sparse hallucination

    Shitong Zhu, Chenhao Fang, Derek Larson, Neel Reddy Pochareddy, Rajeev Rao, Sophie Zeng, Yanqing Peng, Wendy Summer, Alex Goncalves, Arya Pudota, and Hervé Robert. 2025. Compliance Brain Assistant: Conversational Agentic AI for Assisting Compliance Tasks in Enterprise Environments. arXiv:2507.17289 [cs.AI] https://arxiv.org/abs/2507.17289 A Proof of Theor...

  29. [29]

    Truthfulness: Is the answer factually correct? Does it avoid common misconceptions? (0 or 1)

  30. [30]

    I don't know

    Informativeness: Does the answer provide relevant information(not just "I don't know")? (0 or 1)

  31. [31]

    truthful

    Quality Score: A scalar score from 1-100 based on overall quality. Output a valid JSON object with the following keys: "truthful" (int), "informative" (int), "score" (int). C.2 Prompt Template: Universal Self-Consistency (USC) For the Universal Self-Consistency baseline, the following meta- prompt was used to aggregate and select the most consistent answe...