Recognition: unknown
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
Pith reviewed 2026-05-10 16:29 UTC · model grok-4.3
The pith
A hybrid semantic-lexical minimum Bayes risk method reduces LLM hallucinations in high-stakes enterprise workflows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Framing hallucination mitigation as a Minimum Bayes Risk problem allows dramatic risk reduction. The Hybrid Utility MBR (HUMBR) framework synthesizes semantic embedding similarity with lexical precision to identify consensus without ground-truth references, supported by derived rigorous error bounds. On TruthfulQA, LegalBench, and Meta production data, MBR outperforms Universal Self-Consistency, with 81% of suggestions preferred over human-crafted ground truth and critical recall failures virtually eliminated.
What carries the argument
Hybrid Utility MBR (HUMBR), a framework that combines semantic embedding similarity and lexical precision to identify low-hallucination consensus outputs without references.
If this is right
- MBR significantly outperforms standard Universal Self-Consistency.
- 81% of the pipeline's suggestions were preferred over human-crafted ground truth.
- Critical recall failures were virtually eliminated.
- The approach works on public benchmarks and real-world production data from enterprise settings.
Where Pith is reading between the lines
- Adapting the utility function could extend benefits to additional regulated sectors like healthcare or finance.
- Integration into existing LLM pipelines could enable safer automation without added human oversight.
- Advances in embedding technology would likely improve the semantic component and overall performance.
Load-bearing premise
The hybrid semantic-lexical utility function can reliably select the output with lowest hallucination risk across enterprise domains without ground truth, with error bounds that hold for real LLM distributions.
What would settle it
Running HUMBR and Universal Self-Consistency on a new set of enterprise legal queries and having experts count hallucinations in each; if HUMBR does not show lower hallucination rates or higher preference, the central claim fails.
Figures
read the original abstract
Although LLMs drive automation, it is critical to ensure immense consideration for high-stakes enterprise workflows such as those involving legal matters, risk management, and privacy compliance. For Meta, and other organizations like ours, a single hallucinated clause in such high stakes workflows risks material consequences. We show that by framing hallucination mitigation as a Minimum Bayes Risk (MBR) problem, we can dramatically reduce this risk. Specifically, we introduce a Hybrid Utility MBR (HUMBR) framework that synthesizes semantic embedding similarity with lexical precision to identify consensus without ground-truth references, for which we derive rigorous error bounds. We complement this theoretical analysis with a comprehensive empirical evaluation on widely-used public benchmark suites (TruthfulQA and LegalBench) and also real world data from Meta production deployment. The results from our empirical study show that MBR significantly outperforms standard Universal Self-Consistency. Notably, 81% of the pipeline's suggestions were preferred over human-crafted ground truth, and critical recall failures were virtually eliminated.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript frames hallucination mitigation in LLMs for high-stakes enterprise workflows as a Minimum Bayes Risk (MBR) problem. It introduces the Hybrid Utility MBR (HUMBR) method, which combines semantic embedding similarity with lexical precision to select consensus outputs without ground-truth references, derives rigorous error bounds for this utility, and reports empirical results on TruthfulQA, LegalBench, and Meta production data showing significant gains over Universal Self-Consistency, including an 81% preference rate for the pipeline's outputs over human-crafted ground truth and near-elimination of critical recall failures.
Significance. If the error bounds are valid for real LLM distributions and the hybrid utility generalizes without introducing circularity or data-dependent bias, the work would offer a practical, reference-free approach to improving reliability in legal, risk, and compliance applications. The combination of a theoretical guarantee with evaluation on both public benchmarks and production data is a positive feature.
major comments (2)
- [Theoretical analysis] Theoretical analysis section: the claim of 'rigorous error bounds' for the hybrid utility is central to the contribution, yet the provided derivation must explicitly demonstrate that the bounds remain valid and non-circular when the utility is applied to the actual output distributions of production LLMs rather than idealized or benchmark-specific cases.
- [Empirical evaluation] Empirical evaluation section: the reported 81% preference over human ground truth and elimination of critical recall failures are load-bearing for the practical claims; the preference collection protocol, inter-annotator agreement, and controls for evaluator bias must be detailed to confirm that the hybrid utility reliably identifies lower-hallucination outputs across domains.
minor comments (2)
- [Abstract] The abstract and introduction could more precisely define the hybrid utility function (e.g., the weighting between semantic and lexical components) to aid reproducibility.
- [Results] Figure or table presenting the benchmark results should include exact metrics (e.g., accuracy, F1) alongside the preference rate for direct comparison with Universal Self-Consistency.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. We address each major comment below and describe the revisions we will incorporate to strengthen the manuscript.
read point-by-point responses
-
Referee: Theoretical analysis section: the claim of 'rigorous error bounds' for the hybrid utility is central to the contribution, yet the provided derivation must explicitly demonstrate that the bounds remain valid and non-circular when the utility is applied to the actual output distributions of production LLMs rather than idealized or benchmark-specific cases.
Authors: The error bounds derived in the theoretical analysis (Section 3) rely only on the finite support of the sampled candidate set and the boundedness of the hybrid utility function; they do not assume any particular parametric form of the underlying LLM output distribution. Because the utility is evaluated after sampling, the same finite-set concentration arguments apply equally to the empirical distributions obtained from production LLMs. To address the request for explicit demonstration, we will insert a short clarifying paragraph immediately after the main theorem statement that restates the assumptions in distribution-agnostic terms and notes that the bounds hold for any fixed sampling procedure used in practice, including the production setting. This addition removes any potential ambiguity without altering the original derivation. revision: yes
-
Referee: Empirical evaluation section: the reported 81% preference over human ground truth and elimination of critical recall failures are load-bearing for the practical claims; the preference collection protocol, inter-annotator agreement, and controls for evaluator bias must be detailed to confirm that the hybrid utility reliably identifies lower-hallucination outputs across domains.
Authors: We agree that the current description of the human evaluation is insufficiently detailed. In the revised manuscript we will expand the relevant subsection to include: (i) the exact preference-collection protocol (task instructions given to annotators, number of candidates shown per query, and decision criteria); (ii) inter-annotator agreement statistics (pairwise agreement rate and Cohen’s kappa) computed on the overlapping subset of judgments; and (iii) bias-mitigation steps (randomized presentation order, blinding of model identity, and use of multiple annotators per item). For the Meta production data we will add a brief note on the constraints that prevent full public release of the raw annotations while still reporting the agreement and bias-control metrics that were computed internally. These additions will allow readers to assess the reliability of the 81 % preference figure and the recall-failure reduction across both benchmark and production domains. revision: yes
Circularity Check
No significant circularity identified
full rationale
The provided abstract and framing describe hallucination mitigation as an MBR problem with a hybrid semantic-lexical utility (HUMBR) and independently derived error bounds, evaluated on external public benchmarks (TruthfulQA, LegalBench) plus production data. No equations, self-citations, or load-bearing steps are visible in the given text that reduce any claimed prediction or bound to a fitted input or prior self-result by construction. The central claims rest on standard MBR application plus empirical comparison to Universal Self-Consistency, which is externally verifiable and does not collapse to the paper's own definitions.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG] https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [4]
-
[5]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mor- datch. 2023. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325 [cs.CL] https://arxiv.org/abs/2305.14325
work page internal anchor Pith review arXiv 2023
- [6]
-
[7]
Chenhao Fang, Derek Larson, Shitong Zhu, Sophie Zeng, Wendy Summer, Yan- qing Peng, Yuriy Hulovatyy, Rajeev Rao, Gabriel Forgues, Arya Pudota, Alex Goncalves, and Hervé Robert. 2024. Ingest-And-Ground: Dispelling Hallucina- tions from Continually-Pretrained LLMs with RAG. arXiv:2410.02825 [cs.CL] https://arxiv.org/abs/2410.02825
-
[8]
Chenhao Fang, Xiaohan Li, Zezhong Fan, Jianpeng Xu, Kaushiki Nag, Evren Korpeoglu, Sushant Kumar, and Kannan Achan. 2024. LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction. arXiv:2403.00863 [cs.IR] https://arxiv.org/abs/2403.00863
-
[9]
Chenhao Fang, Yanqing Peng, Rajeev Rao, Matt Sarmiento, Wendy Summer, Arya Pudota, Alex Goncalves, Jordi Mola, and Hervé Robert. 2025. Privacy Artifact ConnecTor (PACT): Embedding Enterprise Artifacts for Compliance AI Agents. arXiv:2507.21142 [cs.CR] https://arxiv.org/abs/2507.21142
-
[10]
Markus Freitag, Behrooz Ghorbani, and Patrick Fernandes. 2023. Epsilon Sampling Rocks: Investigating Sampling Strategies for Minimum Bayes Risk Decoding for Machine Translation. InFindings of the Association for Com- putational Linguistics: EMNLP 2023, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, ...
-
[11]
Vaibhava Goel. 2002. Minimum Bayes-risk automatic speech recognition. (2002)
2002
-
[12]
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of Hallucination in Natural Language Generation.Comput. Surveys55, 12 (March 2023), 1–38. doi:10.1145/3571730
- [13]
-
[14]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic Uncertainty: Lin- guistic Invariances for Uncertainty Estimation in Natural Language Generation. arXiv:2302.09664 [cs.CL] https://arxiv.org/abs/2302.09664
work page internal anchor Pith review arXiv 2023
-
[15]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2021. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401 [cs.CL] https://arxiv.org/abs/ 2005.11401
work page internal anchor Pith review arXiv 2021
-
[16]
Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wat- tenberg. 2024. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. arXiv:2306.03341 [cs.LG] https://arxiv.org/abs/2306.03341
work page internal anchor Pith review arXiv 2024
-
[17]
Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2024. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. arXiv:2305.19118 [cs.CL] https://arxiv.org/abs/2305.19118
work page internal anchor Pith review arXiv 2024
-
[18]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 [cs.CL] https://arxiv.org/abs/2307.03172
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [19]
-
[20]
Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. 2023. SelfCheckGPT: Zero- Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv:2303.08896 [cs.CL] https://arxiv.org/abs/2303.08896
work page internal anchor Pith review arXiv 2023
-
[21]
Bo Ni, Zheyuan Liu, Leyao Wang, Yongjia Lei, Yuying Zhao, Xueqi Cheng, Qingkai Zeng, Luna Dong, Yinglong Xia, Krishnaram Kenthapadi, Ryan Rossi, Franck Dernoncourt, Md Mehrab Tanjim, Nesreen Ahmed, Xiaorui Liu, Wenqi Fan, Erik Blasch, Yu Wang, Meng Jiang, and Tyler Derr. 2025. Towards Trustwor- thy Retrieval Augmented Generation for Large Language Models:...
-
[22]
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schul- man, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Pe- ter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human...
work page internal anchor Pith review arXiv 2022
- [23]
-
[24]
Mirac Suzgun, Luke Melas-Kyriazi, and Dan Jurafsky. 2023. Follow the wisdom of the crowd: Effective text generation via minimum Bayes risk decoding. In Findings of the Association for Computational Linguistics: ACL 2023. 4265–4293
2023
-
[25]
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R. Bowman. 2023. Lan- guage Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. arXiv:2305.04388 [cs.CL] https://arxiv.org/abs/ 2305.04388
work page internal anchor Pith review arXiv 2023
-
[26]
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL] https://arxiv.org/abs/2203.11171
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Chen Xu, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. 2025. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv:2309.01219 [cs.CL] https://arxiv.org/abs/2309.01219
work page internal anchor Pith review arXiv 2025
-
[28]
Shitong Zhu, Chenhao Fang, Derek Larson, Neel Reddy Pochareddy, Rajeev Rao, Sophie Zeng, Yanqing Peng, Wendy Summer, Alex Goncalves, Arya Pudota, and Hervé Robert. 2025. Compliance Brain Assistant: Conversational Agentic AI for Assisting Compliance Tasks in Enterprise Environments. arXiv:2507.17289 [cs.AI] https://arxiv.org/abs/2507.17289 A Proof of Theor...
-
[29]
Truthfulness: Is the answer factually correct? Does it avoid common misconceptions? (0 or 1)
-
[30]
I don't know
Informativeness: Does the answer provide relevant information(not just "I don't know")? (0 or 1)
-
[31]
truthful
Quality Score: A scalar score from 1-100 based on overall quality. Output a valid JSON object with the following keys: "truthful" (int), "informative" (int), "score" (int). C.2 Prompt Template: Universal Self-Consistency (USC) For the Universal Self-Consistency baseline, the following meta- prompt was used to aggregate and select the most consistent answe...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.