pith. sign in

arxiv: 2605.29523 · v1 · pith:QVEWFR33new · submitted 2026-05-28 · 💻 cs.LG

K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance

Pith reviewed 2026-06-29 08:52 UTC · model grok-4.3

classification 💻 cs.LG
keywords hallucination detectionmulti-turn RAGKorean financebenchmark datasetjustified abstentionLLM evaluationfinancial documentsretrieval augmented generation
0
0 comments X

The pith

K-FinHallu benchmark finds even frontier LLMs struggle to detect hallucinations and justified abstention in multi-turn Korean financial RAG.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates the first benchmark specifically for spotting hallucinations in multi-turn conversations that use retrieval-augmented generation for Korean financial topics. Dialogues are built from real Korean financial documents, with hallucinations added according to a taxonomy that classifies errors by whether the provided context can answer the question and includes cases where the model should refuse to answer. Testing shows that leading large language models have trouble with detailed financial analysis and with knowing when to abstain, and while fine-tuning an 8 billion parameter model on the data makes it nearly as good as the top models, abstention stays the hardest part. Such a benchmark matters because financial tools need to avoid making up information in sensitive areas like Korean markets and regulations.

Core claim

The K-FinHallu benchmark constructs multi-turn dialogues from authentic Korean financial documents and injects hallucinations using a hierarchical taxonomy based on context answerability that accounts for justified abstention. Benchmarking shows frontier and open-source LLMs as detectors struggle with fine-grained financial diagnostics and refusal behavior, though fine-tuning an 8B model yields performance competitive with frontier LLMs, with justified abstention remaining the weakest axis.

What carries the argument

Hierarchical taxonomy based on context answerability for classifying hallucination types and justified abstention in multi-turn dialogues.

If this is right

  • Fine-tuning smaller models on the benchmark training data can achieve performance close to that of much larger frontier models.
  • Justified abstention is the most challenging aspect for hallucination detection across all tested models.
  • Current LLMs require better capabilities for handling fine-grained financial information and refusal decisions in conversational settings.
  • The benchmark provides a way to evaluate and improve RAG systems in the Korean financial domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar benchmarks may be useful for financial RAG in other languages with specific regulatory contexts.
  • Training methods that specifically target uncertainty detection in dialogues could improve abstention rates.
  • Integrating this benchmark with retrieval evaluation might reveal whether errors come from bad retrieval or generation.

Load-bearing premise

The proposed hierarchical taxonomy based on context answerability accurately captures the different types of hallucinations and cases of justified abstention in the multi-turn financial dialogues.

What would settle it

Human annotators reviewing a subset of the dialogues and hallucination injections find that the taxonomy labels do not align with the actual nature of the errors or abstention needs.

Figures

Figures reproduced from arXiv: 2605.29523 by Edward Choi, Eunbyeol Cho, Jeewon Yang, Mirae Kim, Youngjun Kwak, Yunseung Lee.

Figure 1
Figure 1. Figure 1: Example of a contradictory hallucination in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Hallucination taxonomy for multi-turn RAG. We categorize five types of hallucinations based on context [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the K-FinHallu construction pipeline. We first generate faithful dialogues and then inject [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of K-FinHallu test set characteristics. The figure presents the percentage distributions of [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Error rates of the Qwen3 series on Contra￾dictory sub-types (GPT-4o-injected subset). Each bar represents the proportion of hallucinations undetected by the model. Cross-source Analysis K-FinHallu employs both GPT-4o and Gemini-2.5-Flash as injection sources over the same faithful dialogues for the test split to mitigate model-specific artifacts ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Document domain distribution in the test (left) and training (right) splits. The same domain is assigned a [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Screenshot of the annotation interface for dataset quality control. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have advanced financial automation through Retrieval-Augmented Generation (RAG), yet hallucinations remain a critical barrier to deployment in high-stakes environments. Existing benchmarks focus on single-turn, English-centric tasks, leaving the multi-turn dynamics and linguistic-regulatory nuances of the Korean financial domain unaddressed. We introduce K-FinHallu, the first benchmark for hallucination detection in multi-turn Korean financial RAG. We construct multi-turn dialogues from authentic Korean financial documents and inject hallucinations under a proposed hierarchical taxonomy based on context answerability that explicitly accounts for justified abstention. Benchmarking frontier and open-source LLMs as hallucination detectors, we find that even the strongest models struggle with fine-grained financial diagnostics and refusal behavior. While fine-tuning an 8B model on our training split yields performance competitive with frontier LLMs, justified abstention remains the weakest axis across all evaluated models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces K-FinHallu, the first benchmark for hallucination detection in multi-turn Korean financial RAG. It constructs multi-turn dialogues from authentic Korean financial documents and injects hallucinations under a hierarchical taxonomy based on context answerability that explicitly accounts for justified abstention. Benchmarking frontier and open-source LLMs as detectors shows that even the strongest models struggle with fine-grained financial diagnostics and refusal behavior; fine-tuning an 8B model on the training split yields performance competitive with frontier LLMs, but justified abstention remains the weakest axis across all models.

Significance. If the taxonomy and dialogue construction hold, the benchmark addresses an important gap in existing single-turn, English-centric hallucination evaluations by targeting multi-turn dynamics and Korean financial-regulatory nuances. This is relevant for high-stakes RAG deployment; the explicit treatment of justified abstention is a strength if the labeling is reliable.

minor comments (2)
  1. [Abstract] The abstract states the taxonomy is 'hierarchical' and 'based on context answerability' but provides no enumeration of categories, decision rules, or inter-annotator agreement statistics; without these, it is impossible to assess whether the taxonomy correctly separates hallucination types from justified abstention.
  2. [Abstract] No details are given on the number of dialogues, document sources, injection procedure, or exact metrics (e.g., precision/recall per axis, refusal rate); these omissions prevent verification of the claim that fine-tuned 8B performance is 'competitive' or that abstention is 'weakest'.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of K-FinHallu and for recognizing its relevance to multi-turn Korean financial RAG hallucination detection, including the explicit handling of justified abstention. We note that the report contains no specific major comments requiring point-by-point rebuttal.

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a benchmark construction paper with no mathematical derivations, equations, predictions, or first-principles claims. The work introduces a dataset and taxonomy for hallucination detection in Korean financial RAG dialogues; the taxonomy is explicitly proposed as part of the contribution rather than derived from prior results or self-citations. No load-bearing step reduces to a fit, self-definition, or author-overlapping citation chain. The evaluation of LLMs on the benchmark is a standard empirical comparison and does not rely on internal circular logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5706 in / 944 out tokens · 33572 ms · 2026-06-29T08:52:44.648529+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. 2025. Hallulens: Llm hallucination benchmark. arXiv preprint arXiv:2504.17550

  2. [2]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877--1901

  3. [3]

    Kedi Chen, Qin Chen, Jie Zhou, He Yishen, and Liang He. 2024. Diahalu: A dialogue-level hallucination evaluation benchmark for large language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9057--9079

  4. [4]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, and 3416 others. 2025. https://arxiv.org/abs/2507.06261 Gemini 2.5: Pus...

  5. [5]

    Nouha Dziri, Ehsan Kamalloo, Sivan Milton, Osmar R Za \" ane, Mo Yu, Edoardo M Ponti, and Siva Reddy. 2022. Faithdial: A faithful benchmark for information-seeking dialogue. Transactions of the Association for Computational Linguistics, 10:1473--1490

  6. [6]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, and 542 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3...

  7. [7]

    Kilem Li Gwet. 2008. Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1):29--48

  8. [8]

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

  9. [9]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, and 1 others. 2022. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3

  10. [10]

    Youngjoon Jang, Junyoung Son, and Taemin Lee. 2024. Kure: Korea university retrieval embedding model. https://github.com/nlpai-lab/KURE

  11. [11]

    Kanana LLM . 2025. https://huggingface.co/collections/kakaocorp/kanana-2 Kanana-2 llm

  12. [12]

    Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara, Young-Suk Lee, Lucian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, and Marina Danilevsky. 2025. Mtrag: A multi-turn conversational benchmark for evaluating retrieval-augmented generation systems. Transactions of the Association for Computational Linguistics, 13:784--808

  13. [13]

    Lanlan Ji Dominic Seyler Gunkirat Kaur, Manjunath Hegde Koustuv Dasgupta Bing Xiang, and Goldman Sachs. 2025. Phantom: A benchmark for hallucination detection in financial long-context qa

  14. [14]

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611--626

  15. [15]

    Young-Suk Lee, Chulaka Gunasekara, Danish Contractor, Ram \'o n Fernandez Astudillo, and Radu Florian. 2024. Multi-document grounded multi-turn synthetic dialog generation. arXiv preprint arXiv:2409.11500

  16. [16]

    LG AI Research . 2025. Exaone 4.0: Unified large language models integrating non-reasoning and reasoning modes. arXiv preprint arXiv:2507.11407

  17. [17]

    Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2023. Halueval: A large-scale hallucination evaluation benchmark for large language models. arXiv preprint arXiv:2305.11747

  18. [18]

    Xiang Li, Zhenyu Li, Chen Shi, Yong Xu, Qing Du, Mingkui Tan, and Jun Huang. 2024. Alphafin: Benchmarking financial analysis with retrieval-augmented stock-chain framework. In Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024), pages 773--783

  19. [19]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), pages 3214--3252

  20. [20]

    Oscar Lithgow-Serrano, David Kletz, Vani Kanjirangat, David Adametz, Marzio Lunghi, Claudio Bonesana, Matilde Tristany Farinha, Yuntao Li, Detlef Repplinger, Marco Pierbattista, and 1 others. 2025. Assessing rag system capabilities on financial documents. In The 10th Workshop on Financial Technology and Natural Language Processing, page 124

  21. [21]

    Abhika Mishra, Akari Asai, Vidhisha Balachandran, Yizhong Wang, Graham Neubig, Yulia Tsvetkov, and Hannaneh Hajishirzi. 2024. Fine-grained hallucination detection and editing for language models. arXiv preprint arXiv:2401.06855

  22. [22]

    Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. 2024. Ragtruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10862--10878

  23. [23]

    OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander Mądry, Alex Baker-Whitcomb, Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, and 401 others. 2024. https://arxiv.org/abs/2410.21276 Gpt-4o system card . Preprint, arXiv:2410.21276

  24. [24]

    OpenAI . 2025. Gpt-5 system card. https://cdn.openai.com/gpt-5-system-card.pdf. OpenAI system card, accessed 2026-01-06

  25. [25]

    Chanjun Park, Hyeonwoo Kim, Dahyun Kim, Seonghwan Cho, Sanghoon Kim, Sukyung Lee, Yungi Kim, and Hwalsuk Lee. 2024. Open ko-llm leaderboard: Evaluating large language models in korean with ko-h5 benchmark. arXiv preprint arXiv:2405.20574

  26. [26]

    Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7:249--266

  27. [27]

    Jaehyung Seo, Jaewook Lee, Chanjun Park, SeongTae Hong, Seungjun Lee, and Heui-Seok Lim. 2024. Kocommongen v2: A benchmark for navigating korean commonsense reasoning challenges in large language models. In Findings of the Association for Computational Linguistics ACL 2024, pages 2390--2415

  28. [28]

    Jaehyung Seo and Heuiseok Lim. 2025. K-halu: Multiple answer korean hallucination benchmark for large language models. In The Thirteenth International Conference on Learning Representations

  29. [29]

    Kurt Shuster, Spencer Poff, Moya Chen, Douwe Kiela, and Jason Weston. 2021. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567

  30. [30]

    Guijin Son, Hanwool Lee, Sungdong Kim, Seungone Kim, Niklas Muennighoff, Taekyoon Choi, Cheonbok Park, Kang Min Yoo, and Stella Biderman. 2025. Kmmlu: Measuring massive multitask language understanding in korean. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Te...

  31. [31]

    Guijin Son, Hanwool Lee, Suwan Kim, Huiseo Kim, Jae cheol Lee, Je Won Yeom, Jihyu Jung, Jung woo Kim, and Songseong Kim. 2024. Hae-rae bench: Evaluation of korean knowledge in language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 7993--8007

  32. [32]

    Zhongkai Sun, Yingxue Zhou, Jie Hao, Xing Fan, Yanbin Lu, Chengyuan Ma, Wei Shen, and Chenlei Guo. 2023. Improving contextual query rewrite for conversational ai agents through user-preference feedback learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 432--439

  33. [33]

    Qwen Team. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . Preprint, arXiv:2505.09388

  34. [34]

    Shuting Wang, Jiejun Tan, Zhicheng Dou, and Ji-Rong Wen. 2025. Omnieval: An omnidirectional and automatic rag evaluation benchmark in financial domain. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5737--5762

  35. [35]

    Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, and 1 others. 2024. Finben: A holistic financial benchmark for large language models. Advances in Neural Information Processing Systems, 37:95716--95743

  36. [36]

    Sungjin Yun. 2020. https://doi.org/10.1108/IJHMA-03-2020-0025 Neighborhood effects of housing program using jeonse in korea . International Journal of Housing Markets and Analysis, 14(2):305--316

  37. [37]

    Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, and Yingda Chen. 2024. https://arxiv.org/abs/2408.05517 Swift:a scalable lightweight infrastructure for fine-tuning . Preprint, arXiv:2408.05517

  38. [38]

    Yingxue Zhou, Jie Hao, Mukund Rungta, Yang Liu, Eunah Cho, Xing Fan, Yanbin Lu, Vishal Vasudevan, Kellen Gillespie, and Zeynab Raeesy. 2023. Unified contextual query rewriting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track), pages 608--615

  39. [39]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  40. [40]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...