pith. sign in

arxiv: 2606.00919 · v1 · pith:TECYL76Anew · submitted 2026-05-30 · 💻 cs.CL · cs.LG

Towards Lightweight Reliability: Using Soft Prompts for Hallucination Mitigation in Large Language Models

Pith reviewed 2026-06-28 18:30 UTC · model grok-4.3

classification 💻 cs.CL cs.LG
keywords hallucination mitigationsoft promptslarge language modelscontrastive learningabstentionparameter-efficient tuninggenerative QA
0
0 comments X

The pith

Soft prompts trained with contrastive loss can reduce hallucinations in LLMs while preserving factual recall on generative QA tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Responsible Contrastive Soft Prompting (RCSP), a parameter-efficient method that trains soft prompts on LLMs to suppress hallucinatory outputs, encourage abstention when uncertain, and maintain factual recall. It achieves this through a composite training loss that combines contrastive elements, curriculum learning, and KL regularization. Experiments on five generative QA datasets using Gemma 3 (12B) and Llama 3.1 (8B) backbones show RCSP yields higher F-scores than standard reasoning and instruction-based prompting baselines while updating only a small fraction of parameters.

Core claim

RCSP trains soft prompts via a composite loss incorporating contrastive loss, curriculum learning, and KL regularization to simultaneously suppress hallucinations, promote abstention under uncertainty, and preserve factual recall, resulting in generally superior F-scores over baselines on five generative QA datasets when applied to Gemma 3 and Llama 3.1 models.

What carries the argument

Responsible Contrastive Soft Prompting (RCSP), a soft-prompt tuning procedure whose composite loss balances hallucination suppression, abstention, and factual recall.

Load-bearing premise

The LLM-as-a-Judge evaluation on the five QA datasets accurately measures hallucination, abstention, and recall in a way that matches human judgment.

What would settle it

Human evaluation of model outputs on the same five datasets that shows the LLM judge systematically over- or under-estimates hallucination rates compared to human raters.

Figures

Figures reproduced from arXiv: 2606.00919 by Akib Jawad Ononto, Anoop Singhal, Latifur Khan, S M Tahmid Siddiqui.

Figure 1
Figure 1. Figure 1: Two-phase RCSP workflow. Phase 1 (offline) trains soft prompts using a composite loss function, then evaluates the learned soft prompts using a judge LLM to identify the best-performing one. Phase 2 (deployment) prepends the soft prompt selected in phase 1 with the user question and feeds it to the base LLM to generate the final answer. Phase 1 is performed once prior to deployment while Phase 2 executes f… view at source ↗
Figure 2
Figure 2. Figure 2 [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
read the original abstract

Large language models (LLMs) have seen widespread adoption across various domains, yet their reliability is frequently undermined by hallucinations - responses that are plausible-sounding but factually incorrect. In high-stakes domains, these errors can reduce trust and introduce real-world risk. To address this challenge, we present a parameter-efficient approach that uses soft prompts to mitigate hallucinated content and promote responsible abstention in generative question-answering (QA) tasks. Our method, called Responsible Contrastive Soft Prompting (RCSP), uses a composite loss to train soft prompts that balance three goals: suppressing hallucinatory content, encouraging abstention under uncertainty, and preserving or improving factual recall. To achieve these goals, we incorporate contrastive loss, curriculum learning, and KL regularization into our training mechanism. We evaluate our approach on five diverse generative QA datasets using an LLM-as-a-Judge framework. Experimental results on the Gemma 3 (12B) and Llama 3.1 (8B) backbones demonstrate that RCSP effectively balances factual recall with hallucination suppression and abstention, yielding a generally superior F-score over standard reasoning and instruction-based prompting baselines. Notably, these improvements are achieved by training only a fraction of the parameters required by other tuning techniques. Our results demonstrate that soft prompts provide a modular and computationally efficient path toward improving LLM reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Responsible Contrastive Soft Prompting (RCSP), a parameter-efficient technique that trains soft prompts via a composite loss (contrastive + curriculum + KL regularization) to suppress hallucinations, encourage abstention under uncertainty, and preserve factual recall in generative QA. It reports results on five datasets using an LLM-as-a-Judge evaluation, claiming generally superior F-scores on Gemma-3 (12B) and Llama-3.1 (8B) backbones relative to standard reasoning and instruction baselines, while tuning only a small fraction of parameters.

Significance. If the empirical claims hold under validated metrics, the work would be significant for demonstrating a modular, low-parameter path to improved LLM reliability that explicitly trades off recall against suppression and abstention. The explicit composite-loss design and focus on parameter efficiency are clear strengths that distinguish it from full fine-tuning approaches.

major comments (2)
  1. [Experimental results / LLM-as-a-Judge framework] The headline result (superior F-scores via RCSP) is measured exclusively through an LLM-as-a-Judge framework on the five generative QA datasets. No human validation, inter-annotator agreement, judge-prompt ablation, or calibration against human labels is reported, which directly undermines whether the reported balance of recall, suppression, and abstention reflects genuine model behavior rather than judge bias.
  2. [Experimental results] The abstract and results description provide no quantitative details on effect sizes, variance across runs, statistical significance tests, or confidence intervals for the F-score improvements. This information is required to assess whether the gains are robust or could arise from post-hoc component selection in the composite loss.
minor comments (2)
  1. [Abstract] The abstract would benefit from reporting the actual F-score values (or deltas) rather than the qualitative claim of 'generally superior.'
  2. [Method] Notation for the three loss terms and their weighting schedule should be introduced with explicit equations in the method section to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate the corresponding revisions.

read point-by-point responses
  1. Referee: [Experimental results / LLM-as-a-Judge framework] The headline result (superior F-scores via RCSP) is measured exclusively through an LLM-as-a-Judge framework on the five generative QA datasets. No human validation, inter-annotator agreement, judge-prompt ablation, or calibration against human labels is reported, which directly undermines whether the reported balance of recall, suppression, and abstention reflects genuine model behavior rather than judge bias.

    Authors: We agree that the absence of human validation or calibration is a limitation. LLM-as-a-Judge is used for scalability on generative outputs, but we will add a limitations section discussing potential judge bias and include a judge-prompt ablation study in the revision. A full human evaluation on all datasets is not feasible within this revision cycle due to annotation costs and time, though we plan it for future work. revision: partial

  2. Referee: [Experimental results] The abstract and results description provide no quantitative details on effect sizes, variance across runs, statistical significance tests, or confidence intervals for the F-score improvements. This information is required to assess whether the gains are robust or could arise from post-hoc component selection in the composite loss.

    Authors: We accept this point. Experiments used three random seeds; we will revise the results section and abstract to report mean F-scores with standard deviations, Cohen's d effect sizes, and p-values from paired statistical tests (e.g., Wilcoxon) against baselines. These details were computed internally and will be added to demonstrate robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with explicitly defined composite loss and external evaluation

full rationale

The paper presents RCSP as a training procedure on soft prompts using a composite loss (contrastive + curriculum + KL) whose components are standard techniques applied to the three stated goals; the central claim is an empirical F-score comparison on Gemma-3 and Llama-3.1 backbones against prompting baselines, measured on five QA datasets via LLM-as-Judge. No equations, uniqueness theorems, or predictions are shown that reduce by construction to fitted parameters or self-citations. The derivation chain consists of method definition followed by independent experimental measurement, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on standard assumptions in prompt tuning and contrastive learning rather than new invented entities. The composite loss weights are free parameters whose values are not reported.

free parameters (1)
  • composite loss weights
    The loss balances three goals (suppression, abstention, recall) and therefore requires scalar coefficients whose specific values are not stated in the abstract.
axioms (1)
  • domain assumption Soft prompts can be trained to control high-level behavioral properties such as hallucination rate and abstention in frozen LLMs
    This is the core premise that allows the method to work with only a small number of trainable parameters.

pith-pipeline@v0.9.1-grok · 5787 in / 1455 out tokens · 22796 ms · 2026-06-28T18:30:57.659046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 8 canonical work pages · 4 internal anchors

  1. [1]

    In: Proceedings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing

    Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on Freebase from question-answer pairs. In: Proceedings of the 2013 Conference on Empirical Meth- ods in Natural Language Processing. pp. 1533–1544 (2013) 18 S. Siddiqui et al

  2. [2]

    Bhaila, K., Van, M.H., Wu, X.: Soft prompting for unlearning in large language models. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers). pp. 4046–4056 (2025)

  3. [3]

    In: Practice and experience in advanced research computing 2023: Com- puting for the common good, pp

    Boerner, T.J., Deems, S., Furlani, T.R., Knuth, S.L., Towns, J.: Access: Advancing innovation: Nsf’s advanced cyberinfrastructure coordination ecosystem: Services & support. In: Practice and experience in advanced research computing 2023: Com- puting for the common good, pp. 173–176 (2023)

  4. [4]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  5. [5]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

    Choi, J.Y., Kim, J., Park, J.H., Mok, W.L., Lee, S.: Smop: Towards efficient and ef- fective prompt tuning with sparse mixture-of-prompts. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. pp. 14306– 14316 (2023)

  6. [6]

    Chowdhury, N., Haque, M., Ahmed, A., Tasnim, N., Shihab, M.I.H., Rahman, S., Sadeque, F.: From facts to folklore: Evaluating large language models on bengali cultural knowledge. In: Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational L...

  7. [7]

    In: The Twelfth International Conference on Learning Representations (2024)

    Chuang, Y.S., Xie, Y., Luo, H., Kim, Y., Glass, J.R., He, P.: Dola: Decoding by contrasting layers improves factuality in large language models. In: The Twelfth International Conference on Learning Representations (2024)

  8. [8]

    ACM Computing Surveys57(6), 1–39 (2025)

    Das, B.C., Amini, M.H., Wu, Y.: Security and privacy challenges of large language models: A survey. ACM Computing Surveys57(6), 1–39 (2025)

  9. [9]

    Advances in neural information processing systems36, 10088–10115 (2023)

    Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: Qlora: Efficient fine- tuning of quantized llms. Advances in neural information processing systems36, 10088–10115 (2023)

  10. [10]

    In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

  11. [11]

    Internet of Things and Cyber-Physical Systems5, 1–46 (2025)

    Ferrag, M.A., Alwahedi, F., Battah, A., Cherif, B., Mechri, A., Tihanyi, N., Bisz- tray,T.,Debbah,M.:Generativeaiincybersecurity:Acomprehensivereviewofllm applications and vulnerabilities. Internet of Things and Cyber-Physical Systems5, 1–46 (2025)

  12. [12]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

  13. [13]

    In: International conference on machine learning

    Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International conference on machine learning. pp. 2790–2799. PMLR (2019)

  14. [14]

    In: International Conference on Learning Representations (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022)

  15. [15]

    ACM Transactions on Information Systems43(2), 1–55 (2025) Lightweight Hallucination Mitigation Using Soft Prompts 19

    Huang, L., Yu, W., Ma, W., Zhong, W., Feng, Z., Wang, H., Chen, Q., Peng, W., Feng, X., Qin, B., et al.: A survey on hallucination in large language mod- els: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems43(2), 1–55 (2025) Lightweight Hallucination Mitigation Using Soft Prompts 19

  16. [16]

    Jin, Q., Dhingra, B., Liu, Z., Cohen, W., Lu, X.: Pubmedqa: A dataset for biomed- ical research question answering. In: Proceedings of the 2019 Conference on Em- pirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). pp. 2567–2577 (2019)

  17. [17]

    In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Joshi, M., Choi, E., Weld, D., Zettlemoyer, L.: TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1601–1611 (2017)

  18. [18]

    Why Language Models Hallucinate

    Kalai, A.T., Nachum, O., Vempala, S.S., Zhang, E.: Why language models hallu- cinate. arXiv preprint arXiv:2509.04664 (2025)

  19. [19]

    arXiv preprint arXiv:2311.15548 (2023)

    Kang, H., Liu, X.Y.: Deficiency of large language models in finance: An empirical examination of hallucination. arXiv preprint arXiv:2311.15548 (2023)

  20. [20]

    Proceedings of the national academy of sciences114(13), 3521–3526 (2017)

    Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences114(13), 3521–3526 (2017)

  21. [21]

    Advances in neural information processing systems35, 22199–22213 (2022)

    Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. Advances in neural information processing systems35, 22199–22213 (2022)

  22. [22]

    In: Proceedings of the 2021 conference on empirical methods in natural language processing

    Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 conference on empirical methods in natural language processing. pp. 3045–3059 (2021)

  23. [23]

    Advances in neural information processing systems 33, 9459–9474 (2020)

    Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.t., Rocktäschel, T., et al.: Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in neural information processing systems 33, 9459–9474 (2020)

  24. [24]

    arXiv preprint arXiv:2307.11760 (2023)

    Li, C., Wang, J., Zhang, Y., Zhu, K., Hou, W., Lian, J., Luo, F., Yang, Q., Xie, X.: Large language models understand and can be enhanced by emotional stimuli. arXiv preprint arXiv:2307.11760 (2023)

  25. [25]

    Li, J., Cheng, X., Zhao, W.X., Nie, J.Y., Wen, J.R.: Halueval: A large-scale halluci- nationevaluationbenchmarkforlargelanguagemodels.In:Proceedingsofthe2023 Conference on Empirical Methods in Natural Language Processing. pp. 6449–6464 (2023)

  26. [26]

    Advances in Neural Information Processing Systems36, 41451–41530 (2023)

    Li, K., Patel, O., Viégas, F., Pfister, H., Wattenberg, M.: Inference-time inter- vention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems36, 41451–41530 (2023)

  27. [27]

    In: Proceedings of the 60th annual meeting of the association for com- putational linguistics (volume 1: long papers)

    Lin, S., Hilton, J., Evans, O.: Truthfulqa: Measuring how models mimic human falsehoods. In: Proceedings of the 60th annual meeting of the association for com- putational linguistics (volume 1: long papers). pp. 3214–3252 (2022)

  28. [28]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Long, X., Zeng, J., Meng, F., Ma, Z., Zhang, K., Zhou, B., Zhou, J.: Generative multi-modal knowledge retrieval with large language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 18733–18741 (2024)

  29. [29]

    Luo, Y., Yang, Z., Meng, F., Li, Y., Zhou, J., Zhang, Y.: An empirical study of catastrophicforgettinginlargelanguagemodelsduringcontinualfine-tuning.IEEE Transactions on Audio, Speech and Language Processing33, 3776–3786 (2025)

  30. [30]

    Advances in neural information processing systems36, 46534–46594 (2023) 20 S

    Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., et al.: Self-refine: Iterative refinement with self-feedback. Advances in neural information processing systems36, 46534–46594 (2023) 20 S. Siddiqui et al

  31. [31]

    OpenAI (Apr 2025), https://openai.com/ index/gpt-4-1/, accessed: 2026-04-09

    OpenAI: Introducing gpt-4.1 in the api. OpenAI (Apr 2025), https://openai.com/ index/gpt-4-1/, accessed: 2026-04-09

  32. [32]

    Advances in neural information processing sys- tems35, 27730–27744 (2022)

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human feedback. Advances in neural information processing sys- tems35, 27730–27744 (2022)

  33. [33]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Pandit, S., Xu, J., Hong, J., Wang, Z., Tianlong, C., Xu, K., Ding, Y.: MedHallu: A comprehensive benchmark for detecting medical hallucinations in large language models. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 2858–2873 (2025)

  34. [34]

    Scientific Reports14(1), 30667 (2024)

    Prottasha, N.J., Mahmud, A., Sobuj, M.S.I., Bhat, P., Kowsher, M., Yousefi, N., Garibay, O.O.: Parameter-efficient fine-tuning of large language models using se- mantic knowledge tuning. Scientific Reports14(1), 30667 (2024)

  35. [35]

    Qin, S., Zhou, L., Sun, L., Wang, N.: Do large language models know when they lack knowledge? Electronics15(2), 253 (2026)

  36. [36]

    Advances in neural information processing systems36, 53728–53741 (2023)

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

  37. [37]

    In: Findings of the Association for Compu- tational Linguistics: EMNLP 2021

    Shuster, K., Poff, S., Chen, M., Kiela, D., Weston, J.: Retrieval augmentation reduces hallucination in conversation. In: Findings of the Association for Compu- tational Linguistics: EMNLP 2021. pp. 3784–3803 (2021)

  38. [38]

    Gemma 3 Technical Report

    Team, G., DeepMind, G.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025), https://arxiv.org/abs/2503.19786

  39. [39]

    In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

    Wang, D., Yang, K., Zhu, H., Yang, X., Cohen, A., Li, L., Tian, Y.: Learning personalized alignment for evaluating open-ended text generation. In: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 13274–13292 (2024)

  40. [40]

    In: Proceedings of the 61st annual meeting of the association for compu- tational linguistics (volume 1: long papers)

    Wang, L., Xu, W., Lan, Y., Hu, Z., Lan, Y., Lee, R.K.W., Lim, E.P.: Plan-and- solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. In: Proceedings of the 61st annual meeting of the association for compu- tational linguistics (volume 1: long papers). pp. 2609–2634 (2023)

  41. [41]

    Measuring short-form factuality in large language models

    Wei, J., Karina, N., Chung, H.W., Jiao, Y.J., Papay, S., Glaese, A., Schulman, J., Fedus, W.: Measuring short-form factuality in large language models. arXiv preprint arXiv:2411.04368 (2024)

  42. [42]

    Advances in neural information processing systems35, 24824–24837 (2022)

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

  43. [43]

    arXiv preprint arXiv:2503.01332 (2025)

    Wu, C.K., Tam, Z.R., Lin, C.Y., Chen, Y.N., Lee, H.y.: Answer, refuse, or guess? investigating risk-aware decision making in language models. arXiv preprint arXiv:2503.01332 (2025)

  44. [44]

    In: Proceed- ings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

    Zhang, H., Diao, S., Lin, Y., Fung, Y., Lian, Q., Wang, X., Chen, Y., Ji, H., Zhang, T.: R-tuning: Instructing large language models to say ‘i don’t know’. In: Proceed- ings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). pp. 7113–7139 (2024)

  45. [45]

    Advances in neural information processing systems36, 46595–46623 (2023)

    Zheng, L., Chiang, W.L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al.: Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems36, 46595–46623 (2023)

  46. [46]

    Zheng, S., Huang, J., Chang, K.C.C.: Why does chatgpt fall short in providing truthful answers? arXiv preprint arXiv:2304.10513 (2023)