pith. sign in

arxiv: 2604.14215 · v2 · pith:FDFBYV2Cnew · submitted 2026-04-10 · 💻 cs.IR · cs.AI

PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong

Pith reviewed 2026-05-21 10:05 UTC · model grok-4.3

classification 💻 cs.IR cs.AI
keywords RAGLLMprimary healthcareHong Kongretrieval-augmented generationhealthcare assistant
0
0 comments X

The pith

PriHA uses dual retrieval to boost accuracy and clarity in Hong Kong primary healthcare advice

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents PriHA, a framework that combines large language models with retrieval from official Hong Kong clinical guidelines to assist with primary healthcare questions. The system addresses the problem of fragmented guidelines and the tendency of general LLMs to produce inaccurate information on localized topics. It employs a tri-stage pipeline including a query optimizer and a novel Dual Retrieval Augmented Generation approach to retrieve and reorganize context for better responses. Experiments show improvements in accuracy and clarity compared to baselines. This approach supports better public access to health information for self-management.

Core claim

The central claim is that the PriHA system with its DRAG architecture outperforms both ablations and baseline methods in terms of accuracy and clarity when answering primary healthcare queries using Hong Kong's official guidelines.

What carries the argument

The Dual Retrieval Augmented Generation (DRAG) architecture, which enables mixed-source retrieval and context-reorganized generation to improve response quality.

If this is right

  • Offers a traceable and reliable method for retrieving information from fragmented official sources.
  • Enables better support for citizens in self-managing their health through community resources.
  • Provides a framework that can be adapted for other high-risk localized applications.
  • Reduces the risk of factual errors in LLM-generated health advice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The system might benefit from updates to guidelines in real time to stay current.
  • User studies in actual healthcare settings could validate its practical impact beyond experiments.
  • Similar RAG setups could address guideline fragmentation in other regions or domains.

Load-bearing premise

Official clinical guidelines are complete, current, and sufficient to answer typical primary-care queries without requiring professional medical judgment or additional real-time data.

What would settle it

Demonstrating that for a query on a topic only partially covered by guidelines, the system generates advice that contradicts medical standards or omits critical warnings.

Figures

Figures reproduced from arXiv: 2604.14215 by Hao Chen, Liangjun Jiang, Richard Wai Cheung Chan, Shanru Lin, Wenqi Fan, Ya-nan Ma.

Figure 1
Figure 1. Figure 1: The illustration of scattered primary healthcare information and the existing limitations of Large Language Models (LLMs)-based Healthcare Assistant in Hong Kong. With the widespread adoption of Large Language Models (LLMs), there is an increasing trend to use these tools to obtain health information and medical ad￾vice [14,3,25]. However, individuals are prone to overtrusting these AI-generated results, v… view at source ↗
Figure 2
Figure 2. Figure 2: Overall framework of the proposed PriHA. The framework processes the initial user input through a query optimizer, yielding a clarified intention and refined sub￾queries, then passes to dual retrieval module that fetches and re-ranks content from a local knowledge base and web searchers. Finally, the summarization stage uses a reconciler to synthesize search results into a structured response with proper r… view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the HK-PriHCQA dataset by category. Evaluation Metrics We employed an "LLM-as-a-judge" methodology [9] to capture the nuanced requirements of our target users, prioritizing empa￾thy and clarity alongside accuracy. We defined five key metrics, detailed in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparative performance of system configurations across five metrics. The results presented in [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on a query regarding voucher acceptance at Zhuhai People’s Hospital. 4.3 Case Study To qualitatively assess the retrieval balance, we analyzed a query regarding the acceptance of elderly health vouchers at Zhuhai People’s Hospital ( [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

To address the unsustainable rise in public health expenditures, the Hong Kong SAR Government is shifting its strategic focus to primary healthcare and encouraging citizens to use community resources to self-manage their health. However, official clinical guidelines are fragmented across disparate departments and formats, creating significant access barriers. While general-purpose Large Language Models (LLMs) such as ChatGPT and DeepSeek offer potential solutions for information accessibility, they are prone to generating factually inaccurate content due to a lack of localized and domain-specific knowledge. To this end, we propose a Retrieval-Augmented Generation-Enhanced LLM system as Primary Healthcare Assistant (PriHA) in Hong Kong. Specifically, a tri-stage pipeline is proposed that leverages a query optimizer to generalize user intent-oriented sub-queries, followed by a novel Dual Retrieval Augmented Generation (DRAG) architecture for mixed-source retrieval and context-reorganized generation. Comprehensive experiments and a detailed case study demonstrate that our proposed method can outperform both ablations and baseline in terms of accuracy and clarity. Our research provides a reliable and traceable dialogue retrieval framework for exploring other high-risk, localized application scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PriHA, a RAG-enhanced LLM framework for a Primary Healthcare Assistant tailored to Hong Kong. It proposes a tri-stage pipeline consisting of a query optimizer that generates intent-oriented sub-queries, followed by a Dual Retrieval Augmented Generation (DRAG) architecture that performs mixed-source retrieval from official guidelines and performs context-reorganized generation. The central claim is that comprehensive experiments and a case study demonstrate outperformance over ablations and baselines in accuracy and clarity, while providing a traceable retrieval framework for localized high-risk applications.

Significance. If the empirical claims hold under rigorous evaluation, the work offers a practical demonstration of adapting RAG techniques to address fragmented, localized official documents in a high-stakes domain. This could support Hong Kong's policy shift toward primary-care self-management by improving citizen access to guideline information, and the DRAG design for mixed-source handling may generalize to other regulated information-access scenarios.

major comments (2)
  1. [Abstract and results section] Abstract and results section: the claim that the method 'outperforms both ablations and baseline in terms of accuracy and clarity' is load-bearing, yet the manuscript supplies no quantitative metrics (e.g., accuracy percentages, clarity scores), dataset size or composition, baseline definitions, or error analysis. Without these, the reported gains cannot be verified or compared to standard IR or RAG benchmarks.
  2. [Methods / DRAG pipeline description] Methods / DRAG pipeline description: the accuracy claims rest on the assumption that official clinical guidelines are complete, current, and sufficient to answer typical primary-care queries. The manuscript does not test or discuss failure modes where queries require patient-specific synthesis, clinical discretion, or real-time data absent from static departmental documents; this omission directly affects whether measured improvements reflect general reliability or only the subset of queries that fit the assumption.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'comprehensive experiments' is used without any numerical summary; adding one or two key quantitative results would improve immediate readability.
  2. [Methods] Notation: the distinction between standard RAG and the proposed DRAG is described at a high level; a small diagram or explicit comparison table would clarify the novelty of the dual-retrieval and reorganization steps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the empirical rigor and scope discussion in our work. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and results section] Abstract and results section: the claim that the method 'outperforms both ablations and baseline in terms of accuracy and clarity' is load-bearing, yet the manuscript supplies no quantitative metrics (e.g., accuracy percentages, clarity scores), dataset size or composition, baseline definitions, or error analysis. Without these, the reported gains cannot be verified or compared to standard IR or RAG benchmarks.

    Authors: We agree that the absence of explicit quantitative details undermines verifiability of the central claim. The current manuscript mentions comprehensive experiments but does not report specific accuracy percentages, clarity scores, dataset size/composition, baseline definitions, or error analysis. In the revised version, we will add these to the results section, including concrete metrics (e.g., accuracy rates and human-rated clarity scores), dataset details (e.g., 200 Hong Kong primary-care queries with topic breakdown), baseline specifications (standard RAG, vanilla LLM, and ablation variants), and error categorization. This will enable direct comparison to IR/RAG benchmarks. revision: yes

  2. Referee: [Methods / DRAG pipeline description] Methods / DRAG pipeline description: the accuracy claims rest on the assumption that official clinical guidelines are complete, current, and sufficient to answer typical primary-care queries. The manuscript does not test or discuss failure modes where queries require patient-specific synthesis, clinical discretion, or real-time data absent from static departmental documents; this omission directly affects whether measured improvements reflect general reliability or only the subset of queries that fit the assumption.

    Authors: This observation is correct and highlights a key scope limitation. Our evaluation targets queries answerable via the static official guidelines that the system is designed to retrieve from. We did not explicitly test or discuss out-of-scope cases such as patient-specific synthesis, clinical discretion, or real-time data needs. In the revision, we will add a dedicated limitations subsection discussing these failure modes, clarifying that PriHA serves as an information-access assistant rather than a substitute for professional medical judgment, and noting planned extensions for dynamic data integration. revision: yes

Circularity Check

0 steps flagged

No circularity: standard RAG framework with empirical evaluation

full rationale

The paper describes a tri-stage DRAG pipeline (query optimizer + dual retrieval + context-reorganized generation) as an application of existing retrieval-augmented generation techniques to Hong Kong clinical guidelines. No equations, first-principles derivations, or predictions appear that reduce by construction to fitted parameters or self-referential definitions. Performance claims rest on experiments and a case study rather than on any load-bearing self-citation chain or ansatz smuggled via prior work. The framework is self-contained against external benchmarks of RAG systems; the central assumption about guideline sufficiency is an empirical limitation, not a circularity in the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions about RAG reducing hallucinations and on the existence of usable official guideline corpora; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Retrieval from official documents will supply sufficient and accurate context for primary-care queries
    Implicit in the choice of retrieval sources and the claim of improved accuracy.

pith-pipeline@v0.9.0 · 5735 in / 1195 out tokens · 41731 ms · 2026-05-21T10:05:46.623273+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    npj Digital Medicine (2025)

    Asgari, E., Montaña-Brown, N., Dubois, M., et al.: A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digital Medicine (2025)

  2. [2]

    Intelligent Medicine (2025)

    Chen, X., Xiang, J., Lu, S., Liu, Y., He, M., Shi, D.: Evaluating large language models and agents in healthcare: key challenges in clinical applications. Intelligent Medicine (2025)

  3. [3]

    In: KDD (2024)

    Fan, W., Ding, Y., Ning, L., Wang, S., Li, H., Yin, D., Chua, T.S., Li, Q.: A survey on rag meeting llms: Towards retrieval-augmented large language models. In: KDD (2024)

  4. [4]

    In: WWW (2019)

    Fan, W., Ma, Y., Li, Q., He, Y., Zhao, E., Tang, J., Yin, D.: Graph neural networks for social recommendation. In: WWW (2019)

  5. [5]

    IEEE TKDE (2020)

    Fan, W., Ma, Y., Li, Q., Wang, J., Cai, G., Tang, J., Yin, D.: A graph neural network framework for social recommendations. IEEE TKDE (2020)

  6. [6]

    arXiv preprint arXiv:2501.10282 (2025)

    Fan, W., Zhou, Y., Wang, S., Yan, Y., Liu, H., Zhao, Q., Song, L., Li, Q.: Compu- tational protein science in the era of large language models (llms). arXiv preprint arXiv:2501.10282 (2025)

  7. [7]

    In: EMNLP (2025) 12 R

    Jang, D., Shangguan, Z., Tegtmeyer, K., Gupta, A., Czerminski, J.T., Chheang, S., Cohan, A.: MedTutor: A retrieval-augmented LLM system for case-based medical education. In: EMNLP (2025) 12 R. W. C. Chan et al

  8. [8]

    medRxiv (2025)

    Kim, Y., Jeong, H., Chen, S., Li, S.S., Park, C., Lu, M., Alhamoud, K., Mun, J., Grau, C., Jung, M., Gameiro, R., Fan, L., Park, E., Lin, T., Yoon, J., Yoon, W., Sap, M., Tsvetkov, Y., Liang, P.P., Xu, X., Liu, X., Park, C., Lee, H., Park, H.W., McDuff, D., Tulebaev, S., Breazeal, C.: Medical hallucination in foundation models and their impact on healthca...

  9. [9]

    In: EMNLP (2025)

    Li, D., Jiang, B., Huang, L., Beigi, A., Zhao, C., Tan, Z., Bhattacharjee, A., Jiang, Y., Chen, C., Wu, T., et al.: From generation to judgment: Opportunities and challenges of llm-as-a-judge. In: EMNLP (2025)

  10. [10]

    In: NeurIPS (2024)

    Li, S.S., Balachandran, V., Feng, S., Ilgen, J.S., Pierson, E., Koh, P.W., Tsvetkov, Y.: Mediq: question-asking llms and a benchmark for reliable interactive clinical reasoning. In: NeurIPS (2024)

  11. [11]

    Frontiers in Psychiatry (2023)

    Lo, T.W., Chan, G.H.: Understanding the life experiences of elderly in social iso- lation from the social systems perspective: using Hong Kong as an illustrating example. Frontiers in Psychiatry (2023)

  12. [12]

    ACM Transactions on Information Systems (2025)

    Ning, L., Fan, W., Li, Q.: Retrieval-augmented purifier for robust llm-empowered recommendation. ACM Transactions on Information Systems (2025)

  13. [13]

    In: KDD (2025)

    Ning, L., Liang, Z., Jiang, Z., Qu, H., Ding, Y., Fan, W., Wei, X.y., Lin, S., Liu, H., Yu, P.S., et al.: A survey of webagents: Towards next-generation ai agents for web automation with large foundation models. In: KDD (2025)

  14. [14]

    Patient Preference and Adherence (2025)

    Pal, A., Wangmo, T., Bharadia, T., Ahmed-Richards, M., Bhanderi, M.B., Kach- hadiya, R., Allemann, S.S., Elger, B.S.: Generative ai/llms for plain language medi- cal information for patients, caregivers and general public: Opportunities, risks and ethics. Patient Preference and Adherence (2025)

  15. [15]

    retrieval-augmented generation

    Pingua, B., Sahoo, A., Kandpal, M., Murmu, D., Rautaray, J., Barik, R.K., Saikia, M.J.: Medical llms: Fine-tuning vs. retrieval-augmented generation. Bioengineering (2025)

  16. [16]

    IEEE TKDE (2025)

    Qu, H., Fan, W., Zhao, Z., Li, Q.: Tokenrec: Learning to tokenize id for llm-based generative recommendation. IEEE TKDE (2025)

  17. [17]

    Qu, H., Lin, S., Ding, Y., Wang, Y., Fan, W.: Diffusion generative recommendation with continuous tokens (2026)

  18. [18]

    NEJM AI (2025)

    Shekar, S., Pataranutaporn, P., Sarabu, C., Cecchi, G.A., Maes, P.: People overtrust ai-generated medical advice despite low accuracy. NEJM AI (2025)

  19. [19]

    Research report, The Hong Kong Council of Social Service (HKCSS) (2023)

    The Hong Kong Council of Social Service: A study on the physical and mental health and exercise habits of elderly people living alone or in couples in hong kong. Research report, The Hong Kong Council of Social Service (HKCSS) (2023)

  20. [20]

    Artificial Intelligence Review (2024)

    Wang, D., Zhang, S.: Large language models in medical and healthcare fields: applications, advances, and challenges. Artificial Intelligence Review (2024)

  21. [21]

    Wang, S., Fan, W., Feng, Y., Shanru, L., Ma, X., Wang, S., Yin, D.: Knowl- edgegraph retrieval-augmented generation forllm-based recommendation.In: ACL (2025)

  22. [22]

    In: WWW (2020)

    Wang, X., Ma, Y., Wang, Y., Jin, W., Wang, X., Tang, J., Jia, C., Yu, J.: Traffic flow prediction via spatial temporal graph neural network. In: WWW (2020)

  23. [23]

    In: Findings of ACL (2024)

    Xiong, G., Jin, Q., Lu, Z., Zhang, A.: Benchmarking retrieval-augmented genera- tion for medicine. In: Findings of ACL (2024)

  24. [24]

    TKDE (2024)

    Zhao, Z., Fan, W., Li, J., Liu, Y., Mei, X., Wang, Y., Wen, Z., Wang, F., Zhao, X., Tang, J., et al.: Recommender systems in the era of large language models (llms). TKDE (2024)

  25. [25]

    arXiv preprint arXiv:2512.15133 (2025)

    Zhou, Y., Qu, H., Liu, Y., Lin, S., Song, L., Fan, W.: Hd-prot: A protein lan- guage model for joint sequence-structure modeling with continuous structure to- kens. arXiv preprint arXiv:2512.15133 (2025)