CogRAG+: Cognitive-Level Guided Diagnosis and Remediation of Memory and Reasoning Deficiencies in Professional Exam QA

Xudong Wang; Zhaoyan Ming; Zilong Wang

arxiv: 2604.25928 · v1 · submitted 2026-04-01 · 💻 cs.CL

CogRAG+: Cognitive-Level Guided Diagnosis and Remediation of Memory and Reasoning Deficiencies in Professional Exam QA

Xudong Wang , Zilong Wang , Zhaoyan Ming This is my paper

Pith reviewed 2026-05-13 22:25 UTC · model grok-4.3

classification 💻 cs.CL

keywords RAGconstrained reasoningprofessional exam QAdual-path retrievalcognitive hierarchytraining-freeLLM knowledge gapsRegistered Dietitian exam

0 comments

The pith

CogRAG+ separates retrieval from reasoning in LLMs using dual paths and structured templates to fix knowledge gaps on professional exams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CogRAG+, a training-free framework that aligns the retrieval-augmented generation process with human cognitive levels. It first applies Reinforced Retrieval through a judge that directs two separate paths—one for core facts and one for answer options—to fill missing knowledge before reasoning begins. It then replaces open-ended chain-of-thought with cognition-stratified Constrained Reasoning that uses fixed templates to limit logical drift and redundant output. Experiments on Qwen3-8B and Llama3.1-8B show clear accuracy gains on the Registered Dietitian qualification exam, reaching 85.8 percent and 60.3 percent respectively in single-question mode, while also lowering the rate of unanswered questions.

Core claim

CogRAG+ decouples the retrieval-augmented generation pipeline to align with human cognitive hierarchies through Reinforced Retrieval, a judge-driven dual-path strategy with fact-centric and option-centric paths, and cognition-stratified Constrained Reasoning that replaces unconstrained chain-of-thought generation with structured templates, yielding higher accuracy and fewer inconsistencies on specialized professional tasks.

What carries the argument

Reinforced Retrieval, a judge-driven dual-path strategy with fact-centric and option-centric paths, paired with cognition-stratified Constrained Reasoning that enforces structured templates.

If this is right

Overall accuracy on the Registered Dietitian exam rises to 85.8 percent for Qwen3-8B and 60.3 percent for Llama3.1-8B in single-question mode.
Constrained Reasoning reduces the unanswered rate from 7.6 percent to 1.4 percent.
The method outperforms both general-purpose models and standard RAG baselines without any fine-tuning.
It supplies a model-agnostic route to expert-level performance in other specialized domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of fact retrieval from option evaluation could be applied to medical or legal question-answering tasks.
Fixed reasoning templates might reduce inconsistency in open-ended professional advice generation beyond exam settings.
Performance would likely degrade if the judge component were weaker than the main model or trained on mismatched data.

Load-bearing premise

A judge-driven dual-path retrieval strategy can reliably identify and supply missing foundational knowledge without domain-specific tuning or additional training data.

What would settle it

Replace the judge model with random retrieval selection and measure whether the accuracy gains on the dietitian exam disappear.

read the original abstract

Professional domain knowledge underpins human civilization, serving as both the basis for industry entry and the core of complex decision-making and problem-solving. However, existing large language models often suffer from opaque inference processes in which retrieval and reasoning are tightly entangled, causing knowledge gaps and reasoning inconsistencies in professional tasks. To address this, we propose CogRAG+, a training-free framework that decouples and aligns the retrieval-augmented generation pipeline with human cognitive hierarchies. First, we introduce Reinforced Retrieval, a judge-driven dual-path strategy with fact-centric and option-centric paths that strengthens retrieval and mitigates cascading failures caused by missing foundational knowledge. We then develop cognition-stratified Constrained Reasoning, which replaces unconstrained chain-of-thought generation with structured templates to reduce logical inconsistency and generative redundancy. Experiments on two representative models, Qwen3-8B and Llama3.1-8B, show that CogRAG+ consistently outperforms general-purpose models and standard RAG methods on the Registered Dietitian qualification exam. In single-question mode, it raises overall accuracy to 85.8\% for Qwen3-8B and 60.3\% for Llama3.1-8B, with clear gains over vanilla baselines. Constrained Reasoning also reduces the unanswered rate from 7.6\% to 1.4\%. CogRAG+ offers a robust, model-agnostic path toward training-free expert-level performance in specialized domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CogRAG+ gets solid accuracy gains on the dietitian exam by splitting retrieval into judge-chosen fact or option paths and then using structured reasoning templates, but the judge mechanism is too lightly described to judge its real contribution.

read the letter

The main point is that this training-free pipeline lifts performance on the Registered Dietitian qualification exam. Qwen3-8B reaches 85.8% accuracy and Llama3.1-8B reaches 60.3% in single-question mode, with a drop in unanswered rate from 7.6% to 1.4%. The gains come from Reinforced Retrieval, where a judge picks between a fact-centric path and an option-centric path, plus cognition-stratified Constrained Reasoning that replaces free-form chain-of-thought with fixed templates. That specific pairing is new enough to be worth noting even though both pieces draw from existing RAG and constrained-generation work. The paper shows the method works across two different 8B models and beats standard RAG baselines, which is a practical data point for exam-style QA. The approach stays model-agnostic on paper and avoids any fine-tuning, which keeps the claims modest and focused. The soft spots sit in the missing implementation details. The judge is the load-bearing part of the retrieval step, yet there is no description of which model it is, how it is prompted, or any check that it actually supplies the knowledge the base model lacks. No ablation isolates the judge's selection accuracy, and the abstract supplies no error analysis, statistical tests, or full protocol. Without those, the numeric improvements cannot be reproduced or stress-tested. The work is aimed at people building prompt-level fixes for multiple-choice professional exams. Readers who already run RAG experiments on domain tests will find the template structure and dual-path idea easy to try. It deserves peer review because the results are positive and the framing is clear, but any referee will need to require explicit judge specifications and basic controls before the claims can be taken as settled.

Referee Report

3 major / 2 minor

Summary. The paper proposes CogRAG+, a training-free framework that decouples retrieval-augmented generation from reasoning in LLMs for professional exam QA. It introduces Reinforced Retrieval, a judge-driven dual-path strategy (fact-centric and option-centric paths) to mitigate missing foundational knowledge, followed by cognition-stratified Constrained Reasoning using structured templates to reduce logical inconsistencies. Experiments on Qwen3-8B and Llama3.1-8B models on the Registered Dietitian qualification exam report accuracy gains to 85.8% and 60.3% respectively in single-question mode, plus a reduction in unanswered rate from 7.6% to 1.4%.

Significance. If the empirical claims hold under rigorous verification, the work provides a model-agnostic, training-free method for aligning LLM pipelines with human-like cognitive hierarchies in specialized domains. This could offer a practical route to expert-level performance on professional exams without additional training data or fine-tuning, with potential extensions to other knowledge-intensive tasks.

major comments (3)

[Abstract, §3] Abstract and §3 (Reinforced Retrieval): The headline accuracy figures (85.8% for Qwen3-8B, 60.3% for Llama3.1-8B) are presented without any description of the experimental protocol, baseline definitions (e.g., what constitutes 'vanilla RAG' or 'general-purpose models'), statistical tests, number of runs, or error analysis. This absence makes the central performance claims impossible to evaluate or reproduce.
[§3] §3 (Reinforced Retrieval): The judge-driven dual-path mechanism is claimed to be training-free and model-agnostic, yet no details are given on the judge model identity, its prompting template, or any verification that the judge can reliably detect missing foundational knowledge the base 8B model lacks. If the judge is the same model, the selection step inherits the original gap; if external, the framework is no longer uniformly training-free.
[§4] §4 (Experiments): No ablation studies isolate the contribution of the judge component, path-selection accuracy against ground-truth missing-knowledge cases, or the effect of Constrained Reasoning templates. Without these, it is unclear whether the reported gains are attributable to the proposed mechanisms or to other unstated factors.

minor comments (2)

[Abstract] The abstract states 'clear gains over vanilla baselines' but provides no quantitative comparison table or specific baseline scores in the visible text.
[§3] Notation for the dual paths (fact-centric vs. option-centric) is introduced without a formal definition or pseudocode, making the retrieval strategy difficult to implement from the description alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. We agree that several aspects of the experimental description require expansion for clarity and reproducibility, and we will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (Reinforced Retrieval): The headline accuracy figures (85.8% for Qwen3-8B, 60.3% for Llama3.1-8B) are presented without any description of the experimental protocol, baseline definitions (e.g., what constitutes 'vanilla RAG' or 'general-purpose models'), statistical tests, number of runs, or error analysis. This absence makes the central performance claims impossible to evaluate or reproduce.

Authors: We agree that the initial submission lacked sufficient detail on the experimental setup. In the revised manuscript we will expand §4 (Experiments) with a complete protocol description, explicit baseline definitions (vanilla RAG as single-path retrieval using the same retriever and embedding model; general-purpose models as zero-shot prompting without retrieval), results averaged over multiple runs with standard deviations, appropriate statistical significance tests, and a categorized error analysis distinguishing retrieval failures from reasoning inconsistencies. These additions will make the reported gains fully evaluable and reproducible. revision: yes
Referee: [§3] §3 (Reinforced Retrieval): The judge-driven dual-path mechanism is claimed to be training-free and model-agnostic, yet no details are given on the judge model identity, its prompting template, or any verification that the judge can reliably detect missing foundational knowledge the base 8B model lacks. If the judge is the same model, the selection step inherits the original gap; if external, the framework is no longer uniformly training-free.

Authors: We will clarify in the revised §3 that the judge is an external, higher-capacity model chosen to reliably identify knowledge gaps beyond the base 8B models' capabilities, preserving the training-free property for the evaluated models. The revision will include the exact judge model identity, the full prompting template for dual-path selection, and a verification analysis (e.g., agreement rate with human annotations on a held-out subset of questions). This directly addresses concerns about gap inheritance and framework uniformity. revision: yes
Referee: [§4] §4 (Experiments): No ablation studies isolate the contribution of the judge component, path-selection accuracy against ground-truth missing-knowledge cases, or the effect of Constrained Reasoning templates. Without these, it is unclear whether the reported gains are attributable to the proposed mechanisms or to other unstated factors.

Authors: We concur that ablations are essential to attribute performance gains. The revised §4 will incorporate new ablation experiments: (i) full CogRAG+ versus variants without the judge-driven path selector, (ii) path-selection accuracy measured against ground-truth labels for missing-knowledge cases (obtained via manual annotation of a question subset), and (iii) Constrained Reasoning templates versus standard unconstrained chain-of-thought. These studies will isolate each component's contribution. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical framework with no derivations or self-referential reductions

full rationale

The paper introduces CogRAG+ as a training-free procedural framework (Reinforced Retrieval via judge-driven dual paths plus cognition-stratified Constrained Reasoning) and reports direct accuracy improvements on the Registered Dietitian exam for Qwen3-8B (85.8%) and Llama3.1-8B (60.3%). No equations, fitted parameters, predictions derived from inputs, or self-citations appear in the derivation chain. All claims reduce to explicit experimental comparisons against baselines rather than any self-definitional or load-bearing reduction. The framework is self-contained as a descriptive method evaluated externally.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, parameters, or formal assumptions; the framework implicitly relies on an unstated alignment between human cognitive hierarchies and LLM behavior that is not further specified.

pith-pipeline@v0.9.0 · 5570 in / 1011 out tokens · 36192 ms · 2026-05-13T22:25:57.600245+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Reinforced Retrieval, a judge-driven dual-path strategy with fact-centric and option-centric paths... cognition-stratified Constrained Reasoning... Bloom’s Taxonomy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Nature Medicine31(3), 943–950 (2025)

Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Amin, M., Hou, L., Clark, K., Pfohl, S.R., Cole-Lewis, H.,et al.: Toward expert-level medical question answering with large language models. Nature Medicine31(3), 943–950 (2025)

work page 2025
[2]

IEEE Transactions on Multimedia (2025)

Yin, Y., Qi, H., Zhu, B., Chen, J., Jiang, Y.-G., Ngo, C.-W.: Foodlmm: A versatile food assistant using large multi-modal model. IEEE Transactions on Multimedia (2025)

work page 2025
[3]

Patterns6(5) (2025)

Zhou, P., Min, W., Fu, C., Jin, Y., Huang, M., Li, X., Mei, S., Jiang, S.: Food- sky: A food-oriented large language model that can pass the chef and dietetic examinations. Patterns6(5) (2025)

work page 2025
[4]

Iclr1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.,et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

work page 2022
[5]

Advances in neural information processing systems35, 27730–27744 (2022) 20

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,et al.: Training language models to follow instructions with human feedback. Advances in neural information processing systems35, 27730–27744 (2022) 20

work page 2022
[6]

In: International Conference on Machine Learning, pp

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning, pp. 2790–2799 (2019). PMLR

work page 2019
[7]

Theory into practice41(4), 212–218 (2002)

Krathwohl, D.R.: A revision of bloom’s taxonomy: An overview. Theory into practice41(4), 212–218 (2002)

work page 2002
[8]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D.,et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

work page 2022
[9]

Advances in Neural Information Processing Systems37, 95716–95743 (2024)

Xie, Q., Han, W., Chen, Z., Xiang, R., Zhang, X., He, Y., Xiao, M., Li, D., Dai, Y., Feng, D.,et al.: Finben: A holistic financial benchmark for large language models. Advances in Neural Information Processing Systems37, 95716–95743 (2024)

work page 2024
[10]

Advances in Neural Information Processing Systems37, 85693–85721 (2024)

Liu, J., Huang, Z., Xiao, T., Sha, J., Wu, J., Liu, Q., Wang, S., Chen, E.: Socraticlm: Exploring socratic personalized teaching with large language models. Advances in Neural Information Processing Systems37, 85693–85721 (2024)

work page 2024
[11]

100–114 (2022)

Liu, J., Shen, D., Zhang, Y., Dolan, W.B., Carin, L., Chen, W.: What makes good in-context examples for gpt-3? In: Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114 (2022)

work page 2022
[12]

In: International Conference on Machine Learning, pp

Ye, J., Wu, Z., Feng, J., Yu, T., Kong, L.: Compositional exemplars for in-context learning. In: International Conference on Machine Learning, pp. 39818–39833 (2023). PMLR

work page 2023
[13]

Nature649(8099), 1139–1146 (2026)

Phan, L., Gatti, A., Li, N., Khoja, A., Kim, R., Ren, R., Hausenloy, J., Zhang, O., Mazeika, M., Hendrycks, D.: A benchmark of expert-level academic questions to assess ai capabilities. Nature649(8099), 1139–1146 (2026)

work page 2026
[14]

npj Digital Medicine8(1), 600 (2025)

Agrawal, M., Chen, I.Y., Gulamali, F., Joshi, S.: The evaluation illusion of large language models in medicine. npj Digital Medicine8(1), 600 (2025)

work page 2025
[15]

In: Proceedings of the 31st International Conference on Computational Linguistics, pp

Huber, T., Niklaus, C.: Llms meet bloom’s taxonomy: A cognitive view on large language model evaluations. In: Proceedings of the 31st International Conference on Computational Linguistics, pp. 5211–5246 (2025)

work page 2025
[16]

13440–13457 (2025)

Zhang, G., Ying, Y., Jiang, S., Liang, J., Yue, G., Fu, Y., Hu, H., Xiao, Y.: From remembering to metacognition: Do existing benchmarks accurately evaluate llms? In: Findings of the Association for Computational Linguistics: EMNLP 2025, pp. 13440–13457 (2025)

work page 2025
[17]

In: Proceedings of the 63rd Annual 21 Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pp

Yadav, A., Kashid, H., Sruthi, M., JayaPrakash, B., Kullayappa, C.R., Reddy, M.J., Bhattacharyya, P.: From recall to creation: Generating follow-up questions using bloom’s taxonomy and grice’s maxims. In: Proceedings of the 63rd Annual 21 Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pp. 1322–1338 (2025)

work page 2025
[18]

In: NeurIPS 2025 Work- shop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Zoumpoulidi, M.-E., Batsi, E., Paraskevopoulos, G., Katsouros, V., Potamianos, A.: Bloomxplain: A framework and benchmark dataset for pedagogically sound llm-generated explanations based on bloom’s taxonomy. In: NeurIPS 2025 Work- shop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

work page 2025
[19]

Foundations and trends®in information retrieval3(4), 333–389 (2009)

Robertson, S., Zaragoza, H.,et al.: The probabilistic relevance framework: Bm25 and beyond. Foundations and trends®in information retrieval3(4), 333–389 (2009)

work page 2009
[20]

In: EMNLP (1), pp

Karpukhin, V., Oguz, B., Min, S., Lewis, P.S., Wu, L., Edunov, S., Chen, D., Yih, W.-t.: Dense passage retrieval for open-domain question answering. In: EMNLP (1), pp. 6769–6781 (2020)

work page 2020
[21]

Approximate nearest neighbor negative contrastive learning for dense text retrieval,

Xiong, L., Xiong, C., Li, Y., Tang, K.-F., Liu, J., Bennett, P., Ahmed, J., Over- wijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020)

work page arXiv 2007
[22]

In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp

Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: Colbertv2: Effective and efficient retrieval via lightweight late interaction. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3715–3734 (2022)

work page 2022
[23]

ACM Transactions on Information Systems42(1), 1–35 (2023)

Bruch, S., Gai, S., Ingber, A.: An analysis of fusion functions for hybrid retrieval. ACM Transactions on Information Systems42(1), 1–35 (2023)

work page 2023
[24]

Advances in Neural Information Processing Systems35, 21831–21843 (2022)

Tay, Y., Tran, V., Dehghani, M., Ni, J., Bahri, D., Mehta, H., Qin, Z., Hui, K., Zhao, Z., Gupta, J.,et al.: Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems35, 21831–21843 (2022)

work page 2022
[25]

Advances in Neural Information Processing Systems35, 25600–25614 (2022)

Wang, Y., Hou, Y., Wang, H., Miao, Z., Wu, S., Chen, Q., Xia, Y., Chi, C., Zhao, G., Liu, Z.,et al.: A neural corpus indexer for document retrieval. Advances in Neural Information Processing Systems35, 25600–25614 (2022)

work page 2022
[26]

Picard: Parsing incremen- tally for constrained auto-regressive decoding from language models,

Scholak, T., Schucher, N., Bahdanau, D.: Picard: Parsing incrementally for constrained auto-regressive decoding from language models. arXiv preprint arXiv:2109.05093 (2021)

work page arXiv 2021
[27]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Li, H., Zhang, J., Li, C., Chen, H.: Resdsql: Decoupling schema linking and skele- ton parsing for text-to-sql. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13067–13075 (2023) 22

work page 2023

[1] [1]

Nature Medicine31(3), 943–950 (2025)

Singhal, K., Tu, T., Gottweis, J., Sayres, R., Wulczyn, E., Amin, M., Hou, L., Clark, K., Pfohl, S.R., Cole-Lewis, H.,et al.: Toward expert-level medical question answering with large language models. Nature Medicine31(3), 943–950 (2025)

work page 2025

[2] [2]

IEEE Transactions on Multimedia (2025)

Yin, Y., Qi, H., Zhu, B., Chen, J., Jiang, Y.-G., Ngo, C.-W.: Foodlmm: A versatile food assistant using large multi-modal model. IEEE Transactions on Multimedia (2025)

work page 2025

[3] [3]

Patterns6(5) (2025)

Zhou, P., Min, W., Fu, C., Jin, Y., Huang, M., Li, X., Mei, S., Jiang, S.: Food- sky: A food-oriented large language model that can pass the chef and dietetic examinations. Patterns6(5) (2025)

work page 2025

[4] [4]

Iclr1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.,et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022)

work page 2022

[5] [5]

Advances in neural information processing systems35, 27730–27744 (2022) 20

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A.,et al.: Training language models to follow instructions with human feedback. Advances in neural information processing systems35, 27730–27744 (2022) 20

work page 2022

[6] [6]

In: International Conference on Machine Learning, pp

Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning, pp. 2790–2799 (2019). PMLR

work page 2019

[7] [7]

Theory into practice41(4), 212–218 (2002)

Krathwohl, D.R.: A revision of bloom’s taxonomy: An overview. Theory into practice41(4), 212–218 (2002)

work page 2002

[8] [8]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D.,et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

work page 2022

[9] [9]

Advances in Neural Information Processing Systems37, 95716–95743 (2024)

Xie, Q., Han, W., Chen, Z., Xiang, R., Zhang, X., He, Y., Xiao, M., Li, D., Dai, Y., Feng, D.,et al.: Finben: A holistic financial benchmark for large language models. Advances in Neural Information Processing Systems37, 95716–95743 (2024)

work page 2024

[10] [10]

Advances in Neural Information Processing Systems37, 85693–85721 (2024)

Liu, J., Huang, Z., Xiao, T., Sha, J., Wu, J., Liu, Q., Wang, S., Chen, E.: Socraticlm: Exploring socratic personalized teaching with large language models. Advances in Neural Information Processing Systems37, 85693–85721 (2024)

work page 2024

[11] [11]

100–114 (2022)

Liu, J., Shen, D., Zhang, Y., Dolan, W.B., Carin, L., Chen, W.: What makes good in-context examples for gpt-3? In: Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp. 100–114 (2022)

work page 2022

[12] [12]

In: International Conference on Machine Learning, pp

Ye, J., Wu, Z., Feng, J., Yu, T., Kong, L.: Compositional exemplars for in-context learning. In: International Conference on Machine Learning, pp. 39818–39833 (2023). PMLR

work page 2023

[13] [13]

Nature649(8099), 1139–1146 (2026)

Phan, L., Gatti, A., Li, N., Khoja, A., Kim, R., Ren, R., Hausenloy, J., Zhang, O., Mazeika, M., Hendrycks, D.: A benchmark of expert-level academic questions to assess ai capabilities. Nature649(8099), 1139–1146 (2026)

work page 2026

[14] [14]

npj Digital Medicine8(1), 600 (2025)

Agrawal, M., Chen, I.Y., Gulamali, F., Joshi, S.: The evaluation illusion of large language models in medicine. npj Digital Medicine8(1), 600 (2025)

work page 2025

[15] [15]

In: Proceedings of the 31st International Conference on Computational Linguistics, pp

Huber, T., Niklaus, C.: Llms meet bloom’s taxonomy: A cognitive view on large language model evaluations. In: Proceedings of the 31st International Conference on Computational Linguistics, pp. 5211–5246 (2025)

work page 2025

[16] [16]

13440–13457 (2025)

Zhang, G., Ying, Y., Jiang, S., Liang, J., Yue, G., Fu, Y., Hu, H., Xiao, Y.: From remembering to metacognition: Do existing benchmarks accurately evaluate llms? In: Findings of the Association for Computational Linguistics: EMNLP 2025, pp. 13440–13457 (2025)

work page 2025

[17] [17]

In: Proceedings of the 63rd Annual 21 Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pp

Yadav, A., Kashid, H., Sruthi, M., JayaPrakash, B., Kullayappa, C.R., Reddy, M.J., Bhattacharyya, P.: From recall to creation: Generating follow-up questions using bloom’s taxonomy and grice’s maxims. In: Proceedings of the 63rd Annual 21 Meeting of the Association for Computational Linguistics (Volume 6: Industry Track), pp. 1322–1338 (2025)

work page 2025

[18] [18]

In: NeurIPS 2025 Work- shop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

Zoumpoulidi, M.-E., Batsi, E., Paraskevopoulos, G., Katsouros, V., Potamianos, A.: Bloomxplain: A framework and benchmark dataset for pedagogically sound llm-generated explanations based on bloom’s taxonomy. In: NeurIPS 2025 Work- shop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling

work page 2025

[19] [19]

Foundations and trends®in information retrieval3(4), 333–389 (2009)

Robertson, S., Zaragoza, H.,et al.: The probabilistic relevance framework: Bm25 and beyond. Foundations and trends®in information retrieval3(4), 333–389 (2009)

work page 2009

[20] [20]

In: EMNLP (1), pp

Karpukhin, V., Oguz, B., Min, S., Lewis, P.S., Wu, L., Edunov, S., Chen, D., Yih, W.-t.: Dense passage retrieval for open-domain question answering. In: EMNLP (1), pp. 6769–6781 (2020)

work page 2020

[21] [21]

Approximate nearest neighbor negative contrastive learning for dense text retrieval,

Xiong, L., Xiong, C., Li, Y., Tang, K.-F., Liu, J., Bennett, P., Ahmed, J., Over- wijk, A.: Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020)

work page arXiv 2007

[22] [22]

In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp

Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: Colbertv2: Effective and efficient retrieval via lightweight late interaction. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3715–3734 (2022)

work page 2022

[23] [23]

ACM Transactions on Information Systems42(1), 1–35 (2023)

Bruch, S., Gai, S., Ingber, A.: An analysis of fusion functions for hybrid retrieval. ACM Transactions on Information Systems42(1), 1–35 (2023)

work page 2023

[24] [24]

Advances in Neural Information Processing Systems35, 21831–21843 (2022)

Tay, Y., Tran, V., Dehghani, M., Ni, J., Bahri, D., Mehta, H., Qin, Z., Hui, K., Zhao, Z., Gupta, J.,et al.: Transformer memory as a differentiable search index. Advances in Neural Information Processing Systems35, 21831–21843 (2022)

work page 2022

[25] [25]

Advances in Neural Information Processing Systems35, 25600–25614 (2022)

Wang, Y., Hou, Y., Wang, H., Miao, Z., Wu, S., Chen, Q., Xia, Y., Chi, C., Zhao, G., Liu, Z.,et al.: A neural corpus indexer for document retrieval. Advances in Neural Information Processing Systems35, 25600–25614 (2022)

work page 2022

[26] [26]

Picard: Parsing incremen- tally for constrained auto-regressive decoding from language models,

Scholak, T., Schucher, N., Bahdanau, D.: Picard: Parsing incrementally for constrained auto-regressive decoding from language models. arXiv preprint arXiv:2109.05093 (2021)

work page arXiv 2021

[27] [27]

In: Proceedings of the AAAI Conference on Artificial Intelligence, vol

Li, H., Zhang, J., Li, C., Chen, H.: Resdsql: Decoupling schema linking and skele- ton parsing for text-to-sql. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, pp. 13067–13075 (2023) 22

work page 2023