arxiv: 2309.01219 · v3 · submitted 2023-09-03 · 💻 cs.CL · cs.AI· cs.CY· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang , Yafu Li , Leyang Cui , Deng Cai , Lemao Liu , Tingchen Fu , Xinting Huang , Enbo Zhao

show 8 more authors

Yu Zhang Chen Xu Yulong Chen Longyue Wang Anh Tuan Luu Wei Bi Freda Shi Shuming Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 14:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.LG

keywords large language modelshallucinationsurveydetectionmitigationbenchmarksreliability

0 comments

The pith

This survey organizes hallucinations in large language models into taxonomies with benchmarks and analyzes detection plus mitigation methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models sometimes produce outputs that conflict with the given input, earlier generated text, or established facts, which the paper terms hallucination. The work reviews methods for spotting these cases, explaining their causes, and limiting their appearance, with special attention to issues that arise only with very large models. It introduces classifications for the different hallucination behaviors and for the tests used to measure them. Readers should care because these errors reduce how much real-world systems can rely on LLMs for tasks that require accuracy. The survey closes by outlining open research paths that could make models more consistent with reality.

Core claim

The authors state that hallucinations pose a major barrier to dependable use of LLMs and that taxonomies of the hallucination phenomena, together with evaluation benchmarks and a review of mitigation techniques, supply the structure needed to tackle the distinct problems these models introduce.

What carries the argument

Taxonomies of LLM hallucination phenomena and evaluation benchmarks that support systematic detection and mitigation analysis.

If this is right

Detection tools can be matched to specific hallucination types for higher precision.
Mitigation techniques can be compared directly using the shared benchmarks.
Gaps identified in current approaches point to concrete next steps for model improvement.
Standardized categories make it easier to track progress across different research groups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could serve as a shared reference that speeds up development of automatic evaluation tools.
Similar classification efforts might apply to hallucinations in other generative models such as image or code generators.
Widespread use of the benchmarks could shift training objectives toward lower hallucination rates by default.

Load-bearing premise

The selected papers and the taxonomies built from them capture the main hallucination behaviors and remedies despite the field changing quickly.

What would settle it

An experiment or analysis that identifies a common hallucination pattern in current LLMs which fits none of the proposed categories or evades all reviewed mitigation strategies.

read the original abstract

While large language models (LLMs) have demonstrated remarkable capabilities across a range of downstream tasks, a significant concern revolves around their propensity to exhibit hallucinations: LLMs occasionally generate content that diverges from the user input, contradicts previously generated context, or misaligns with established world knowledge. This phenomenon poses a substantial challenge to the reliability of LLMs in real-world scenarios. In this paper, we survey recent efforts on the detection, explanation, and mitigation of hallucination, with an emphasis on the unique challenges posed by LLMs. We present taxonomies of the LLM hallucination phenomena and evaluation benchmarks, analyze existing approaches aiming at mitigating LLM hallucination, and discuss potential directions for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A useful organizing survey on LLM hallucinations with clear taxonomies, but no documented literature search protocol.

read the letter

This survey pulls together work on hallucinations in large language models and gives them a workable structure. It separates intrinsic hallucinations (contradicting the input or prior context) from extrinsic ones (misaligned with world knowledge), lists relevant benchmarks, and groups mitigation ideas into categories like retrieval-augmented generation, fine-tuning, and decoding adjustments. That organization is the real contribution here; it makes the scattered literature easier to navigate for someone entering the area.

Referee Report

1 major / 0 minor

Summary. The paper surveys recent efforts on the detection, explanation, and mitigation of hallucinations in large language models (LLMs). It presents taxonomies of hallucination phenomena (e.g., intrinsic vs. extrinsic) and evaluation benchmarks, analyzes mitigation approaches (organized into categories such as retrieval-augmented generation and fine-tuning), and discusses potential future research directions.

Significance. If the taxonomies and analysis prove representative, the survey would offer a useful organizing framework for a rapidly evolving area central to LLM reliability. The emphasis on LLM-specific challenges and the categorization of mitigation strategies could help researchers navigate the literature, though the absence of a documented selection protocol limits its value as a definitive reference.

major comments (1)

[Taxonomy and Mitigation sections (no dedicated Methods section)] The central claim that the taxonomies comprehensively capture the main hallucination issues and mitigation strategies requires the cited literature to be representative. However, the manuscript contains no explicit description of the literature search strategy (databases, keywords, date cutoffs, or inclusion/exclusion rules). This directly undermines verification of the intrinsic/extrinsic taxonomy and the completeness of the mitigation analysis sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our survey. The point about documenting the literature search process is well-taken and will be addressed in revision.

read point-by-point responses

Referee: [Taxonomy and Mitigation sections (no dedicated Methods section)] The central claim that the taxonomies comprehensively capture the main hallucination issues and mitigation strategies requires the cited literature to be representative. However, the manuscript contains no explicit description of the literature search strategy (databases, keywords, date cutoffs, or inclusion/exclusion rules). This directly undermines verification of the intrinsic/extrinsic taxonomy and the completeness of the mitigation analysis sections.

Authors: We agree that an explicit description of the literature selection process would enhance transparency. Although the survey does not claim exhaustive coverage, providing details on how the literature was identified strengthens verifiability of the taxonomies and mitigation analysis. In the revised version, we will add a new subsection 'Literature Collection and Selection' (placed after the introduction) that specifies: databases searched (arXiv, Google Scholar, ACL Anthology), primary keywords ('LLM hallucination', 'hallucination in large language models', 'intrinsic hallucination', 'extrinsic hallucination'), time period (primarily 2021–2023 with key earlier works), and inclusion criteria (relevance to detection, explanation, or mitigation in LLMs). Selection was guided by relevance and citation impact rather than a formal PRISMA protocol, consistent with common practice in NLP surveys. This addition directly addresses the concern about representativeness. revision: yes

Circularity Check

0 steps flagged

No circularity: survey compiles external literature without derivations or self-referential predictions

full rationale

This paper is a literature survey that presents taxonomies of hallucination phenomena, benchmarks, and mitigation approaches drawn from cited prior work. It contains no equations, fitted parameters, predictions, or derivations that could reduce to its own inputs by construction. The central claims rest on external references rather than self-citation chains or ansatzes smuggled from the authors' prior results. Absence of an explicit search protocol is a methodological gap but does not create circularity under the defined patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This survey paper introduces no new free parameters, axioms, or invented entities; it reviews and taxonomizes existing research on LLM hallucinations.

pith-pipeline@v0.9.0 · 5473 in / 909 out tokens · 47830 ms · 2026-05-12T14:16:04.583150+00:00 · methodology

discussion (0)

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
cs.LG 2026-05 unverdicted novelty 7.0

LLMs suppress factual corrections in task contexts despite internal knowledge of errors, with two training-free interventions shown to increase correction rates substantially.
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
cs.AI 2026-05 unverdicted novelty 7.0

Transformer hidden states encode facts as attractor basins; hallucinations occur from basin absence and conflicts from basin competition, detected cleanly by geometric margin rather than entropy.
Attractor Geometry of Transformer Memory: From Conflict Arbitration to Confident Hallucination
cs.AI 2026-05 unverdicted novelty 7.0

Attractor basins in transformer hidden states unify conflict and hallucination as basin competition or absence, with geometric margin outperforming entropy for detection and a scaling law governing confident hallucina...
Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

SemGrad is a gradient-based uncertainty quantification technique for free-form LLM generation that operates in semantic space using a Semantic Preservation Score to select stable embeddings.
Foundation Models as Oracles for Refactoring Correctness Detection
cs.SE 2026-05 unverdicted novelty 7.0

Foundation models serve as effective oracles for detecting refactoring correctness issues in Java programs, achieving up to 93.8% accuracy in zero-shot evaluations on 226 real bugs.
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations
cs.LG 2026-04 unverdicted novelty 7.0

Counterfactual Routing awakens dormant experts in MoE models via layer-wise perturbation and a new CEI metric, raising factual accuracy 3.1% on average across TruthfulQA, FACTOR, and TriviaQA without extra inference cost.
Dual Hierarchical Dialogue Policy Learning for Legal Inquisitive Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A dual hierarchical RL framework lets agents learn when and how to ask probing questions in U.S. Supreme Court arguments, outperforming baselines on a court dataset.
Knowing but Not Correcting: Routine Task Requests Suppress Factual Correction in LLMs
cs.LG 2026-05 unverdicted novelty 6.0

Task context suppresses factual correction in LLMs at the response-selection stage even when the model has encoded the error, and two training-free interventions raise correction rates substantially.
Bridging Generation and Training: A Systematic Review of Quality Issues in LLMs for Code
cs.SE 2026-05 accept novelty 6.0

A review of 114 studies creates taxonomies for code and data quality issues, formalizes 18 propagation mechanisms from training data defects to LLM-generated code defects, and synthesizes detection and mitigation techniques.
Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
cs.LG 2026-05 unverdicted novelty 6.0

S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
LLM Ghostbusters: Surgical Hallucination Suppression via Adaptive Unlearning
cs.CR 2026-05 unverdicted novelty 6.0

Adaptive Unlearning suppresses package hallucinations in code-generating LLMs by 81% while preserving benchmark performance, using model-generated data and no human labels.
The Surprising Universality of LLM Outputs: A Real-Time Verification Primitive
cs.CR 2026-04 unverdicted novelty 6.0

LLM token rank-frequency distributions converge to a shared Mandelbrot distribution across models and domains, enabling a microsecond-scale statistical primitive for provenance verification and black-box anomaly triage.
The Illusion of Equivalence: Systematic FP16 Divergence in KV-Cached Autoregressive Inference
cs.LG 2026-04 unverdicted novelty 6.0

FP16 KV caching in transformers causes deterministic token divergence versus cache-free inference due to non-associative floating-point accumulation orderings.
Search-o1: Agentic Search-Enhanced Large Reasoning Models
cs.AI 2025-01 unverdicted novelty 6.0

Search-o1 integrates agentic retrieval-augmented generation and a Reason-in-Documents module into large reasoning models to dynamically supply missing knowledge and improve performance on complex science, math, coding...
Agentless: Demystifying LLM-based Software Engineering Agents
cs.SE 2024-07 conditional novelty 6.0

Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.
Corrective Retrieval Augmented Generation
cs.CL 2024-01 unverdicted novelty 6.0

CRAG improves RAG robustness via a retrieval quality evaluator that triggers web augmentation and a decompose-recompose filter to focus on relevant information, yielding better results on short- and long-form generati...
HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs
cs.CL 2026-05 unverdicted novelty 5.0

HalluScan benchmark tests hallucination detectors on LLMs, identifies NLI Verification as top performer with 0.88 AUROC, and introduces HalluScore (r=0.41 with humans) plus a routing method for 2x cost savings.
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
cs.AI 2026-05 unverdicted novelty 5.0

SCM-GRPO grounds multi-hop fact verification in structural causal models and applies GRPO reinforcement learning to optimize reasoning chain length, outperforming baselines on HoVer and EX-FEVER.
IUQ: Interrogative Uncertainty Quantification for Long-Form Large Language Model Generation
cs.CL 2026-04 unverdicted novelty 5.0

IUQ quantifies claim-level uncertainty in long-form LLM generation by combining inter-sample consistency and intra-sample faithfulness through an interrogate-then-respond approach and outperforms baselines on two datasets.
Adapt to Thrive! Adaptive Power-Mean Policy Optimization for Improved LLM Reasoning
cs.CL 2026-04 unverdicted novelty 5.0

APMPO boosts average Pass@1 scores on math reasoning benchmarks by 3 points over GRPO by using an adaptive power-mean policy objective and feedback-driven clipping bounds in RLVR training.
Free Energy-Driven Reinforcement Learning with Adaptive Advantage Shaping for Unsupervised Reasoning in LLMs
cs.CL 2026-04 unverdicted novelty 5.0

FREIA applies free energy principles and adaptive advantage shaping to unsupervised RL, outperforming baselines by 0.5-3.5 Pass@1 points on math reasoning with a 1.5B model.
Policy-Aware Edge LLM-RAG Framework for Internet of Battlefield Things Mission Orchestration
cs.NI 2026-04 unverdicted novelty 5.0

PA-LLM-RAG adds policy retrieval and dual-LLM verification to enable reliable low-latency mission orchestration in simulated IoBT environments, with Gemma-2B reaching 100% policy compliance at 4.17s latency.
A Graph-Enhanced Defense Framework for Explainable Fake News Detection with LLM
cs.CL 2026-04 unverdicted novelty 5.0

G-Defense builds claim-centered graphs from sub-claims, applies RAG for evidence and competing explanations, then uses graph inference to detect fake news veracity and generate intuitive explanation graphs, claiming S...
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective
cs.AI 2026-05 unverdicted novelty 4.0

Reliable AI needs structured Knowledge Objects to externalize and enable human validation of implicit knowledge that current methods cannot verify.
Grounding Multi-Hop Reasoning in Structural Causal Models via Group Relative Policy Optimization
cs.AI 2026-05 unverdicted novelty 4.0

The SCM-GRPO framework models multi-hop fact verification as causal inference and applies reinforcement learning to optimize reasoning depth, reporting outperformance on HoVer and EX-FEVER.
Recommendations for Efficient and Responsible LLM Adoption within Industrial Software Development
cs.SE 2026-04 conditional novelty 4.0

A multi-case study plus survey produces seven actionable recommendations for efficient and responsible LLM use in industrial software engineering.
Reducing Hallucination in Enterprise AI Workflows via Hybrid Utility Minimum Bayes Risk (HUMBR)
cs.LG 2026-04 unverdicted novelty 4.0

HUMBR reduces LLM hallucinations in enterprise workflows by using a hybrid semantic-lexical utility within minimum Bayes risk decoding to identify consensus outputs, with derived error bounds and reported outperforman...
Understanding the planning of LLM agents: A survey
cs.AI 2024-02 accept novelty 4.0

A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
The Rise and Potential of Large Language Model Based Agents: A Survey
cs.AI 2023-09 accept novelty 4.0

The paper surveys the origins, frameworks, applications, and open challenges of AI agents built on large language models.
A Survey on the Memory Mechanism of Large Language Model based Agents
cs.AI 2024-04 accept novelty 3.0

A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
A Survey on Hallucination in Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 31 Pith papers · 6 internal anchors

[1]

InProceedings of human lan- guage technology conference and conference on empirical methods in natural language process- ing, pages 193–200

Predicting sentences using n-gram lan- guage models. InProceedings of human lan- guage technology conference and conference on empirical methods in natural language process- ing, pages 193–200. Sebastian Borgeaud, Arthur Mensch, Jordan Hoff- mann, Trevor Cai, Eliza Rutherford, Katie Mil- lican, George Bm Van Den Driessche, Jean- Baptiste Lespiau, Bogdan D...

work page arXiv 2022
[2]

PaLM: Scaling Language Modeling with Pathways

Palm: Scaling language modeling with pathways.arXiv preprint arXiv:2204.02311. Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Dola: Decoding by contrasting layers improves factuality in large language models

Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883. Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction- finetuned language models.arXiv preprint arXiv:2210.11416. Roi Cohen, M...

work page arXiv 2022
[4]

Knowledge enhanced fine-tuning for bet- ter handling unseen entities in dialogue genera- tion. InEMNLP. David Dale, Elena V oita, Loïc Barrault, and Marta R. Costa-jussà. 2023. Detecting and mit- igating hallucinations in machine translation: 25 Model internal workings alone do well, sen- tence similarity even better. InProceedings of the 61st Annual Meet...

work page arXiv 2023
[5]

Leo Gao, John Schulman, and Jacob Hilton

Bridging the gap: A survey on integrat- ing (human) feedback for natural language gen- eration.arXiv preprint arXiv:2305.00955. Leo Gao, John Schulman, and Jacob Hilton. 2022. Scaling laws for reward model overoptimiza- tion. Luyu Gao, Zhuyun Dai, Panupong Pasupat, An- thony Chen, Arun Tejasvi Chaganty, Yicheng 26 Fan, Vincent Zhao, Ni Lao, Hongrae Lee, D...

work page arXiv 2022
[6]

Detecting and preventing hallucinations in large vi- sion language models

Detecting and preventing hallucinations in large vision language models.arXiv preprint arXiv:2308.06394. Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. InInternational conference on machine learning, pages 1321–1330. PMLR. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The cu...

work page arXiv 2017
[7]

Platypus: Quick, Cheap, and Powerful Refinement of LLMs

Platypus: Quick, cheap, and pow- erful refinement of llms.arXiv preprint arXiv:2308.07317. Katherine Lee, Orhan Firat, Ashish Agarwal, Clara Fannjiang, and David Sussillo. 2019. Hallucinations in neural machine translation. Nayeon Lee, Wei Ping, Peng Xu, Mostofa Pat- wary, Pascale N Fung, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Factuality en- hanced ...

work page arXiv 2019
[8]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Truthfulqa: Measuring how mod- els mimic human falsehoods.arXiv preprint arXiv:2109.07958. Zhen Lin, Shubhendu Trivedi, and Jimeng Sun

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2305.19187 , year=

Generating with confidence: Uncertainty quantification for black-box large language models.arXiv preprint arXiv:2305.19187. Adam Liska, Tomas Kocisky, Elena Gribovskaya, Tayfun Terzi, Eren Sezener, Devang Agrawal, D’Autume Cyprien De Masson, Tim Scholtes, Manzil Zaheer, Susannah Young, et al. 2022. Streamingqa: A benchmark for adaptation to new knowledge ...

work page arXiv 2022
[10]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

A token-level reference-free halluci- nation detection benchmark for free-form text generation. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6723–6737. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin 29 ...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[11]

Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models

Time waits for no one! analysis and challenges of temporal misalignment. InPro- ceedings of the 2022 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Tech- nologies, pages 5944–5958. Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. 2023. Video- chatgpt: Towards detailed video u...

work page internal anchor Pith review arXiv 2022
[12]

doi:10.48550/arXiv.2305.14251 , abstract =

Recurrent neural network based language model. InInterspeech. Makuhari. Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Os- car Sainz, Eneko Agirre, Ilana Heintz, and Dan Roth. 2021. Recent advances in natural lan- guage processing via large pre-trained language models: A survey.ACM Computing Surveys. Sewon Min, Kalpesh Kris...

work page arXiv 2021
[13]

InInternational Conference on Machine Learn- ing, pages 15817–15831

Memory-based model editing at scale. InInternational Conference on Machine Learn- ing, pages 15817–15831. PMLR. Elaraby Mohamed, Lu Mengyin, Dunn Jacob, Zhang Xueying, Wang Yu, and Liu Shizhu

work page
[14]

Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, and Yoav Shoham

Halo: Estimation and reduction of hal- lucinations in open-source weak large language models.arXiv preprint arXiv:2308.11764. Dor Muhlgay, Ori Ram, Inbal Magar, Yoav Levine, Nir Ratner, Yonatan Belinkov, Omri Abend, Kevin Leyton-Brown, Amnon Shashua, and Yoav Shoham. 2023. Generating bench- marks for factuality evaluation of language models.arXiv preprint...

work page arXiv 2023
[15]

InProceedings of the AAAI conference on artificial intelligence

Summarunner: A recurrent neural net- work based sequence model for extractive sum- marization of documents. InProceedings of the AAAI conference on artificial intelligence. Courtney Napoles, Keisuke Sakaguchi, and Joel Tetreault. 2017. Jfleg: A fluency corpus and benchmark for grammatical error correction. In Proceedings of the 15th Conference of the Eu- ...

work page 2017
[16]

Skeleton-of-thought: Prompting llms for efficient parallel generation.arXiv preprint arXiv:2307.15337, 2023

Biases in large language models: Ori- gins, inventory and discussion.ACM Journal of Data and Information Quality. Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gus- tavo Hernández Ábrego, Ji Ma, Vincent Y . Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, and Yinfei Yang. 2022. Large dual encoders are generalizable retrievers. InProceedings of the 2022 Conference...

work page arXiv 2022
[17]

merge conflicts!

Can lms learn new entities from de- scriptions? challenges in propagating injected knowledge.arXiv preprint arXiv:2305.01651. OpenAI. 2023a. ChatGPT.https:// openai.com/blog/chatgpt. OpenAI. 2023b. Gpt-4 technical report.arXiv preprint arXiv:2303.08774. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sand...

work page arXiv 2022
[18]

Tool learning with foundation models

Tool learning with foundation models. arXiv preprint arXiv:2304.08354. Xipeng Qiu, Tianxiang Sun, Yige Xu, Yunfan Shao, Ning Dai, and Xuanjing Huang. 2020. Pre-trained models for natural language pro- cessing: A survey.Science China Technolog- ical Sciences, 63(10):1872–1897. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al

work page arXiv 2020
[19]

Ques- tion decomposition improves the faithfulness of model-generated reasoning.arXiv preprint arXiv:2307.11768, 2023

Language models are unsupervised mul- titask learners.OpenAI blog, 1(8):9. Ansh Radhakrishnan, Karina Nguyen, Anna Chen, Carol Chen, Carson Denison, Danny Hernan- dez, Esin Durmus, Evan Hubinger, Jackson Kernion, Kamil˙e Lukoši¯ut˙e, et al. 2023. Ques- tion decomposition improves the faithfulness of model-generated reasoning.arXiv preprint arXiv:2307.1176...

work page arXiv 2023
[20]

32 Vipula Rawte, Prachi Priya, SM Tonmoy, SM Za- man, Amit Sheth, and Amitava Das

Lynx: An open source halluci- nation evaluation model.arXiv preprint arXiv:2407.08488. 32 Vipula Rawte, Prachi Priya, SM Tonmoy, SM Za- man, Amit Sheth, and Amitava Das. 2023. Ex- ploring the relationship between llm hallucina- tions and prompt linguistic nuances: Readabil- ity, formality, and concreteness.arXiv preprint arXiv:2309.11064. Clément Rebuffel...

work page arXiv 2023
[21]

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empir- ical Methods in Natural Language Processing (EMNLP), pages 5418–5426. Stephen Robertson, Hugo Zaragoza, et al. 2009. The probabilistic relevance framework: Bm25 and beyond.Foundations and Trends® in In- formation Retrieval, 3(4):333–389...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[22]

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H

A thorough examination of decoding methods in the era of llms. Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, and Denny Zhou. 2023a. Large lan- guage models can be easily distracted by irrel- evant context. InProceedings of the 40th In- ternational Conference on Machine Learning, volume 202, pages 31210–3...

work page arXiv 2022
[23]

Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai

Editable neural networks.arXiv preprint arXiv:2004.00345. Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan Wang, and Deng Cai. 2023. Pandagpt: One model to instruction-follow them all.arXiv preprint arXiv:2305.16355. Kai Sun, Yifan Ethan Xu, Hanwen Zha, Yue Liu, and Xin Luna Dong. 2023a. Head-to-tail: How knowledgeable are large language models (llm)? aka w...

work page arXiv 2004
[24]

com/tatsu-lab/stanford_alpaca

Stanford alpaca: An instruction- following llama model.https://github. com/tatsu-lab/stanford_alpaca. Faraz Torabi, Garrett Warnell, and Peter Stone

work page
[25]

LLaMA: Open and Efficient Foundation Language Models

Behavioral cloning from observation. InProceedings of the 27th International Joint Conference on Artificial Intelligence, pages 4950–4957. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation langu...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality

Barrel: Boundary-aware reasoning for factual and reliable lrms. Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Repre- sentations. Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang ...

work page Pith review arXiv 2022
[27]

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al

Decoding methods in neural language generation: a survey.Information, 12(9):355. Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al

work page
[28]

InThe Eleventh International Confer- ence on Learning Representations

Glm-130b: An open bilingual pre-trained model. InThe Eleventh International Confer- ence on Learning Representations. Qingcheng Zeng, Weihao Xuan, Leyang Cui, and Rob V oigt. 2025. Do reasoning models show better verbalized calibration? Yuheng Zha, Yichi Yang, Ruichen Li, and Zhiting Hu. 2023. AlignScore: Evaluating factual con- sistency with a unified al...

work page arXiv 2025
[29]

arXiv preprint arXiv:2305.14795

Mquake: Assessing knowledge edit- ing in language models via multi-hop questions. arXiv preprint arXiv:2305.14795. Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. 2023a. Lima: Less is more for alignment.arXiv preprint arXiv:2305.11206. Chunting Zhou, Graham Neubig, Jiatao Gu, Mona T. ...

work page arXiv 2021
[30]

Promptbench: Towards evaluating the ro- bustness of large language models on adversar- ial prompts.arXiv preprint arXiv:2306.04528. Andy Zou, Long Phan, Sarah Chen, James Camp- bell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Bas...

work page arXiv