arxiv: 2308.05374 · v2 · pith:ZGTKTZFWnew · submitted 2023-08-10 · 💻 cs.AI · cs.LG

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Yang Liu , Yuanshun Yao , Jean-Francois Ton , Xiaoying Zhang , Ruocheng Guo , Hao Cheng , Yegor Klochkov , Muhammad Faaiz Taufiq

show 1 more author

Hang Li

This is my paper

Pith reviewed 2026-05-17 22:26 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords LLM alignmenttrustworthiness evaluationlarge language modelsreliabilitysafetyfairnessrobustnesssocial norms

0 comments

The pith

A survey finds that more aligned LLMs generally achieve higher trustworthiness, though the gains differ across categories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys seven key categories of trustworthiness in large language models, expanding them into twenty-nine sub-categories to provide evaluation guidance. It selects eight sub-categories for concrete measurement experiments on several common LLMs. Results show that models with more alignment work tend to score better across trustworthiness measures overall. Yet this improvement is inconsistent, stronger in some areas than in others. This variation underscores the value of detailed, ongoing testing rather than assuming broad alignment fixes everything.

Core claim

By organizing trustworthiness into reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness, and measuring eight sub-areas, the authors establish that greater alignment correlates with better overall performance but with category-dependent effectiveness, calling for finer-grained analysis and continued alignment refinements.

What carries the argument

The seven-category taxonomy with twenty-nine sub-categories that structures the survey and directs the selection of measurement studies.

If this is right

More aligned models can be expected to deliver higher overall trustworthiness in practice.
Alignment efforts must address variation by targeting specific categories separately.
Evaluation should include fine-grained tests rather than relying on general alignment metrics.
Deployment decisions benefit from checking performance across multiple trustworthiness dimensions.
Practitioners gain a structured guideline for iterating on LLM alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending measurements to additional sub-categories could confirm or refine the observed patterns.
The framework might apply to assessing trustworthiness in multimodal or other AI models beyond text-based LLMs.
Prioritizing categories where alignment shows weaker effects could improve overall system reliability.
Real-world deployment might reveal gaps not captured by the current sub-category selections.

Load-bearing premise

That the chosen seven categories, twenty-nine sub-categories, and the eight selected for measurement accurately represent the full scope of trustworthiness in real-world LLM use.

What would settle it

A replication study that applies alternative trustworthiness categories or different evaluation methods to the same models and finds no general advantage for aligned models, or uniform effects across categories, would challenge the main findings.

read the original abstract

Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes LLM trustworthiness into seven categories and 29 sub-categories with some new measurements, but the empirical trend that more-aligned models score higher lacks clear independent ranking and methodological detail.

read the letter

This paper organizes existing work on LLM alignment evaluation into seven top-level categories—reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness—split into 29 sub-categories total. It then selects eight of those sub-categories and reports measurements across several widely used models. The main takeaway from the measurements is that models with more alignment effort generally come out ahead on overall trustworthiness, though the gains vary by category.

Referee Report

2 major / 2 minor

Summary. The paper surveys seven major categories of LLM trustworthiness (reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness), subdivided into 29 sub-categories in total. It then selects a subset of 8 sub-categories, designs corresponding measurements, and applies them to several widely-used LLMs. The central empirical finding is that more aligned models tend to perform better overall in trustworthiness, although the effectiveness of alignment varies across categories. The work positions itself as providing guidance for systematic evaluation and improvement of LLM alignment.

Significance. If the directional findings hold after methodological clarification, the survey offers a structured taxonomy that consolidates key trustworthiness dimensions and supplies concrete measurement examples. The explicit enumeration of 29 sub-categories and the cross-model comparisons add practical value for practitioners seeking to iterate on alignment. The observation that alignment success is uneven across categories is a useful falsifiable pointer for future targeted work.

major comments (2)

[Results / Empirical evaluation] Results section (and abstract): the claim that 'more aligned models tend to perform better in terms of overall trustworthiness' rests on an implicit ordering of the tested LLMs by alignment strength. The manuscript does not state an a-priori, externally validated ranking (e.g., base models vs. RLHF-tuned vs. further safety-tuned) constructed independently of the eight trustworthiness metrics; without this separation the reported positive trend risks circularity rather than confirmation.
[Measurement studies] Measurement studies section: no explicit criteria are given for choosing the 8 sub-categories out of the 29, nor are the exact test implementations, prompt templates, or statistical controls described. These omissions leave the support for the directional claims only moderately strong and make replication or extension difficult.

minor comments (2)

[Abstract] The abstract refers to 'several widely-used LLMs' without naming them; listing the specific models (and their versions) would improve immediate clarity.
[Results figures/tables] Table or figure captions for the measurement results could more explicitly note the source of the alignment ordering used for the 'more aligned' vs. 'less aligned' comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and suggestions. We address each of the major comments below and indicate how we plan to revise the manuscript accordingly.

read point-by-point responses

Referee: [Results / Empirical evaluation] Results section (and abstract): the claim that 'more aligned models tend to perform better in terms of overall trustworthiness' rests on an implicit ordering of the tested LLMs by alignment strength. The manuscript does not state an a-priori, externally validated ranking (e.g., base models vs. RLHF-tuned vs. further safety-tuned) constructed independently of the eight trustworthiness metrics; without this separation the reported positive trend risks circularity rather than confirmation.

Authors: We agree with the referee that an explicit, a-priori ordering of the models based on their alignment efforts, independent of our evaluation metrics, would strengthen the claim and avoid any appearance of circularity. In the revised manuscript, we will add a dedicated paragraph in the Results section (and update the abstract if necessary) that describes the alignment levels of the tested LLMs based on external information, such as their training procedures documented in official papers and announcements (e.g., distinguishing base models from those fine-tuned with RLHF or additional safety measures). This ordering will be presented prior to reporting the trustworthiness scores. revision: yes
Referee: [Measurement studies] Measurement studies section: no explicit criteria are given for choosing the 8 sub-categories out of the 29, nor are the exact test implementations, prompt templates, or statistical controls described. These omissions leave the support for the directional claims only moderately strong and make replication or extension difficult.

Authors: We acknowledge that the selection criteria for the 8 sub-categories and the detailed experimental setups were not sufficiently elaborated. We will revise the Measurement studies section to include explicit criteria for selection, such as coverage of different major categories, feasibility of automated evaluation, and importance for real-world applications. Furthermore, we will provide the exact prompt templates, evaluation protocols, and any statistical methods used in an appendix to enable full replication and extension by other researchers. revision: yes

Circularity Check

0 steps flagged

No circularity in survey review or empirical measurements

full rationale

The paper is a literature survey that organizes LLM trustworthiness into seven categories and 29 sub-categories drawn from prior work, then performs new measurements on a selected subset of eight sub-categories across several LLMs. The central observation that more aligned models tend to perform better is an empirical comparison between models whose alignment status is established by external training history (e.g., base models versus those that received RLHF or safety tuning) and the independently collected trustworthiness scores. No equations, fitted parameters, or self-referential definitions are present; the ordering of models by alignment degree is not derived from the paper's own metrics. Self-citations exist as part of normal survey practice but are not load-bearing for any uniqueness claim or ansatz. The work is therefore self-contained against external benchmarks and contains no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that trustworthiness can be decomposed into the listed categories and that human intentions provide a stable reference for alignment; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Alignment refers to making models behave in accordance with human intentions
Explicitly stated in the opening sentence of the abstract as the definition of the central task.

pith-pipeline@v0.9.0 · 5620 in / 1157 out tokens · 36983 ms · 2026-05-17T22:26:47.271196+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

LawOfExistence law_of_existence echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem
cs.CR 2025-09 unverdicted novelty 8.0

This paper defines a new Parasitic Toolchain Attack pattern (MCP-UPD) that assembles legitimate tools into privacy-exfiltrating workflows and reports the first large-scale scan of 12230 MCP tools across 1360 servers r...
Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Math Education Digital Shadows for facilitating learning with LLMs: Math performance, anxiety and confidence in simulated students and AIs
cs.AI 2026-04 unverdicted novelty 7.0

MEDS is a dataset of 28,000 LLM personas performing high-school math tasks alongside psychometric tests and cognitive networks that capture math anxiety, self-efficacy, and confidence to support safer AI tutors.
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
cs.LG 2026-04 conditional novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
VoiceBench: Benchmarking LLM-Based Voice Assistants
cs.CL 2024-10 unverdicted novelty 7.0

VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.
Domain Restriction via Multi SAE Layer Transitions
cs.AI 2026-05 unverdicted novelty 6.0

Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
cs.AI 2026-05 unverdicted novelty 6.0

Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
cs.MM 2026-05 unverdicted novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
Common-agency Games for Multi-Objective Test-Time Alignment
cs.GT 2026-05 unverdicted novelty 6.0

CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
cs.AI 2026-05 unverdicted novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
AlignCultura: Towards Culturally Aligned Large Language Models?
cs.CL 2026-04 unverdicted novelty 6.0

Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
cs.CR 2026-04 unverdicted novelty 6.0

ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
cs.LG 2025-11 unverdicted novelty 6.0

OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.
Mapping how LLMs debate societal issues when shadowing human personality traits, sociodemographics and social media behavior
cs.CL 2026-04 unverdicted novelty 5.0

CDS is a new synthetic corpus of LLM-generated texts on vaccines, disinformation, gender gaps, and STEM stereotypes, linked to persona attributes to enable bias and alignment audits.
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
TrustLLM: Trustworthiness in Large Language Models
cs.CL 2024-01 unverdicted novelty 5.0

TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt...
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
cs.CL 2023-11 unverdicted novelty 5.0

The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
Large Language Model-Based Agents for Software Engineering: A Survey
cs.SE 2024-09 unverdicted novelty 4.0

A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.
Understanding AI Trustworthiness: A Scoping Review of AIES & FAccT Articles
cs.AI 2025-10 unverdicted novelty 3.0

A scoping review of AIES and FAccT literature concludes that AI trustworthiness research prioritizes technical precision over social, ethical, and institutional factors, leaving the sociotechnical nature of AI systems...
A Survey on the Memory Mechanism of Large Language Model based Agents
cs.AI 2024-04 accept novelty 3.0

A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
A Survey on Knowledge Distillation of Large Language Models
cs.CL 2024-02 accept novelty 3.0

A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 21 Pith papers · 31 internal anchors

[1]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022

work page 2022
[2]

Alignment of language agents

Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents. arXiv preprint arXiv:2103.14659, 2021

work page arXiv 2021
[3]

OpenAI. Gpt-4. https://openai.com/research/gpt-4, 2023

work page 2023
[4]

On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021

work page 2021
[5]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

work page 2019
[6]

Gpt-4 system card, https://cdn.openai.com/papers/gpt-4-system-card.pdf

OpenAI. Gpt-4 system card, https://cdn.openai.com/papers/gpt-4-system-card.pdf . 2023

work page 2023
[7]

Andrew R. Chow. How chatgpt managed to grow faster than tiktok or instagram. https://time.com/6253615/chatgpt-fastest-growing

work page arXiv
[8]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[9]

A systematic review of the relationship between internet use, self-harm and suicidal behaviour in young people: The good, the bad and the unknown

Amanda Marchant, Keith Hawton, Ann Stewart, Paul Montgomery, Vinod Singaravelu, Keith Lloyd, Nicola Purdy, Kate Daine, and Ann John. A systematic review of the relationship between internet use, self-harm and suicidal behaviour in young people: The good, the bad and the unknown. PloS one, 12(8):e0181722, 2017. 41 Trustworthy LLMs

work page 2017
[10]

The regulation of pornography and child pornography on the internet

Yaman Akdeniz. The regulation of pornography and child pornography on the internet. Available at SSRN 41684, 1997

work page 1997
[11]

Dynamics of hate based internet user networks

Pawel Sobkowicz and Antoni Sobkowicz. Dynamics of hate based internet user networks. The European Physical Journal B, 73(4):633–643, 2010

work page 2010
[12]

Zikun Liu, Chen Luo, and Jia Lu. Hate speech in the internet context: Unpacking the roles of internet penetration, online legal regulation, and online opinion polarization from a transnational perspective.Information Development, page 02666669221148487, 2023

work page 2023
[13]

Is the internet causing political polarization? evidence from demographics

Levi Boxell, Matthew Gentzkow, and Jesse M Shapiro. Is the internet causing political polarization? evidence from demographics. Technical report, National Bureau of Economic Research, 2017

work page 2017
[14]

Regulating the internet of things: first steps toward managing discrimination, privacy, security and consent

Scott R Peppet. Regulating the internet of things: first steps toward managing discrimination, privacy, security and consent. Tex. L. Rev., 93:85, 2014

work page 2014
[15]

Normative challenges of identification in the internet of things: Privacy, profiling, discrimination, and the gdpr

Sandra Wachter. Normative challenges of identification in the internet of things: Privacy, profiling, discrimination, and the gdpr. Computer law & security review, 34(3):436–449, 2018

work page 2018
[16]

Misuse of the internet by pedophiles: Implications for law enforcement and probation practice

Keith F Durkin. Misuse of the internet by pedophiles: Implications for law enforcement and probation practice. Fed. Probation, 61:14, 1997

work page 1997
[17]

Controversies and legal issues of prescribing and dispensing medications using the internet

Constance H Fung, Hawkin E Woo, and Steven M Asch. Controversies and legal issues of prescribing and dispensing medications using the internet. In Mayo Clinic Proceedings, volume 79, pages 188–194. Elsevier, 2004

work page 2004
[18]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

work page 2017
[20]

A General Language Assistant as a Laboratory for Alignment

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

Ethical and social risks of harm from Language Models

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[22]

Evaluating the social impact of generative ai systems in systems and society

Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Hal Daumé III, Jesse Dodge, Ellie Evans, Sara Hooker, et al. Evaluating the social impact of generative ai systems in systems and society. arXiv preprint arXiv:2306.05949, 2023

work page arXiv 2023
[23]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

Eight things to know about large language models

Samuel R Bowman. Eight things to know about large language models. arXiv preprint arXiv:2304.00612, 2023

work page arXiv 2023
[25]

Deep learning, 2016

Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning, 2016. http://www. deeplearningbook.org

work page 2016
[26]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020

work page 2020
[27]

Six Challenges for Neural Machine Translation

Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 42 Trustworthy LLMs

work page 2017
[32]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification.arXiv preprint arXiv:1801.06146, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

Improving language understanding by generative pre-training

Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018

work page 2018
[35]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

GLM-130B: An Open Bilingual Pre-trained Model

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Dialogpt: Large-scale generative pre-training for conversational response generation

Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536, 2019

work page arXiv 1911
[38]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Rrhf: Rank responses to align language models with human feedback without tears, 2023

Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023

work page arXiv 2023
[40]

RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment

Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023

work page internal anchor Pith review arXiv 2023
[41]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Training socially aligned language models in simulated human society

Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M Dai, Diyi Yang, and Soroush V osoughi. Training socially aligned language models in simulated human society. arXiv preprint arXiv:2305.16960, 2023

work page arXiv 2023
[43]

Large language models and software as a medical device

Johan Ordish. Large language models and software as a medical device. https://medregs.blog.gov.uk/2023/03/03/large-language-models-and-software-as-a-medical-device/

work page 2023
[44]

Are large language models ready for healthcare? a comparative study on clinical language understanding, 2023

Yuqing Wang, Yun Zhao, and Linda Petzold. Are large language models ready for healthcare? a comparative study on clinical language understanding, 2023

work page 2023
[45]

How well do large language models support clinician information needs? https://hai.stanford.edu/news/how-well-do-large-language-models-support-clinician-information-needs

Dev Dash, Eric Horvitz, and Nigam Shah. How well do large language models support clinician information needs? https://hai.stanford.edu/news/how-well-do-large-language-models-support-clinician-information-needs

work page
[46]

Bloomberggpt: A large language model for finance, 2023

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance, 2023

work page 2023
[47]

Fingpt: Open-source financial large language models, 2023

Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. Fingpt: Open-source financial large language models, 2023

work page 2023
[48]

Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation

Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

A categorical archive of chatgpt failures

Ali Borji. A categorical archive of chatgpt failures. arXiv preprint arXiv:2302.03494, 2023

work page arXiv 2023
[50]

Chatgpt and software testing education: Promises & perils

Sajed Jalil, Suzzana Rafi, Thomas D LaToza, Kevin Moran, and Wing Lam. Chatgpt and software testing education: Promises & perils. In 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pages 4130–4137. IEEE, 2023

work page 2023
[51]

Fake news detection on social media: A data mining perspective

Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake news detection on social media: A data mining perspective. ACM SIGKDD explorations newsletter, 19(1):22–36, 2017

work page 2017
[52]

Some Like it Hoax: Automated Fake News Detection in Social Networks

Eugenio Tacchini, Gabriele Ballarin, Marco L Della Vedova, Stefano Moret, and Luca De Alfaro. Some like it hoax: Automated fake news detection in social networks. arXiv preprint arXiv:1704.07506, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[53]

Quantifying Memorization Across Neural Language Models

Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

A closer look at memorization in deep networks

Devansh Arpit, Stanisław Jastrz˛ ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017. 43 Trustworthy LLMs

work page 2017
[55]

Measuring causal effects of data statistics on language model’sfactual’predictions.arXiv preprint arXiv:2207.14251, 2022

Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, and Yoav Goldberg. Measuring causal effects of data statistics on language model’sfactual’predictions.arXiv preprint arXiv:2207.14251, 2022

work page arXiv 2022
[56]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022

work page internal anchor Pith review arXiv 2022
[57]

Unsupervised dense information retrieval with contrastive learning

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. 2022

work page 2022
[58]

Prompting gpt-3 to be reliable

Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, and Lijuan Wang. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150, 2022

work page arXiv 2022
[59]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020

work page 2020
[60]

Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023

work page 2023
[61]

Artificial hallucinations in chatgpt: implications in scientific writing

Hussam Alkaissi and Samy I McFarlane. Artificial hallucinations in chatgpt: implications in scientific writing. Cureus, 15(2), 2023

work page 2023
[62]

A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity

Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

False memories and confabulation

Marcia K Johnson and Carol L Raye. False memories and confabulation. Trends in cognitive sciences, 2(4):137– 145, 1998

work page 1998
[64]

Calibrated language model fine-tuning for in-and out-of-distribution data

Lingkai Kong, Haoming Jiang, Yuchen Zhuang, Jie Lyu, Tuo Zhao, and Chao Zhang. Calibrated language model fine-tuning for in-and out-of-distribution data. arXiv preprint arXiv:2010.11506, 2020

work page arXiv 2010
[65]

Increasing faithfulness in knowledge- grounded dialogue with controllable features

Hannah Rashkin, David Reitter, Gaurav Singh Tomar, and Dipanjan Das. Increasing faithfulness in knowledge- grounded dialogue with controllable features. arXiv preprint arXiv:2107.06963, 2021

work page arXiv 2021
[66]

Why does chatgpt fall short in answering questions faithfully? arXiv preprint arXiv:2304.10513, 2023

Shen Zheng, Jie Huang, and Kevin Chen-Chuan Chang. Why does chatgpt fall short in answering questions faithfully? arXiv preprint arXiv:2304.10513, 2023

work page arXiv 2023
[67]

Modeling fluency and faithfulness for diverse neural machine translation

Yang Feng, Wanying Xie, Shuhao Gu, Chenze Shao, Wen Zhang, Zhengxin Yang, and Dong Yu. Modeling fluency and faithfulness for diverse neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 59–66, 2020

work page 2020
[68]

Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization

Haoran Li, Junnan Zhu, Jiajun Zhang, and Chengqing Zong. Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1430–1441, 2018

work page 2018
[69]

arXiv preprint arXiv:2104.08455 , year=

Nouha Dziri, Andrea Madotto, Osmar Zaiane, and Avishek Joey Bose. Neural path hunter: Reducing hallucina- tion in dialogue systems via path grounding. arXiv preprint arXiv:2104.08455, 2021

work page arXiv 2021
[70]

Entity-based knowledge conflicts in question answering

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering. arXiv preprint arXiv:2109.05052, 2021

work page arXiv 2021
[71]

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models

Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[72]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004

work page 2004
[73]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002

work page 2002
[74]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[75]

Rome was built in 1776: A case study on factual correctness in knowledge-grounded response generation

Sashank Santhanam, Behnam Hedayatnia, Spandana Gella, Aishwarya Padmakumar, Seokhwan Kim, Yang Liu, and Dilek Hakkani-Tur. Rome was built in 1776: A case study on factual correctness in knowledge-grounded response generation. arXiv preprint arXiv:2110.05456, 2021. 44 Trustworthy LLMs

work page arXiv 2021
[76]

Q2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering

Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor, and Omri Abend. Q2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. arXiv preprint arXiv:2104.08202, 2021

work page arXiv 2021
[77]

Improving faithfulness in abstractive summarization with contrast candidate generation and selection

Sihao Chen, Fan Zhang, Kazoo Sone, and Dan Roth. Improving faithfulness in abstractive summarization with contrast candidate generation and selection. arXiv preprint arXiv:2104.09061, 2021

work page arXiv 2021
[78]

A simple recipe towards reducing hallucination in neural surface realisation

Feng Nie, Jin-Ge Yao, Jinpeng Wang, Rong Pan, and Chin-Yew Lin. A simple recipe towards reducing hallucination in neural surface realisation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2673–2679, 2019

work page 2019
[79]

Faithful to the original: Fact aware neural abstractive summarization

Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. Faithful to the original: Fact aware neural abstractive summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018

work page 2018
[80]

Totto: A controlled table-to-text generation dataset

Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373, 2020

work page arXiv 2004

Showing first 80 references.