Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Pith reviewed 2026-05-17 22:26 UTC · model grok-4.3
The pith
A survey finds that more aligned LLMs generally achieve higher trustworthiness, though the gains differ across categories.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By organizing trustworthiness into reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness, and measuring eight sub-areas, the authors establish that greater alignment correlates with better overall performance but with category-dependent effectiveness, calling for finer-grained analysis and continued alignment refinements.
What carries the argument
The seven-category taxonomy with twenty-nine sub-categories that structures the survey and directs the selection of measurement studies.
If this is right
- More aligned models can be expected to deliver higher overall trustworthiness in practice.
- Alignment efforts must address variation by targeting specific categories separately.
- Evaluation should include fine-grained tests rather than relying on general alignment metrics.
- Deployment decisions benefit from checking performance across multiple trustworthiness dimensions.
- Practitioners gain a structured guideline for iterating on LLM alignment.
Where Pith is reading between the lines
- Extending measurements to additional sub-categories could confirm or refine the observed patterns.
- The framework might apply to assessing trustworthiness in multimodal or other AI models beyond text-based LLMs.
- Prioritizing categories where alignment shows weaker effects could improve overall system reliability.
- Real-world deployment might reveal gaps not captured by the current sub-category selections.
Load-bearing premise
That the chosen seven categories, twenty-nine sub-categories, and the eight selected for measurement accurately represent the full scope of trustworthiness in real-world LLM use.
What would settle it
A replication study that applies alternative trustworthiness categories or different evaluation methods to the same models and finds no general advantage for aligned models, or uniform effects across categories, would challenge the main findings.
read the original abstract
Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys seven major categories of LLM trustworthiness (reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness), subdivided into 29 sub-categories in total. It then selects a subset of 8 sub-categories, designs corresponding measurements, and applies them to several widely-used LLMs. The central empirical finding is that more aligned models tend to perform better overall in trustworthiness, although the effectiveness of alignment varies across categories. The work positions itself as providing guidance for systematic evaluation and improvement of LLM alignment.
Significance. If the directional findings hold after methodological clarification, the survey offers a structured taxonomy that consolidates key trustworthiness dimensions and supplies concrete measurement examples. The explicit enumeration of 29 sub-categories and the cross-model comparisons add practical value for practitioners seeking to iterate on alignment. The observation that alignment success is uneven across categories is a useful falsifiable pointer for future targeted work.
major comments (2)
- [Results / Empirical evaluation] Results section (and abstract): the claim that 'more aligned models tend to perform better in terms of overall trustworthiness' rests on an implicit ordering of the tested LLMs by alignment strength. The manuscript does not state an a-priori, externally validated ranking (e.g., base models vs. RLHF-tuned vs. further safety-tuned) constructed independently of the eight trustworthiness metrics; without this separation the reported positive trend risks circularity rather than confirmation.
- [Measurement studies] Measurement studies section: no explicit criteria are given for choosing the 8 sub-categories out of the 29, nor are the exact test implementations, prompt templates, or statistical controls described. These omissions leave the support for the directional claims only moderately strong and make replication or extension difficult.
minor comments (2)
- [Abstract] The abstract refers to 'several widely-used LLMs' without naming them; listing the specific models (and their versions) would improve immediate clarity.
- [Results figures/tables] Table or figure captions for the measurement results could more explicitly note the source of the alignment ordering used for the 'more aligned' vs. 'less aligned' comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and suggestions. We address each of the major comments below and indicate how we plan to revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Results / Empirical evaluation] Results section (and abstract): the claim that 'more aligned models tend to perform better in terms of overall trustworthiness' rests on an implicit ordering of the tested LLMs by alignment strength. The manuscript does not state an a-priori, externally validated ranking (e.g., base models vs. RLHF-tuned vs. further safety-tuned) constructed independently of the eight trustworthiness metrics; without this separation the reported positive trend risks circularity rather than confirmation.
Authors: We agree with the referee that an explicit, a-priori ordering of the models based on their alignment efforts, independent of our evaluation metrics, would strengthen the claim and avoid any appearance of circularity. In the revised manuscript, we will add a dedicated paragraph in the Results section (and update the abstract if necessary) that describes the alignment levels of the tested LLMs based on external information, such as their training procedures documented in official papers and announcements (e.g., distinguishing base models from those fine-tuned with RLHF or additional safety measures). This ordering will be presented prior to reporting the trustworthiness scores. revision: yes
-
Referee: [Measurement studies] Measurement studies section: no explicit criteria are given for choosing the 8 sub-categories out of the 29, nor are the exact test implementations, prompt templates, or statistical controls described. These omissions leave the support for the directional claims only moderately strong and make replication or extension difficult.
Authors: We acknowledge that the selection criteria for the 8 sub-categories and the detailed experimental setups were not sufficiently elaborated. We will revise the Measurement studies section to include explicit criteria for selection, such as coverage of different major categories, feasibility of automated evaluation, and importance for real-world applications. Furthermore, we will provide the exact prompt templates, evaluation protocols, and any statistical methods used in an appendix to enable full replication and extension by other researchers. revision: yes
Circularity Check
No circularity in survey review or empirical measurements
full rationale
The paper is a literature survey that organizes LLM trustworthiness into seven categories and 29 sub-categories drawn from prior work, then performs new measurements on a selected subset of eight sub-categories across several LLMs. The central observation that more aligned models tend to perform better is an empirical comparison between models whose alignment status is established by external training history (e.g., base models versus those that received RLHF or safety tuning) and the independently collected trustworthiness scores. No equations, fitted parameters, or self-referential definitions are present; the ordering of models by alignment degree is not derived from the paper's own metrics. Self-citations exist as part of normal survey practice but are not load-bearing for any uniqueness claim or ansatz. The work is therefore self-contained against external benchmarks and contains no reduction of results to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Alignment refers to making models behave in accordance with human intentions
Lean theorems connected to this paper
-
LawOfExistencelaw_of_existence echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
Parasites in the Toolchain: A Large-Scale Analysis of Attacks on the MCP Ecosystem
This paper defines a new Parasitic Toolchain Attack pattern (MCP-UPD) that assembles legitimate tools into privacy-exfiltrating workflows and reports the first large-scale scan of 12230 MCP tools across 1360 servers r...
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
Math Education Digital Shadows for facilitating learning with LLMs: Math performance, anxiety and confidence in simulated students and AIs
MEDS is a dataset of 28,000 LLM personas performing high-school math tasks alongside psychometric tests and cognitive networks that capture math anxiety, self-efficacy, and confidence to support safer AI tutors.
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
VoiceBench: Benchmarking LLM-Based Voice Assistants
VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.
-
Domain Restriction via Multi SAE Layer Transitions
Multi-layer SAE transitions capture domain-specific signatures that distinguish OOD texts in Gemma-2 models.
-
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
-
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
-
Common-agency Games for Multi-Objective Test-Time Alignment
CAGE uses common-agency games and an EPEC algorithm to compute equilibrium policies that balance multiple conflicting objectives for test-time LLM alignment.
-
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
-
AlignCultura: Towards Culturally Aligned Large Language Models?
Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
-
The Art of (Mis)alignment: How Fine-Tuning Methods Effectively Misalign and Realign LLMs in Post-Training
ORPO is most effective at misaligning LLMs while DPO excels at realigning them, though it reduces utility, revealing an asymmetry between attack and defense methods.
-
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.
-
Mapping how LLMs debate societal issues when shadowing human personality traits, sociodemographics and social media behavior
CDS is a new synthetic corpus of LLM-generated texts on vaccines, disinformation, gender gaps, and STEM stereotypes, linked to persona attributes to enable bias and alignment audits.
-
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
-
TrustLLM: Trustworthiness in Large Language Models
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt...
-
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
-
Large Language Model-Based Agents for Software Engineering: A Survey
A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.
-
Understanding AI Trustworthiness: A Scoping Review of AIES & FAccT Articles
A scoping review of AIES and FAccT literature concludes that AI trustworthiness research prioritizes technical precision over social, ethical, and institutional factors, leaving the sociotechnical nature of AI systems...
-
A Survey on the Memory Mechanism of Large Language Model based Agents
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
-
A Survey on Knowledge Distillation of Large Language Models
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
Reference graph
Works this paper leans on
-
[1]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022
work page 2022
-
[2]
Zachary Kenton, Tom Everitt, Laura Weidinger, Iason Gabriel, Vladimir Mikulik, and Geoffrey Irving. Alignment of language agents. arXiv preprint arXiv:2103.14659, 2021
-
[3]
OpenAI. Gpt-4. https://openai.com/research/gpt-4, 2023
work page 2023
-
[4]
Emily M Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021
work page 2021
-
[5]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019
work page 2019
-
[6]
Gpt-4 system card, https://cdn.openai.com/papers/gpt-4-system-card.pdf
OpenAI. Gpt-4 system card, https://cdn.openai.com/papers/gpt-4-system-card.pdf . 2023
work page 2023
- [7]
-
[8]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020
work page 1901
-
[9]
Amanda Marchant, Keith Hawton, Ann Stewart, Paul Montgomery, Vinod Singaravelu, Keith Lloyd, Nicola Purdy, Kate Daine, and Ann John. A systematic review of the relationship between internet use, self-harm and suicidal behaviour in young people: The good, the bad and the unknown. PloS one, 12(8):e0181722, 2017. 41 Trustworthy LLMs
work page 2017
-
[10]
The regulation of pornography and child pornography on the internet
Yaman Akdeniz. The regulation of pornography and child pornography on the internet. Available at SSRN 41684, 1997
work page 1997
-
[11]
Dynamics of hate based internet user networks
Pawel Sobkowicz and Antoni Sobkowicz. Dynamics of hate based internet user networks. The European Physical Journal B, 73(4):633–643, 2010
work page 2010
-
[12]
Zikun Liu, Chen Luo, and Jia Lu. Hate speech in the internet context: Unpacking the roles of internet penetration, online legal regulation, and online opinion polarization from a transnational perspective.Information Development, page 02666669221148487, 2023
work page 2023
-
[13]
Is the internet causing political polarization? evidence from demographics
Levi Boxell, Matthew Gentzkow, and Jesse M Shapiro. Is the internet causing political polarization? evidence from demographics. Technical report, National Bureau of Economic Research, 2017
work page 2017
-
[14]
Scott R Peppet. Regulating the internet of things: first steps toward managing discrimination, privacy, security and consent. Tex. L. Rev., 93:85, 2014
work page 2014
-
[15]
Sandra Wachter. Normative challenges of identification in the internet of things: Privacy, profiling, discrimination, and the gdpr. Computer law & security review, 34(3):436–449, 2018
work page 2018
-
[16]
Misuse of the internet by pedophiles: Implications for law enforcement and probation practice
Keith F Durkin. Misuse of the internet by pedophiles: Implications for law enforcement and probation practice. Fed. Probation, 61:14, 1997
work page 1997
-
[17]
Controversies and legal issues of prescribing and dispensing medications using the internet
Constance H Fung, Hawkin E Woo, and Steven M Asch. Controversies and legal issues of prescribing and dispensing medications using the internet. In Mayo Clinic Proceedings, volume 79, pages 188–194. Elsevier, 2004
work page 2004
-
[18]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Deep reinforcement learning from human preferences
Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017
work page 2017
-
[20]
A General Language Assistant as a Laboratory for Alignment
Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
Ethical and social risks of harm from Language Models
Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[22]
Evaluating the social impact of generative ai systems in systems and society
Irene Solaiman, Zeerak Talat, William Agnew, Lama Ahmad, Dylan Baker, Su Lin Blodgett, Hal Daumé III, Jesse Dodge, Ellie Evans, Sara Hooker, et al. Evaluating the social impact of generative ai systems in systems and society. arXiv preprint arXiv:2306.05949, 2023
-
[23]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[24]
Eight things to know about large language models
Samuel R Bowman. Eight things to know about large language models. arXiv preprint arXiv:2304.00612, 2023
-
[25]
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning, 2016. http://www. deeplearningbook.org
work page 2016
-
[26]
The curious case of neural text degeneration
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2020
work page 2020
-
[27]
Six Challenges for Neural Machine Translation
Philipp Koehn and Rebecca Knowles. Six challenges for neural machine translation. arXiv preprint arXiv:1706.03872, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[29]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[30]
Scaling Instruction-Finetuned Language Models
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 42 Trustworthy LLMs
work page 2017
-
[32]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[33]
Universal Language Model Fine-tuning for Text Classification
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification.arXiv preprint arXiv:1801.06146, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[34]
Improving language understanding by generative pre-training
Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. 2018
work page 2018
-
[35]
OPT: Open Pre-trained Transformer Language Models
Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Dialogpt: Large-scale generative pre-training for conversational response generation
Yizhe Zhang, Siqi Sun, Michel Galley, Yen-Chun Chen, Chris Brockett, Xiang Gao, Jianfeng Gao, Jingjing Liu, and Bill Dolan. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536, 2019
-
[38]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[39]
Rrhf: Rank responses to align language models with human feedback without tears, 2023
Zheng Yuan, Hongyi Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023
-
[40]
RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
Hanze Dong, Wei Xiong, Deepanshu Goyal, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. Raft: Reward ranked finetuning for generative foundation model alignment. arXiv preprint arXiv:2304.06767, 2023
work page internal anchor Pith review arXiv 2023
-
[41]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Training socially aligned language models in simulated human society
Ruibo Liu, Ruixin Yang, Chenyan Jia, Ge Zhang, Denny Zhou, Andrew M Dai, Diyi Yang, and Soroush V osoughi. Training socially aligned language models in simulated human society. arXiv preprint arXiv:2305.16960, 2023
-
[43]
Large language models and software as a medical device
Johan Ordish. Large language models and software as a medical device. https://medregs.blog.gov.uk/2023/03/03/large-language-models-and-software-as-a-medical-device/
work page 2023
-
[44]
Yuqing Wang, Yun Zhao, and Linda Petzold. Are large language models ready for healthcare? a comparative study on clinical language understanding, 2023
work page 2023
-
[45]
Dev Dash, Eric Horvitz, and Nigam Shah. How well do large language models support clinician information needs? https://hai.stanford.edu/news/how-well-do-large-language-models-support-clinician-information-needs
-
[46]
Bloomberggpt: A large language model for finance, 2023
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance, 2023
work page 2023
-
[47]
Fingpt: Open-source financial large language models, 2023
Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. Fingpt: Open-source financial large language models, 2023
work page 2023
-
[48]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
A categorical archive of chatgpt failures
Ali Borji. A categorical archive of chatgpt failures. arXiv preprint arXiv:2302.03494, 2023
-
[50]
Chatgpt and software testing education: Promises & perils
Sajed Jalil, Suzzana Rafi, Thomas D LaToza, Kevin Moran, and Wing Lam. Chatgpt and software testing education: Promises & perils. In 2023 IEEE International Conference on Software Testing, Verification and Validation Workshops (ICSTW), pages 4130–4137. IEEE, 2023
work page 2023
-
[51]
Fake news detection on social media: A data mining perspective
Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu. Fake news detection on social media: A data mining perspective. ACM SIGKDD explorations newsletter, 19(1):22–36, 2017
work page 2017
-
[52]
Some Like it Hoax: Automated Fake News Detection in Social Networks
Eugenio Tacchini, Gabriele Ballarin, Marco L Della Vedova, Stefano Moret, and Luca De Alfaro. Some like it hoax: Automated fake news detection in social networks. arXiv preprint arXiv:1704.07506, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[53]
Quantifying Memorization Across Neural Language Models
Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[54]
A closer look at memorization in deep networks
Devansh Arpit, Stanisław Jastrz˛ ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017. 43 Trustworthy LLMs
work page 2017
-
[55]
Yanai Elazar, Nora Kassner, Shauli Ravfogel, Amir Feder, Abhilasha Ravichander, Marius Mosbach, Yonatan Belinkov, Hinrich Schütze, and Yoav Goldberg. Measuring causal effects of data statistics on language model’sfactual’predictions.arXiv preprint arXiv:2207.14251, 2022
-
[56]
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Hannaneh Hajishirzi, and Daniel Khashabi. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022
work page internal anchor Pith review arXiv 2022
-
[57]
Unsupervised dense information retrieval with contrastive learning
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. 2022
work page 2022
-
[58]
Prompting gpt-3 to be reliable
Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jordan Boyd-Graber, and Lijuan Wang. Prompting gpt-3 to be reliable. arXiv preprint arXiv:2210.09150, 2022
-
[59]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020
work page 2020
-
[60]
Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023
work page 2023
-
[61]
Artificial hallucinations in chatgpt: implications in scientific writing
Hussam Alkaissi and Samy I McFarlane. Artificial hallucinations in chatgpt: implications in scientific writing. Cureus, 15(2), 2023
work page 2023
-
[62]
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
False memories and confabulation
Marcia K Johnson and Carol L Raye. False memories and confabulation. Trends in cognitive sciences, 2(4):137– 145, 1998
work page 1998
-
[64]
Calibrated language model fine-tuning for in-and out-of-distribution data
Lingkai Kong, Haoming Jiang, Yuchen Zhuang, Jie Lyu, Tuo Zhao, and Chao Zhang. Calibrated language model fine-tuning for in-and out-of-distribution data. arXiv preprint arXiv:2010.11506, 2020
-
[65]
Increasing faithfulness in knowledge- grounded dialogue with controllable features
Hannah Rashkin, David Reitter, Gaurav Singh Tomar, and Dipanjan Das. Increasing faithfulness in knowledge- grounded dialogue with controllable features. arXiv preprint arXiv:2107.06963, 2021
-
[66]
Why does chatgpt fall short in answering questions faithfully? arXiv preprint arXiv:2304.10513, 2023
Shen Zheng, Jie Huang, and Kevin Chen-Chuan Chang. Why does chatgpt fall short in answering questions faithfully? arXiv preprint arXiv:2304.10513, 2023
-
[67]
Modeling fluency and faithfulness for diverse neural machine translation
Yang Feng, Wanying Xie, Shuhao Gu, Chenze Shao, Wen Zhang, Zhengxin Yang, and Dong Yu. Modeling fluency and faithfulness for diverse neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 59–66, 2020
work page 2020
-
[68]
Haoran Li, Junnan Zhu, Jiajun Zhang, and Chengqing Zong. Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1430–1441, 2018
work page 2018
-
[69]
arXiv preprint arXiv:2104.08455 , year=
Nouha Dziri, Andrea Madotto, Osmar Zaiane, and Avishek Joey Bose. Neural path hunter: Reducing hallucina- tion in dialogue systems via path grounding. arXiv preprint arXiv:2104.08455, 2021
-
[70]
Entity-based knowledge conflicts in question answering
Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering. arXiv preprint arXiv:2109.05052, 2021
-
[71]
SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models
Potsawee Manakul, Adian Liusie, and Mark JF Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[72]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004
work page 2004
-
[73]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002
work page 2002
-
[74]
TruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[75]
Sashank Santhanam, Behnam Hedayatnia, Spandana Gella, Aishwarya Padmakumar, Seokhwan Kim, Yang Liu, and Dilek Hakkani-Tur. Rome was built in 1776: A case study on factual correctness in knowledge-grounded response generation. arXiv preprint arXiv:2110.05456, 2021. 44 Trustworthy LLMs
-
[76]
Or Honovich, Leshem Choshen, Roee Aharoni, Ella Neeman, Idan Szpektor, and Omri Abend. Q2: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. arXiv preprint arXiv:2104.08202, 2021
-
[77]
Improving faithfulness in abstractive summarization with contrast candidate generation and selection
Sihao Chen, Fan Zhang, Kazoo Sone, and Dan Roth. Improving faithfulness in abstractive summarization with contrast candidate generation and selection. arXiv preprint arXiv:2104.09061, 2021
-
[78]
A simple recipe towards reducing hallucination in neural surface realisation
Feng Nie, Jin-Ge Yao, Jinpeng Wang, Rong Pan, and Chin-Yew Lin. A simple recipe towards reducing hallucination in neural surface realisation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2673–2679, 2019
work page 2019
-
[79]
Faithful to the original: Fact aware neural abstractive summarization
Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. Faithful to the original: Fact aware neural abstractive summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018
work page 2018
-
[80]
Totto: A controlled table-to-text generation dataset
Ankur P Parikh, Xuezhi Wang, Sebastian Gehrmann, Manaal Faruqui, Bhuwan Dhingra, Diyi Yang, and Dipanjan Das. Totto: A controlled table-to-text generation dataset. arXiv preprint arXiv:2004.14373, 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.