TrustLLM: Trustworthiness in Large Language Models
Pith reviewed 2026-05-18 11:12 UTC · model grok-4.3
The pith
Proprietary large language models generally outperform open-source ones on trustworthiness measures, and trustworthiness tracks closely with overall utility.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By defining eight principles and applying a six-dimension benchmark to sixteen LLMs, the study finds that trustworthiness and utility are positively correlated, proprietary models generally lead open-source ones on the tested dimensions, a few open-source models approach proprietary performance, and some models over-calibrate by refusing benign prompts.
What carries the argument
The TrustLLM benchmark, which applies standardized tests across truthfulness, safety, fairness, robustness, privacy, and machine ethics to rank models.
If this is right
- Higher trustworthiness tends to accompany stronger performance on standard tasks.
- Widespread use of open-source LLMs carries elevated risk compared with proprietary alternatives.
- Overly strict safety tuning can reduce model utility by blocking safe user requests.
- Transparency about the specific methods used to improve trustworthiness enables better analysis of their effects.
Where Pith is reading between the lines
- Closing the trustworthiness gap between open-source and proprietary models could require targeted improvements in training data or alignment techniques.
- The observed correlation between trustworthiness and utility suggests that general capability advances may bring trustworthiness gains as a side effect.
- Developers should monitor refusal rates on safe inputs as a routine check when adding safety features.
Load-bearing premise
The selected datasets and evaluation methods for the six dimensions capture the main real-world trustworthiness risks without major gaps or biases.
What would settle it
An open-source model that scores higher than leading proprietary models on all six benchmark dimensions while correctly answering every benign prompt would contradict the reported pattern.
read the original abstract
Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces TrustLLM as a comprehensive study of trustworthiness in LLMs. It proposes a set of principles spanning eight dimensions, constructs a benchmark across six dimensions (truthfulness, safety, fairness, robustness, privacy, and machine ethics) using more than 30 datasets, evaluates 16 mainstream LLMs, and reports three primary findings: a positive correlation between trustworthiness and utility, general outperformance by proprietary models over open-source counterparts, and over-calibration in some models that leads to refusal of benign prompts. The work concludes with discussion of open challenges and the need for transparency in trustworthiness technologies.
Significance. If the central empirical claims hold after methodological clarification, the paper would make a useful contribution by providing one of the larger-scale multi-dimensional evaluations of LLM trustworthiness to date. The explicit linkage of findings to model accessibility (proprietary vs. open-source) and the utility-trustworthiness trade-off supplies concrete observations that can inform deployment decisions and future alignment research. The scale (>30 datasets, 16 models) is a clear strength that distinguishes it from narrower prior benchmarks.
major comments (2)
- [§3 and §4] §3 (Benchmark Construction) and §4 (Evaluation): The mapping from the eight proposed principles to the six benchmark dimensions and the specific dataset choices lacks an explicit coverage or gap analysis. Without this, it is unclear whether the observed proprietary-model advantage and positive trustworthiness-utility correlation are robust to alternative task selections (e.g., long-context privacy or culturally varied ethics scenarios). This directly affects the load-bearing claim that proprietary LLMs generally outperform open-source ones.
- [§4] §4 (Evaluation Methodology): The manuscript provides insufficient detail on prompt templates, exact scoring rubrics (especially for subjective dimensions such as machine ethics and fairness), and any inter-annotator or inter-model consistency checks. These choices are central to the reported rankings and the over-calibration observation; their omission prevents independent verification of whether the differences are intrinsic or protocol-dependent.
minor comments (3)
- [Abstract] The abstract states principles across eight dimensions but a benchmark across six; a single clarifying sentence would remove potential reader confusion.
- [Results] Correlation plots in the results section would be strengthened by reporting confidence intervals or statistical significance for the trustworthiness-utility relationship.
- [Related Work] A small number of citations to prior multi-dimensional LLM safety benchmarks (e.g., HELM, DecodingTrust) appear to be missing from the related-work discussion.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the benchmark's scope and improve methodological transparency. We address each point below and commit to revisions that strengthen the paper without altering its core claims.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Benchmark Construction) and §4 (Evaluation): The mapping from the eight proposed principles to the six benchmark dimensions and the specific dataset choices lacks an explicit coverage or gap analysis. Without this, it is unclear whether the observed proprietary-model advantage and positive trustworthiness-utility correlation are robust to alternative task selections (e.g., long-context privacy or culturally varied ethics scenarios). This directly affects the load-bearing claim that proprietary LLMs generally outperform open-source ones.
Authors: We agree that an explicit mapping and gap analysis would improve transparency. In the revised version we will add a table in §3 that maps each of the eight principles to the six benchmark dimensions and lists the datasets chosen for each, together with a short discussion of coverage and acknowledged gaps (e.g., limited long-context privacy scenarios and culturally specific ethics tasks). Our dataset selection follows prior literature for each dimension; the proprietary-model advantage and trustworthiness-utility correlation hold consistently across the >30 datasets we include. We will nevertheless add a limitations paragraph noting that results may vary under alternative task distributions and flag long-context and culturally varied evaluations as important future work. revision: partial
-
Referee: [§4] §4 (Evaluation Methodology): The manuscript provides insufficient detail on prompt templates, exact scoring rubrics (especially for subjective dimensions such as machine ethics and fairness), and any inter-annotator or inter-model consistency checks. These choices are central to the reported rankings and the over-calibration observation; their omission prevents independent verification of whether the differences are intrinsic or protocol-dependent.
Authors: We accept that additional methodological detail is required for reproducibility. In the revision we will expand §4 (and add an appendix) with: (i) the full prompt templates used for each dimension, (ii) precise scoring rubrics including how human or automated judgments were applied to machine ethics and fairness, and (iii) inter-annotator agreement statistics for any human-evaluated subsets together with consistency checks across model outputs. These additions will allow readers to verify that reported differences are not artifacts of the evaluation protocol. revision: yes
Circularity Check
No significant circularity in empirical benchmarking study
full rationale
The paper conducts an empirical evaluation of 16 LLMs across six trustworthiness dimensions using over 30 external datasets. Central claims (proprietary models outperforming open-source ones, positive trustworthiness-utility correlation, over-calibration) derive directly from model outputs on these datasets rather than from any internal derivation, fitted parameters, or self-referential definitions. No equations, predictions, or uniqueness theorems are presented that reduce to the authors' own inputs by construction. The work is self-contained against external benchmarks, with dataset selection serving as an operationalization step rather than a circular fit.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing NLP datasets can serve as valid proxies for real-world trustworthiness failures in LLMs.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our findings firstly show that in general trustworthiness and utility are positively related... proprietary LLMs generally outperform most open-source counterparts...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction
AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models ...
-
Trustworthy AI: Ensuring Reliability and Accountability from Models to Agents
The thesis presents a kernel method for multiaccuracy across overlooked subpopulations, information-theoretic optimal watermarking for LLMs, and a simulator showing LLM agents outperforming humans in supply chains whi...
-
Profiling for Pennies: Unveiling the Privacy Iceberg of LLM Agents
LLM agents can reconstruct high-fidelity personal profiles from minimal PII seeds with over 90% accuracy in under 10 minutes at less than $3 cost, exposing three escalating tiers of privacy risks.
-
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
-
Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models
Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.
-
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
-
Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression
CoT compression frequently introduces trustworthiness regressions with method-specific degradation profiles; a proposed normalized efficiency score and alignment-aware DPO variant reduce length by 19.3% with smaller t...
-
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.
-
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
-
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
-
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
-
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
-
Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction
A multi-view evidential framework combines semantic and reasoning information to improve accuracy and provide trustworthy uncertainty estimates for mental health prediction on text data.
-
A Multi-Dimensional Audit of Politically Aligned Large Language Models
A multi-dimensional audit framework for politically aligned LLMs finds consistent trade-offs: larger models are more effective and truthful but less fair with higher bias, while fine-tuned models reduce bias but incre...
-
Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work
Vibe Medicine proposes directing AI agents via natural language for end-to-end biomedical workflows using LLMs, agent frameworks, and a curated collection of over 1,000 medical skills.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[1]
A toolkit for text extraction and analysis for natural language processing tasks
Tshephisho Joseph Sefara, Mahlatse Mbooi, Katlego Mashile, Thompho Rambuda, and Mapitsi Rangata. A toolkit for text extraction and analysis for natural language processing tasks. In 2022 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), pages 1–6, 2022
work page 2022
-
[2]
Natural language processing: State of the art, current trends and challenges
Diksha Khurana, Aditya Koli, Kiran Khatter, and Sukhdev Singh. Natural language processing: State of the art, current trends and challenges. Multimedia tools and applications, 82(3):3713–3744, 2023
work page 2023
-
[3]
Wordcraft: story writing with large language models
Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, pages 841–852, 2022
work page 2022
-
[4]
Multilingual machine translation with large language models: Empirical results and analysis, 2023
Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models: Empirical results and analysis, 2023
work page 2023
-
[5]
Reinventing search with a new ai-powered microsoft bing and edge, your copilot for the web, 2023. https://blogs.microsoft.com/blog/2023/02/07/ reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/
work page 2023
-
[6]
https://medium.com/whatnot-engineering/ enhancing-search-using-large-language-models-f9dcb988bdb9
Enhancing search using large language models, 2023. https://medium.com/whatnot-engineering/ enhancing-search-using-large-language-models-f9dcb988bdb9
work page 2023
-
[7]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
https://www.projectpro.io/article/ large-language-model-use-cases-and-applications/887
7 top large language model use cases and applications, 2023. https://www.projectpro.io/article/ large-language-model-use-cases-and-applications/887
work page 2023
-
[9]
Code Llama: Open Foundation Models for Code
Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Large language models: The future of b2b software, 2023
MintMesh. Large language models: The future of b2b software, 2023
work page 2023
-
[11]
Bloomberggpt: A large language model for finance, 2023
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance, 2023
work page 2023
-
[12]
Scientific discovery in the age of artificial intelligence
Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60, 2023
work page 2023
-
[13]
Xuan Zhang, Limei Wang, Jacob Helwig, Youzhi Luo, Cong Fu, Yaochen Xie, Meng Liu, Yuchao Lin, Zhao Xu, Keqiang Yan, Keir Adams, Maurice Weiler, Xiner Li, Tianfan Fu, Yucheng Wang, Haiyang Yu, YuQing Xie, Xiang Fu, Alex Strasser, Shenglong Xu, Yi Liu, Yuanqi Du, Alexandra Saxton, Hongyi Ling, Hannah Lawrence, Hannes Stärk, Shurui Gui, Carl Edwards, Nichola...
-
[14]
The impact of large language models on scientific discovery: a preliminary study using gpt-4, 2023
Microsoft Research AI4Science and Microsoft Azure Quantum. The impact of large language models on scientific discovery: a preliminary study using gpt-4, 2023
work page 2023
-
[15]
Pllama: An open-source large language model for plant science, 2024
Xianjun Yang, Junfeng Gao, Wenxin Xue, and Erik Alexandersson. Pllama: An open-source large language model for plant science, 2024
work page 2024
-
[16]
The future landscape of large language models in medicine
Jan Clusmann, Fiona R Kolbinger, Hannah Sophie Muti, Zunamys I Carrero, Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie-Caroline Schwarzkopf, Michaela Unger, 80 TRUST LLM Gregory P Veldhuizen, et al. The future landscape of large language models in medicine. Communica- tions Medicine, 3(1):141, 2023
work page 2023
-
[17]
Yuanhe Tian, Ruyi Gan, Yan Song, Jiaxing Zhang, and Yongdong Zhang. ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences. arXiv preprint arXiv:2311.06025, 2023
-
[18]
Alpacare:instruction-tuned large language models for medical application, 2023
Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, and Linda Ruth Petzold. Alpacare:instruction-tuned large language models for medical application, 2023
work page 2023
-
[19]
Davison, Quanzheng Li, Yong Chen, Hongfang Liu, and Lichao Sun
Kai Zhang, Jun Yu, Zhiling Yan, Yixin Liu, Eashan Adhikarla, Sunyang Fu, Xun Chen, Chen Chen, Yuyin Zhou, Xiang Li, Lifang He, Brian D. Davison, Quanzheng Li, Yong Chen, Hongfang Liu, and Lichao Sun. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks, 2023
work page 2023
-
[20]
Yirong Chen, Zhenyu Wang, Xiaofen Xing, huimin zheng, Zhipei Xu, Kai Fang, Junhong Wang, Sihang Li, Jieling Wu, Qi Liu, and Xiangmin Xu. Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt, 2023
work page 2023
-
[21]
Huatuogpt, towards taming language models to be a doctor
Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li. Huatuogpt, towards taming language models to be a doctor. arXiv preprint arXiv:2305.15075, 2023
-
[22]
Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6), 2023
work page 2023
-
[23]
Medicalgpt: Training medical gpt model
Ming Xu. Medicalgpt: Training medical gpt model. https://github.com/shibing624/MedicalGPT, 2023
work page 2023
-
[24]
Soumen Pal, Manojit Bhattacharya, Sang-Soo Lee, and Chiranjib Chakraborty. A domain-specific next-generation large language model (llm) or chatgpt is required for biomedical engineering and research. Annals of Biomedical Engineering, pages 1–4, 2023
work page 2023
-
[25]
Towards generalist biomedical ai
Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Chuck Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai. arXiv preprint arXiv:2307.14334, 2023
-
[26]
Large language models and political science
Mitchell Linegar, Rafal Kocielnik, and R Michael Alvarez. Large language models and political science. Frontiers in Political Science, 5:1257092, 2023
work page 2023
-
[27]
https://github.com/irlab-sdu/fuzi.mingcha, 2023
fuzi.mingcha. https://github.com/irlab-sdu/fuzi.mingcha, 2023
work page 2023
-
[28]
Disc-lawllm: Fine-tuning large language models for intelligent legal services, 2023
Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, and Zhongyu Wei. Disc-lawllm: Fine-tuning large language models for intelligent legal services, 2023
work page 2023
-
[29]
Chawla, Olaf Wiest, and Xiangliang Zhang
Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. In NeurIPS, 2023
work page 2023
-
[30]
Structured chemistry reasoning with large language models
Siru Ouyang, Zhuosheng Zhang, Bing Yan, Xuan Liu, Jiawei Han, and Lianhui Qin. Structured chemistry reasoning with large language models. arXiv preprint arXiv:2311.09656, 2023
-
[31]
Marinegpt: Unlocking secrets of "ocean" to the public, 2023
Ziqiang Zheng, Jipeng Zhang, Tuan-Anh Vu, Shizhe Diao, Yue Him Wong Tim, and Sai-Kit Yeung. Marinegpt: Unlocking secrets of "ocean" to the public, 2023
work page 2023
-
[32]
Oceangpt: A large language model for ocean science tasks, 2023
Zhen Bi, Ningyu Zhang, Yida Xue, Yixin Ou, Daxiong Ji, Guozhou Zheng, and Huajun Chen. Oceangpt: A large language model for ocean science tasks, 2023
work page 2023
-
[33]
Jingsi Yu, Junhui Zhu, Yujie Wang, Yang Liu, Hongxiang Chang, Jinran Nie, Cunliang Kong, Ruining Chong, XinLiu, Jiyuan An, Luming Lu, Mingwei Fang, and Lin Zhu. Taoli llama. https://github.com/ blcuicall/taoli, 2023
work page 2023
-
[34]
Artgpt-4: Artistic vision-language understanding with adapter-enhanced minigpt-4, 2023
Zhengqing Yuan, Huiwen Xue, Xinyi Wang, Yongming Liu, Zhuanzhe Zhao, and Kun Wang. Artgpt-4: Artistic vision-language understanding with adapter-enhanced minigpt-4, 2023. 81 TRUST LLM
work page 2023
-
[35]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin- odkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bra...
work page 2022
-
[36]
Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Sia- mak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing...
work page 2023
-
[37]
Palm: Efficiently training massive language models, 2023
Towards Data Science. Palm: Efficiently training massive language models, 2023
work page 2023
-
[38]
How chatgpt works: A look inside large language models, 2023
Wired. How chatgpt works: A look inside large language models, 2023
work page 2023
-
[39]
LoRA: Low-Rank Adaptation of Large Language Models
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[40]
QLoRA: Efficient Finetuning of Quantized LLMs
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Pathways: Asynchronous distributed dataflow for ml
Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, et al. Pathways: Asynchronous distributed dataflow for ml. Proceedings of Machine Learning and Systems, 4:430–449, 2022
work page 2022
-
[42]
Ai alignment: A comprehensive survey, 2023
Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan, Aidan O’Gara, Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. Ai alignment: A comprehensive survey, 2023
work page 2023
-
[43]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730– 27744, 2022
work page 2022
-
[44]
Improving language model negotiation with self-play and in-context learning from ai feedback
Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023. 82 TRUST LLM
-
[45]
Principle-driven self-alignment of language models from scratch with minimal human supervision
Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023
-
[46]
Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, Ashwin Kalyan, Peter Clark, Derry Wijaya, and Niket Tandon. Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs, 2023
work page 2023
-
[47]
Measuring Progress on Scalable Oversight for Large Language Models
Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil ˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
Discovering Language Model Behaviors with Model-Written Evaluations
Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[49]
Improving Factuality and Reasoning in Language Models through Multiagent Debate
Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Characterizing manipulation from ai systems
Micah Carroll, Alan Chan, Henry Ashton, and David Krueger. Characterizing manipulation from ai systems. arXiv preprint arXiv:2303.09387, 2023
-
[51]
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback
Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[52]
Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[53]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[54]
The effects of reward misspecification: Mapping and mitigating misaligned models, 2022
Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022
-
[55]
Cooperative inverse reinforcement learning
Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29, 2016
work page 2016
-
[56]
Survey of hallucination in natural language generation
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023
work page 2023
-
[57]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Factuality challenges in the era of large language models
Isabelle Augenstein, Timothy Baldwin, Meeyoung Cha, Tanmoy Chakraborty, Giovanni Luca Ciampaglia, David Corney, Renee DiResta, Emilio Ferrara, Scott Hale, Alon Halevy, et al. Factuality challenges in the era of large language models. arXiv preprint arXiv:2310.05189, 2023
-
[59]
Combating misinformation in the age of llms: Opportunities and challenges
Canyu Chen and Kai Shu. Combating misinformation in the age of llms: Opportunities and challenges. arXiv preprint arXiv:2311.05656, 2023
-
[60]
10 ways cybercriminals can abuse large language models, 2023
Forbes Tech Council. 10 ways cybercriminals can abuse large language models, 2023
work page 2023
-
[61]
Jailbroken: How Does LLM Safety Training Fail?
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[62]
Unraveling the link between translations and gender bias in llms, 2023
Appen. Unraveling the link between translations and gender bias in llms, 2023
work page 2023
-
[63]
Navigating the biases in llm generative ai: A guide to responsible implementation, 2023
Forbes Tech Council. Navigating the biases in llm generative ai: A guide to responsible implementation, 2023
work page 2023
-
[64]
Large language models may leak personal data, 2022
Slator. Large language models may leak personal data, 2022. https://slator.com/ large-language-models-may-leak-personal-data/. 83 TRUST LLM
work page 2022
-
[65]
Deid-gpt: Zero-shot medical text de-identification by gpt-4, 2023
Zhengliang Liu, Xiaowei Yu, Lu Zhang, Zihao Wu, Chao Cao, Haixing Dai, Lin Zhao, Wei Liu, Dinggang Shen, Quanzheng Li, Tianming Liu, Dajiang Zhu, and Xiang Li. Deid-gpt: Zero-shot medical text de-identification by gpt-4, 2023
work page 2023
-
[66]
What does it mean to align ai with human values?, 2022
Quanta Magazine. What does it mean to align ai with human values?, 2022
work page 2022
- [67]
- [68]
-
[69]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[70]
Holistic Evaluation of Language Models
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[71]
Decodingtrust: A comprehensive assessment of trustworthiness in gpt models
Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023
-
[72]
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[73]
Do-not-answer: A dataset for evaluating safeguards in llms
Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387, 2023
-
[74]
Chatbot arena leaderboard week 8: Introducing mt-bench and vicuna-33b
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, and Hao Zhang. Chatbot arena leaderboard week 8: Introducing mt-bench and vicuna-33b. https://lmsys.org/ chatbot-arena-leaderboard-week-8-introducing-mt-bench-and-vicuna-33b/, 2023
work page 2023
-
[75]
The big benchmarks collection - a open-llm-leaderboard collection
Hugging Face. The big benchmarks collection - a open-llm-leaderboard collection. https://huggingface. co/spaces/OpenLLMBenchmark/The-Big-Benchmarks-Collection
-
[76]
https://platform.openai.com/docs/guides/moderation
Openai moderation api, 2023. https://platform.openai.com/docs/guides/moderation
work page 2023
-
[77]
The foundation model transparency index, 2023
Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, and Percy Liang. The foundation model transparency index, 2023
work page 2023
-
[78]
Zhengzhong Liu, Aurick Qiao, Willie Neiswanger, Hongyi Wang, Bowen Tan, Tianhua Tao, Junbo Li, Yuqi Wang, Suqi Sun, Omkar Pangarkar, Richard Fan, Yi Gu, Victor Miller, Yonghao Zhuang, Guowei He, Haonan Li, Fajri Koto, Liping Tang, Nikhil Ranjan, Zhiqiang Shen, Xuguang Ren, Roberto Iriondo, Cun Mu, Zhiting Hu, Mark Schulze, Preslav Nakov, Tim Baldwin, and ...
work page 2023
- [79]
-
[80]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.