arxiv: 2401.05561 · v6 · submitted 2024-01-10 · 💻 cs.CL

TrustLLM: Trustworthiness in Large Language Models

Yue Huang , Lichao Sun , Haoran Wang , Siyuan Wu , Qihui Zhang , Yuan Li , Chujie Gao , Yixin Huang

show 62 more authors

Wenhan Lyu Yixuan Zhang Xiner Li Zhengliang Liu Yixin Liu Yijue Wang Zhikun Zhang Bertie Vidgen Bhavya Kailkhura Caiming Xiong Chaowei Xiao Chunyuan Li Eric Xing Furong Huang Hao Liu Heng Ji Hongyi Wang Huan Zhang Huaxiu Yao Manolis Kellis Marinka Zitnik Meng Jiang Mohit Bansal James Zou Jian Pei Jian Liu Jianfeng Gao Jiawei Han Jieyu Zhao Jiliang Tang Jindong Wang Joaquin Vanschoren John Mitchell Kai Shu Kaidi Xu Kai-Wei Chang Lifang He Lifu Huang Michael Backes Neil Zhenqiang Gong Philip S. Yu Pin-Yu Chen Quanquan Gu Ran Xu Rex Ying Shuiwang Ji Suman Jana Tianlong Chen Tianming Liu Tianyi Zhou William Wang Xiang Li Xiangliang Zhang Xiao Wang Xing Xie Xun Chen Xuyu Wang Yan Liu Yanfang Ye Yinzhi Cao Yong Chen Yue Zhao

This is my paper

Pith reviewed 2026-05-18 11:12 UTC · model grok-4.3

classification 💻 cs.CL

keywords trustworthinesslarge language modelsLLMsbenchmarkevaluationsafetyfairnessprivacy

0 comments

The pith

Proprietary large language models generally outperform open-source ones on trustworthiness measures, and trustworthiness tracks closely with overall utility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out principles for trustworthy LLMs across eight dimensions and builds a benchmark covering six of them: truthfulness, safety, fairness, robustness, privacy, and machine ethics. It runs the benchmark on sixteen mainstream models using more than thirty datasets. The results indicate that trustworthiness and functional performance rise together, proprietary models lead most open-source ones, and a handful of open-source models nearly match the leaders. Some models prove overly cautious, refusing harmless requests in the name of safety and thereby lowering their usefulness. The work also highlights the need for transparency about the specific techniques used to build trustworthiness.

Core claim

By defining eight principles and applying a six-dimension benchmark to sixteen LLMs, the study finds that trustworthiness and utility are positively correlated, proprietary models generally lead open-source ones on the tested dimensions, a few open-source models approach proprietary performance, and some models over-calibrate by refusing benign prompts.

What carries the argument

The TrustLLM benchmark, which applies standardized tests across truthfulness, safety, fairness, robustness, privacy, and machine ethics to rank models.

If this is right

Higher trustworthiness tends to accompany stronger performance on standard tasks.
Widespread use of open-source LLMs carries elevated risk compared with proprietary alternatives.
Overly strict safety tuning can reduce model utility by blocking safe user requests.
Transparency about the specific methods used to improve trustworthiness enables better analysis of their effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Closing the trustworthiness gap between open-source and proprietary models could require targeted improvements in training data or alignment techniques.
The observed correlation between trustworthiness and utility suggests that general capability advances may bring trustworthiness gains as a side effect.
Developers should monitor refusal rates on safe inputs as a routine check when adding safety features.

Load-bearing premise

The selected datasets and evaluation methods for the six dimensions capture the main real-world trustworthiness risks without major gaps or biases.

What would settle it

An open-source model that scores higher than leading proprietary models on all six benchmark dimensions while correctly answering every benign prompt would contradict the reported pattern.

read the original abstract

Large language models (LLMs), exemplified by ChatGPT, have gained considerable attention for their excellent natural language processing capabilities. Nonetheless, these LLMs present many challenges, particularly in the realm of trustworthiness. Therefore, ensuring the trustworthiness of LLMs emerges as an important topic. This paper introduces TrustLLM, a comprehensive study of trustworthiness in LLMs, including principles for different dimensions of trustworthiness, established benchmark, evaluation, and analysis of trustworthiness for mainstream LLMs, and discussion of open challenges and future directions. Specifically, we first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics. We then present a study evaluating 16 mainstream LLMs in TrustLLM, consisting of over 30 datasets. Our findings firstly show that in general trustworthiness and utility (i.e., functional effectiveness) are positively related. Secondly, our observations reveal that proprietary LLMs generally outperform most open-source counterparts in terms of trustworthiness, raising concerns about the potential risks of widely accessible open-source LLMs. However, a few open-source LLMs come very close to proprietary ones. Thirdly, it is important to note that some LLMs may be overly calibrated towards exhibiting trustworthiness, to the extent that they compromise their utility by mistakenly treating benign prompts as harmful and consequently not responding. Finally, we emphasize the importance of ensuring transparency not only in the models themselves but also in the technologies that underpin trustworthiness. Knowing the specific trustworthy technologies that have been employed is crucial for analyzing their effectiveness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper aggregates prior datasets into a six-dimension benchmark and reports that proprietary LLMs generally outperform open-source ones on trustworthiness while showing a positive link to utility.

read the letter

The main thing to know is that TrustLLM gathers existing tasks into one benchmark covering truthfulness, safety, fairness, robustness, privacy, and machine ethics, then runs 16 models across more than 30 datasets. It reports proprietary models ahead on most measures, a positive correlation between trustworthiness and utility, and some models refusing safe prompts too often. The transparency point about knowing which alignment techniques were used is also practical.

Referee Report

2 major / 3 minor

Summary. The paper introduces TrustLLM as a comprehensive study of trustworthiness in LLMs. It proposes a set of principles spanning eight dimensions, constructs a benchmark across six dimensions (truthfulness, safety, fairness, robustness, privacy, and machine ethics) using more than 30 datasets, evaluates 16 mainstream LLMs, and reports three primary findings: a positive correlation between trustworthiness and utility, general outperformance by proprietary models over open-source counterparts, and over-calibration in some models that leads to refusal of benign prompts. The work concludes with discussion of open challenges and the need for transparency in trustworthiness technologies.

Significance. If the central empirical claims hold after methodological clarification, the paper would make a useful contribution by providing one of the larger-scale multi-dimensional evaluations of LLM trustworthiness to date. The explicit linkage of findings to model accessibility (proprietary vs. open-source) and the utility-trustworthiness trade-off supplies concrete observations that can inform deployment decisions and future alignment research. The scale (>30 datasets, 16 models) is a clear strength that distinguishes it from narrower prior benchmarks.

major comments (2)

[§3 and §4] §3 (Benchmark Construction) and §4 (Evaluation): The mapping from the eight proposed principles to the six benchmark dimensions and the specific dataset choices lacks an explicit coverage or gap analysis. Without this, it is unclear whether the observed proprietary-model advantage and positive trustworthiness-utility correlation are robust to alternative task selections (e.g., long-context privacy or culturally varied ethics scenarios). This directly affects the load-bearing claim that proprietary LLMs generally outperform open-source ones.
[§4] §4 (Evaluation Methodology): The manuscript provides insufficient detail on prompt templates, exact scoring rubrics (especially for subjective dimensions such as machine ethics and fairness), and any inter-annotator or inter-model consistency checks. These choices are central to the reported rankings and the over-calibration observation; their omission prevents independent verification of whether the differences are intrinsic or protocol-dependent.

minor comments (3)

[Abstract] The abstract states principles across eight dimensions but a benchmark across six; a single clarifying sentence would remove potential reader confusion.
[Results] Correlation plots in the results section would be strengthened by reporting confidence intervals or statistical significance for the trustworthiness-utility relationship.
[Related Work] A small number of citations to prior multi-dimensional LLM safety benchmarks (e.g., HELM, DecodingTrust) appear to be missing from the related-work discussion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the benchmark's scope and improve methodological transparency. We address each point below and commit to revisions that strengthen the paper without altering its core claims.

read point-by-point responses

Referee: [§3 and §4] §3 (Benchmark Construction) and §4 (Evaluation): The mapping from the eight proposed principles to the six benchmark dimensions and the specific dataset choices lacks an explicit coverage or gap analysis. Without this, it is unclear whether the observed proprietary-model advantage and positive trustworthiness-utility correlation are robust to alternative task selections (e.g., long-context privacy or culturally varied ethics scenarios). This directly affects the load-bearing claim that proprietary LLMs generally outperform open-source ones.

Authors: We agree that an explicit mapping and gap analysis would improve transparency. In the revised version we will add a table in §3 that maps each of the eight principles to the six benchmark dimensions and lists the datasets chosen for each, together with a short discussion of coverage and acknowledged gaps (e.g., limited long-context privacy scenarios and culturally specific ethics tasks). Our dataset selection follows prior literature for each dimension; the proprietary-model advantage and trustworthiness-utility correlation hold consistently across the >30 datasets we include. We will nevertheless add a limitations paragraph noting that results may vary under alternative task distributions and flag long-context and culturally varied evaluations as important future work. revision: partial
Referee: [§4] §4 (Evaluation Methodology): The manuscript provides insufficient detail on prompt templates, exact scoring rubrics (especially for subjective dimensions such as machine ethics and fairness), and any inter-annotator or inter-model consistency checks. These choices are central to the reported rankings and the over-calibration observation; their omission prevents independent verification of whether the differences are intrinsic or protocol-dependent.

Authors: We accept that additional methodological detail is required for reproducibility. In the revision we will expand §4 (and add an appendix) with: (i) the full prompt templates used for each dimension, (ii) precise scoring rubrics including how human or automated judgments were applied to machine ethics and fairness, and (iii) inter-annotator agreement statistics for any human-evaluated subsets together with consistency checks across model outputs. These additions will allow readers to verify that reported differences are not artifacts of the evaluation protocol. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The paper conducts an empirical evaluation of 16 LLMs across six trustworthiness dimensions using over 30 external datasets. Central claims (proprietary models outperforming open-source ones, positive trustworthiness-utility correlation, over-calibration) derive directly from model outputs on these datasets rather than from any internal derivation, fitted parameters, or self-referential definitions. No equations, predictions, or uniqueness theorems are presented that reduce to the authors' own inputs by construction. The work is self-contained against external benchmarks, with dataset selection serving as an operationalization step rather than a circular fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the selected datasets and metrics are representative proxies for the eight trustworthiness principles; no new physical or mathematical entities are introduced.

axioms (1)

domain assumption Existing NLP datasets can serve as valid proxies for real-world trustworthiness failures in LLMs.
The benchmark construction in the abstract relies on this without independent validation of coverage.

pith-pipeline@v0.9.0 · 6093 in / 1239 out tokens · 26937 ms · 2026-05-18T11:12:58.510044+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We first propose a set of principles for trustworthy LLMs that span eight different dimensions. Based on these principles, we further establish a benchmark across six dimensions including truthfulness, safety, fairness, robustness, privacy, and machine ethics.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our findings firstly show that in general trustworthiness and utility are positively related... proprietary LLMs generally outperform most open-source counterparts...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models
cs.CR 2026-05 conditional novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
cs.SD 2026-04 unverdicted novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction
cs.MM 2026-04 unverdicted novelty 8.0

AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models ...
Trustworthy AI: Ensuring Reliability and Accountability from Models to Agents
cs.LG 2026-05 unverdicted novelty 6.0

The thesis presents a kernel method for multiaccuracy across overlooked subpopulations, information-theoretic optimal watermarking for LLMs, and a simulator showing LLM agents outperforming humans in supply chains whi...
Profiling for Pennies: Unveiling the Privacy Iceberg of LLM Agents
cs.CR 2026-05 unverdicted novelty 6.0

LLM agents can reconstruct high-fidelity personal profiles from minimal PII seeds with over 90% accuracy in under 10 minutes at less than $3 cost, exposing three escalating tiers of privacy risks.
Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.
Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.
Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.
Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression
cs.CL 2026-04 unverdicted novelty 6.0

CoT compression frequently introduces trustworthiness regressions with method-specific degradation profiles; a proposed normalized efficiency score and alignment-aware DPO variant reduce length by 19.3% with smaller t...
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models
cs.LG 2025-11 unverdicted novelty 6.0

OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
cs.LG 2025-10 conditional novelty 6.0

Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
cs.CR 2024-03 accept novelty 6.0

JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and ...
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
cs.LG 2026-04 unverdicted novelty 5.0

Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs
cs.CL 2025-10 unverdicted novelty 5.0

ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.
Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction
cs.CL 2026-05 unverdicted novelty 4.0

A multi-view evidential framework combines semantic and reasoning information to improve accuracy and provide trustworthy uncertainty estimates for mental health prediction on text data.
A Multi-Dimensional Audit of Politically Aligned Large Language Models
cs.CL 2026-04 unverdicted novelty 4.0

A multi-dimensional audit framework for politically aligned LLMs finds consistent trade-offs: larger models are more effective and truthful but less fair with higher bias, while fine-tuned models reduce bias but incre...
Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work
cs.AI 2026-04 unverdicted novelty 4.0

Vibe Medicine proposes directing AI agents via natural language for end-to-end biomedical workflows using LLMs, agent frameworks, and a curated collection of over 1,000 medical skills.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · cited by 18 Pith papers · 33 internal anchors

[1]

A toolkit for text extraction and analysis for natural language processing tasks

Tshephisho Joseph Sefara, Mahlatse Mbooi, Katlego Mashile, Thompho Rambuda, and Mapitsi Rangata. A toolkit for text extraction and analysis for natural language processing tasks. In 2022 International Conference on Artificial Intelligence, Big Data, Computing and Data Communication Systems (icABCD), pages 1–6, 2022

work page 2022
[2]

Natural language processing: State of the art, current trends and challenges

Diksha Khurana, Aditya Koli, Kiran Khatter, and Sukhdev Singh. Natural language processing: State of the art, current trends and challenges. Multimedia tools and applications, 82(3):3713–3744, 2023

work page 2023
[3]

Wordcraft: story writing with large language models

Ann Yuan, Andy Coenen, Emily Reif, and Daphne Ippolito. Wordcraft: story writing with large language models. In 27th International Conference on Intelligent User Interfaces, pages 841–852, 2022

work page 2022
[4]

Multilingual machine translation with large language models: Empirical results and analysis, 2023

Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Shujian Huang, Lingpeng Kong, Jiajun Chen, and Lei Li. Multilingual machine translation with large language models: Empirical results and analysis, 2023

work page 2023
[5]

https://blogs.microsoft.com/blog/2023/02/07/ reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/

Reinventing search with a new ai-powered microsoft bing and edge, your copilot for the web, 2023. https://blogs.microsoft.com/blog/2023/02/07/ reinventing-search-with-a-new-ai-powered-microsoft-bing-and-edge-your-copilot-for-the-web/

work page 2023
[6]

https://medium.com/whatnot-engineering/ enhancing-search-using-large-language-models-f9dcb988bdb9

Enhancing search using large language models, 2023. https://medium.com/whatnot-engineering/ enhancing-search-using-large-language-models-f9dcb988bdb9

work page 2023
[7]

WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

https://www.projectpro.io/article/ large-language-model-use-cases-and-applications/887

7 top large language model use cases and applications, 2023. https://www.projectpro.io/article/ large-language-model-use-cases-and-applications/887

work page 2023
[9]

Code Llama: Open Foundation Models for Code

Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Large language models: The future of b2b software, 2023

MintMesh. Large language models: The future of b2b software, 2023

work page 2023
[11]

Bloomberggpt: A large language model for finance, 2023

Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. Bloomberggpt: A large language model for finance, 2023

work page 2023
[12]

Scientific discovery in the age of artificial intelligence

Hanchen Wang, Tianfan Fu, Yuanqi Du, Wenhao Gao, Kexin Huang, Ziming Liu, Payal Chandak, Shengchao Liu, Peter Van Katwyk, Andreea Deac, et al. Scientific discovery in the age of artificial intelligence. Nature, 620(7972):47–60, 2023

work page 2023
[13]

Hofgard, Aria Mansouri Tehrani, Rui Wang, Ameya Daigavane, Montgomery Bohde, Jerry Kurtin, Qian Huang, Tuong Phung, Minkai Xu, Chaitanya K

Xuan Zhang, Limei Wang, Jacob Helwig, Youzhi Luo, Cong Fu, Yaochen Xie, Meng Liu, Yuchao Lin, Zhao Xu, Keqiang Yan, Keir Adams, Maurice Weiler, Xiner Li, Tianfan Fu, Yucheng Wang, Haiyang Yu, YuQing Xie, Xiang Fu, Alex Strasser, Shenglong Xu, Yi Liu, Yuanqi Du, Alexandra Saxton, Hongyi Ling, Hannah Lawrence, Hannes Stärk, Shurui Gui, Carl Edwards, Nichola...

work page arXiv 2023
[14]

The impact of large language models on scientific discovery: a preliminary study using gpt-4, 2023

Microsoft Research AI4Science and Microsoft Azure Quantum. The impact of large language models on scientific discovery: a preliminary study using gpt-4, 2023

work page 2023
[15]

Pllama: An open-source large language model for plant science, 2024

Xianjun Yang, Junfeng Gao, Wenxin Xue, and Erik Alexandersson. Pllama: An open-source large language model for plant science, 2024

work page 2024
[16]

The future landscape of large language models in medicine

Jan Clusmann, Fiona R Kolbinger, Hannah Sophie Muti, Zunamys I Carrero, Jan-Niklas Eckardt, Narmin Ghaffari Laleh, Chiara Maria Lavinia Löffler, Sophie-Caroline Schwarzkopf, Michaela Unger, 80 TRUST LLM Gregory P Veldhuizen, et al. The future landscape of large language models in medicine. Communica- tions Medicine, 3(1):141, 2023

work page 2023
[17]

ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences

Yuanhe Tian, Ruyi Gan, Yan Song, Jiaxing Zhang, and Yongdong Zhang. ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences. arXiv preprint arXiv:2311.06025, 2023

work page arXiv 2023
[18]

Alpacare:instruction-tuned large language models for medical application, 2023

Xinlu Zhang, Chenxin Tian, Xianjun Yang, Lichang Chen, Zekun Li, and Linda Ruth Petzold. Alpacare:instruction-tuned large language models for medical application, 2023

work page 2023
[19]

Davison, Quanzheng Li, Yong Chen, Hongfang Liu, and Lichao Sun

Kai Zhang, Jun Yu, Zhiling Yan, Yixin Liu, Eashan Adhikarla, Sunyang Fu, Xun Chen, Chen Chen, Yuyin Zhou, Xiang Li, Lifang He, Brian D. Davison, Quanzheng Li, Yong Chen, Hongfang Liu, and Lichao Sun. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks, 2023

work page 2023
[20]

Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt, 2023

Yirong Chen, Zhenyu Wang, Xiaofen Xing, huimin zheng, Zhipei Xu, Kai Fang, Junhong Wang, Sihang Li, Jieling Wu, Qi Liu, and Xiangmin Xu. Bianque: Balancing the questioning and suggestion ability of health llms with multi-turn health conversations polished by chatgpt, 2023

work page 2023
[21]

Huatuogpt, towards taming language models to be a doctor

Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Jianquan Li, Guiming Chen, Xiangbo Wu, Zhiyi Zhang, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li. Huatuogpt, towards taming language models to be a doctor. arXiv preprint arXiv:2305.15075, 2023

work page arXiv 2023
[22]

Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge

Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6), 2023

work page 2023
[23]

Medicalgpt: Training medical gpt model

Ming Xu. Medicalgpt: Training medical gpt model. https://github.com/shibing624/MedicalGPT, 2023

work page 2023
[24]

A domain-specific next-generation large language model (llm) or chatgpt is required for biomedical engineering and research

Soumen Pal, Manojit Bhattacharya, Sang-Soo Lee, and Chiranjib Chakraborty. A domain-specific next-generation large language model (llm) or chatgpt is required for biomedical engineering and research. Annals of Biomedical Engineering, pages 1–4, 2023

work page 2023
[25]

Towards generalist biomedical ai

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Chuck Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai. arXiv preprint arXiv:2307.14334, 2023

work page arXiv 2023
[26]

Large language models and political science

Mitchell Linegar, Rafal Kocielnik, and R Michael Alvarez. Large language models and political science. Frontiers in Political Science, 5:1257092, 2023

work page 2023
[27]

https://github.com/irlab-sdu/fuzi.mingcha, 2023

fuzi.mingcha. https://github.com/irlab-sdu/fuzi.mingcha, 2023

work page 2023
[28]

Disc-lawllm: Fine-tuning large language models for intelligent legal services, 2023

Shengbin Yue, Wei Chen, Siyuan Wang, Bingxuan Li, Chenchen Shen, Shujun Liu, Yuxuan Zhou, Yao Xiao, Song Yun, Xuanjing Huang, and Zhongyu Wei. Disc-lawllm: Fine-tuning large language models for intelligent legal services, 2023

work page 2023
[29]

Chawla, Olaf Wiest, and Xiangliang Zhang

Taicheng Guo, Kehan Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh V . Chawla, Olaf Wiest, and Xiangliang Zhang. What can large language models do in chemistry? a comprehensive benchmark on eight tasks. In NeurIPS, 2023

work page 2023
[30]

Structured chemistry reasoning with large language models

Siru Ouyang, Zhuosheng Zhang, Bing Yan, Xuan Liu, Jiawei Han, and Lianhui Qin. Structured chemistry reasoning with large language models. arXiv preprint arXiv:2311.09656, 2023

work page arXiv 2023
[31]

Marinegpt: Unlocking secrets of "ocean" to the public, 2023

Ziqiang Zheng, Jipeng Zhang, Tuan-Anh Vu, Shizhe Diao, Yue Him Wong Tim, and Sai-Kit Yeung. Marinegpt: Unlocking secrets of "ocean" to the public, 2023

work page 2023
[32]

Oceangpt: A large language model for ocean science tasks, 2023

Zhen Bi, Ningyu Zhang, Yida Xue, Yixin Ou, Daxiong Ji, Guozhou Zheng, and Huajun Chen. Oceangpt: A large language model for ocean science tasks, 2023

work page 2023
[33]

Taoli llama

Jingsi Yu, Junhui Zhu, Yujie Wang, Yang Liu, Hongxiang Chang, Jinran Nie, Cunliang Kong, Ruining Chong, XinLiu, Jiyuan An, Luming Lu, Mingwei Fang, and Lin Zhu. Taoli llama. https://github.com/ blcuicall/taoli, 2023

work page 2023
[34]

Artgpt-4: Artistic vision-language understanding with adapter-enhanced minigpt-4, 2023

Zhengqing Yuan, Huiwen Xue, Xinyi Wang, Yongming Liu, Zhuanzhe Zhao, and Kun Wang. Artgpt-4: Artistic vision-language understanding with adapter-enhanced minigpt-4, 2023. 81 TRUST LLM

work page 2023
[35]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vin- odkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bra...

work page 2022
[36]

Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Sia- mak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H

Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Sia- mak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yanping Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yujing...

work page 2023
[37]

Palm: Efficiently training massive language models, 2023

Towards Data Science. Palm: Efficiently training massive language models, 2023

work page 2023
[38]

How chatgpt works: A look inside large language models, 2023

Wired. How chatgpt works: A look inside large language models, 2023

work page 2023
[39]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[40]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Pathways: Asynchronous distributed dataflow for ml

Paul Barham, Aakanksha Chowdhery, Jeff Dean, Sanjay Ghemawat, Steven Hand, Daniel Hurt, Michael Isard, Hyeontaek Lim, Ruoming Pang, Sudip Roy, et al. Pathways: Asynchronous distributed dataflow for ml. Proceedings of Machine Learning and Systems, 4:430–449, 2022

work page 2022
[42]

Ai alignment: A comprehensive survey, 2023

Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan, Aidan O’Gara, Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. Ai alignment: A comprehensive survey, 2023

work page 2023
[43]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730– 27744, 2022

work page 2022
[44]

Improving language model negotiation with self-play and in-context learning from ai feedback

Yao Fu, Hao Peng, Tushar Khot, and Mirella Lapata. Improving language model negotiation with self-play and in-context learning from ai feedback. arXiv preprint arXiv:2305.10142, 2023. 82 TRUST LLM

work page arXiv 2023
[45]

Principle-driven self-alignment of language models from scratch with minimal human supervision

Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023

work page arXiv 2023
[46]

Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs, 2023

Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, Ashwin Kalyan, Peter Clark, Derry Wijaya, and Niket Tandon. Rl4f: Generating natural language feedback with reinforcement learning for repairing model outputs, 2023

work page 2023
[47]

Measuring Progress on Scalable Oversight for Large Language Models

Samuel R Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamil ˙e Lukoši¯ut˙e, Amanda Askell, Andy Jones, Anna Chen, et al. Measuring progress on scalable oversight for large language models. arXiv preprint arXiv:2211.03540, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Discovering Language Model Behaviors with Model-Written Evaluations

Ethan Perez, Sam Ringer, Kamil˙e Lukoši¯ut˙e, Karina Nguyen, Edwin Chen, Scott Heiner, Craig Pettit, Catherine Olsson, Sandipan Kundu, Saurav Kadavath, et al. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[49]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Characterizing manipulation from ai systems

Micah Carroll, Alan Chan, Henry Ashton, and David Krueger. Characterizing manipulation from ai systems. arXiv preprint arXiv:2303.09387, 2023

work page arXiv 2023
[51]

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Harrison Lee, Samrat Phatale, Hassan Mansoor, Kellie Lu, Thomas Mesnard, Colton Bishop, Victor Carbune, and Abhinav Rastogi. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[52]

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, Sergio Gomez Colmenarejo, Alexander Novikov, Gabriel Barth-Maron, Mai Gimenez, Yury Sulsky, Jackie Kay, Jost Tobias Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[53]

Constitutional AI: Harmlessness from AI Feedback

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

The effects of reward misspecification: Mapping and mitigating misaligned models, 2022

Alexander Pan, Kush Bhatia, and Jacob Steinhardt. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv preprint arXiv:2201.03544, 2022

work page arXiv 2022
[55]

Cooperative inverse reinforcement learning

Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Anca Dragan. Cooperative inverse reinforcement learning. Advances in neural information processing systems, 29, 2016

work page 2016
[56]

Survey of hallucination in natural language generation

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023

work page 2023
[57]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Factuality challenges in the era of large language models

Isabelle Augenstein, Timothy Baldwin, Meeyoung Cha, Tanmoy Chakraborty, Giovanni Luca Ciampaglia, David Corney, Renee DiResta, Emilio Ferrara, Scott Hale, Alon Halevy, et al. Factuality challenges in the era of large language models. arXiv preprint arXiv:2310.05189, 2023

work page arXiv 2023
[59]

Combating misinformation in the age of llms: Opportunities and challenges

Canyu Chen and Kai Shu. Combating misinformation in the age of llms: Opportunities and challenges. arXiv preprint arXiv:2311.05656, 2023

work page arXiv 2023
[60]

10 ways cybercriminals can abuse large language models, 2023

Forbes Tech Council. 10 ways cybercriminals can abuse large language models, 2023

work page 2023
[61]

Jailbroken: How Does LLM Safety Training Fail?

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[62]

Unraveling the link between translations and gender bias in llms, 2023

Appen. Unraveling the link between translations and gender bias in llms, 2023

work page 2023
[63]

Navigating the biases in llm generative ai: A guide to responsible implementation, 2023

Forbes Tech Council. Navigating the biases in llm generative ai: A guide to responsible implementation, 2023

work page 2023
[64]

Large language models may leak personal data, 2022

Slator. Large language models may leak personal data, 2022. https://slator.com/ large-language-models-may-leak-personal-data/. 83 TRUST LLM

work page 2022
[65]

Deid-gpt: Zero-shot medical text de-identification by gpt-4, 2023

Zhengliang Liu, Xiaowei Yu, Lu Zhang, Zihao Wu, Chao Cao, Haixing Dai, Lin Zhao, Wei Liu, Dinggang Shen, Quanzheng Li, Tianming Liu, Dajiang Zhu, and Xiang Li. Deid-gpt: Zero-shot medical text de-identification by gpt-4, 2023

work page 2023
[66]

What does it mean to align ai with human values?, 2022

Quanta Magazine. What does it mean to align ai with human values?, 2022

work page 2022
[67]

Openai, 2023

OpenAI. Openai, 2023. https://www.openai.com

work page 2023
[68]

Ai at meta, 2023

Meta. Ai at meta, 2023. https://ai.meta.com

work page 2023
[69]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[70]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[71]

Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023

work page arXiv 2023
[72]

Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Do-not-answer: A dataset for evaluating safeguards in llms

Yuxia Wang, Haonan Li, Xudong Han, Preslav Nakov, and Timothy Baldwin. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387, 2023

work page arXiv 2023
[74]

Chatbot arena leaderboard week 8: Introducing mt-bench and vicuna-33b

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, and Hao Zhang. Chatbot arena leaderboard week 8: Introducing mt-bench and vicuna-33b. https://lmsys.org/ chatbot-arena-leaderboard-week-8-introducing-mt-bench-and-vicuna-33b/, 2023

work page 2023
[75]

The big benchmarks collection - a open-llm-leaderboard collection

Hugging Face. The big benchmarks collection - a open-llm-leaderboard collection. https://huggingface. co/spaces/OpenLLMBenchmark/The-Big-Benchmarks-Collection

work page
[76]

https://platform.openai.com/docs/guides/moderation

Openai moderation api, 2023. https://platform.openai.com/docs/guides/moderation

work page 2023
[77]

The foundation model transparency index, 2023

Rishi Bommasani, Kevin Klyman, Shayne Longpre, Sayash Kapoor, Nestor Maslej, Betty Xiong, Daniel Zhang, and Percy Liang. The foundation model transparency index, 2023

work page 2023
[78]

Zhengzhong Liu, Aurick Qiao, Willie Neiswanger, Hongyi Wang, Bowen Tan, Tianhua Tao, Junbo Li, Yuqi Wang, Suqi Sun, Omkar Pangarkar, Richard Fan, Yi Gu, Victor Miller, Yonghao Zhuang, Guowei He, Haonan Li, Fajri Koto, Liping Tang, Nikhil Ranjan, Zhiqiang Shen, Xuguang Ren, Roberto Iriondo, Cun Mu, Zhiting Hu, Mark Schulze, Preslav Nakov, Tim Baldwin, and ...

work page 2023
[79]

Ernie - baidu yiyan, 2023

Baidu. Ernie - baidu yiyan, 2023. https://yiyan.baidu.com/

work page 2023
[80]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017

Showing first 80 references.