GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.
Should chatgpt be biased? challenges and risks of bias in large language models
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
A two-dimensional persona simulation framework generates harmful content that is more challenging to detect and comparably diverse to human-curated datasets for robust evaluation of detection systems.
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
The paper identifies social and ethical risks from unguarded use of general-purpose LLMs in Canadian newcomer settlement and advocates for AI literacy programs plus customized models with human oversight.
citing papers explorer
-
Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning
GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.
-
Beyond Static Benchmarks: Synthesizing Harmful Content via Persona-based Simulation for Robust Evaluation
A two-dimensional persona simulation framework generates harmful content that is more challenging to detect and comparably diverse to human-curated datasets for robust evaluation of detection systems.
-
Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs
Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.
-
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
-
Data-Centric Foundation Models in Computational Healthcare: A Survey
The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.
-
Social and Ethical Risks Posed by General-Purpose LLMs for Settling Newcomers in Canada
The paper identifies social and ethical risks from unguarded use of general-purpose LLMs in Canadian newcomer settlement and advocates for AI literacy programs plus customized models with human oversight.