An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Toxicity in chatgpt: Analyzing persona-assigned language models , url =
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
Debiasing-DPO reduces bias to spurious social contexts by 84% and improves predictive accuracy by 52% on average for LLMs evaluating U.S. classroom transcripts.
LLMs assigned high or low status personas in multi-turn dialogues exhibit socio-cognitive effects including language coordination, pronoun patterns, persuasion success, and compliance with unsafe requests.
Monte Carlo simulations of LLM agents confirm that toxic debates take 25% longer to converge, with larger delays in smaller models, and show a first-mover advantage independent of toxicity.
The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.
Activation steering with FLORES-derived language vectors produces modest, layer-sensitive and language-dependent gains on cultural awareness tasks, with some settings degrading performance and strong interaction with prompt design.
citing papers explorer
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
-
Mitigating LLM biases toward spurious social contexts using direct preference optimization
Debiasing-DPO reduces bias to spurious social contexts by 84% and improves predictive accuracy by 52% on average for LLMs evaluating U.S. classroom transcripts.
-
Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?
LLMs assigned high or low status personas in multi-turn dialogues exhibit socio-cognitive effects including language coordination, pronoun patterns, persuasion success, and compliance with unsafe requests.
-
Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations
Monte Carlo simulations of LLM agents confirm that toxic debates take 25% longer to converge, with larger delays in smaller models, and show a first-mover advantage independent of toxicity.
-
AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions
The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.
-
DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge
Activation steering with FLORES-derived language vectors produces modest, layer-sensitive and language-dependent gains on cultural awareness tasks, with some settings degrading performance and strong interaction with prompt design.