LLMs exhibit high failure rates on alignment tests for conflict contexts, with some models failing 80-100% on balance requests where legal responsibility is established.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
BioTool dataset enables fine-tuning a 4B-parameter LLM to outperform GPT-5.1 in biomedical tool calling while improving downstream answer quality per human experts.
citing papers explorer
-
Can AI Make Conflicts Worse? An Alignment Failure in LLM Deployment Across Conflict Contexts
LLMs exhibit high failure rates on alignment tests for conflict contexts, with some models failing 80-100% on balance requests where legal responsibility is established.
-
BioTool: A Comprehensive Tool-Calling Dataset for Enhancing Biomedical Capabilities of Large Language Models
BioTool dataset enables fine-tuning a 4B-parameter LLM to outperform GPT-5.1 in biomedical tool calling while improving downstream answer quality per human experts.