Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing

· 2025 · arXiv 2503.21598

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 2

citation-polarity summary

background 1 support 1

representative citing papers

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

cs.CL · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.

To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

LLMs propagate misinformation more in lower-resource languages and lower-HDI countries, with input safety classifiers and retrieval-augmented fact-checking showing cross-lingual and regional gaps.

Beyond Context: Large Language Models' Failure to Grasp Users' Intent

cs.AI · 2025-12-24 · unverdicted · novelty 3.0

LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.

citing papers explorer

Showing 3 of 3 citing papers.

One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue cs.CL · 2026-05-07 · unverdicted · none · ref 32 · 2 links
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs cs.CL · 2026-04-08 · unverdicted · none · ref 37
LLMs propagate misinformation more in lower-resource languages and lower-HDI countries, with input safety classifiers and retrieval-augmented fact-checking showing cross-lingual and regional gaps.
Beyond Context: Large Language Models' Failure to Grasp Users' Intent cs.AI · 2025-12-24 · unverdicted · none · ref 90
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.

Prompt, Divide, and Conquer: Bypassing Large Language Model Safety Filters via Segmented and Distributed Prompt Processing

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer