TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
From judgment to interference: Early stopping LLM harmful outputs via streaming content monitoring
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 5roles
background 1polarities
background 1representative citing papers
TPCs allow term-by-term progressive polynomial evaluation on LLM activations for flexible safety monitoring that supports both stronger guardrails and low-cost adaptive cascades.
The paper supplies a unified definition based on data flow and dynamic interaction plus a systematic taxonomy to organize fragmented work on streaming large language models.
Simple thresholding on an external verifier signal, calibrated by risk control, performs competitively with sequential hypothesis testing monitors on math reasoning and red-teaming datasets.
AERIC uses a 387-parameter head on LLM hidden states for same-pass anticipatory detection of implicit harm, reporting AUROC gains on DiaSafety and Harmful Advice plus low-latency trigger rates on HarmBench and SocialHarmBench.
citing papers explorer
-
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.