hub

arXiv preprint arXiv:2304.10436 , year =

Sun, H · 2022 · arXiv 2304.10436

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 baseline 1 dataset 1

citation-polarity summary

background 1 baseline 1 use dataset 1

representative citing papers

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

Introduces ChiSafe-PAS, a 1,897-prompt human-annotated Chinese adversarial benchmark for LLM safety with 3-class labels, 9-category obfuscation taxonomy, and domain coverage in self-harm, drugs, fraud, and satire.

AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian

cs.CL · 2026-05-26 · unverdicted · novelty 7.0

Releases the first public safety evaluation dataset for Albanian LLMs with 2,951 prompts spanning 11 categories including self-harm, violence, and radicalization.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

Culturally-adapted red-teaming prompts raise ASR by a mean of 9.3 pp over direct translations across 16 language-model pairs in four Asian languages, with DT scoring mean cultural depth of 0.17 versus up to 2.51 for CA.

To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

LLMs propagate misinformation more in lower-resource languages and lower-HDI countries, with input safety classifiers and retrieval-augmented fact-checking showing cross-lingual and regional gaps.

OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models

cs.LG · 2025-11-13 · unverdicted · novelty 6.0

OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

cs.AI · 2025-07-01 · conditional · novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

cs.CL · 2024-06-26 · conditional · novelty 6.0

WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

cs.AI · 2023-09-19 · unverdicted · novelty 6.0

GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

CITA generates Chinese implicit toxicity samples that cause 69.48% average missed detection across seven tested detectors while preserving harmfulness, and the same data improves robustness when used to fine-tune a CITD defense model.

TrustLLM: Trustworthiness in Large Language Models

cs.CL · 2024-01-10 · unverdicted · novelty 5.0

TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.

JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data

cs.AI · 2026-05-23 · unverdicted · novelty 4.0

JT-Safe-V2 is a safety-by-design LLM that reports SOTA scores on both capability and safety benchmarks while Safe-MoMA cuts inference cost over 30 percent.

TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts

cs.CR · 2026-04-17 · unverdicted · novelty 4.0

TWGuard achieves +0.289 F1 improvement and 94.9% false-positive reduction for LLM safety guardrails in the Taiwan linguistic context compared to foundation models and baselines.

Jailbreak Attacks and Defenses Against Large Language Models: A Survey

cs.CR · 2024-07-05 · accept · novelty 4.0

A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

cs.CL · 2024-01-05 · unverdicted · novelty 4.0

DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

citing papers explorer

Showing 16 of 16 citing papers.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 11
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Beyond English and Evasion: A Human-Annotated Multi-Domain Benchmark for High-Stakes LLM Safety Evaluation in Chinese cs.CL · 2026-05-28 · unverdicted · none · ref 9
Introduces ChiSafe-PAS, a 1,897-prompt human-annotated Chinese adversarial benchmark for LLM safety with 3-class labels, 9-category obfuscation taxonomy, and domain coverage in self-harm, drugs, fraud, and satire.
AlbanianLLMSafety: A Safety Evaluation Dataset for Large Language Models in Albanian cs.CL · 2026-05-26 · unverdicted · none · ref 2
Releases the first public safety evaluation dataset for Albanian LLMs with 2,951 prompts spanning 11 categories including self-harm, violence, and radicalization.
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model cs.CL · 2024-05-07 · unverdicted · none · ref 141
DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.
Culturally-Adapted Red-Teaming Across East and Southeast Asian Contexts: A Methodological and Comparative Analysis cs.CL · 2026-06-08 · unverdicted · none · ref 12
Culturally-adapted red-teaming prompts raise ASR by a mean of 9.3 pp over direct translations across 16 language-model pairs in four Asian languages, with DT scoring mean cultural depth of 0.17 versus up to 2.51 for CA.
To Lie or Not to Lie? Investigating The Biased Spread of Global Lies by LLMs cs.CL · 2026-04-08 · unverdicted · none · ref 32
LLMs propagate misinformation more in lower-resource languages and lower-HDI countries, with input safety classifiers and retrieval-augmented fact-checking showing cross-lingual and regional gaps.
OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models cs.LG · 2025-11-13 · unverdicted · none · ref 50
OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.
Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning cs.AI · 2025-07-01 · conditional · none · ref 226
Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.
WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs cs.CL · 2024-06-26 · conditional · none · ref 31
WildGuard is a new open moderation model and dataset for LLM safety that identifies harmful prompts, risky responses, and refusal rates, achieving SOTA open-source performance and sometimes exceeding GPT-4 while cutting jailbreak success from 79.8% to 2.4%.
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts cs.AI · 2023-09-19 · unverdicted · none · ref 54
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting cs.CL · 2026-05-21 · unverdicted · none · ref 23
CITA generates Chinese implicit toxicity samples that cause 69.48% average missed detection across seven tested detectors while preserving harmfulness, and the same data improves robustness when used to fine-tune a CITD defense model.
TrustLLM: Trustworthiness in Large Language Models cs.CL · 2024-01-10 · unverdicted · none · ref 187
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data cs.AI · 2026-05-23 · unverdicted · none · ref 13
JT-Safe-V2 is a safety-by-design LLM that reports SOTA scores on both capability and safety benchmarks while Safe-MoMA cuts inference cost over 30 percent.
TWGuard: A Case Study of LLM Safety Guardrails for Localized Linguistic Contexts cs.CR · 2026-04-17 · unverdicted · none · ref 17
TWGuard achieves +0.289 F1 improvement and 94.9% false-positive reduction for LLM safety guardrails in the Taiwan linguistic context compared to foundation models and baselines.
Jailbreak Attacks and Defenses Against Large Language Models: A Survey cs.CR · 2024-07-05 · accept · none · ref 86
A survey that creates taxonomies for jailbreak attacks and defenses on LLMs, subdivides them into sub-classes, and compares evaluation approaches.
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism cs.CL · 2024-01-05 · unverdicted · none · ref 146
DeepSeek LLM 67B exceeds LLaMA-2 70B on code, mathematics and reasoning benchmarks after pre-training on 2 trillion tokens and alignment via SFT and DPO.

arXiv preprint arXiv:2304.10436 , year =

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer