A Red Teaming Framework for Large Language Models: A Case Study on Faithfulness Evaluation
Pith reviewed 2026-06-25 20:56 UTC · model grok-4.3
The pith
A three-model red teaming setup with attacker and jury roles exposes up to 7.9% more unfaithful LLM responses in question answering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a multi-role architecture of target, attacker, and jury models can systematically uncover vulnerabilities in LLM faithfulness, with exploitative adversarial prompts increasing attack success rate by up to 7.9% in QA tasks, and that design choices outweigh parameter scaling for model safety. The framework adapts across tasks and languages while revealing how output-format constraints affect vulnerability patterns.
What carries the argument
The multi-role architecture with target model (generates responses), attacker model (creates adversarial prompts), and jury model (evaluates accuracy and consistency).
If this is right
- Exploitative adversarial prompts increase detected unfaithfulness by up to 7.9% in question-answering tasks.
- Format limitations in summarization tasks produce measurable gains in faithfulness.
- Architectural design choices typically outweigh parameter scaling in determining model safety.
- The framework enables direct comparison of vulnerabilities across English question-answering and Arabic summarization.
Where Pith is reading between the lines
- The same attacker-jury loop could be run repeatedly on an updated model to track whether specific vulnerabilities persist after retraining.
- Extending the jury to detect subtle inconsistencies beyond explicit contradictions would require additional evaluation signals.
- The observed advantage of architecture over scale suggests that safety testing should prioritize controlled architectural variants rather than larger models alone.
Load-bearing premise
The jury model can rigorously and unbiasedly evaluate response accuracy and consistency across tasks and languages.
What would settle it
Human raters scoring the same set of target-model responses for faithfulness produce attack-success rates that differ substantially from the jury model's rates.
read the original abstract
Large language models (LLMs) have demonstrated remarkable performance across natural language processing tasks, yet their deployment in high-stakes applications raises critical concerns regarding reliability, safety, and trustworthiness. In this paper, we present a red teaming framework that systematically uncovers vulnerabilities in LLM outputs. Our approach employs a novel multi-role architecture comprising target, attacker, and jury models. The attackers generate increasingly effective adversarial prompts while the jury rigorously evaluates response accuracy and consistency across tasks. In a case study, our strategy proved particularly effective at exposing unfaithfulness in LLM responses. Exploitative adversarial prompts increased the attack success rate by up to 7.9% in question-answering tasks, revealing weaknesses in reliability. The approach identifies how structural constraints in summarization can shape vulnerability patterns, with format limitations yielding measurable gains in faithfulness, and shows that architectural design choices typically outweigh parameter scaling in determining model safety. The framework's key strength is its adaptability across evaluation tasks, from English question-answering to Arabic summarization, enabling comprehensive comparison of model vulnerabilities. While it excels at comparing cross-model and cross-linguistic vulnerabilities, it faces challenges in fully automating adversarial prompt generation across languages. Our experiments also reveal limitations in detecting subtle forms of unfaithfulness that do not manifest as explicit factual contradictions, particularly across linguistic contexts. Overall, this architecture provides both actionable insights into current LLM vulnerabilities and a scalable methodology for ongoing safety evaluation as models evolve.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a red teaming framework for LLMs using a multi-role architecture (target, attacker, and jury models) to generate adversarial prompts and evaluate faithfulness. In a case study on question-answering and summarization tasks (including cross-lingual Arabic), it reports that exploitative adversarial prompts raise attack success rate by up to 7.9% in QA, that format constraints in summarization improve faithfulness, and that architectural design choices outweigh parameter scaling for model safety. The framework is presented as adaptable across tasks and languages but notes challenges in full automation and detecting subtle unfaithfulness.
Significance. If the jury-based measurements prove reliable, the work would offer a practical, extensible methodology for systematic vulnerability discovery in LLMs, with concrete evidence that prompt exploitation and structural constraints affect faithfulness more than scale alone. This could inform safety evaluation practices, especially for cross-lingual settings.
major comments (2)
- [Abstract] Abstract and case-study results: the central quantitative finding (exploitative prompts increase ASR by up to 7.9% in QA) is computed exclusively from jury-model verdicts on target outputs. No human agreement study, inter-annotator metrics, or ablation replacing the jury with an alternative evaluator is described, despite the abstract explicitly flagging the jury's difficulties with subtle unfaithfulness and cross-lingual cases. This makes the reported delta and the architecture-vs-scale conclusion dependent on an unvalidated measurement instrument.
- [Abstract] Abstract: the claim that 'architectural design choices typically outweigh parameter scaling in determining model safety' is presented without reference to specific model pairs, parameter counts, or controlled comparisons that isolate architecture from scale. The supporting data are not shown to be independent of the same jury judgments.
minor comments (1)
- [Abstract] The abstract would be clearer if it named the concrete LLMs assigned to the target, attacker, and jury roles and stated the number of prompts or examples per task.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below, focusing on the evaluation methodology and abstract claims.
read point-by-point responses
-
Referee: [Abstract] Abstract and case-study results: the central quantitative finding (exploitative prompts increase ASR by up to 7.9% in QA) is computed exclusively from jury-model verdicts on target outputs. No human agreement study, inter-annotator metrics, or ablation replacing the jury with an alternative evaluator is described, despite the abstract explicitly flagging the jury's difficulties with subtle unfaithfulness and cross-lingual cases. This makes the reported delta and the architecture-vs-scale conclusion dependent on an unvalidated measurement instrument.
Authors: We agree the reported 7.9% ASR increase and related conclusions rest on jury-model verdicts. The abstract already flags limitations in detecting subtle unfaithfulness and cross-lingual issues, and the framework is explicitly designed around automated jury evaluation rather than human annotation. No human agreement study or ablation is present because the work focuses on the multi-role automated pipeline. We will revise the manuscript to add an explicit statement in the abstract and a dedicated paragraph in the limitations section clarifying that all quantitative results derive from jury verdicts and discussing this as a methodological choice. revision: partial
-
Referee: [Abstract] Abstract: the claim that 'architectural design choices typically outweigh parameter scaling in determining model safety' is presented without reference to specific model pairs, parameter counts, or controlled comparisons that isolate architecture from scale. The supporting data are not shown to be independent of the same jury judgments.
Authors: The claim is grounded in the experimental comparisons across model families presented in the results section. We will revise the abstract to include brief references to the specific model pairs and scale ranges used, while noting that the verdicts come from the jury component of the framework. Perfect isolation of architecture from scale is inherently difficult, but our controlled task setups hold other variables fixed; we will add a short clarifying sentence to this effect. revision: yes
Circularity Check
No significant circularity; empirical claims rest on experimental outcomes
full rationale
The paper describes an empirical red-teaming framework and case study reporting attack success rates (e.g., up to 7.9% increase) measured via a jury model. No equations, parameter fitting, self-definitional loops, or load-bearing self-citations appear in the abstract or described content. The central quantitative claims derive from reported experimental results rather than reducing to inputs by construction. The paper explicitly notes limitations in the jury (subtle unfaithfulness, cross-lingual cases), which is consistent with an externally falsifiable measurement approach rather than circularity. This matches the default expectation for non-circular empirical work.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM-based attacker models can generate increasingly effective adversarial prompts that expose unfaithfulness
- domain assumption Jury models provide an objective measure of accuracy and consistency
invented entities (1)
-
Multi-role architecture comprising target, attacker, and jury models
no independent evidence
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2303.08774 (2023)
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Pith/arXiv arXiv 2023
-
[2]
arXiv preprint arXiv:2401.02954 (2024)
Bi, X., Chen, D., Chen, G., Chen, S., Dai, D., Deng, C., Ding, H., Dong, K., Du, Q., Fu, Z., et al.: Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954 (2024)
Pith/arXiv arXiv 2024
-
[3]
Wei, A., Haghtalab, N., Steinhardt, J.: Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems36(2023)
2023
-
[4]
In: Muresan, S., Nakov, P., Villavicencio, A
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Irving, G.: Red teaming language models with language models. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, 3419–3448 (2022) https://doi.org/10.18653/v1/2022. emnlp-main.225
-
[5]
arXiv preprint arXiv:2406.11036 (2024)
Derczynski, L., Galinkin, E., Martin, J., Majumdar, S., Inie, N.: garak: A frame- work for security probing large language models. arXiv preprint arXiv:2406.11036 (2024)
arXiv 2024
-
[6]
arXiv preprint arXiv:2209.07858 (2022)
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al.: Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858 (2022)
Pith/arXiv arXiv 2022
-
[7]
USENIX Security Symposium (2020)
Carlini, N., Tram` er, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T.B., Song, D., Erlingsson, ´U., Oprea, A., Raffel, C.: Extract- ing training data from large language models. USENIX Security Symposium (2020)
2020
-
[8]
Advances in Neural Information Processing Systems37, 33402–33422 (2024)
Wu, K., Wu, E., Zou, J.Y.: Clasheval: Quantifying the tug-of-war between an llm’s internal prior and external evidence. Advances in Neural Information Processing Systems37, 33402–33422 (2024)
2024
-
[9]
arXiv preprint arXiv:2309.05922 (2023)
Evans, R., Gao, L., Zhang, W., et al.: Hallucination in large language models: A survey of detection, attribution, and mitigation. arXiv preprint arXiv:2309.05922 (2023)
Pith/arXiv arXiv 2023
-
[10]
arXiv preprint arXiv:2311.03274 (2023)
Zhou, L., Zhang, N., Wang, Y.: Detecting hallucinated content in large language model outputs. arXiv preprint arXiv:2311.03274 (2023)
arXiv 2023
-
[11]
Information Processing & Management62(6), 104239 (2025) https://doi.org/10.1016/j.ipm.2025.104239 37
Jabbar, M.S., Al-Azani, S., Alotaibi, A., Ahmed, M.: Red teaming large language models: A comprehensive review and critical analysis. Information Processing & Management62(6), 104239 (2025) https://doi.org/10.1016/j.ipm.2025.104239 37
-
[12]
arXiv preprint arXiv:2306.11507 (2023)
Lin, Y., Hou, Y., Li, C., Gu, Y., Feng, C., Chen, W., Wang, W.: Trustgpt: A benchmark for trustworthy and responsible large language models. arXiv preprint arXiv:2306.11507 (2023)
arXiv 2023
-
[13]
Journal of Artificial Intelligence Research75, 45–78 (2024)
Ahmed, M., Ali, H.: Challenges in arabic-english cross-lingual language models. Journal of Artificial Intelligence Research75, 45–78 (2024)
2024
-
[14]
Liu, Y., Deng, G., Li, Y., Wang, K., Wang, Z., Wang, X., Zhang, T., Liu, Y., Wang, H., Zheng, Y., Liu, Y.: Prompt injection attack against llm-integrated applications (2023)
2023
-
[15]
Weight Poisoning Attacks on Pretrained Models
Kurita, K., Michel, P., Neubig, G.: Weight poisoning attacks on pre-trained mod- els. Proceedings of the Annual Meeting of the Association for Computational Linguistics, 2793–2806 (2020) https://doi.org/10.18653/v1/2020.acl-main.249
-
[16]
8th International Conference on Learning Representations, ICLR 2020 (2019)
Krishna, K., Tomar, G.S., Parikh, A.P., Papernot, N., Iyyer, M.: Thieves on sesame street! model extraction of bert-based apis. 8th International Conference on Learning Representations, ICLR 2020 (2019)
2020
-
[17]
Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., Liu, Y.: Jailbreaking chatgpt via prompt engineering: An empirical study (2023)
2023
-
[18]
arXiv preprint arXiv:2310.08859 (2023)
Wang, Z., Li, C., Zhang, T.: Multilingual security challenges in large language models. arXiv preprint arXiv:2310.08859 (2023)
arXiv 2023
-
[19]
arXiv preprint arXiv:2310.15140 (2023)
Zhu, S., Zhang, R., An, B., Wu, G., Barrow, J., Wang, Z., Huang, F., Nenkova, A., Sun, T.: Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140 (2023)
arXiv 2023
-
[20]
arXiv preprint arXiv:2312.02119 (2023)
Mehrotra, A., Zampetakis, M., Kassianik, P., Nelson, B., Anderson, H., Singer, Y., Karbasi, A.: Tree of attacks: Jailbreaking black-box llms automatically. arXiv preprint arXiv:2312.02119 (2023)
arXiv 2023
-
[21]
arXiv preprint arXiv:2311.07689 (2023)
Ge, S., Zhou, C., Hou, R., Khabsa, M., Wang, Y.-C., Wang, Q., Han, J., Mao, Y.: Mart: Improving llm safety with multi-round automatic red-teaming. arXiv preprint arXiv:2311.07689 (2023)
arXiv 2023
-
[22]
arXiv preprint arXiv:2410.01606 (2024)
Pavlova, M., Brinkman, E., Iyer, K., Albiero, V., Bitton, J., Nguyen, H., Li, J., Ferrer, C.C., Evtimov, I., Grattafiori, A.: Automated red teaming with goat: the generative offensive agent tester. arXiv preprint arXiv:2410.01606 (2024)
arXiv 2024
-
[23]
arXiv preprint arXiv:2402.04249 (2024) 38
Mazeika, M., Phan, L., Yin, X., Zou, A., Wang, Z., Mu, N., Sakhaee, E., Li, N., Basart, S., Li, B., et al.: Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. arXiv preprint arXiv:2402.04249 (2024) 38
Pith/arXiv arXiv 2024
-
[24]
arXiv preprint arXiv:2310.06474 (2023)
Deng, Y., Zhang, W., Pan, S.J., Bing, L.: Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474 (2023)
arXiv 2023
-
[25]
Journal of Machine Learning Research24(240), 1–113 (2023)
Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S.,et al.: Palm: Scaling lan- guage modeling with pathways. Journal of Machine Learning Research24(240), 1–113 (2023)
2023
-
[26]
arXiv preprint arXiv:2401.05561 (2024)
Sun, L., Huang, Y., Wang, H., Wu, S., Zhang, Q., Gao, C., Huang, Y., Lyu, W., Zhang, Y., Li, X., et al.: Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024)
Pith/arXiv arXiv 2024
-
[27]
IEEE Software40(3), 4–8 (2023)
Ozkaya, I.: Application of large language models to software engineering tasks: Opportunities, risks, and implications. IEEE Software40(3), 4–8 (2023)
2023
-
[28]
Foundations and Trends®in Information Retrieval19(5), 557–711 (2025)
Siino, M., Tinnirello, I., Cascia, M.L.: From foundations to gpt in text clas- sification: A comprehensive survey on current approaches and future trends. Foundations and Trends®in Information Retrieval19(5), 557–711 (2025)
2025
-
[29]
arXiv preprint arXiv:2004.06660 (2020)
Kurita, K., Michel, P., Neubig, G.: Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660 (2020)
arXiv 2004
-
[30]
In: 2022 IEEE Symposium on Security and Privacy (SP), pp
Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., Tramer, F.: Membership inference attacks from first principles. In: 2022 IEEE Symposium on Security and Privacy (SP), pp. 1897–1914 (2022). IEEE
2022
-
[31]
arXiv preprint arXiv:2010.12563 (2020)
Wallace, E., Zhao, T.Z., Feng, S., Singh, S.: Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563 (2020)
arXiv 2010
-
[32]
arXiv preprint arXiv:2410.12855 (2024)
Liu, F., Feng, Y., Xu, Z., Su, L., Ma, X., Yin, D., Liu, H.: Jailjudge: A comprehen- sive jailbreak judge benchmark with multi-agent enhanced explanation evaluation framework. arXiv preprint arXiv:2410.12855 (2024)
arXiv 2024
-
[33]
arXiv preprint arXiv:2310.02446 (2023)
Yong, Z.-X., Menghini, C., Bach, S.H.: Low-resource languages jailbreak gpt-4. arXiv preprint arXiv:2310.02446 (2023)
Pith/arXiv arXiv 2023
-
[34]
arXiv preprint arXiv:2311.12445 (2023)
Zhao, Y., Wang, L., Chen, X.: Privacy concerns in multilingual language models. arXiv preprint arXiv:2311.12445 (2023)
arXiv 2023
-
[35]
arXiv preprint arXiv:2311.03348 (2023)
Shah, R., Pour, S., Tagade, A., Casper, S., Rando, J., et al.: Scalable and trans- ferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348 (2023)
arXiv 2023
-
[36]
arXiv preprint arXiv:2405.21018 (2024)
Jia, X., Pang, T., Du, C., Huang, Y., Gu, J., Liu, Y., Cao, X., Lin, M.: Improved techniques for optimization-based jailbreaking on large language models. arXiv preprint arXiv:2405.21018 (2024)
arXiv 2024
-
[37]
39 arXiv preprint arXiv:2311.05232 (2023)
Wang, Z., Shen, T., Huang, Z., Lu, H.: A survey on language model hallucination. 39 arXiv preprint arXiv:2311.05232 (2023)
Pith/arXiv arXiv 2023
-
[38]
In: Proceedings of EMNLP (2023)
Zhang, T., Wang, Y., Chen, H.,et al.: Measuring and mitigating hallucination in summarization. In: Proceedings of EMNLP (2023)
2023
-
[39]
arXiv preprint arXiv:2312.05209 (2023)
Kim, J.-W., Park, J., Cho, K.: Cross-lingual hallucination detection in large language models. arXiv preprint arXiv:2312.05209 (2023)
arXiv 2023
-
[40]
In: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pp
Siino, M.: Brainllama at semeval-2024 task 6: Prompting llama to detect hallu- cinations and related observable overgeneration mistakes. In: Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024), pp. 82–87 (2024)
2024
-
[41]
arXiv preprint arXiv:2310.12534 (2023)
Lee, K., Kim, T., Park, C.: Translation-induced hallucination in multilingual models. arXiv preprint arXiv:2310.12534 (2023)
arXiv 2023
-
[42]
In: Proceedings of ACL (2023)
Chang, M., Henderson, P.,et al.: Cultural alignment and bias in large language models. In: Proceedings of ACL (2023)
2023
-
[43]
arXiv preprint arXiv:2311.07468 (2023)
Wu, X., Zhang, C., Li, W.: Multilingual consistency in large language models. arXiv preprint arXiv:2311.07468 (2023)
arXiv 2023
-
[44]
Computational Linguistics50(1), 89–124 (2024)
Zhao, L., Kumar, R.: Cross-cultural semantic preservation in multilingual lan- guage models. Computational Linguistics50(1), 89–124 (2024)
2024
-
[45]
arXiv preprint arXiv:2310.00905 (2023)
Wang, W., Tu, Z., Chen, C., Yuan, Y., Huang, J.-t., Jiao, W., Lyu, M.R.: All languages matter: On the multilingual safety of large language models. arXiv preprint arXiv:2310.00905 (2023)
arXiv 2023
-
[46]
In: Findings of ACL (2023)
Wang, Y., Chen, H.,et al.: Measuring factual consistency in large language model outputs. In: Findings of ACL (2023)
2023
-
[47]
arXiv preprint arXiv:2312.09036 (2023)
Chen, Y., Liu, Y., Zhang, W.: Cross-lingual consistency checking for large language models. arXiv preprint arXiv:2312.09036 (2023)
arXiv 2023
-
[48]
arXiv preprint arXiv:2311.12024 (2023)
Thompson, S., Chen, D.: A taxonomy of hallucination patterns in large language models. arXiv preprint arXiv:2311.12024 (2023)
arXiv 2023
-
[49]
arXiv preprint arXiv:2311.09801 (2023)
Liu, X., Zhang, W., et al.: A systematic analysis of jailbreaking and response unfaithfulness in large language models. arXiv preprint arXiv:2311.09801 (2023)
arXiv 2023
-
[50]
arXiv preprint arXiv:2202.03286 (2022)
Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., Irving, G.: Red teaming language models with language models. arXiv preprint arXiv:2202.03286 (2022)
Pith/arXiv arXiv 2022
-
[51]
arXiv preprint arXiv:2401.16656 (2024) 40
Wichers, N., Denison, C., Beirami, A.: Gradient-based language model red teaming. arXiv preprint arXiv:2401.16656 (2024) 40
arXiv 2024
-
[52]
In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pp
Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., Fritz, M.: Not what you ´ ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In: Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pp. 79–90 (2023)
2023
-
[53]
arXiv preprint arXiv:2402.08679 (2024)
Guo, X., Yu, F., Zhang, H., Qin, L., Hu, B.: Cold-attack: Jailbreaking llms with stealthiness and controllability. arXiv preprint arXiv:2402.08679 (2024)
arXiv 2024
-
[54]
arXiv preprint arXiv:2311.08268 (2023)
Ding, P., Kuang, J., Ma, D., Cao, X., Xian, Y., Chen, J., Huang, S.: A wolf in sheep´ s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268 (2023)
arXiv 2023
-
[55]
arXiv preprint arXiv:2406.01288 (2024)
Zheng, X., Pang, T., Du, C., Liu, Q., Jiang, J., Lin, M.: Improved few-shot jailbreaking can circumvent aligned language models and their defenses. arXiv preprint arXiv:2406.01288 (2024)
arXiv 2024
-
[56]
arXiv preprint arXiv:2407.16667 (2024)
Xu, H., Zhang, W., Wang, Z., Xiao, F., Zheng, R., Feng, Y., Ba, Z., Ren, K.: Redagent: Red teaming large language models with context-aware autonomous language agent. arXiv preprint arXiv:2407.16667 (2024)
arXiv 2024
-
[57]
arXiv preprint arXiv:2407.03876 (2024)
Jiang, B., Jing, Y., Shen, T., Wu, T., Yang, Q., Xiong, D.: Automated progressive red teaming. arXiv preprint arXiv:2407.03876 (2024)
arXiv 2024
-
[58]
arXiv preprint arXiv:2301.02344 (2023)
Aghakhani, H., Dai, W., Manoel, A., Fernandes, X., Kharkar, A., Kruegel, C., Vigna, G., Evans, D., Zorn, B., Sim, R.: Trojanpuzzle: Covertly poisoning code- suggestion models. arXiv preprint arXiv:2301.02344 (2023)
arXiv 2023
-
[59]
SQ u AD : 100,000+ Questions for Machine Comprehension of Text
Rajpurkar, P., Zhang, J., Lopyrev, K., Liang, P.: SQuAD: 100,000+ questions for machine comprehension of text. In: Su, J., Duh, K., Carreras, X. (eds.) Pro- ceedings of the 2016 Conference on Empirical Methods in Natural Language Pro- cessing, pp. 2383–2392. Association for Computational Linguistics, Austin, Texas (2016). https://doi.org/10.18653/v1/D16-1...
-
[60]
Journal of King Saud University-Computer and Information Sciences33(5), 497–507 (2021)
Guellil, I., Saˆ adane, H., Azouaou, F., Gueni, B., Nouvel, D.: Arabic natural lan- guage processing: An overview. Journal of King Saud University-Computer and Information Sciences33(5), 497–507 (2021)
2021
-
[61]
arXiv preprint arXiv:2407.15390 (2024)
Bari, M.S., Alnumay, Y., Alzahrani, N.A., Alotaibi, N.M., Alyahya, H.A., AlRashed, S., Mirza, F.A., Alsubaie, S.Z., Alahmed, H.A., Alabduljabbar, G., et al.: Allam: Large language models for arabic and english. arXiv preprint arXiv:2407.15390 (2024)
arXiv 2024
-
[62]
arXiv preprint arXiv:2005.00661 (2020)
Maynez, J., Narayan, S., Bohnet, B., McDonald, R.: On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661 (2020)
arXiv 2005
-
[63]
arXiv preprint arXiv:1910.12840 (2019)
Kry´ sci´ nski, W., McCann, B., Xiong, C., Socher, R.: Evaluating the factual 41 consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840 (2019)
arXiv 1910
-
[64]
arXiv preprint arXiv:2102.09130 (2021)
Nan, F., Nallapati, R., Wang, Z., Santos, C.N.d., Zhu, H., Zhang, D., McKeown, K., Xiang, B.: Entity-level factual consistency of abstractive text summarization. arXiv preprint arXiv:2102.09130 (2021)
arXiv 2021
-
[65]
Transactions of the Association for Computational Linguistics10, 163–177 (2022)
Laban, P., Schnabel, T., Bennett, P.N., Hearst, M.A.: Summac: Re-visiting nli- based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics10, 163–177 (2022)
2022
-
[66]
arXiv preprint arXiv:2210.07197 (2022)
Zhong, M., Liu, Y., Yin, D., Mao, Y., Jiao, Y., Liu, P., Zhu, C., Ji, H., Han, J.: Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197 (2022)
arXiv 2022
-
[67]
arXiv preprint arXiv:2305.13194 (2023)
Clark, E., Rijhwani, S., Gehrmann, S., Maynez, J., Aharoni, R., Nikolaev, V., Sellam, T., Siddhant, A., Das, D., Parikh, A.P.: Seahorse: A multilingual, mul- tifaceted dataset for summarization evaluation. arXiv preprint arXiv:2305.13194 (2023)
arXiv 2023
-
[68]
arXiv preprint arXiv:2104.13346 (2021)
Pagnoni, A., Balachandran, V., Tsvetkov, Y.: Understanding factuality in abstractive summarization with frank: A benchmark for factuality metrics. arXiv preprint arXiv:2104.13346 (2021)
arXiv 2021
-
[69]
arXiv preprint arXiv:2410.15236 (2024)
Peng, B., Bi, Z., Niu, Q., Liu, M., Feng, P., Wang, T., Yan, L.K., Wen, Y., Zhang, Y., Yin, C.H.: Jailbreaking and mitigation of vulnerabilities in large language models. arXiv preprint arXiv:2410.15236 (2024)
Pith/arXiv arXiv 2024
-
[70]
arXiv preprint arXiv:2307.15043 (2023)
Zou, A., Wang, Z., Kolter, J.Z., Fredrikson, M.: Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043 (2023)
Pith/arXiv arXiv 2023
-
[71]
arXiv preprint arXiv:2305.13860 (2023)
Liu, Y., Deng, G., Xu, Z., Li, Y., Zheng, Y., Zhang, Y., Zhao, L., Zhang, T., Wang, K., Liu, Y.: Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860 (2023)
Pith/arXiv arXiv 2023
-
[72]
arXiv preprint arXiv:2310.08419 (2023)
Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G.J., Wong, E.: Jail- breaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419 (2023)
Pith/arXiv arXiv 2023
-
[73]
arXiv preprint arXiv:2404.01318 (2024)
Chao, P., Debenedetti, E., Robey, A., Andriushchenko, M., Croce, F., Sehwag, V., Dobriban, E., Flammarion, N., Pappas, G.J., Tramer, F., et al.: Jailbreakbench: An open robustness benchmark for jailbreaking large language models. arXiv preprint arXiv:2404.01318 (2024)
Pith/arXiv arXiv 2024
-
[74]
arXiv preprint arXiv:2408.04686 42 (2024)
Sun, X., Zhang, D., Yang, D., Zou, Q., Li, H.: Multi-turn context jailbreak attack on large language models from first principles. arXiv preprint arXiv:2408.04686 42 (2024)
arXiv 2024
-
[75]
arXiv preprint arXiv:2206.07682 (2022)
Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al.: Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 (2022)
Pith/arXiv arXiv 2022
-
[76]
arXiv preprint arXiv:2001.08361 (2020)
Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)
Pith/arXiv arXiv 2001
-
[77]
arXiv preprint arXiv:2306.09479 (2023)
McKenzie, I.R., Lyzhov, A., Pieler, M., Parrish, A., Mueller, A., Prabhu, A., McLean, E., Kirtland, A., Ross, A., Liu, A., et al.: Inverse scaling: When bigger isn´t better. arXiv preprint arXiv:2306.09479 (2023)
arXiv 2023
-
[78]
arXiv preprint arXiv:1904.09751 (2019)
Holtzman, A., Buys, J., Du, L., Forbes, M., Choi, Y.: The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751 (2019)
Pith/arXiv arXiv 1904
-
[79]
arXiv preprint arXiv:1909.05858 (2019)
Keskar, N.S., McCann, B., Varshney, L.R., Xiong, C., Socher, R.: Ctrl: A con- ditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858 (2019)
Pith/arXiv arXiv 1909
-
[80]
arXiv preprint arXiv:2306.02561 (2023)
Jiang, D., Ren, X., Lin, B.Y.: Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561 (2023)
arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.