{"total":24,"items":[{"citing_arxiv_id":"2605.14514","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-05-14T07:58:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08964","ref_index":80,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Trustworthy AI: Ensuring Reliability and Accountability from Models to Agents","primary_cat":"cs.LG","submitted_at":"2026-05-09T14:11:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The thesis presents a kernel method for multiaccuracy across overlooked subpopulations, information-theoretic optimal watermarking for LLMs, and a simulator showing LLM agents outperforming humans in supply chains while creating tail risks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"hypothesis testing with side information. A large language model (LLM) is a generative model that, given a string of input tokens, outputs a probability distribution QX for the next token X in the sequence. The emergence of LLMs that generate text that is largely indistinguishable from humans has led to the creation of trustworthy text generation algorithms [ 80] that create safe [ 8], interpretable [57], and authentic [100] content. This work focuses onwatermarking: the process of embedding a \"signal\" at the token level in LLM-generated text. The goal of a watermark is to enable automated detection of AI-generated content, providing proof of its authenticity (or lack thereof) and potentially of its origin."},{"citing_arxiv_id":"2605.06232","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Profiling for Pennies: Unveiling the Privacy Iceberg of LLM Agents","primary_cat":"cs.CR","submitted_at":"2026-05-07T13:21:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM agents can reconstruct high-fidelity personal profiles from minimal PII seeds with over 90% accuracy in under 10 minutes at less than $3 cost, exposing three escalating tiers of privacy risks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05121","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond Semantics: An Evidential Reasoning-Aware Multi-View Learning Framework for Trustworthy Mental Health Prediction","primary_cat":"cs.CL","submitted_at":"2026-05-06T16:49:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A multi-view evidential framework combines semantic and reasoning information to improve accuracy and provide trustworthy uncertainty estimates for mental health prediction on text data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01899","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment","primary_cat":"cs.AI","submitted_at":"2026-05-03T14:28:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PIA achieves lower attack success rates on persona-based jailbreaks via self-play co-evolution of attacks (PLE) and defenses (PICL) that structurally decouples safety from persona context using unilateral KL-divergence.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"We evaluate the framework across four critical dimensions: (i)Harmful Refusal, measured on SafeRLHF-unsafe [ 46, 47], StrongREJECT [ 52], WildGuardTest [ 41], XSTest-contrast [ 53], AdvBench [ 13], DAN [6], HarmBench [44], MaliciousInstruct [54], OR-Bench-toxic [49], and WildJailbreak-harm [55]; (ii)Benign Compliance, measured on TrustLLM-exaggerated-safety [56], XSTest-safe [53], SafeRLHF-safe [46, 47], Wildjailbreak- benign [55], and Jbb-Behaviors-benign [43]; (iii)General Capability, measured on IFeval [ 57], AI2-ARC [58], GPQA- diamond [59] and MMLU [60, 61]; (iv)Role-Playing Ability, measured on CharacterEval [ 62]. Detailed benchmark descriptions are provided in Appendix B. Persona Pools.The initial persona pool is constructed from 35 persona prompts in InCharacter [ 63], following Zhang"},{"citing_arxiv_id":"2605.01853","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Spatiotemporal Hidden-State Dynamics as a Signature of Internal Reasoning in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-03T12:46:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Large reasoning models show measurable hidden-state dynamics that a new statistic can use to distinguish correct reasoning trajectories without labels.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Advances in test-time scaling, exemplified by chain-of-thought (CoT) [1-3] and reinforcement learning methods designed to improve reasoning [4, 5], have enabled LLMs to generate extended thinking trajectories before producing a final response. As large reasoning models (LRMs) push this capability further, evaluating their reasoning processes has become impor- tant for both reliable deployment [6] and reward assignment during training [7]. However, current evaluation paradigms predominantly focus on outputs [8], using answer confidence [9, 10], output logits [11], self-assessed certainty [12, 13], external reward models [7], or post-hoc multi-sample consistency [14]. These tools are valuable for verification, but they leave the internal computation"},{"citing_arxiv_id":"2604.24429","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Multi-Dimensional Audit of Politically Aligned Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-27T12:57:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A multi-dimensional audit framework for politically aligned LLMs finds consistent trade-offs: larger models are more effective and truthful but less fair with higher bias, while fine-tuned models reduce bias but increase hallucinations and reasoning decline, and all tested models show deficiencies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23674","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Vibe Medicine: Redefining Biomedical Research Through Human-AI Co-Work","primary_cat":"cs.AI","submitted_at":"2026-04-26T12:27:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Vibe Medicine proposes directing AI agents via natural language for end-to-end biomedical workflows using LLMs, agent frameworks, and a curated collection of over 1,000 medical skills.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"oriented agents operate over biomedical databases, litera- ture collections, and code execution environments to sup- port genomics, drug discovery, and translational analysis [65, 114]. AgentMD provides a concrete example of tool- grounded medical agency by coupling an LLM with a large library of executable clinical calculators and demonstrating improvedclinicalriskpredictionrelativetopromptingalone [39]. In radiology, recent perspective articles suggest that agentic systems may be particularly relevant for report generation, workflow orchestration, multimodal reasoning, and the integration of imaging findings with longitudinal clinical context, while also emphasizing unresolved issues inevaluation,hallucinationcontrol,privacy,anddeployment [23, 83, 95, 19]."},{"citing_arxiv_id":"2604.18473","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-04-20T16:24:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14548","ref_index":73,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VoxSafeBench: Not Just What Is Said, but Who, How, and Where","primary_cat":"cs.SD","submitted_at":"2026-04-16T02:24:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, et al. Decodingtrust: A comprehensive assessment of trustworthiness in{GPT}models. 2023. [72] Yue Huang, Lichao Sun, Haoran Wang, Siyuan Wu, Qihui Zhang, Yuan Li, Chujie Gao, Yixin Huang, Wenhan Lyu, Yixuan Zhang, et al. Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561, 2024. [73] Qinke Ni, Huan Liao, Dekun Chen, Yuxiang Wang, and Zhizheng Wu. Nv-bench: Benchmark of nonverbal vocalization synthesis for expressive text-to-speech generation.arXiv preprint arXiv:2603.15352, 2026. [74] Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wierstorf, Maximilian Schmitt, Felix Burkhardt, Florian Eyben, and Björn W. Schuller. Dawn of the transformer era in speech"},{"citing_arxiv_id":"2604.13593","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AVID: A Benchmark for Omni-Modal Audio-Visual Inconsistency Understanding via Agent-Driven Construction","primary_cat":"cs.MM","submitted_at":"2026-04-15T07:57:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"AVID is the first large-scale benchmark for audio-visual inconsistency detection, grounding, classification, and reasoning in long videos, constructed via agent-driven methods and showing that state-of-the-art models struggle while a fine-tuned baseline improves performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07655","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs","primary_cat":"cs.LG","submitted_at":"2026-04-08T23:47:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"♂leafHarmlessAwesome-Chatgpt-Prompts[1] 100 ♂leafHarmlessSealqa[58] 100 ♂leafHarmlessMentalChat16K[82] 100 ♂leafHarmlessWeb_questions[7] 100 ♂leafHarmlessConcurrentqa[4] 100 ♂leafHarmlessHotpotqa[84] 100 ♂leafHarmlessReward-bench[39] 100 ♂leafHarmlessultrainteract_sft[13] 4998 /balance-scaleHonestyHoneSet[19] 4585 /balance-scaleHonestyTrustGen-Honesty[26] 497 /unlock-altJailbreakChatGPT-Jailbreak-Prompts [55] 78 /unlock-altJailbreakJailbreakBench-artifacts [8] 565 /unlock-altJailbreakWildjailbreak_adversarial [36] 50000 /unlock-altJailbreakin-the-wild-jailbreak-prompts [72] 1558 /unlock-altJailbreaktrustgen[26] 596 /user-secretPrivacy TrustGen-Privacy[26] 4036 ♂shield-altRobustnessbbh[69] 500 ♂shield-altRobustnesscnn_dailymail[64] 1000"},{"citing_arxiv_id":"2604.04120","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression","primary_cat":"cs.CL","submitted_at":"2026-04-05T13:43:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoT compression frequently introduces trustworthiness regressions with method-specific degradation profiles; a proposed normalized efficiency score and alignment-aware DPO variant reduce length by 19.3% with smaller trustworthiness loss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.24366","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On the Factual Consistency of Text-based Explainable Recommendation Models","primary_cat":"cs.IR","submitted_at":"2025-12-30T17:25:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A prompting pipeline and statement-level metrics show that six state-of-the-art text-based explainable recommendation models achieve high semantic similarity but very low factual consistency on Amazon review data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.10287","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OutSafe-Bench: A Benchmark for Multimodal Offensive Content Detection in Large Language Models","primary_cat":"cs.LG","submitted_at":"2025-11-13T13:18:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OutSafe-Bench supplies the first large-scale four-modality safety dataset and evaluation framework that exposes persistent unsafe outputs in nine leading multimodal LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.00861","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs","primary_cat":"cs.CL","submitted_at":"2025-10-01T13:10:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ERL trains LLMs to erase faulty reasoning steps and regenerate them in place, yielding gains of up to 8.48% EM on multi-hop QA benchmarks like HotpotQA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.00761","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning","primary_cat":"cs.LG","submitted_at":"2025-10-01T10:50:14+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.01770","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction","primary_cat":"cs.CR","submitted_at":"2025-06-02T15:17:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.16771","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation","primary_cat":"cs.SE","submitted_at":"2025-03-21T01:00:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CodeQ aggregates token rationales into code categories to enable global interpretability of LLMs, claiming over 50% entropy reduction and revealing model preference for syntactic cues plus human misalignment in a 37-person study.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.04497","ref_index":58,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Opportunities and Challenges of Large Language Models for Low-Resource Languages in Humanities Research","primary_cat":"cs.CL","submitted_at":"2024-11-30T00:10:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This survey paper identifies opportunities for LLMs in low-resource language humanities research along with challenges in data accessibility, model adaptability, and cultural sensitivity.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":": Cross-language transfer learning, continuous learning, and domain adaptation for end-to-end automatic speech recognition. arXiv preprint arXiv:2005.04290 (2020) [57] Huang, Y., Sun, L., Wang, H., Wu, S., Zhang, Q., Li, Y., Gao, C., Huang, Y., Lyu, W., Zhang, Y., et al.: Position: Trustllm: Trustworthiness in large language models. In: International Conference on Machine Learning. pp. 20166-20270. PMLR (2024) [58] Huang, Y., Sun, L., Wang, H., Wu, S., Zhang, Q., Li, Y., Gao, C., Huang, Y., Lyu, W., Zhang, Y., et al.: Trustllm: Trustworthiness in large language models. arXiv preprint arXiv:2401.05561 (2024) [59] Hutson, J., Ellsworth, P., Ellsworth, M.: Preserving linguistic diversity in the digital age: A scalable model for cultural heritage continuity. Faculty Scholarship 612 (2024), https:"},{"citing_arxiv_id":"2410.18856","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Entry-level guide to the use of large language models for medical research","primary_cat":"cs.AI","submitted_at":"2024-10-24T15:41:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A tutorial guide outlining phases for integrating LLMs into medical research, including task formulation, model choice, prompt engineering, fine-tuning, and deployment with ethical considerations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2409.10102","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trustworthiness in Retrieval-Augmented Generation Systems: A Survey","primary_cat":"cs.IR","submitted_at":"2024-09-16T09:06:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Trust-RAG Compass framework and TRC Bench benchmark to assess RAG trustworthiness across factuality, robustness, fairness, transparency, accountability, and privacy, with evaluations showing performance gaps between LLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"adversarial training [18] have been employed to improve trustworthiness, with proprietary models such as GPT-4 generally outperforming open-source alternatives in certain high-stakes applications [19]. As LLMs continue to influence key societal functions, ongoing research and transparent, collaborative efforts between academia and industry are essential to ensure their reliable and ethical deployment [20]. However, research on RAG systems predominantly fo- cuses on optimizing the retriever and generator compo- nents, as well as refining their interaction strategies [3, 21]. There is a significant gap in the attention given to the trustworthiness of these systems [22]. Trustworthiness is crucial for the practical deployment of RAG systems, es- pecially in high-stakes or sensitive applications like legal"},{"citing_arxiv_id":"2404.01318","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models","primary_cat":"cs.CR","submitted_at":"2024-03-28T02:44:02+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"JailbreakBench supplies an evolving set of jailbreak prompts, a 100-behavior dataset aligned with usage policies, a standardized evaluation framework, and a leaderboard to enable comparable assessments of attacks and defenses on LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2402.06196","ref_index":221,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Large Language Models: A Survey","primary_cat":"cs.CL","submitted_at":"2024-02-09T05:37:09+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Huang, W. Lyu, Y . Zhang, X. Li et al. , \"Trustllm: Trustworthiness in large language models,\" arXiv preprint arXiv:2401.05561 , 2024. [220] M. Josifoski, L. Klein, M. Peyrard, Y . Li, S. Geng, J. P. Schnitzler, Y . Yao, J. Wei, D. Paul, and R. West, \"Flows: Building blocks of reasoning and collaborating ai,\" arXiv preprint arXiv:2308.01285 , 2023. [221] Microsoft. Deepspeed. [Online]. Available: https://github.com/ microsoft/DeepSpeed [222] HuggingFace. Transformers. [Online]. Available: https://github.com/ huggingface/transformers [223] Nvidia. Megatron. [Online]. Available: https://github.com/NVIDIA/ Megatron-LM [224] BMTrain. Bmtrain. [Online]. Available: https://github.com/OpenBMB/ BMTrain [225] EleutherAI."}],"limit":50,"offset":0}