pith. sign in

hub

arXiv preprint arXiv:2305.15324 , year=

21 Pith papers cite this work. Polarity classification is still indexing.

21 Pith papers citing it

hub tools

citation-role summary

background 3

citation-polarity summary

roles

background 3

polarities

background 3

representative citing papers

Measuring Safety Alignment Effects in Autonomous Security Agents

cs.CR · 2026-05-19 · conditional · novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.

Overtrained, Not Misaligned

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

LLM-Guided Prompt Evolution for Password Guessing

cs.CR · 2026-04-14 · unverdicted · novelty 6.0

LLM-guided evolutionary prompt optimization using MAP-Elites and island models raises password cracking rates from 2.02% to 8.48% on a RockYou-derived test set across local, cloud, and ensemble LLM setups.

Benchmarking Misuse Mitigation Against Covert Adversaries

cs.CR · 2025-06-06 · unverdicted · novelty 6.0

Develops the BSD data generation pipeline and two new datasets to evaluate decomposition attacks as effective misuse enablers and stateful defenses as a countermeasure in language model safety.

Towards an AI co-scientist

cs.AI · 2025-02-26 · unverdicted · novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.

Gemini: A Family of Highly Capable Multimodal Models

cs.CL · 2023-12-19 · conditional · novelty 6.0

Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.

TrustLLM: Trustworthiness in Large Language Models

cs.CL · 2024-01-10 · unverdicted · novelty 5.0

TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.

Risk Reporting for Developers' Internal AI Model Use

cs.CY · 2026-04-27 · unverdicted · novelty 4.0

A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.

Internal Deployment in the AI Act

cs.CY · 2025-12-05 · unverdicted · novelty 4.0

Interpretations of Articles 2(1), 2(6), and 2(8) of the AI Act support applying the regulation to internal AI deployment while allowing for R&D exceptions, with the provisions viewed as complementary.

citing papers explorer

Showing 21 of 21 citing papers.