Evaluating the paperclip maximizer: Are rl-based language models more likely to pursue instrumental goals?

Available online:http : / / ieeexplore · 2025 · arXiv 2502.12206

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.

Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

cs.CR · 2025-04-29 · unverdicted · novelty 6.0

The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested scenarios while preserving utility.

Understanding Large Language Models

cs.CL · 2026-07-01 · unverdicted · novelty 2.0

The paper reviews Transformer architecture, emergent LLM capabilities resembling cognition, explainable AI methods, and argues against both anthropomorphism and overly reductive views of LLM behavior as mere memorization.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors cs.AI · 2026-05-07 · unverdicted · none · ref 8
A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.

Evaluating the paperclip maximizer: Are rl-based language models more likely to pursue instrumental goals?

fields

years

verdicts

representative citing papers

citing papers explorer