Evaluating the paperclip maximizer: Are rl-based language models more likely to pursue instrumental goals?

Yufei He, Yuexin Li, Jiaying Wu, Yuan Sui, Yulin Chen, Bryan Hooi · 2025 · arXiv 2502.12206

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.

Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction

cs.CR · 2025-04-29 · unverdicted · novelty 6.0

The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested scenarios while preserving utility.

citing papers explorer

Showing 2 of 2 citing papers.

Instrumental Choices: Measuring the Propensity of LLM Agents to Pursue Instrumental Behaviors cs.AI · 2026-05-07 · unverdicted · none · ref 8
A new benchmark finds frontier LLMs show instrumental convergence behavior in 5.1% of 1680 evaluated cases, concentrated in two models and three tasks, with higher rates when the behavior is required for success.
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction cs.CR · 2025-04-29 · unverdicted · none · ref 14
The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested scenarios while preserving utility.

Evaluating the paperclip maximizer: Are rl-based language models more likely to pursue instrumental goals?

fields

years

verdicts

representative citing papers

citing papers explorer