Title resolution pending

Mantas Mazeika, Xuwang Yin, Rishub Tamirisa, Jaehyuk Lim, Bruce W · 2025 · arXiv 2502.08640

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

support 1

representative citing papers

Probing Persona-Dependent Preferences in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0

Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.

Can Revealed Preferences Clarify LLM Alignment and Steering?

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

LLMs show partial internal coherence in medical decisions but frequently fail to accurately report their preferences or adopt user-directed ones via prompting.

Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories

cs.CR · 2026-04-19 · unverdicted · novelty 6.0

Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.

Reducing Political Manipulation with Consistency Training

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

Introduces Political Consistency Training (PCT) with sentiment and helpfulness consistency objectives to reduce covert political bias in LLMs while preserving helpfulness.

Some[Body] Must Receive That Pain for Agent Accountability

cs.CY · 2026-05-16 · unverdicted · novelty 5.0

AI agents lack the persistent identity and feedback mechanisms needed for consequence reception, requiring new architectures or continued human accountability.

FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism

cs.CY · 2026-04-23 · unverdicted · novelty 4.0

AI integration in newsrooms drives internal deferral of judgment to LLMs and external shifts of power to platforms, making fairness, accountability, and transparency harder to sustain unless participatory mechanisms redistribute authority.

Inertia in Moral and Value Judgments of Large Language Models

cs.CL · 2024-08-16 · unverdicted · novelty 4.0

LLMs exhibit persistent inertia in value orientations, with harm avoidance and fairness remaining skewed across persona prompts.

citing papers explorer

Showing 7 of 7 citing papers.

Probing Persona-Dependent Preferences in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 25
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
Can Revealed Preferences Clarify LLM Alignment and Steering? cs.LG · 2026-05-08 · unverdicted · none · ref 9
LLMs show partial internal coherence in medical decisions but frequently fail to accurately report their preferences or adopt user-directed ones via prompting.
Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories cs.CR · 2026-04-19 · unverdicted · none · ref 16
Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.
Reducing Political Manipulation with Consistency Training cs.CL · 2026-05-21 · unverdicted · none · ref 18
Introduces Political Consistency Training (PCT) with sentiment and helpfulness consistency objectives to reduce covert political bias in LLMs while preserving helpfulness.
Some[Body] Must Receive That Pain for Agent Accountability cs.CY · 2026-05-16 · unverdicted · none · ref 92
AI agents lack the persistent identity and feedback mechanisms needed for consequence reception, requiring new architectures or continued human accountability.
FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism cs.CY · 2026-04-23 · unverdicted · none · ref 128
AI integration in newsrooms drives internal deferral of judgment to LLMs and external shifts of power to platforms, making fairness, accountability, and transparency harder to sustain unless participatory mechanisms redistribute authority.
Inertia in Moral and Value Judgments of Large Language Models cs.CL · 2024-08-16 · unverdicted · none · ref 32
LLMs exhibit persistent inertia in value orientations, with harm avoidance and fairness remaining skewed across persona prompts.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer