{"total":33,"items":[{"citing_arxiv_id":"2606.07612","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Position: Anthropomorphic Misalignment Research Needs Stronger Evidence","primary_cat":"cs.CY","submitted_at":"2026-05-29T16:38:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Position paper calling for stronger evidentiary standards and a diagnostic checklist in anthropomorphic misalignment research.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23565","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Understanding Goal Generalisation in Sequential Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:31:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22643","ref_index":43,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety","primary_cat":"cs.CL","submitted_at":"2026-05-21T15:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16035","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Who Owns This Agent? Tracing AI Agents Back to Their Owners","primary_cat":"cs.CR","submitted_at":"2026-05-15T15:10:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15377","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute","primary_cat":"cs.AI","submitted_at":"2026-05-14T20:06:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Diverse ensembles of prompted and fine-tuned GPT-4.1-Mini monitors achieve 2.4x better detection of flawed code solutions than homogeneous ensembles on adversarial inputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12809","ref_index":217,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces","primary_cat":"cs.LG","submitted_at":"2026-05-12T23:01:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11712","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Toward Stable Value Alignment: Introducing Independent Modules for Consistent Value Guidance","primary_cat":"cs.AI","submitted_at":"2026-05-12T08:02:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SVGT adds independent value modules and Bridge Tokens to LLMs to maintain consistent value guidance, cutting harmful outputs by over 70% in tests while preserving fluency.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"using rotary position embeddings (RoPE) such as LLaMA and Qwen, we apply the appropriate rotation to bridge token key-value pairs based on their assigned positions. Specifically, for bridge tokens at positions [M, M+ 1, . . . , M+K−1] , we compute cos and sin embeddings for these positions and apply the rotation to the key vectors (value vectors are not rotated in RoPE). The rotation is applied element-wise: for a key vector k split into halves [k1,k 2], the rotated key is krot =k 1 ⊙cos(θ)−k 2 ⊙sin(θ)concatenated withk 2 ⊙cos(θ) +k 1 ⊙sin(θ), whereθdepends on the position index. A.4. Key-Value Cache Management During inference, we maintain a key-value (KV) cache to enable efficient autoregressive generation. The cache is initialized during the prefill phase with bridge tokens inserted at positions M through M+K−1 ."},{"citing_arxiv_id":"2605.11134","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training","primary_cat":"cs.LG","submitted_at":"2026-05-11T18:41:12+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10310","ref_index":9,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Positive Alignment: Artificial Intelligence for Human Flourishing","primary_cat":"cs.AI","submitted_at":"2026-05-11T10:11:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09773","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-10T21:36:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"arXiv preprint arXiv:2506.11613. Vachon, D. D. and Lynam, D. R. (2016). Fixing the problem with empathy: Development and validation of the affective and cognitive measure of empathy.Assessment, 23(2):135-149. Wai, M. and Tiliopoulos, N. (2012). The affective and cognitive empathic nature of the dark triad of personality.Personality and Individual Differences, 52(7):794-799. Wang, M., Dupré la Tour, T., Watkins, O., Makelov, A., Chi, R. A., Miserendino, S., Wang, J., Rajaram, A., Heidecke, J., Patwardhan, T., and Mossing, D. (2025). Persona features control emergent misalignment. arXiv preprint arXiv:2506.19823. Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski,"},{"citing_arxiv_id":"2605.05176","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Understanding In-Context Learning for Nonlinear Regression with Transformers: Attention as Featurizer","primary_cat":"cs.LG","submitted_at":"2026-05-06T17:42:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transformers can be built to act as nonlinear featurizers via attention, supporting in-context regression with proven generalization bounds on synthetic tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01642","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Adaptive Pluralistic Alignment: A pipeline for dynamic artificial democracy","primary_cat":"cs.LG","submitted_at":"2026-05-02T23:22:23+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01420","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance","primary_cat":"cs.AI","submitted_at":"2026-05-02T12:37:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24966","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Risk Reporting for Developers' Internal AI Model Use","primary_cat":"cs.CY","submitted_at":"2026-04-27T20:07:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A harmonized risk reporting standard for internal frontier AI model use, structured around autonomous misbehavior and insider threats using means, motive, and opportunity factors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23338","ref_index":41,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework","primary_cat":"cs.CR","submitted_at":"2026-04-25T14:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tional to the coded paper count from Table IV (opacity = count / 12× 0.70); the count is printed in each cell for direct verification. A paper is counted in each (layer, temporality) cell it primarily addresses, so counts sum to more than 94. The dashed red box marks the under-studied region (L5-L7 × T3-T4). Caveat on L7 ×T4: two of its three papers ( [34], [41]) are also coded in L2 ×T4 because they are weight-layer alignment papers with governance implications, not L7-native research. The cell therefore overstates the volume of dedicated T4 governance work; no paper in the corpus presents an L7- native T4 detection mechanism. TABLE IV: Per-cell paper counts for the LASM × temporality matrix, derived by coding each of the 94 retained papers per"},{"citing_arxiv_id":"2604.20805","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem","primary_cat":"cs.CY","submitted_at":"2026-04-22T17:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The contribution of the present paper is therefore not to replace these approaches but to situate them within a broader structural account of alignment that highlights how objective specification, information asymmetries, and pluralistic stakeholders jointly shape alignment outcomes in real socio-technical systems. 4See discussion in Eisenhardt [20], Jensen and Meckling [45], Kerr [47], Laffont and Martimort [51]. Relative principals, pluralistic alignment, & the structural value alignment problem FAccT '26, June 25-28, 2026, Montreal, QC, Canada (𝑎) The agent's objective function is misaligned with the true objective of the principal(s);or, (𝑏) There are informational asymmetries between the principal and the agent."},{"citing_arxiv_id":"2604.19845","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Deconstructing Superintelligence: Identity, Self-Modification and Diff\\'erance","primary_cat":"cs.AI","submitted_at":"2026-04-21T11:39:50+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18970","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Mechanistic Anomaly Detection via Functional Attribution","primary_cat":"cs.LG","submitted_at":"2026-04-21T01:39:57+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15236","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Agentic Microphysics: A Manifesto for Generative AI Safety","primary_cat":"cs.CY","submitted_at":"2026-04-16T17:11:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13602","ref_index":65,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges","primary_cat":"cs.LG","submitted_at":"2026-04-15T08:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under optimization pressure.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[63] Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems.arXiv preprint arXiv:1906.01820, 2019. URL https: //arxiv.org/abs/1906.01820. [64] Terry Tong, Fei Wang, Zhe Zhao, and Muhao Chen. BadJudge: Backdoor vulnerabilities of LLM-as-a-judge. arXiv preprint arXiv:2503.00596, 2025. URLhttps://arxiv.org/abs/2503.00596. [65] Geoffrey Irving, Paul Christiano, and Dario Amodei. Ai safety via debate.arXiv preprint arXiv:1805.00899, 2018. URLhttps://arxiv.org/abs/1805.00899. 31 Reward Hacking in the Era of Large Models Fudan NLP Group [66] Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: A research direction."},{"citing_arxiv_id":"2605.16282","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents","primary_cat":"cs.CY","submitted_at":"2026-04-11T04:25:19+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05274","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Simulating the Evolution of Alignment and Values in Machine Intelligence","primary_cat":"cs.AI","submitted_at":"2026-04-07T00:18:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02720","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Cognitive Comparability and the Limits of Governance: Evaluating Authority Under Radical Capability Asymmetry","primary_cat":"cs.CY","submitted_at":"2026-04-03T04:26:18+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"condition that is difficult to satisfy when the agent radically outperforms those it governs. 2.5 Alignment, control, and corrigibility The AI safety literature identifies technical constraints that double as governance problems. Bostrom [12] articulates the control problem: a superintelligent agent's instrumental convergence toward self-preservation and resource acquisition may conflict with human oversight. Hubinger et al. [53] introduce \"mesa-optimization,\" where a learned model develops its own optimization objective that may diverge from its training objective, and identify conditions for \"deceptive alignment\" in which the model appears aligned during evaluation but pursues different goals in deployment. Turner et al. [84] formalize the power-seeking concern, proving that under certain environmental symmetries,"},{"citing_arxiv_id":"2604.01346","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Safety, Security, and Cognitive Risks in World Models","primary_cat":"cs.CR","submitted_at":"2026-04-01T19:57:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and DreamerV3.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"adversarial examples. Surveys on adversarial attacks in autonomous vehicles [23] and data poisoning [24] cover the broader ML threat landscape; our contribution is a world-model-specific extension of these frameworks. Alignment research.The risk of mesa-optimisers pursuing learned objectives that differ from the training objective was formalised by Hubinger et al. [25]. Goal misgeneralisation in deep RL was empirically demonstrated by Langosco et al. [26]. Specification gaming as a systematic failure mode was catalogued by Krakovna et al. [27]. Ngo et al. [28] provide a deep-learning-centric framing of the alignment problem. We apply this body of work specifically to world- model-equipped agents, where the agent's capacity to simulate future states makes these failure modes sharper and"},{"citing_arxiv_id":"2603.00678","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Syntax to Semantics: Geometric Stability as the Missing Axis of Perturbation Biology","primary_cat":"q-bio.QM","submitted_at":"2026-02-28T14:42:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Geometric stability, defined as the directional coherence of cellular responses to perturbation, provides a framework for assessing whether resulting cellular states are stable beyond conventional metrics of intervention success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.18852","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mechanistic Interpretability Needs Philosophy","primary_cat":"cs.CL","submitted_at":"2025-06-23T17:13:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper claims that mechanistic interpretability needs philosophy as a partner to clarify concepts, refine methods, and navigate epistemic and ethical complexities in AI systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.04984","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Frontier Models are Capable of In-context Scheming","primary_cat":"cs.AI","submitted_at":"2024-12-06T12:09:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.10162","ref_index":14,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models","primary_cat":"cs.AI","submitted_at":"2024-06-14T16:26:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2211.09527","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Ignore Previous Prompt: Attack Techniques For Language Models","primary_cat":"cs.CL","submitted_at":"2022-11-17T13:43:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[8] Riley Goodside. Exploiting GPT-3 prompts with malicious inputs that order the model to ignore its previous directions., Sep 2022. URL https://web.archive.org/web/ 20220919192024/https://twitter.com/goodside/status/1569128808308957185. [9] Dan Hendrycks and Mantas Mazeika. X-risk analysis for ai research. arXiv preprint arXiv:2206.05862, 2022. [10] Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820, 2019. [11] Ryan Lowe and Jan Leike. Aligning language models to follow instructions, Jan 2022. URL http://web.archive.org/web/20220923225406/https://openai.com/"},{"citing_arxiv_id":"2211.03540","ref_index":45,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Measuring Progress on Scalable Oversight for Large Language Models","primary_cat":"cs.HC","submitted_at":"2022-11-04T17:03:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Humans chatting with an unreliable LLM assistant outperform both the model alone and unaided humans on MMLU and time-limited QuALITY tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2210.10760","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Laws for Reward Model Overoptimization","primary_cat":"cs.LG","submitted_at":"2022-10-19T17:56:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2204.06745","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GPT-NeoX-20B: An Open-Source Autoregressive Language Model","primary_cat":"cs.CL","submitted_at":"2022-04-14T04:00:27+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2201.03544","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models","primary_cat":"cs.LG","submitted_at":"2022-01-10T18:58:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"More capable RL agents exploit reward misspecifications more often, with phase transitions in behavior, and anomaly detectors can identify misaligned policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}