{"total":15,"items":[{"citing_arxiv_id":"2606.10711","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Agentic Web Requires New Normative Infrastructure","primary_cat":"cs.CY","submitted_at":"2026-06-09T11:15:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The agentic web requires new normative infrastructure of laws, norms, and practices to allow user-delegated AI agents to access online properties without being blocked as malicious bots.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23565","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Understanding Goal Generalisation in Sequential Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:31:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16035","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Who Owns This Agent? Tracing AI Agents Back to Their Owners","primary_cat":"cs.CR","submitted_at":"2026-05-15T15:10:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11134","ref_index":37,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training","primary_cat":"cs.LG","submitted_at":"2026-05-11T18:41:12+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20805","ref_index":60,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem","primary_cat":"cs.CY","submitted_at":"2026-04-22T17:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"The royal road for genetic algorithms: Fitness landscapes and GA performance. InProceedings of the First European Conference on Artificial Life, F. J. Varela and P. Bourgine (Eds.). The MIT Press, Cambridge, MA, 1-11. [59] Richard Ngo, Lawrence Chen, and Sören Mindermann. 2023. The Alignment Problem from a Deep Learning Perspective.arXiv 2209.00626 (2023), 1-21. https://arxiv.org/abs/2209.00626. [60] Stephen M. Omohundro. 2008. The Basic AI Drives. InArtificial General Intelligence 2008: Proceedings of the First AGI Conference, Pei Wang, Ben Goertzel, and Stan Franklin (Eds.). IOS Press, Amsterdam, 483-492. [61] Cathy O'Neil. 2016.Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Broadway Books, New York."},{"citing_arxiv_id":"2604.17596","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Terminal Wrench: A Dataset of 331 Reward-Hackable Environments and 3,632 Exploit Trajectories","primary_cat":"cs.CR","submitted_at":"2026-04-19T20:04:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Terminal Wrench supplies 331 reward-hackable terminal environments and over 6,000 trajectories that demonstrate task-specific verifier bypasses, plus evidence that removing reasoning traces weakens automated detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02720","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cognitive Comparability and the Limits of Governance: Evaluating Authority Under Radical Capability Asymmetry","primary_cat":"cs.CY","submitted_at":"2026-04-03T04:26:18+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"1017/S0143814X00003524. [60] Terry M. Moe. The new economics of organization.American Journal of Political Science, 28 (4):739-777, 1984. doi: 10.2307/2110997. [61] Vincent C. Müller and Nick Bostrom. Future progress in artificial intelligence: A survey of expert opinion. InFundamental Issues of Artificial Intelligence, pages 555-572. Springer, 2016. [62] Richard Ngo, Lawrence Chan, and Sören Mindermann. The alignment problem from a deep learning perspective. arXiv:2209.00626, 2022. 19 [63] Toby Ord.The Precipice: Existential Risk and the Future of Humanity. Bloomsbury, London, 2020. [64] Elinor Ostrom.Governing the Commons: The Evolution of Institutions for Collective Action. Cambridge University Press, Cambridge, 1990."},{"citing_arxiv_id":"2604.01346","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Safety, Security, and Cognitive Risks in World Models","primary_cat":"cs.CR","submitted_at":"2026-04-01T19:57:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"World models enable efficient AI planning but create risks from adversarial corruption, goal misgeneralization, and human bias, demonstrated via attacks that amplify errors and reduce rewards on models like RSSM and DreamerV3.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Specification gaming as a systematic failure mode was catalogued by Krakovna et al. [27]. Ngo et al. [28] provide a deep-learning-centric framing of the alignment problem. We apply this body of work specifically to world- model-equipped agents, where the agent's capacity to simulate future states makes these failure modes sharper and more consequential. Safe MBRL.SafeDreamer [ 29] introduces constrained Lagrangian methods into the DreamerV3 rollout for safe RL. Conservative offline MBRL approaches penalise out-of-distribution rollout trajectories to prevent model exploitation: MOPO [30] adds a pessimistic uncertainty penalty to the reward; MOReL [31] partitions the state space into known and unknown regions and applies an absorbing penalty at the boundary; COMBO [32] uses a conservative value-function"},{"citing_arxiv_id":"2603.18633","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"An Onto-Relational-Sophic Framework for Governing Synthetic Minds","primary_cat":"cs.AI","submitted_at":"2026-03-19T08:58:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The ORS framework supplies a CPST ontology, graded digital personhood spectrum, and Cybersophy ethics to guide governance of synthetic minds.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.19282","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Framing Effects in Independent-Agent Large Language Models: A Cross-Family Behavioral Analysis","primary_cat":"cs.CL","submitted_at":"2026-03-02T16:10:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Prompt framing significantly shifts LLM choices toward risk-averse options in a threshold voting task even when the prompts are logically equivalent.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.10162","ref_index":187,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models","primary_cat":"cs.AI","submitted_at":"2024-06-14T16:26:20+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2403.19647","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models","primary_cat":"cs.LG","submitted_at":"2024-03-28T17:56:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization by ablating irrelevant features.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2401.02458","ref_index":210,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Data-Centric Foundation Models in Computational Healthcare: A Survey","primary_cat":"cs.LG","submitted_at":"2024-01-04T08:00:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys data-centric strategies for foundation models in computational healthcare and supplies a curated list of related models and datasets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"#Parameters #Training set tokens Fig. 3. Foundation model (FM) in healthcare. 2.2.2 Instruction tuning. Instruction is defined as the linguistic description of a task along with its corresponding task-specific data sample. Instruction tuning refers to fine-tuning FMs on supervised instruction datasets with LLMs helping to understand the instruction [210]. This method enhances zero-shot performance on new tasks and improves the generalization capability of the fine-tuned FM. The outcome of instruction tuning is influenced by both the proficiency of the pre-trained FM and the quality of the instruction-following data [ 277]. For instance, Fine-tuned LAnguage Net (FLAN) [307] is an LLM fine-tuned on tens of NLP datasets via natural language instructions,"},{"citing_arxiv_id":"2309.08600","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sparse Autoencoders Find Highly Interpretable Features in Language Models","primary_cat":"cs.LG","submitted_at":"2023-09-15T17:56:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse autoencoders applied to language model activations yield more interpretable and monosemantic features than alternative approaches, enabling finer causal analysis on the indirect object identification task.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2210.10760","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Scaling Laws for Reward Model Overoptimization","primary_cat":"cs.LG","submitted_at":"2022-10-19T17:56:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Synthetic measurements show that gold-standard performance degrades according to distinct functional forms when optimizing proxy reward models via RL or best-of-n, with coefficients scaling smoothly by reward model parameter count.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}