{"total":121,"items":[{"citing_arxiv_id":"2606.26396","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization","primary_cat":"cs.LG","submitted_at":"2026-06-24T21:26:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Sparse autoencoders show OOD prompts increase fallacious concept activation in transformers, offering a mechanistic measure of shift and a path to robust fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10106","ref_index":53,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What makes a harness a harness: necessary and sufficient conditions for an agent harness","primary_cat":"cs.SE","submitted_at":"2026-06-08T19:35:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09508","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs","primary_cat":"cs.AI","submitted_at":"2026-06-08T14:02:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EntropyInfer adaptively allocates inference compute using per-head attention entropy for rigid/dynamic classification during prefilling and compresses KV cache with generated tokens, achieving up to 2.39x speedup on long contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06840","ref_index":97,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Characterize Then Distill: Mechanistic Reasoning in Large Output Spaces","primary_cat":"cs.CL","submitted_at":"2026-06-05T02:32:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reasoning in large output spaces proceeds via shortlisting then fine-grained reasoning; this characterization enables a mechanistic distillation strategy that outperforms standard distillation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30323","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"In-Context Reward Adaptation for Robust Preference Modeling","primary_cat":"cs.LG","submitted_at":"2026-05-28T17:56:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Transformer model with response-time auxiliary input adapts reward models to unseen human preference domains via in-context learning from demonstrations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27914","ref_index":124,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm","primary_cat":"cs.CL","submitted_at":"2026-05-27T03:41:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23660","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Using Large Language Models in Physics Education","primary_cat":"physics.ed-ph","submitted_at":"2026-05-22T14:11:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Frontier LLMs from late 2025 reach near-perfect scores on text-based physics problem solving and show improved human-grading alignment, yet still struggle to assign partial credit for flawed reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21622","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-20T18:32:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20730","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Distributional Alignment as a Criterion for Designing Task Vectors in In-Context Learning","primary_cat":"cs.CL","submitted_at":"2026-05-20T05:26:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A distributional alignment metric d_NTP and a linear regression method LTV for task vectors that improves accuracy by 9.2% over baselines on classification and regression tasks across multiple LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18535","ref_index":60,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Scaling: Agents Are Heading to the Edge","primary_cat":"cs.LG","submitted_at":"2026-05-18T15:18:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Personal agents require edge deployment to preserve high-fidelity local context and zero-latency loops, as claimed through three structural shifts away from cloud-centric designs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18022","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise","primary_cat":"cs.LG","submitted_at":"2026-05-18T08:12:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Experiments on modular arithmetic with heavy label noise show that over-parameterized networks form a distributed internal generalization structure that can be extracted via frequency methods to achieve high accuracy despite 80% noise.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15104","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-14T17:22:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12809","ref_index":238,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces","primary_cat":"cs.LG","submitted_at":"2026-05-12T23:01:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10574","ref_index":23,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLM Jaggedness Unlocks Scientific Creativity","primary_cat":"cs.AI","submitted_at":"2026-05-11T13:47:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Jagged capabilities in LLMs for scientific idea generation can be leveraged through inference-time ensembles to outperform individual models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"4.2 Exploiting Jaggedness: Meta-Model Ensembles The jaggedness observed in our analyses suggests that contemporary LLMs are not merely scaled versions of one another but are, in meaningful ways,different. Prior work attributes such variation to factors including model scale, training data composition, architectural choices, and alignment strategies [23, 24, 25, 26]; however, regardless of its origin, this divergence manifests as differences in what models know, how they reason, and the kinds of creative moves they tend to make. 9 Model Inference-Time Compute Pooling Brainstorming Individual Models Claude-3.5-Sonnet✗ ✗ ✗ Claude-3.7-Sonnet-Thinking✓ ✗ ✗ Meta-Models Router✗ ✓ ✗ Top-5✗ ✓ ✓ Top-5-Parallel✓ ✓ ✓"},{"citing_arxiv_id":"2605.10405","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization","primary_cat":"cs.LG","submitted_at":"2026-05-11T11:43:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Doubly robust estimators that incorporate low-rank predictions enable valid finite-sample confidence intervals for best-model identification under adaptive sampling and without-replacement example selection in LLM evaluation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. [2] Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. A survey on large language models for code generation.ACM Transactions on Software Engineering and Methodology, 35 (2):1-72, 2026. [3] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022. [4] Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al."},{"citing_arxiv_id":"2605.09271","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding","primary_cat":"cs.AI","submitted_at":"2026-05-10T02:42:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"multi-step reasoning, their outputs remain unstable when expressed purely in natural language (Level 0) [51, 52], which inherently lacks explicit logical constraints and is riddled with semantic ambiguity [53, 54]. These limitations suggest that the performance bottleneck often arises not from a lack of latent capability, but from the inadequacy of natural language as a stable interface [55, 56]. Levels 1-2 directly address these two deficiencies: Ambiguity Elimination (Level 1) sharpens token-to-entity precision, while Logical Constraints (Level 2) enforces structural rigor on the inference trajectory, together shaping the model's internal schema to compensate for the deficiencies of natural language. Strengthening Current Abilities Extending the Ability Frontier"},{"citing_arxiv_id":"2605.09106","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain","primary_cat":"cs.CL","submitted_at":"2026-05-09T18:28:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08904","ref_index":90,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces","primary_cat":"cs.AI","submitted_at":"2026-05-09T11:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08529","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Propagation Field: A Geometric Substrate Theory of Deep Learning","primary_cat":"cs.LG","submitted_at":"2026-05-08T22:26:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Neural networks possess a propagation field of trajectories and Jacobians whose quality can be measured and optimized independently of endpoint loss, yielding better unseen-path generalization and reduced forgetting in continual learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07076","ref_index":60,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-Consolidating Language Models: Continual Knowledge Incorporation from Context","primary_cat":"cs.CL","submitted_at":"2026-05-08T00:50:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"collapse in the fine-tuning of large language models. InICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2024. [59] Ziniu Li, Congliang Chen, Tian Xu, Zeyu Qin, Jiancong Xiao, Zhi-Quan Luo, and Ruoyu Sun. Preserving diversity in supervised fine-tuning of large language models.arXiv preprint arXiv:2408.16673, 2024. 13 [60] Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022. [61] Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu,"},{"citing_arxiv_id":"2605.06522","ref_index":62,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models","primary_cat":"cs.LG","submitted_at":"2026-05-07T16:29:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06154","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Graphlets as Building Blocks for Structural Vocabulary in Knowledge Graph Foundation Models","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:47:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Graphlets mined as structural tokens improve zero-shot inductive and transductive link prediction in knowledge graph foundation models across 51 diverse graphs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06040","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Novelty-based Tree-of-Thought Search for LLM Reasoning and Planning","primary_cat":"cs.AI","submitted_at":"2026-05-07T11:28:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Novelty estimation via LLM prompts enables pruning in Tree-of-Thought search, reducing overall token usage on language planning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04595","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints","primary_cat":"cs.LG","submitted_at":"2026-05-06T07:42:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A queueing model derives stability conditions for LLM inference services under combined compute and KV cache memory limits, with experimental validation showing typical deviations under 10%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.04243","ref_index":88,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA","primary_cat":"cs.AI","submitted_at":"2026-05-05T19:30:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks when representations are correct.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(PIS) successfully triggers targeted repairs to recover from ambiguous relational extractions [ 3]. However, on the unstructured, narrative-heavy TRACIE stress test, performance precipitously drops to ∼50% [57]. Rather than indicating a reasoning breakdown, this consistent cross-dataset trend provides robust empirical evidence for our central claim: temporal reasoning is not the primary bottleneck structural representation is [88, 91]. Telemetry of the PIS Engine and Error Profiling.The precise etiology of these failures is laid bare by analyzing the step-level behavior of the PIS alongside the diagnostic error profiles [ 1]. Under structured inputs, the PIS remains uniformly stable and low, mathematically confirming the correct execution of interval constraints [ 64]. In the noisy TimeX-NLI setting, the PIS exhibits"},{"citing_arxiv_id":"2605.02300","ref_index":272,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Meta Reinforcement Learning Approach to Goals-Based Wealth Management","primary_cat":"cs.LG","submitted_at":"2026-05-04T07:48:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01420","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Artificial Jagged Intelligence as Uneven Optimization Energy Allocation Capability Concentration, Redistribution, and Optimization Governance","primary_cat":"cs.AI","submitted_at":"2026-05-02T12:37:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"AJI frames jagged AI capabilities as lower bounds on performance dispersion arising from concentrated optimization energy allocation under anisotropic objectives, with theorems on tradeoffs and redistribution interventions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00817","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-01T17:55:47+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27723","ref_index":118,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Optimized Deferral for Imbalanced Settings","primary_cat":"cs.LG","submitted_at":"2026-04-30T11:15:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MILD reformulates two-stage learning to defer as cost-sensitive learning over the input-expert domain and derives new margin-based losses with guarantees, yielding better performance than baselines on image classification and LLM routing tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24544","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator","primary_cat":"cs.AI","submitted_at":"2026-04-27T14:39:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"STELLAR-E modifies the TGRT Self-Instruct framework to produce tailored synthetic LLM evaluation datasets that score an average 5.7% higher on LLM-as-a-judge metrics than existing language-specific benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"The conclusions on all other metrics are the same as on the English 11 weak model experiments. The translated Mintaka dataset is slightly harder for the weak model compared to the real one, as this LLM is tied to the specific statistical patterns that it was trained on and the unnatural artifacts introduced by the translation may represent an out-of-distribution input [36, 37]. 6 Discussion and limitations The implemented pipeline is simple to configure, offers the expected customizability and the DVE/DFE mechanisms prove effective according to the experiments. Overall, by averaging the G- Eval scores of strong and weak LLMs evaluations on both English and Italian datasets, the distance between the DVE & DFE datasets and the original Mintaka is +5."},{"citing_arxiv_id":"2604.23338","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework","primary_cat":"cs.CR","submitted_at":"2026-04-25T14:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"(web browsers, code interpreters, file systems, APIs) [1]- [4], and increasingly coordinates with other autonomous agents to accomplish long-horizon tasks [5]-[9]. This shift is driven partly by the emergent capabilities that arise at K. Chu is with the Department of Computer Science and Engineering, Uni- versity of Connecticut, Storrs, CT 06269 USA. E-mail: kexin.chu@uconn.edu. scale [10] and partly by new infrastructure for tool and memory integration [11]. This architectural complexity generates vulnerabilities that areemergent,compositional, andtemporally extended. Security models designed for stateless systems cannot capture these properties. Consider three examples that motivate this survey: • An adversarially crafted document retrieved during a routine"},{"citing_arxiv_id":"2604.23108","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mixture of Heterogeneous Grouped Experts for Language Modeling","primary_cat":"cs.CL","submitted_at":"2026-04-25T02:05:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MoHGE achieves standard MoE performance with 20% fewer parameters and balanced GPU utilization via grouped heterogeneous experts, two-level routing, and specialized auxiliary losses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22207","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations","primary_cat":"cs.SE","submitted_at":"2026-04-24T04:22:17+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM pipeline with generation-critic feedback reaches 61% accuracy on low-level goal extraction from requirements documents and outperforms standalone few-shot prompting, yet remains best suited as an accelerator for manual work.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.21536","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Pre-trained LLMs Meet Sequential Recommenders: Efficient User-Centric Knowledge Distillation","primary_cat":"cs.IR","submitted_at":"2026-04-23T10:59:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A distillation technique embeds LLM-generated textual user profiles into efficient sequential recommenders without runtime LLM inference, architectural changes, or fine-tuning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"have advanced the field significantly, these systems continue to face fundamental challenges: data sparsity leading to poor generalization and limited ability to capture user semantics beyond interaction patterns [15,16]. The rise of Large Language Models (LLMs) offers promising opportunities to enhance recommendation systems through their sophisticated semantic under- standing capabilities [1,20,36,39]. This has led to methods ranging from zero-shot prompting [1,7,8,20,36,39] and feature augmentation [2,22,23,30,37,38] to full LLM fine-tuning for recommendation tasks [4,27,31]. However, these approaches arXiv:2604.21536v1 [cs.IR] 23 Apr 2026 2 N. Severin et al. T r ansf ormer la y er N T r ansf ormer la y er K T r ansf ormer la y er 1Pr e-tr ained LLM"},{"citing_arxiv_id":"2604.20658","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows","primary_cat":"cs.CL","submitted_at":"2026-04-22T15:07:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Cooperative profiles from behavioral economics games predict LLM team performance in AI-for-science workflows.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20531","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Effects of Cross-lingual Evidence in Multilingual Medical Question Answering","primary_cat":"cs.CL","submitted_at":"2026-04-22T13:09:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Combining English and target-language web retrieval boosts medical QA for low-resource languages to match high-resource performance, while English web data benefits high-resource languages most and specialized sources like PubMed lack multilingual coverage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20915","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Absorber LLM: Harnessing Causal Synchronization for Test-Time Training","primary_cat":"cs.LG","submitted_at":"2026-04-22T02:58:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Absorber LLM introduces causal synchronization to absorb context into parameters for memory-efficient long-context LLM inference while preserving causal effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19193","ref_index":74,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How Far Are Video Models from True Multimodal Reasoning?","primary_cat":"cs.CV","submitted_at":"2026-04-21T08:04:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"cally quantifying these limitations, the proposed method provides action- able feedbacks and a clear roadmap toward truly robust, general-purpose video models. CLVG-Bench and code are released here. Keywords:Video generation·Multimodal reasoning·Video evaluation 1 Introduction Driven by large-scale training on web-scale data with generative objectives [6, 31,74], video models have demonstrated groundbreaking zero-shot capabilities, evolving from mere instruction following to complex understanding and reason- ing [57,62,71,76]. Specifically, traditional reference-based video generation has arXiv:2604.19193v1 [cs.CV] 21 Apr 2026 2 X. Zhang et al. <image2><image1> Multiple Video Task Categories Multimodal"},{"citing_arxiv_id":"2604.18827","ref_index":113,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OmniMouse: Scaling properties of multi-modal, multi-task Brain Models on 150B Neural Tokens","primary_cat":"q-bio.NC","submitted_at":"2026-04-20T20:46:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OmniMouse demonstrates data-driven scaling in multi-task brain models on a 150B-token neural dataset, achieving SOTA across prediction, decoding, and forecasting while model size gains saturate.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17857","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"On the Emergence of Syntax by Means of Local Interaction","primary_cat":"cs.CL","submitted_at":"2026-04-20T06:10:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A 2D neural cellular automaton spontaneously self-organizes into a Proto-CKY representation that exhibits syntactic processing capabilities for context-free grammars when trained on membership problems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17770","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLM-AUG: Robust Wireless Data Augmentation with In-Context Learning in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-04-20T03:49:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM-AUG applies LLM in-context learning for embedding-space data augmentation in wireless ML, outperforming baselines and reaching near-oracle accuracy with only 15% labeled data on RadioML and IC datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17419","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ARMove: Learning to Predict Human Mobility through Agentic Reasoning","primary_cat":"cs.MA","submitted_at":"2026-04-19T12:59:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ARMove is a transferable framework for human mobility prediction that combines agentic LLM reasoning, feature management, and large-small model synergy to outperform baselines on several metrics while improving interpretability and robustness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17220","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Dynamics of Cognitive Heterogeneity: Investigating Behavioral Biases in Multi-Stage Supply Chains with LLM-Based Simulation","primary_cat":"cs.MA","submitted_at":"2026-04-19T03:03:25+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15529","ref_index":38,"ref_count":3,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LACE: Lattice Attention for Cross-thread Exploration","primary_cat":"cs.AI","submitted_at":"2026-04-16T21:19:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LACE enables concurrent reasoning paths in LLMs to interact via lattice attention and a synthetic training pipeline, raising accuracy more than 7 points over independent parallel search.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12816","ref_index":77,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The role of System 1 and System 2 semantic memory structure in human and LLM biases","primary_cat":"cs.CL","submitted_at":"2026-04-14T14:43:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Human semantic memory networks for System 1 and System 2 are structurally distinct and consistently relate to implicit gender bias levels, but LLM networks do not exhibit these properties.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12426","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Do Transformers Use their Depth Adaptively? Evidence from a Relational Reasoning Task","primary_cat":"cs.LG","submitted_at":"2026-04-14T08:16:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transformers show limited adaptive depth use on relational reasoning, with clearer evidence after finetuning on the task.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12525","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PERCEIVE: A Benchmark for Personalized Emotion and Communication Behavior Understanding on Social Media","primary_cat":"cs.SI","submitted_at":"2026-04-10T09:35:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PERCEIVE is the first bilingual benchmark integrating author content, reader emotions from comments, communication behavior, user attributes, and social graphs for personalized social media emotion understanding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08044","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"A Full-Stack Performance Evaluation Infrastructure for 3D-DRAM-based LLM Accelerators","primary_cat":"cs.AR","submitted_at":"2026-04-09T09:48:43+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ATLAS is the first silicon-validated simulation framework for 3D-DRAM LLM accelerators, achieving under 8.57% error and over 97% correlation with real hardware while supporting design exploration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07745","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Cartesian Cut in Agentic AI","primary_cat":"cs.AI","submitted_at":"2026-04-09T03:03:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"9 (2025), pp. 783-786.doi:10.1016/j.tics.2025.07.004. url:https://doi.org/10.1016/j.tics.2025.07.004. [63] Miles Turpin et al. \"Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting\". In:arXiv(2023). NeurIPS 2023.doi: 10.48550/arXiv.2305.04388. arXiv:2305.04388 [cs.CL].url:https://arxiv. org/abs/2305.04388. [64] Jason Wei et al. \"Emergent abilities of large language models\". In:arXiv preprint arXiv:2206.07682(2022). [65] Daniel M. Wolpert, R. Chris Miall, and Mitsuo Kawato. \"Internal models in the cerebel- lum\". In:Trends in Cognitive Sciences2.9 (1998), pp. 338-347.doi:10.1016/S1364- 6613(98)01221-2.url:https://doi.org/10.1016/S1364-6613(98)01221-2. 23"},{"citing_arxiv_id":"2604.07530","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Shrinking Lifespan of LLMs in Science","primary_cat":"cs.DL","submitted_at":"2026-04-08T19:12:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM adoption in science follows a compressing inverted-U trajectory where release year predicts time-to-peak and lifespan better than model attributes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}