{"total":650,"items":[{"citing_arxiv_id":"2606.26396","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization","primary_cat":"cs.LG","submitted_at":"2026-06-24T21:26:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Sparse autoencoders show OOD prompts increase fallacious concept activation in transformers, offering a mechanistic measure of shift and a path to robust fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.19638","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MiqraBERT: Regression-Based Sentence-BERT Finetuning for Biblical Hebrew Parallel Detection","primary_cat":"cs.CL","submitted_at":"2026-06-17T22:31:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiqraBERT, a finetuned Sentence-BERT model, achieves 2.7-fold better distributional separation of parallel versus non-parallel Biblical Hebrew verses and reduces ambiguous overlap from 24% to 6%, with strong performance on narrative but weak on poetic parallels.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17289","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Nothing from Something: Can a Language Model Discover 0?","primary_cat":"cs.AI","submitted_at":"2026-06-15T20:54:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Language models require explicit examples to learn zero in arithmetic but language pretraining halves the examples needed.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.17179","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Why dimensional analysis works: general classification of self-similarity based on scale-invariance","primary_cat":"cond-mat.stat-mech","submitted_at":"2026-06-15T18:18:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A scale-invariance formulation explains why dimensional analysis succeeds and partitions self-similar solutions into three categories based on whether unit-induced and parameter-induced scale functions coincide.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11499","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Hubs or Fringes: Pretraining Data Selection via Web Graph Centrality","primary_cat":"cs.CL","submitted_at":"2026-06-09T22:44:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Web graph centrality from Common Crawl supplies an orthogonal signal for pretraining data selection that improves language model performance when central and peripheral hosts are balanced.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11409","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models","primary_cat":"cs.LG","submitted_at":"2026-06-09T19:59:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A compute-aware framework using cumulative FLOPs shows alignment training has non-monotonic effects on robustness and attack costs vary up to 5x across harm categories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11382","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction","primary_cat":"cs.LG","submitted_at":"2026-06-09T19:05:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"GLACIER combines graph, SMILES, and descriptor encoders with Finsler fusion and contrastive distillation to produce an efficient multimodal model for molecular property prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08783","ref_index":59,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality","primary_cat":"math.OC","submitted_at":"2026-06-07T18:59:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05143","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HORIZON: Recoverability-Governed Curriculum for Physical-Domain Scaling","primary_cat":"cs.RO","submitted_at":"2026-06-03T17:50:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HORIZON is a recoverability-governed checkpointed frontier curriculum for on-policy physical-domain scaling on quadruped locomotion that identifies three regularities: uneven widening, non-monotonic composition, and the necessity of joint on-policy interaction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02211","ref_index":132,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Consistency Training while Mitigating Obfuscation via Rate Matching","primary_cat":"cs.CL","submitted_at":"2026-06-01T13:10:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02008","ref_index":73,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Provable Data Scaling Law for Meta Learning via Complexity Minimization","primary_cat":"stat.ML","submitted_at":"2026-06-01T10:02:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A novel complexity minimization meta-learning framework provably demonstrates that few-shot adaptation error decreases as meta-training data volume increases.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01207","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Feature Alignment Determines Fusion Strategy: A Comparative Study of Cross-Attention and Concatenation in Multimodal Learning","primary_cat":"cs.CV","submitted_at":"2026-05-31T12:55:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Feature alignment quality determines whether concatenation or cross-attention excels for multimodal fusion, with concatenation winning on pre-aligned features due to lower sample complexity O(dv+dt) versus O(dv*dt).","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01155","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Data Is Scarce: Scaling Sparse Language Models with Repeated Training","primary_cat":"cs.LG","submitted_at":"2026-05-31T10:51:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse LLMs in data-scarce multi-epoch regimes follow a scaling law based on active parameters, unique tokens, repetition count, and sparsity level that predicts performance and delays data saturation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01080","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks","primary_cat":"cs.LG","submitted_at":"2026-05-31T07:57:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ThinkSwitch uses iterative self-distillation with QLoRA and spherical weight interpolation to raise both instruct and thinking checkpoint accuracy on small AIME and PubMedQA sets using only 15 human prompts per domain.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00813","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs","primary_cat":"cs.CR","submitted_at":"2026-05-30T17:07:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Non-monotonic safety alignment appears in Gemma models, with Gemma 3 at 68.7% ASR versus 45.5% in Gemma 2 and 33.9% in Gemma 4 via MAP-Elites red-teaming and cross-generational attack transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00771","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Logit Distillation on Manifolds: Mapping by Learning","primary_cat":"cs.LG","submitted_at":"2026-05-30T15:22:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Presents a layer- and point-wise projection mapping for manifold-based logit distillation combined with LoRA to enable low-parameter student training with reported WER gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02632","ref_index":72,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Position: Prioritize Identifying Structure, Not Complex Models, for Scientific Discovery","primary_cat":"stat.ML","submitted_at":"2026-05-30T15:21:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Mechanistic learning from ML is generically underdetermined in high-dimensional proxy regimes, with LLMs worsening the problem by collapsing many possible explanations into one fluent narrative.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07623","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-30T14:07:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proves row-space criterion for finite determinacy in linear finite-field tasks, NP-completeness of minimal forcing subcontext, and anti-mirage theorem separating threshold metrics from semantic confidence via Keisler measures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00729","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AI Sovereignty as National Learning Capacity: A Human-Centered Learning Mechanics Viewpoint on France, the United States, and China","primary_cat":"cs.AI","submitted_at":"2026-05-30T13:49:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"AI sovereignty is presented as a nation's effectiveness at converting distributed information into absorbed, coordinated, and legitimate capability via a learning mechanics analogy applied to France, the US, and China.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00674","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-30T11:06:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Outcome optimization induces reward-induced manifold collapse in LLMs by favoring low-complexity spurious correlations over high-complexity causal reasoning, with process reward models acting as topological filters to block shortcuts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00499","ref_index":56,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"OptiWorld: Optimal Control for Video World Generation under Physical Constraints","primary_cat":"cs.CV","submitted_at":"2026-05-30T03:13:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OptiWorld inserts a classical optimal-control layer that extracts a world state, plans an optimal trajectory on a geometric manifold under physical constraints, and renders the video conditioned on that trajectory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00324","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLMs Need Encoders for Semantic IDs Too","primary_cat":"cs.IR","submitted_at":"2026-05-29T20:01:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PrefixMem encoder for Semantic IDs improves deepest-level accuracy by up to 46% relative and full-SID retrieval recall by up to 22% relative on Pinterest data across LLM families.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31535","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video","primary_cat":"cs.CV","submitted_at":"2026-05-29T16:50:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RayDer is a unified transformer backbone for self-supervised static-scene novel view synthesis that absorbs dynamic content as a nuisance factor and shows power-law scaling with data and compute while matching supervised methods in zero-shot settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.02624","ref_index":134,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TadA-Bench: A Million-Variant Benchmark for Future-Round Discovery Toward Agentic Protein Engineering","primary_cat":"q-bio.QM","submitted_at":"2026-05-29T12:12:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TadA-Bench supplies a chronological million-variant wet-lab replay benchmark from 31 TadA directed-evolution rounds that evaluates models on future-round variant ranking given only earlier data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31164","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"D$^3$: Dynamic Directional Graph-Constrained Data Scheduling for LLM Training","primary_cat":"cs.CL","submitted_at":"2026-05-29T11:13:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"D³ introduces a dynamic directional graph-constrained framework that models sample interactions via loss dependencies to derive an optimized training sequence for LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30911","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness","primary_cat":"cs.CV","submitted_at":"2026-05-29T06:47:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The study links three LVLM architectural dimensions to three hallucination types via a new benchmark, finding that language foundation quality reduces co-occurrence errors, visual encoder strength reduces similarity errors, alignment reduces uncertainty errors, and joint visual-alignment improvement","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07597","ref_index":71,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them","primary_cat":"cs.LG","submitted_at":"2026-05-29T06:08:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Repetition rate mismatch between small-scale proxies and target budgets is the main reason data mixture experiments do not scale; a subsampling procedure that equalizes repetition rates recovers optimal mixtures from 1/16-scale experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30346","ref_index":56,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"YoCausal: How Far is Video Generation from World Model? A Causality Perspective","primary_cat":"cs.CV","submitted_at":"2026-05-28T17:59:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"YoCausal benchmark shows video diffusion models detect the arrow of time but lack genuine causal understanding relative to humans.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30018","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Performance Profiling of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-28T14:41:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Latent Performance Profiling (LPP) as a task-agnostic framework deriving scalar metrics from LLM latent representations and dynamics to complement benchmark evaluations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29548","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention","primary_cat":"cs.LG","submitted_at":"2026-05-28T08:02:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Larger models succeed on rare and complex tasks by reducing gradient interference from common tasks, allowing rare-task features to accumulate, as shown via synthetic task mixtures and OLMo pretraining from 4M to 4B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29448","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions","primary_cat":"cs.LG","submitted_at":"2026-05-28T06:40:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Vendi Score and scaling-law objectives belong to the class of matrix spectral functions, which are submodular, enabling efficient greedy selection of training data that outperforms random subsets in predicting held-out performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29358","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet","primary_cat":"cs.AI","submitted_at":"2026-05-28T04:57:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Sparse autoencoders scaled to 34 million features on Claude 3 Sonnet yield interpretable, steerable representations of concrete and abstract concepts that generalize across languages and modalities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29223","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Inferring the Size of Large Language Models From Popular Text Memorization","primary_cat":"cs.LG","submitted_at":"2026-05-28T01:20:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A method infers conservative lower bounds on LLM parameter counts from next-token accuracy profiles on popular texts using pairwise tests and PCA-based scaling-law estimation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27963","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Throughput-Optimized Networks at Scale","primary_cat":"cs.NI","submitted_at":"2026-05-27T04:56:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TONS uses linear optimization and heuristics to synthesize deadlock-free network topologies and routing for datacenter AI training, reporting 2.1x and 1.6x geometric mean speedups over best TPU torus variants for uniform random and all-to-all traffic in simulation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00112","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evolving to the Aesthetics of a Vision-Language Model","primary_cat":"cs.NE","submitted_at":"2026-05-27T04:38:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Vision-language models are used to score or rank evolved designs, compared to human artist preferences in a case study.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27918","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Addressing Variable Heterogeneity in Distributed Multimodal Training with Entrain","primary_cat":"cs.DC","submitted_at":"2026-05-27T03:44:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Entrain reduces microbatch workload variability by up to 10.6x and improves multimodal LLM training throughput by 1.4x via static model parallelism and deferred hierarchical microbatch assignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23901","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws","primary_cat":"cs.LG","submitted_at":"2026-05-22T17:59:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The Shannon Scaling Law treats LLM training as noisy-channel transmission and predicts U-shaped performance degradation when signal-to-noise ratio falls below a threshold, outperforming monotonic scaling laws on Pythia and OLMo2 data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23591","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Asymmetric Scaling Laws from Sparse Features","primary_cat":"stat.ML","submitted_at":"2026-05-22T13:00:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A sparse-activation model predicts double-descent loss with distinct under- and over-parameterized scaling exponents set by sparsity, plus a compute-optimal frontier favoring dataset growth.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23417","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"An Open-Source Training Dataset for Foundation Models for Black-box Optimization","primary_cat":"cs.LG","submitted_at":"2026-05-22T09:27:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"BBO-Pile is the first large-scale open dataset of real optimization trajectories used to train and scale foundation models that imitate black-box optimization methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23294","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference","primary_cat":"cs.AR","submitted_at":"2026-05-22T07:10:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NASiC fuses CAM-based expert selection and multibit CIM computation in 3D NAND into one cycle for MoE LLM inference, claiming 4-114.8x performance and 3.9-70x energy efficiency gains over prior designs with high accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23191","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation","primary_cat":"cs.LG","submitted_at":"2026-05-22T03:17:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RankElastor mitigates embedding collapse via spectrum-robust token mixing and GLU-based P-FFNs, yielding better performance and scaling on industrial recommendation datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23169","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PRAXIS: Case-distilled and code-verified AI agents for biological research","primary_cat":"q-bio.QM","submitted_at":"2026-05-22T02:41:41+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23051","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"General-Purpose Photonic Computing Primitive for Contemporary Artificial Intelligence","primary_cat":"physics.optics","submitted_at":"2026-05-21T21:33:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DUET is a photonic tensor core paradigm that uses structural symmetry in VODICs to support arbitrary signed operands directly, experimentally tested on image classification, segmentation, and Transformer tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22940","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Human-Centered Learning Mechanics: A Dynamical Framework for Entropy-Regulated Representation Learning","primary_cat":"cs.LG","submitted_at":"2026-05-21T18:16:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes HCLM framework formalizing entropy regularization via effective information force and geometric surrogates like log-determinant covariance, with experiments claiming stronger stable forces than softmax entropy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22821","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Tokenisation via Convex Relaxations","primary_cat":"cs.CL","submitted_at":"2026-05-21T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ConvexTok uses convex relaxation of tokenization to a linear program, improving intrinsic metrics, bits-per-byte, and some downstream tasks while certifying near-optimality within 1% at typical vocabulary sizes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22711","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Abstraction for Offline Goal-Conditioned Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-21T16:50:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces relativised options and hierarchical abstraction to reuse experience across similar contexts in offline GCRL, with two algorithms demonstrating performance gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22681","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Forecasting Scientific Progress with Artificial Intelligence","primary_cat":"cs.AI","submitted_at":"2026-05-21T16:23:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22672","ref_index":25,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most","primary_cat":"cs.AI","submitted_at":"2026-05-21T16:14:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"More capable LLMs produce worse distributional forecasts on superlinear growth time series with tail risks of regime change, with the error concentrated in the upper tail; this reverses on conventional threshold metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22502","ref_index":78,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Compiling Agentic Workflows into LLM Weights: Near-Frontier Quality at Two Orders of Magnitude Less Cost","primary_cat":"cs.AI","submitted_at":"2026-05-21T13:54:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Compiling agentic workflows into LLM weights creates subterranean agents with near-frontier quality at two orders of magnitude less cost, validated empirically on travel booking, Zoom support, and insurance claims tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22341","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Boundary-Layer Mechanism for One-Third Scaling in Online Softmax Classification","primary_cat":"cs.LG","submitted_at":"2026-05-21T11:26:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Derives α^{-1/3} scaling for generalization error in online softmax classification from boundary layers in a teacher-student model.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}