{"total":258,"items":[{"citing_arxiv_id":"2606.24506","ref_index":25,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation","primary_cat":"cs.DC","submitted_at":"2026-06-23T12:34:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CrossPool separates weights and KV-cache into distinct GPU pools plus a planner, virtualizer, and layer-wise scheduler to cut P99 time-between-tokens by up to 10.4x versus prior kvcached multi-LLM systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07995","ref_index":85,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR","primary_cat":"cs.CL","submitted_at":"2026-06-06T06:22:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces ShopTrajQA long-context benchmark and an RLVR-trained tool-augmented agent that bypasses LLM context limits by external file storage and code-based retrieval for shopping trajectories.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00771","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Logit Distillation on Manifolds: Mapping by Learning","primary_cat":"cs.LG","submitted_at":"2026-05-30T15:22:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Presents a layer- and point-wise projection mapping for manifold-based logit distillation combined with LoRA to enable low-parameter student training with reported WER gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00686","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing","primary_cat":"cs.LG","submitted_at":"2026-05-30T11:49:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SafeMoE isolates unsafe knowledge in domain-specific LoRA experts and routes them via a lightweight gate trained on safe responses to produce safer and more informative LLM outputs with zero-shot generalization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00609","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts","primary_cat":"cs.LG","submitted_at":"2026-05-30T08:18:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CARE-RL combines PA-GRM for task-adaptive rewards on open-ended tasks and DACSP for modulating RL updates using historical capability directions, reporting higher total average scores than baselines on Qwen models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00467","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance","primary_cat":"cs.CL","submitted_at":"2026-05-30T01:21:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs correct only 34.8% of zero-shot annotation errors via prompting, and Definition-Specific Familiarity correlates positively with performance (partial r = +0.41) while memorization metrics do not.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00437","ref_index":92,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing","primary_cat":"cs.LG","submitted_at":"2026-05-30T00:05:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00359","ref_index":102,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Next-Billion AI Index: The compass for AI utility and adoption in the global majority","primary_cat":"cs.CY","submitted_at":"2026-05-29T21:01:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces nexbax, a diagnostic framework with three themes and 10 dimensions for evaluating AI economic viability, operational practicality, and societal integrity in next-billion-user contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11232","ref_index":105,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-29T02:36:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30486","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting","primary_cat":"cs.LG","submitted_at":"2026-05-28T19:05:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GC-MoE improves MAE on four traffic forecasting benchmarks by routing nodes to combinations of frozen spatio-temporal GNN experts via a graph-conditioned lightweight router, training only ~17K parameters atop 1.5M frozen weights.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30018","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Latent Performance Profiling of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-28T14:41:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Introduces Latent Performance Profiling (LPP) as a task-agnostic framework deriving scalar metrics from LLM latent representations and dynamics to complement benchmark evaluations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07585","ref_index":110,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach","primary_cat":"cs.CV","submitted_at":"2026-05-27T16:36:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes cross-attention audio-video fusion and VE-MD latent-space models for group emotion recognition that avoid individual cues and report competitive performance via ablation studies on synthetic and real data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.27849","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FPMoE: A Sparse Mixture-of-Experts Approach to Functional Code Generation","primary_cat":"cs.PL","submitted_at":"2026-05-27T02:06:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FPMoE applies a sparse MoE architecture with per-language routed experts and a shared expert to improve LLM code generation on functional languages, outperforming fine-tuned baselines while matching larger models with 3B active parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23764","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HyperParallel-MoE: Multi-Core Interleaved Scheduling for Fast MoE Training on Ascend NPUs","primary_cat":"cs.DC","submitted_at":"2026-05-22T15:35:23+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22602","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Think Thrice Before You Speak: Dual knowledge-enhanced Theory-of-Mind Reasoning for Persuasive Agents","primary_cat":"cs.AI","submitted_at":"2026-05-21T15:15:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces ToM-PD task and ToM-BPD dataset plus TTBYS dual-knowledge framework, with Qwen3-8B outperforming GPT-5 on desire, belief, and strategy prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22403","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Translating Signals to Languages for sEMG-Based Activity Recognition","primary_cat":"cs.CV","submitted_at":"2026-05-21T12:31:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM-sEMG maps sEMG signals to language via a dedicated mechanism to enable LLMs to perform accurate activity recognition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21427","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PALS: Power-Aware LLM Serving for Mixture-of-Experts Models","primary_cat":"cs.AI","submitted_at":"2026-05-20T17:19:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PALS adds dynamic GPU power capping to LLM serving frameworks like vLLM, jointly tuning it with batch size via offline models and feedback control to improve energy efficiency up to 26.3% and cut QoS violations 4-7x on dense and MoE models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21272","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MONET: A Massive, Open, Non-redundant and Enriched Text-to-image dataset","primary_cat":"cs.CV","submitted_at":"2026-05-20T15:04:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MONET is an open 104.9M image-text pair dataset created via safety filtering, deduplication, and multi-VLM recaptioning from 2.9B raw pairs, validated by training a competitive 4B-parameter latent diffusion model.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"based filtering for source governance. Each surviving image is re-captioned by multiple VLMs, ranging from short concept-level to long fine-grained descriptions, and the corpus is augmented with synthetic samples generated byApache 2.0T2I models. All samples are shipped with standard image embeddings (DINOv2 [64], CLIP [70], SSCD [66]), classifiers and detectors (YOLO [41], Mediapipe [61]), and pre-encoded with SANA V AE [102]. We also provide a comprehensive analysis of the dataset, including statistics, content and topic analyzes, and human quality assessment, and validate its usefulness by training a 4B-parameter T2I model exclusively on MONET, which achieves competitive evaluation scores. 2 Related work Text-to-image modelsAlthough early GAN-based approaches [ 27, 104, 79, 44] laid the ground-"},{"citing_arxiv_id":"2605.21100","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"NanoCP: Request-Level Dynamic Context Parallelism for Data-Expert Parallel Decoding","primary_cat":"cs.DC","submitted_at":"2026-05-20T12:28:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"NanoCP introduces request-level dynamic context parallelism to decouple MoE communication from KV cache placement in hybrid data-expert parallel serving, reporting up to 3.27x higher request rates and 2.12x lower P99 latency under TPOT SLOs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21071","ref_index":20,"ref_count":6,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fine-grained Claim-level RAG Benchmark for Law","primary_cat":"cs.CL","submitted_at":"2026-05-20T11:56:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20179","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TIDE: Efficient and Lossless MoE Diffusion LLM Inference with I/O-aware Expert Offload","primary_cat":"cs.CL","submitted_at":"2026-05-19T17:59:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TIDE schedules I/O-aware expert offloading for MoE diffusion LLMs by solving for an optimal refresh interval that exploits temporal stability of activations, yielding up to 1.5x throughput gain losslessly.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20165","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-19T17:50:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19957","ref_index":73,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"World-Ego Modeling for Long-Horizon Evolution in Hybrid Embodied Tasks","primary_cat":"cs.CV","submitted_at":"2026-05-19T15:10:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Proposes World-Ego Modeling with WEM using CP-MoE diffusion and a new HTEWorld benchmark, claiming SOTA on hybrid navigation-manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19762","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code","primary_cat":"cs.AI","submitted_at":"2026-05-19T12:37:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Controlled experiments show structured reasoning traces and higher-density math-domain samples improve mathematical reasoning more than pure executable code, with internal routing patterns reflecting these data effects.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19481","ref_index":27,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"C2CServe: Leveraging NVLink-C2C for Elastic Serverless LLM Serving on MIG","primary_cat":"cs.OS","submitted_at":"2026-05-19T07:34:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"C2CServe is a request-granularity serverless LLM serving system that keeps weights in host memory and streams them via C2C to MIG instances, cutting cold-start latency up to 7.1x while preserving TTFT/TPOT under contention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18643","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Post-Trained MoE Can Skip Half Experts via Self-Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-18T16:50:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18163","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction","primary_cat":"cs.AI","submitted_at":"2026-05-18T10:08:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TRACE uses cross-layer candidate trajectories inside frozen LLMs to dynamically select and apply one of three correction operators, delivering mean gains of +12.26 MC1 and +8.65 MC2 points across 15 models and 3 benchmarks with no regressions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18106","ref_index":72,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers","primary_cat":"math.OC","submitted_at":"2026-05-18T09:17:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"still improves markedly over the coordinate-wiseAdamWbaseline. We also provide additional experimental results for a base learning rate sweep and two extra random seeds in Section G.2. 4.3 OLMoE-1B-7B-Style Pre-Training In addition to dense language models, we also pre-train a sparse Mixture-of-Experts (MoE) model, a widely used architecture in recent open-weight language models [72, 34, 120, 128, 53, 142, 58, 33]. We use AllenAI'sOLMoE-1B-7B [115], which provides a comprehensive training recipe together with open-source data, code, and training logs. The model has vocabulary size 50,304 and hidden dimension 2048, making the embedding and LM head matrices considerably large. Relative to the original pre-training setup, we remove the auxiliary load-balancing loss [136] and the router"},{"citing_arxiv_id":"2605.17889","ref_index":11,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CoX-MoE: Coalesced Expert Execution for High-Throughput MoE Inference with AMX-Enabled CPU-GPU Co-Execution","primary_cat":"cs.LG","submitted_at":"2026-05-18T05:54:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CoX-MoE achieves up to 7.1x higher throughput than FlexGen for MoE inference via coalesced expert execution and AMX-enabled CPU-GPU orchestration with static expert stratification.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17598","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mixture of Experts for Low-Resource LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-17T18:50:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Pre-trained MoE models exhibit deep-layer routing collapse for low-resource languages like Hebrew, largely corrected by continual pre-training on balanced bilingual data, with consistent patterns observed in Japanese.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17228","ref_index":103,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making","primary_cat":"cs.CL","submitted_at":"2026-05-17T02:28:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17106","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools","primary_cat":"cs.CL","submitted_at":"2026-05-16T18:19:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HyDRA routes queries to cost-effective LLMs by predicting multi-dimensional capability requirements with a multi-head encoder and applying shortfall matching against configuration-defined model profiles, delivering up to 72.5 percent cost savings on coding benchmarks while remaining decoupled from具体","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16849","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"SpaceMoE: Towards Orbital General Intelligence with Distributed Mixture-of-Experts Inference","primary_cat":"cs.NI","submitted_at":"2026-05-16T07:15:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SpaceMoE is presented as a new paradigm for distributed MoE inference in satellite networks, with satellite-specific constraints reshaping expert placement, selection, and hidden-state routing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16690","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"UB-SMoE: Universally Balanced Sparse Mixture-of-Experts for Resource-adaptive Federated Fine-tuning of Foundation Models","primary_cat":"cs.LG","submitted_at":"2026-05-15T23:06:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UB-SMoE balances expert utilization in heterogeneous federated SMoE fine-tuning via Dynamic Modulated Routing and Universal Pseudo-Gradient, delivering up to 45% compute reduction and 8.7x performance gains for low-resource clients over prior LoRA-rank methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15484","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Does Sparse MoE Help in Vision? The Role of Backbone Compute Leverage in Sparse Routing","primary_cat":"cs.CV","submitted_at":"2026-05-15T00:01:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Sparse MoE vision models show positive accuracy gaps only when routing a substantial compute fraction ρ and using k≥2 experts at large scale; batch-axis dispatch is identified as a key failure mode.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15104","ref_index":153,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents","primary_cat":"cs.CL","submitted_at":"2026-05-14T17:22:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A dataset-agnostic framework converts text tool-calling benchmarks to paired audio evaluations via TTS, speaker variation and noise, then evaluates seven omni-modal models showing model- and task-dependent performance with small text-to-voice gaps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14438","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BEAM: Binary Expert Activation Masking for Dynamic Routing in MoE","primary_cat":"cs.AI","submitted_at":"2026-05-14T06:33:41+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BEAM uses binary expert activation masks trained end-to-end to achieve dynamic sparsity in MoE models, cutting FLOPs by 85% with over 98% performance retention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13997","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HodgeCover: Higher-Order Topological Coverage Drives Compression of Sparse Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:07:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13769","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dense vs Sparse Pretraining at Tiny Scale: Active-Parameter vs Total-Parameter Matching","primary_cat":"cs.CL","submitted_at":"2026-05-13T16:48:24+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"At tiny scale, MoE transformers lower validation loss versus dense models when active parameters match but raise it when total stored parameters match.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13247","ref_index":6,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EMO: Frustratingly Easy Progressive Training of Extendable MoE","primary_cat":"cs.LG","submitted_at":"2026-05-13T09:31:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EMO progressively expands the expert pool in MoE models during training to match fixed-expert performance with improved wall-clock efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13190","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation","primary_cat":"cs.LG","submitted_at":"2026-05-13T08:46:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12922","ref_index":28,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Attention Closes: How LLMs Lose the Thread in Multi-Turn Interaction","primary_cat":"cs.AI","submitted_at":"2026-05-13T02:58:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16401","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CADS: Conformal Adaptive Decision System for Cost-Efficient Image Classification","primary_cat":"cs.CV","submitted_at":"2026-05-12T19:38:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"CADS is a conformal-prediction-driven model cascade that routes images to scout or oracle models based on estimated complexity to reduce inference cost while preserving accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12476","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Routers Learn the Geometry of Their Experts: Geometric Coupling in Sparse Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:55:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Routers in SMoE models form geometric alignments with their experts through shared gradient directions, enabling effective specialization that auxiliary load-balancing losses tend to disrupt.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"the paper, we refer to zi and pi as the router's score and the routing weight of experti, respectively, and denote the set of top-K selected experts by TK. The SMoE layer combines only the selected expert outputs: y= X i∈TK piEi(x).(2) Each expert Ei maps the hidden state through an intermediate expert dimension. In the gated SwiGLU experts used by recent SMoEs [12, 5, 4], this computation can be written as Ei(x) =W down i σ(W gate i x)⊙W up i x \u0001 ,(3) where σ is the SiLU activation and W gate i , W up i ∈R di×d are input-side expert matrices with an intermediate dimension di. Throughout the paper, we refer to the coordinates of σ(W gate i x) as the expert's gate-neuron activations; these are the activations measured in our empirical analysis."},{"citing_arxiv_id":"2605.12258","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Instruction Lens Score: Your Instruction Contributes a Powerful Object Hallucination Detector for Multimodal Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-12T15:27:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Instruction token embeddings encode visual information that can be leveraged to detect object hallucinations in MLLMs via a new combined score outperforming prior detectors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12197","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Unified Graph Language Model for Multi-Domain Multi-Task Graph Alignment Instruction Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-12T14:37:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniGraphLM uses a multi-domain multi-task GNN encoder and adaptive alignment to create unified graph tokens for LLMs across diverse domains and tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressandet al., \"Mixtral of experts,\"arXiv preprint arXiv:2401.04088, 2024. [43] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., \"Llama 2: Open foundation and fine-tuned chat models,\"arXiv preprint arXiv:2307.09288, 2023. [44] J. Liu, C. Yang, Z. Lu, J. Chen, Y . Li, M. Zhang, T. Bai, Y . Fang, L. Sun, P. S. Yuet al., \"Towards graph foundation models: A survey and beyond,\"arXiv preprint arXiv:2310.11829, 2023. [45] X. Sun, H. Cheng, J. Li, B. Liu, and J. Guan, \"All in one: Multi-task prompting for graph neural networks,\" inProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining, 2023, pp."},{"citing_arxiv_id":"2605.11800","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ROMER: Expert Replacement and Router Calibration for Robust MoE LLMs on Analog Compute-in-Memory Systems","primary_cat":"cs.LG","submitted_at":"2026-05-12T08:57:59+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ROMER cuts perplexity by up to 59% in noisy analog CIM environments for MoE LLMs via expert replacement and router recalibration calibrated on real-chip measurements.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11333","ref_index":70,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces","primary_cat":"cs.DC","submitted_at":"2026-05-11T23:38:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11277","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models","primary_cat":"cs.AR","submitted_at":"2026-05-11T22:00:39+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. Mixtral of Experts. arXiv:2401.04088 [cs.LG] https://arxiv.org/abs/2401.04088 [25] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. InProceedings of the 44th annual international symposium on computer architecture. 1-12. [26] Jin Hyun Kim, Shin-Haeng Kang, Sukhan Lee, Hyeonsu Kim, Yuhwan Ro, Seung-"},{"citing_arxiv_id":"2605.11255","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HEBATRON: A Hebrew-Specialized Open-Weight Mixture-of-Experts Language Model","primary_cat":"cs.CL","submitted_at":"2026-05-11T21:27:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Hebatron is the first open-weight Hebrew MoE LLM adapted from Nemotron-3, reaching 73.8% on Hebrew reasoning benchmarks while activating only 3B parameters per pass and supporting 65k-token context.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}