UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.
super hub Mixed citations
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Mixed citation behavior. Most common role is background (69%).
abstract
We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd ove
authors
co-cited works
representative citing papers
Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.
WildTableBench is the first QA benchmark for naturally occurring table images, where 21 multimodal models were evaluated and only one exceeded 50% accuracy.
User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.
Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.
Proposes COM-as-Action paradigm for deterministic software manipulation, introduces ComCADBench benchmark and ComActor agent that achieves SOTA performance over GUI baselines.
LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.
Stateful visual encoders condition each visual representation on prior features, yielding consistent gains on multi-image tasks under supervised finetuning across model sizes and domains.
OVO-S-Bench provides 1680 human-annotated questions on 348 videos to measure streaming spatial intelligence in MLLMs across instantaneous perception, spatiotemporal tracking, spatial simulation, and allocentric mapping.
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.
LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.
PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.
PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.
Introduces BacktestBench benchmark with 18k QA pairs across four backtesting tasks and evaluates 23 LLMs via the AutoBacktest multi-agent system.
A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.
StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.
AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.
Dr.Sai autonomously executed full physics analysis pipelines on real BESIII data to re-measure ten J/psi decay branching fractions, matching established benchmarks without any manual coding.
SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.
FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.