super hub Mixed citations

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Bin Chen, Chengxing Xie, GLM-4.5 Team: Aohan Zeng, Qinkai Zheng, Xin Lv, Zhenyu Hou · 2025 · cs.CL · arXiv 2508.06471

Mixed citation behavior. Most common role is background (69%).

130 Pith papers citing it

Background 69% of classified citations

open full Pith review browse 130 citing papers more from Bin Chen arXiv PDF

abstract

We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd overall among all evaluated models and 2nd on agentic benchmarks. We release both GLM-4.5 (355B parameters) and a compact version, GLM-4.5-Air (106B parameters), to advance research in reasoning and agentic AI systems. Code, models, and more information are available at https://github.com/zai-org/GLM-4.5.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 19 baseline 3 method 3 other 1

citation-polarity summary

background 18 baseline 3 use method 3 unclear 2

claims ledger

abstract We present GLM-4.5, an open-source Mixture-of-Experts (MoE) large language model with 355B total parameters and 32B activated parameters, featuring a hybrid reasoning method that supports both thinking and direct response modes. Through multi-stage training on 23T tokens and comprehensive post-training with expert model iteration and reinforcement learning, GLM-4.5 achieves strong performance across agentic, reasoning, and coding (ARC) tasks, scoring 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified. With much fewer parameters than several competitors, GLM-4.5 ranks 3rd ove

authors

Bin Chen Chengxing Xie GLM-4.5 Team: Aohan Zeng Qinkai Zheng Xin Lv Zhenyu Hou

co-cited works

representative citing papers

UltraEP: Unleash MoE Training and Inference on Rack-Scale Nodes with Near-Optimal Load Balancing

cs.DC · 2026-06-02 · unverdicted · novelty 8.0

UltraEP is the first exact-load real-time expert balancer for large-EP MoE training and serving on rack-scale nodes, reaching 94.3% of ideal throughput and 1.49x over no-balancing.

Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models

cs.AR · 2026-05-11 · conditional · novelty 8.0

Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.

ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning

cs.LG · 2026-05-09 · conditional · novelty 8.0

ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

cs.CV · 2026-05-01 · conditional · novelty 8.0 · 2 refs

WildTableBench is the first QA benchmark for naturally occurring table images, where 21 multimodal models were evaluated and only one exceeded 50% accuracy.

Beyond the Assistant Turn: User Turn Generation as a Probe of Interaction Awareness in Language Models

cs.AI · 2026-04-02 · unverdicted · novelty 8.0

User-turn generation reveals that LLMs' interaction awareness is largely decoupled from task accuracy, remaining near zero in deterministic settings even as accuracy scales to 96.8% on GSM8K.

Can Agents Generalize to the Open World? Unveiling the Fragility of Static Training in Tool Use

cs.AI · 2026-07-01 · unverdicted · novelty 7.0

Static SFT and RL training for tool-use agents leads to performance drops under open-world distributional shifts across perception, interaction, reasoning and internalization; perturbation-augmented fine-tuning is proposed as mitigation.

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.

MacroLens: A Multi-Task Benchmark for Contextual Financial Reasoning under Macroeconomic Scenarios

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

MacroLens is a point-in-time multi-signal benchmark dataset and seven tasks for evaluating contextual financial reasoning models under macroeconomic scenarios.

Is Agent Code Less Maintainable Than Human Code?

cs.SE · 2026-06-19 · unverdicted · novelty 7.0

Agent-generated code produces up to 13.1% lower follow-on task resolution rates than human code in chained repository-level experiments, with differences linked to behavioral patterns rather than conventional metrics.

When Do Intrinsic Rewards Work for Code Reasoning? A Comprehensive Study

cs.AI · 2026-06-18 · unverdicted · novelty 7.0

Empirical evaluation on LiveCodeBench shows certainty-based RLIF yields early gains followed by output shortening and reasoning collapse, providing no advantage for RLVR initialization on code tasks.

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents

cs.CL · 2026-06-17 · unverdicted · novelty 7.0

LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.

daVinci-kernel: Co-Evolving Skill Selection, Summarization, and Utilization via RL for GPU Kernel Optimization

cs.LG · 2026-06-15 · unverdicted · novelty 7.0

daVinci-kernel is a multi-agent RL system that co-evolves skill selection, policy generation, and summarization via shared LLM and REINFORCE to optimize GPU kernels, reporting higher KernelBench scores than prior RL models.

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

cs.SE · 2026-06-11 · unverdicted · novelty 7.0 · 2 refs

Proposes COM-as-Action paradigm for deterministic software manipulation, introduces ComCADBench benchmark and ComActor agent that achieves SOTA performance over GUI baselines.

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.

Stateful Visual Encoders for Vision-Language Models

cs.CV · 2026-06-03 · unverdicted · novelty 7.0

Stateful visual encoders condition each visual representation on prior features, yielding consistent gains on multi-image tasks under supervised finetuning across model sizes and domains.

OVO-S-Bench: A Hierarchical Benchmark for Streaming Spatial Intelligence in Multimodal LLMs

cs.CV · 2026-06-02 · accept · novelty 7.0

OVO-S-Bench provides 1680 human-annotated questions on 348 videos to measure streaming spatial intelligence in MLLMs across instantaneous perception, spatiotemporal tracking, spatial simulation, and allocentric mapping.

Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs

cs.CL · 2026-05-29 · unverdicted · novelty 7.0

Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

VitaBench 2.0 introduces a benchmark for long-term personalized and proactive agent behavior, with results indicating substantial gaps in current frontier LLMs.

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.

CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning

cs.CL · 2026-05-19 · unverdicted · novelty 7.0

CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.

A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.

BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting

cs.CL · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

Introduces BacktestBench benchmark with 18k QA pairs across four backtesting tasks and evaluates 23 LLMs via the AutoBacktest multi-agent system.

GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction

cs.CY · 2026-05-14 · unverdicted · novelty 7.0

A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.

citing papers explorer

Showing 25 of 25 citing papers after filters.

LegalWorld: A Life-Cycle Interactive Environment for Legal Agents cs.CL · 2026-06-17 · unverdicted · none · ref 49 · internal anchor
LegalWorld is a life-cycle interactive environment modeling Chinese civil litigation as five causally connected stages grounded in 75,309 judgments, paired with LongJud-Bench for cross-stage agent evaluation.
Every Act Has Its Price: Compressed Moral Composition in Frontier LLMs cs.CL · 2026-05-29 · unverdicted · none · ref 98 · internal anchor
Moral Trolley Arena shows frontier LLMs produce composite moral preferences that are compressed rather than additive functions of calibrated component act strengths across Moral Foundations Theory.
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning cs.CL · 2026-05-21 · unverdicted · none · ref 50 · internal anchor
LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.
CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning cs.CL · 2026-05-19 · unverdicted · none · ref 42 · internal anchor
CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.
A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\Delta$ Integration into Upcycled MoE cs.CL · 2026-05-18 · unverdicted · none · ref 4 · internal anchor
PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.
BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting cs.CL · 2026-05-18 · unverdicted · none · ref 37 · 2 links · internal anchor
Introduces BacktestBench benchmark with 18k QA pairs across four backtesting tasks and evaluates 23 LLMs via the AutoBacktest multi-agent system.
StoryAlign: Evaluating and Training Reward Models for Story Generation cs.CL · 2026-05-06 · unverdicted · none · ref 25 · internal anchor
StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset cs.CL · 2026-03-14 · unverdicted · none · ref 16 · internal anchor
OmniCompliance-100K supplies 12,985 distinct rules and 106,009 associated real-world cases from 74 multi-domain regulations to benchmark LLM safety and compliance.
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning cs.CL · 2025-11-04 · unverdicted · none · ref 39 · internal anchor
MemSearcher trains LLMs to manage compact memory in multi-turn searches via multi-context GRPO for end-to-end RL, outperforming ReAct-style baselines with stable token counts.
MSQA: A Natively Sourced Multilingual and Multicultural SimpleQA Benchmark cs.CL · 2026-07-01 · unverdicted · none · ref 24 · internal anchor
MSQA benchmark shows LLMs exhibit cultural degradation where competence tracks pre-training data exposure more than reasoning ability, and inference fixes like sampling or retrieval do not close the gap.
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement cs.CL · 2026-06-10 · unverdicted · none · ref 37 · internal anchor
Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
Cross-Source Reasoning-based Correction for Author Name Disambiguation cs.CL · 2026-06-07 · unverdicted · none · ref 41 · internal anchor
CrossND is a full-stack framework using data refinement, cross-source reasoning via probabilistic soft logic, and test-time scaling to correct author assignments from inconsistent sources, outperforming 17 baselines on real-world datasets.
Self-Evolving Deep Research via Joint Generation and Evaluation cs.CL · 2026-06-03 · unverdicted · none · ref 44 · internal anchor
SCORE is a shared-parameter co-evolutionary framework coupling generation and evaluation of deep research reports with a meta-harness to adapt evaluation standards as performance improves.
Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory cs.CL · 2026-05-20 · unverdicted · none · ref 55 · internal anchor
Memory Grafting improves language-model benchmarks by grafting offline hidden-state memory from a larger model into a recipient model using n-gram lookups and lightweight adapters, outperforming MoE and vanilla Engram baselines at 0.92B and 2.8B scales.
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models cs.CL · 2026-05-17 · unverdicted · none · ref 90 · internal anchor
PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.
Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making cs.CL · 2026-05-17 · unverdicted · none · ref 96 · internal anchor
Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.
Edit-Based Refinement for Parallel Masked Diffusion Language Models cs.CL · 2026-05-10 · unverdicted · none · ref 30 · internal anchor
ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning cs.CL · 2026-05-07 · unverdicted · none · ref 34 · internal anchor
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
MAIC-UI: Making Interactive Courseware with Generative UI cs.CL · 2026-04-28 · unverdicted · none · ref 59 · internal anchor
MAIC-UI provides a zero-code authoring system for generating and iteratively editing interactive courseware from educational materials via structured analysis and incremental generation, with lab and classroom evaluations showing usability gains and learning improvements.
In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores cs.CL · 2026-04-21 · unverdicted · none · ref 14 · internal anchor
Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific fairness behaviors across millions of dialogues.
FinReasoning: A Hierarchical Benchmark for Reliable Financial Research Reporting cs.CL · 2026-02-25 · unverdicted · none · ref 38 · internal anchor
FinReasoning is a hierarchical benchmark that decomposes LLM financial research capabilities into semantic consistency, data alignment, and deep insight, revealing model-type differences in auditing versus insight generation.
BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding cs.CL · 2025-12-12 · unverdicted · none · ref 27 · internal anchor
BLASST dynamically sparsifies attention by thresholding softmax scores to skip blocks, delivering 1.5x speedups at 70%+ sparsity while preserving benchmark accuracy.
MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling cs.CL · 2025-11-14 · unverdicted · none · ref 4 · internal anchor
MiroThinker shows that scaling agent-environment interactions via reinforcement learning lets a 72B open-source model reach up to 81.9% on GAIA and approach commercial performance on research benchmarks.
Value-and-Structure Alignment for Routing-Consistent Quantization of Mixture-of-Experts Models cs.CL · 2026-06-04 · unverdicted · none · ref 26 · internal anchor
VSRAQ is a MoE-specific quantization objective that combines value and structure alignment to preserve expert-selection behavior and reduce quality loss without inference overhead.
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models cs.CL · 2025-12-02 · unverdicted · none · ref 16 · internal anchor
DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer