CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed models and 22.5 for open-source ones.
super hub Mixed citations
GLM-5: from Vibe Coding to Agentic Engineering
Mixed citation behavior. Most common role is background (68%).
abstract
We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that further improve RL quality, enabling the model to learn from complex, long-horizon interactions more effectively. Through these innovations, GLM-5 achieves state-of-the-art performance on major open benchmarks. Most critically, GLM-5 demonstrates unprecedented capability in real-world coding tasks, surpassing previous baselines in handling end-to-end software engineering challenges. Code, models, and more information are available at https://github.com/zai-org/GLM-5.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract We present GLM-5, a next-generation foundation model designed to transition the paradigm of vibe coding to agentic engineering. Building upon the agentic, reasoning, and coding (ARC) capabilities of its predecessor, GLM-5 adopts DSA to significantly reduce training and inference costs while maintaining long-context fidelity. To advance model alignment and autonomy, we implement a new asynchronous reinforcement learning infrastructure that drastically improves post-training efficiency by decoupling generation from training. Furthermore, we propose novel asynchronous agent RL algorithms that fur
authors
co-cited works
years
2026 125representative citing papers
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.
Introduces VidPair-Halluc benchmark of 1K background-controlled adversarial video pairs and 11K QA pairs generated via PairFlow pipeline to evaluate hallucination in LVMs.
SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.
Dockerless uses agentic repository exploration to verify patches without execution, enabling SFT and RL training of coding agents that reach 62.0/50.0/35.2% resolve rates on SWE-bench Verified/Multilingual/Pro while matching environment-based results.
Muon momentum matrices show layer-dependent power-law scaling of stabilized singular value quantiles with model size from 77M to 2.8B parameters.
EntSQL is a new benchmark with 1,066 examples across five domains where top systems reach only 15.9% accuracy on English inputs when long-form enterprise documents are provided.
Moment-Video benchmark shows top video MLLM achieves only 39.6% accuracy on momentary visual event tasks, with most open-source models below 25%.
AutoMedBench evaluates AI agents on long-horizon medical workflows across five stages and finds validation and submission as dominant failure points based on thousands of runs.
CRAB-Bench and RUSE create a new evaluation framework for LLM agents on constraint-graph tasks with realistic human-like user behaviors, reporting 61% pass@1 for the best model and up to 57% further drops under RUSE.
Introduces LongJudgeBench benchmark showing LLM judges remain unstable for long-form output evaluation even with rubrics or references.
On a real multi-node H100 cluster the authors show that for MLA, routing the ~1 KB compressed query row is cheaper than moving cache chunks and supply a topology-aware cost model accurate to ~7% on IBGDA fabrics.
DirectorBench is a profile-aware diagnostic benchmark that localizes bottlenecks in long-form video generation workflows using structured checkpoints and multi-agent evaluation.
PassNet provides a dataset of 18K graphs and PassBench for LLM-generated compiler passes, with fine-tuned models achieving 2.67x gains on long-tail tasks where TorchInductor underperforms.
SCDBench introduces a 600-contract dataset and cumulative four-stage evaluation showing frontier LLMs achieve perfect semantic decompilation on only 42 contracts.
LiveBrowseComp shows search agents rely on intrinsic knowledge on standard benchmarks, with scores dropping 25-40 points and closed-book accuracy below 2% on questions about facts from the prior 90 days.
LiveK12Bench is a growing multi-disciplinary benchmark showing LMMs like GPT-5 drop from 79 to 53 under realistic exam constraints including process rigor and efficiency.
LLM-Wiki structures external knowledge as compilable wiki pages with links and persistent self-correction, achieving SOTA results on HotpotQA, MuSiQue, and 2WikiMultiHopQA by 2.0-8.1 F1 points over prior RAG systems.
AuthTrace is a diagnostic benchmark that annotates fan-in gradients in single-author corpora to measure evidence recall, precision, and answer correctness across eight systems in retrieval, memory, graph, and structured-evidence paradigms.
MemMark enables snapshot-only attribution for agent long-term memory by embedding signals via keyed distribution-preserving sampling at memory-write decisions, recovering 40-bit payloads with near-baseline utility.
AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
citing papers explorer
No citing papers match the current filters.