super hub Mixed citations

Kimi K2.5: Visual Agentic Intelligence

Kimi Team: Tongtong Bai, S.H. Cai, Y. Charles, Yifan Bai, Yiping Bao, Yuan Cao · 2026 · cs.CL · arXiv 2602.02276

Mixed citation behavior. Most common role is background (68%).

185 Pith papers citing it

Background 68% of classified citations

open full Pith review browse 185 citing papers more from Kimi Team: Tongtong Bai arXiv PDF

abstract

We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 42 baseline 13 method 3 other 2

citation-polarity summary

background 41 baseline 13 unclear 3 use method 3

claims ledger

abstract We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evalu

authors

Kimi Team: Tongtong Bai S.H. Cai Y. Charles Yifan Bai Yiping Bao Yuan Cao

co-cited works

representative citing papers

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 8.0 · 2 refs

Soohak is a 439-problem mathematician-curated benchmark where frontier LLMs reach at most 30.4% on research math challenges and no model exceeds 50% on refusal for ill-posed problems.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild

cs.CV · 2026-05-01 · conditional · novelty 8.0 · 2 refs

WildTableBench is the first QA benchmark for naturally occurring table images, where 21 multimodal models were evaluated and only one exceeded 50% accuracy.

Can Coding Agents Reproduce Findings in Computational Materials Science?

cs.SE · 2026-05-01 · conditional · novelty 8.0

AutoMat benchmark shows current LLM coding agents achieve at most 54.1% success when reproducing computational materials science claims from papers.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

cs.AI · 2026-04-28 · accept · novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

HWE-Bench: Benchmarking LLM Agents on Real-World Hardware Bug Repair Tasks

cs.AI · 2026-04-16 · unverdicted · novelty 8.0

HWE-Bench is the first repository-level benchmark for LLM agents on real hardware bug repair, where the best agent fixes 70.7% of 417 tasks but drops below 65% on complex SoC projects.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

Narrative over Numbers: The Identifiable Victim Effect and its Amplification Under Alignment and Reasoning in Large Language Models

cs.CL · 2026-04-13 · conditional · novelty 8.0

Large language models display the identifiable victim effect at roughly twice the human baseline, strongly amplified by instruction tuning and chain-of-thought prompting but inverted by reasoning-specialized models.

OccuBench: Evaluating AI Agents on Real-World Professional Tasks via Language Environment Simulation

cs.CL · 2026-04-13 · unverdicted · novelty 8.0

OccuBench is a new benchmark for AI agents on real-world occupational tasks via LLM-driven simulators, showing no model dominates all industries, implicit faults are hardest, and larger models with more reasoning perform better.

FashionMV: Product-Level Composed Image Retrieval with Multi-View Fashion Data

cs.CV · 2026-04-11 · unverdicted · novelty 8.0

FashionMV introduces product-level multi-view CIR, a 127K-product dataset built via automated LMM pipeline, and a 0.8B ProCIR model that beats larger baselines on three fashion benchmarks.

Towards Cross-lingual Values Judgment: A Consensus-Pluralism Perspective

cs.CL · 2026-02-19 · unverdicted · novelty 8.0

X-Value is the first cross-lingual values judgment benchmark that reveals limitations and performance gaps in LLMs across languages and issue categories.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.

MuseBench: Benchmarking Intent-Level Audiovisual Arts Understanding in MLLMs

cs.CV · 2026-06-29 · unverdicted · novelty 7.0

MuseBench shows state-of-the-art MLLMs achieve only 48.29% accuracy on intent-level audiovisual arts understanding versus 87.18% for human experts.

SpreadsheetBench 2: Evaluating Agents on End-to-End Business Spreadsheet Workflows

cs.SE · 2026-06-29 · unverdicted · novelty 7.0

SpreadsheetBench 2 provides 321 expert-validated tasks from authentic business data showing frontier LLMs reach only 34.89% overall accuracy on end-to-end spreadsheet workflows.

The Mirage of Optimizing Training Policies: Monotonic Inference Policies as the Real Objective for LLM Reinforcement Learning

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

Proposes Monotonic Inference Policy Improvement (MIPI) objective and MIPU two-step update framework to address objective misalignment between training and inference policies in LLM reinforcement learning.

Dockerless: Environment-Free Program Verifier for Coding Agents

cs.SE · 2026-06-26 · unverdicted · novelty 7.0

Dockerless uses agentic repository exploration to verify patches without execution, enabling SFT and RL training of coding agents that reach 62.0/50.0/35.2% resolve rates on SWE-bench Verified/Multilingual/Pro while matching environment-based results.

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

cs.LG · 2026-06-08 · conditional · novelty 7.0

A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.

DragOn: A Benchmark and Dataset for Drag-Based GUI Interactions

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

DragOn provides a new drag-grounding benchmark and training dataset for GUI agents, with evaluations suggesting potential improvements on computer-use tasks.

Spectral Scaling Laws of Muon

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Muon momentum matrices show layer-dependent power-law scaling of stabilized singular value quantiles with model size from 77M to 2.8B parameters.

Towards Characterizing Scientific Image Utility and Upgradability

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

The SIU²A framework evaluates scientific images for error detection, repair feasibility, and correction quality, showing current multimodal systems have major limitations in preserving scientific validity.

WorldCoder-Bench: Benchmarking Physically Grounded 3D World Synthesis

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

Introduces WorldCoder-Bench and StateProbe for evaluating LLM-generated physically grounded 3D browser worlds, with frontier models reaching at most 27.8% verification coverage.

ChartArena: Benchmarking Chart Parsing across Languages, Scenarios, and Formats

cs.CV · 2026-05-31 · unverdicted · novelty 7.0

ChartArena is a new benchmark dataset and evaluation protocol for chart parsing by MLLMs that covers numeric and diagrammatic charts in multiple languages and real-world visual conditions.

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

cs.CV · 2026-05-29 · unverdicted · novelty 7.0

StemBind benchmark diagnoses MLLM failures in abstract visual reasoning by separating perception, rule induction, and answer selection on shared stems, finding a persistent rule-to-instance binding gap even when perception and rule are correct.

citing papers explorer

Showing 9 of 9 citing papers after filters.

Benchmarking and Improving GUI Agents in High-Dynamic Environments cs.CV · 2026-04-28 · unverdicted · none · ref 34 · 2 links · internal anchor
DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new DynamicGUIBench spanning ten applications.
SketchVLM: Vision language models can annotate images to explain thoughts and guide users cs.CV · 2026-04-23 · unverdicted · none · ref 44 · internal anchor
SketchVLM lets VLMs generate non-destructive SVG annotations on input images to visually explain answers, raising visual reasoning accuracy by up to 28.5 points and annotation quality by 1.48x over baselines.
DetailVerifyBench: A Benchmark for Dense Hallucination Localization in Long Image Captions cs.CV · 2026-04-07 · unverdicted · none · ref 42 · internal anchor
DetailVerifyBench supplies 1,000 images and densely annotated long captions to evaluate precise hallucination localization in multimodal large language models.
The Cartesian Shortcut: Re-evaluate Vision Reasoning in Polar Coordinate Space cs.CV · 2026-05-11 · unverdicted · none · ref 30 · 2 links · internal anchor
Reformulating 53 visual reasoning tasks in polar coordinates causes frontier MLLMs to drop from 70-83% to 31-39% accuracy while preserving logical equivalence, revealing a Cartesian shortcut in current benchmarks.
LLaVA-UHD v4: What Makes Efficient Visual Encoding in MLLMs? cs.CV · 2026-05-09 · unverdicted · none · ref 38 · internal anchor
LLaVA-UHD v4 reduces visual-encoding FLOPs by 55.8% for high-resolution images in MLLMs via slice-based encoding plus intra-ViT early compression while matching or exceeding baseline performance on document, OCR, and VQA benchmarks.
Let ViT Speak: Generative Language-Image Pre-training cs.CV · 2026-05-01 · unverdicted · none · ref 63 · 2 links · internal anchor
GenLIP pretrains ViTs to generate language tokens from images via LM objective without contrastive batches or extra decoders, matching baselines on less data and improving on OCR after multi-resolution continued pretraining.
Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning cs.CV · 2026-05-04 · unverdicted · none · ref 37 · internal anchor
A framework with similarity-based visual token compression, dynamic attention rebalancing, and explicit inductive-deductive chain-of-thought improves multimodal ICL performance across eight benchmarks for open-source VLMs.
EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 34 · internal anchor
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 21 · internal anchor
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

Kimi K2.5: Visual Agentic Intelligence

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer