pith. sign in

hub

SWE-bench Multimodal: Do AI systems generalize to visual software do- mains?

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

hub tools

citation-role summary

background 3 baseline 1

citation-polarity summary

years

2026 7 2025 7

representative citing papers

KernelBench: Can LLMs Write Efficient GPU Kernels?

cs.LG · 2025-02-14 · accept · novelty 7.0

KernelBench shows that even the best current LLMs generate correct and faster-than-baseline GPU kernels in fewer than 20 percent of realistic ML workloads.

Open-World Evaluations for Measuring Frontier AI Capabilities

cs.AI · 2026-05-19 · conditional · novelty 6.0

Open-world evaluations using qualitative review of real-world tasks can give earlier warnings of frontier AI capabilities than automated benchmarks, as demonstrated by an AI agent publishing a simple iOS app with one minor human fix.

VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments

cs.AI · 2025-06-03 · unverdicted · novelty 6.0

VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of

AlphaEval: Evaluating Agents in Production

cs.CL · 2026-04-14 · unverdicted · novelty 5.0

AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.

citing papers explorer

Showing 14 of 14 citing papers.