KernelBench shows that even the best current LLMs generate correct and faster-than-baseline GPU kernels in fewer than 20 percent of realistic ML workloads.
Can language models solve olympiad programming?
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Solvita is an agentic evolution system using Planner, Solver, Oracle, and Hacker agents with trainable graph knowledge networks updated by reinforcement learning on pass/fail and vulnerability signals to achieve SOTA code generation performance.
CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.
An open-source MoE code model matches GPT-4 Turbo on coding and math benchmarks while expanding to 338 languages and 128K context length.
CodeGolf Bench is a dynamic benchmark for LLM concise code generation in 60 languages, showing reasoning models reach 70.97% average human percentile on Python and C++ tasks while non-reasoning models lag.
AgentCrypt introduces a deterministic three-tier privacy framework for AI agent collaboration that uses masking and homomorphic encryption to protect data independently of model accuracy.
citing papers explorer
-
KernelBench: Can LLMs Write Efficient GPU Kernels?
KernelBench shows that even the best current LLMs generate correct and faster-than-baseline GPU kernels in fewer than 20 percent of realistic ML workloads.
-
Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution
Solvita is an agentic evolution system using Planner, Solver, Oracle, and Hacker agents with trainable graph knowledge networks updated by reinforcement learning on pass/fail and vulnerability signals to achieve SOTA code generation performance.
-
Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation
CoNL lets LLMs self-improve on non-verifiable tasks by rewarding critiques that produce better solutions in multi-agent conversations, jointly optimizing generation and judging without external feedback.
-
DeepSeek-Coder-V2: Breaking the Barrier of Closed-Source Models in Code Intelligence
An open-source MoE code model matches GPT-4 Turbo on coding and math benchmarks while expanding to 338 languages and 128K context length.
-
CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models
CodeGolf Bench is a dynamic benchmark for LLM concise code generation in 60 languages, showing reasoning models reach 70.97% average human percentile on Python and C++ tasks while non-reasoning models lag.
-
AgentCrypt: Advancing Privacy and (Secure) Computation in AI Agent Collaboration
AgentCrypt introduces a deterministic three-tier privacy framework for AI agent collaboration that uses masking and homomorphic encryption to protect data independently of model accuracy.