Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, Lingming Zhang · 2023

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Bayesian Model Merging introduces a bi-level optimization framework that merges task-specific models via closed-form Bayesian regression with an anchor prior and global hyperparameter search, outperforming baselines and nearly matching expert averages on up to 20-task vision and 5-task language Merg

RewardBench 2: Advancing Reward Model Evaluation

cs.CL · 2025-06-02 · unverdicted · novelty 6.0

RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.

ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models

cs.CL · 2025-05-30 · conditional · novelty 6.0

Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.

InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

cs.CL · 2025-05-20 · unverdicted · novelty 5.0

InfiGFusion introduces graph-on-logits distillation with an O(n log n) Gromov-Wasserstein approximation to fuse LLMs by modeling token co-activations, reporting gains over baselines on 11 benchmarks.

citing papers explorer

Showing 4 of 4 citing papers.

Bayesian Model Merging cs.LG · 2026-05-13 · unverdicted · none · ref 57
Bayesian Model Merging introduces a bi-level optimization framework that merges task-specific models via closed-form Bayesian regression with an anchor prior and global hyperparameter search, outperforming baselines and nearly matching expert averages on up to 20-task vision and 5-task language Merg
RewardBench 2: Advancing Reward Model Evaluation cs.CL · 2025-06-02 · unverdicted · none · ref 49
RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training performance.
ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models cs.CL · 2025-05-30 · conditional · none · ref 32
Prolonged RL training with KL control and reference policy resetting enables LLMs to develop novel reasoning strategies inaccessible to base models even under extensive sampling.
InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion cs.CL · 2025-05-20 · unverdicted · none · ref 60
InfiGFusion introduces graph-on-logits distillation with an O(n log n) Gromov-Wasserstein approximation to fuse LLMs by modeling token co-activations, reporting gains over baselines on 11 benchmarks.

Is your code generated by chatGPT really correct? rigorous evaluation of large language models for code generation

fields

years

verdicts

representative citing papers

citing papers explorer