arXiv preprint arXiv:2311.08516 , year=

· 2023 · arXiv 2311.08516

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

cs.SE · 2024-03-12 · unverdicted · novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection

cs.CL · 2026-04-07 · unverdicted · novelty 5.0

Enforcing structured reflection via Outlines-based constrained decoding on an 8B LLM triggers structure snowballing instead of better self-correction, producing near-perfect syntax but persistent semantic errors and revealing an alignment tax.

Gemma 3 Technical Report

cs.CL · 2025-03-25 · accept · novelty 4.0

Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

cs.CL · 2024-12-07 · accept · novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

cs.CL · 2026-05-19

citing papers explorer

Showing 5 of 5 citing papers.

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code cs.SE · 2024-03-12 · unverdicted · none · ref 101
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
From Hallucination to Structure Snowballing: The Alignment Tax of Constrained Decoding in LLM Reflection cs.CL · 2026-04-07 · unverdicted · none · ref 8
Enforcing structured reflection via Outlines-based constrained decoding on an 8B LLM triggers structure snowballing instead of better self-correction, producing near-perfect syntax but persistent semantic errors and revealing an alignment tax.
Gemma 3 Technical Report cs.CL · 2025-03-25 · accept · none · ref 42
Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 231
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution cs.CL · 2026-05-19 · unreviewed · ref 52

arXiv preprint arXiv:2311.08516 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer