Reproduction of hieroglyphic translation finds test-set contamination inflating BLEU from 37.0 to 61.5, with corrected clean baselines of 30.9-39.2.
arXiv preprint arXiv:2405.16281 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CL 3roles
background 1polarities
unclear 1representative citing papers
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
citing papers explorer
-
Data Contamination in Neural Hieroglyphic Translation: A Reproducibility Study
Reproduction of hieroglyphic translation finds test-set contamination inflating BLEU from 37.0 to 61.5, with corrected clean baselines of 30.9-39.2.
-
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
-
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.