Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.
Advances in Neural Information Processing Systems , volume=
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the penalty coefficient by ~70%.
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
Grouping attention heads in Muon creates a trade-off between whitening gains and norm costs that, when tuned, improves training loss over full or per-head Muon on GPT-2.
Injecting 1% synthetic data targeting specific constructions during pre-training of GPT-2 Small boosts performance on 8 of 9 weakest BLiMP paradigms (e.g., only_npi_scope from 20.9% to 69.4%), while aggregate performance holds or improves, with one resistant case.
citing papers explorer
-
Base Models Look Human To AI Detectors
Base model text evades AI detectors better than instruction-tuned text, and the HIP method strengthens this trade-off across model sizes.
-
Prescriptive Scaling Laws for Data Constrained Training
A one-parameter scaling law models excess loss from data repetition as an additive overfitting penalty, recommending model capacity increases over excessive repetition and showing that strong weight decay reduces the penalty coefficient by ~70%.
-
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
-
When and Why Grouping Attention Heads Accelerates Muon Optimization
Grouping attention heads in Muon creates a trade-off between whitening gains and norm costs that, when tuned, improves training loss over full or per-head Muon on GPT-2.
-
Heterogeneity in Formal Linguistic Competence of Language Models: Is Data the Real Bottleneck?
Injecting 1% synthetic data targeting specific constructions during pre-training of GPT-2 Small boosts performance on 8 of 9 weakest BLiMP paradigms (e.g., only_npi_scope from 20.9% to 69.4%), while aggregate performance holds or improves, with one resistant case.