John D Lafferty, Andrew McCallum, and Fernando CN Pereira

Jean Kaddour · 2023 · arXiv 2304.08442

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.

Faster Superword Tokenization

cs.CL · 2026-04-06 · accept · novelty 7.0

Frequency aggregation of supermerge candidates and a two-phase formulation make BoundlessBPE and SuperBPE training over 600x faster on 1GB data while preserving identical results, with open-source Python and Rust code.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

Scaling Probabilistic Transformer via Efficient Cross-Scale Hyperparameter Transfer

cs.CL · 2026-04-28 · unverdicted · novelty 5.0

Applying muP allows Probabilistic Transformers to scale to 0.4B parameters with transferred hyperparameters and outperform standard transformers on MLM tasks under equal parameter budgets.

citing papers explorer

Showing 4 of 4 citing papers.

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models cs.LG · 2026-05-17 · unverdicted · none · ref 16
LLMForge is a NAS framework with Infinite-Head Attention, a Forge-Former surrogate, and Forge-DSE engine that discovers hardware-specific architectures for edge language models, yielding variants with improved accuracy, energy, or latency on different substrates.
Faster Superword Tokenization cs.CL · 2026-04-06 · accept · none · ref 7
Frequency aggregation of supermerge candidates and a two-phase formulation make BoundlessBPE and SuperBPE training over 600x faster on 1GB data while preserving identical results, with open-source Python and Rust code.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 95
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
Scaling Probabilistic Transformer via Efficient Cross-Scale Hyperparameter Transfer cs.CL · 2026-04-28 · unverdicted · none · ref 4
Applying muP allows Probabilistic Transformers to scale to 0.4B parameters with transferred hyperparameters and outperform standard transformers on MLM tasks under equal parameter budgets.

John D Lafferty, Andrew McCallum, and Fernando CN Pereira

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer