hub

Cerebras-gpt: Open compute- optimal language models trained on the cerebras wafer- scale cluster

Dey, N · 2023 · arXiv 2304.03208

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Learning the Signature of Memorization in Autoregressive Language Models

cs.CL · 2026-04-03 · accept · novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

GQA-{\mu}P: The maximal parameterization update for grouped query attention

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size

cs.CL · 2026-04-14 · unverdicted · novelty 7.0

Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.

Spectral Condition for $\mu$P under Width-Depth Scaling

cs.LG · 2026-02-28 · unverdicted · novelty 6.0

A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

cs.CL · 2024-04-09 · conditional · novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

cs.CL · 2023-06-01 · unverdicted · novelty 6.0

Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

cs.CV · 2026-04-06 · unverdicted · novelty 5.0

A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolLM models.

SpaDA: A Spatial Dataflow Architecture Programming Language

cs.DC · 2025-11-12 · unverdicted · novelty 5.0

SpaDA provides a concise language and multi-level compiler for spatial dataflow hardware that integrates with stencil DSLs and delivers substantial code reduction and high performance on wafer-scale engines.

cs.AI · 2025-09-21 · unverdicted · novelty 5.0

Similarity Field Theory defines a similarity field over entities, concepts as superlevel-set fibers, and intelligence as a generative operator that preserves fiber membership under evolution.

Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models

cs.LG · 2026-05-14 · unverdicted · novelty 3.0

Empirical tests on 118 transformers show success falling from 88.1% at 512 tokens to 0% at 2048 tokens, with compressed models achieving 649.2 tokens/sec/M parameters versus 12.5 for large generative ones.

Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)

cs.CL · 2025-01-03 · unverdicted · novelty 2.0

A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.

citing papers explorer

Showing 12 of 12 citing papers.

Learning the Signature of Memorization in Autoregressive Language Models cs.CL · 2026-04-03 · accept · none · ref 6
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
GQA-{\mu}P: The maximal parameterization update for grouped query attention cs.LG · 2026-05-14 · unverdicted · none · ref 4
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size cs.CL · 2026-04-14 · unverdicted · none · ref 4
Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.
Spectral Condition for $\mu$P under Width-Depth Scaling cs.LG · 2026-02-28 · unverdicted · none · ref 8
A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies cs.CL · 2024-04-09 · conditional · none · ref 12
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 273
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only cs.CL · 2023-06-01 · unverdicted · none · ref 20
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs cs.CV · 2026-04-06 · unverdicted · none · ref 7
A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolLM models.
SpaDA: A Spatial Dataflow Architecture Programming Language cs.DC · 2025-11-12 · unverdicted · none · ref 7
SpaDA provides a concise language and multi-level compiler for spatial dataflow hardware that integrates with stencil DSLs and delivers substantial code reduction and high performance on wafer-scale engines.
Similarity Field Theory: A Mathematical Framework for Intelligence cs.AI · 2025-09-21 · unverdicted · none · ref 14
Similarity Field Theory defines a similarity field over entities, concepts as superlevel-set fibers, and intelligence as a generative operator that preserves fiber membership under evolution.
Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models cs.LG · 2026-05-14 · unverdicted · none · ref 30
Empirical tests on 118 transformers show success falling from 88.1% at 512 tokens to 0% at 2048 tokens, with compressed models achieving 649.2 tokens/sec/M parameters versus 12.5 for large generative ones.
Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026) cs.CL · 2025-01-03 · unverdicted · none · ref 27
A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.

Cerebras-gpt: Open compute- optimal language models trained on the cerebras wafer- scale cluster

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer