Title resolution pending

Dey, Nolan, Gosal, Gurpreet, Chen, Zhiming, Khachane, Hemant, Marshall, William, Pathria, Ribhu · 2023 · arXiv 2304.03208

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Learning the Signature of Memorization in Autoregressive Language Models

cs.CL · 2026-04-03 · accept · novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Vendi Score and scaling-law objectives belong to the class of matrix spectral functions, which are submodular, enabling efficient greedy selection of training data that outperforms random subsets in predicting held-out performance.

GQA-{\mu}P: The maximal parameterization update for grouped query attention

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.

Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size

cs.CL · 2026-04-14 · unverdicted · novelty 7.0

Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.

Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Optimal hyperparameters for LLM continued pre-training follow predictable scaling laws derived from proxy models, enabling a two-stage framework that predicts settings from compute budget and checkpoint state to reduce search overhead by 90%.

Validity Threats for Foundation Model Research

cs.LG · 2026-06-03 · accept · novelty 6.0

Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.

Spectral Condition for $\mu$P under Width-Depth Scaling

cs.LG · 2026-02-28 · unverdicted · novelty 6.0

A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

cs.CL · 2024-04-09 · conditional · novelty 6.0

MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

cs.CL · 2023-06-01 · unverdicted · novelty 6.0

Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

cs.CV · 2026-06-22 · unverdicted · novelty 5.0

SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.

Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

cs.CV · 2026-04-06 · unverdicted · novelty 5.0

A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolLM models.

SpaDA: A Spatial Dataflow Architecture Programming Language

cs.DC · 2025-11-12 · unverdicted · novelty 5.0

SpaDA provides a concise language and multi-level compiler for spatial dataflow hardware that integrates with stencil DSLs and delivers substantial code reduction and high performance on wafer-scale engines.

cs.AI · 2025-09-21 · unverdicted · novelty 5.0

Similarity Field Theory defines a similarity field over entities, concepts as superlevel-set fibers, and intelligence as a generative operator that preserves fiber membership under evolution.

Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models

cs.LG · 2026-05-14 · unverdicted · novelty 3.0

Empirical tests on 118 transformers show success falling from 88.1% at 512 tokens to 0% at 2048 tokens, with compressed models achieving 649.2 tokens/sec/M parameters versus 12.5 for large generative ones.

Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026)

cs.CL · 2025-01-03 · unverdicted · novelty 2.0

A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.

Statistical Properties of Training & Generalization

stat.ML · 2026-06-18 · unverdicted · novelty 1.0 · 2 refs

Review of neural scaling laws and their relation to constraints and inductive biases when applying machine learning to physics problems.

citing papers explorer

Showing 17 of 17 citing papers.

Learning the Signature of Memorization in Autoregressive Language Models cs.CL · 2026-04-03 · accept · none · ref 6
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions cs.LG · 2026-05-28 · unverdicted · none · ref 20
Vendi Score and scaling-law objectives belong to the class of matrix spectral functions, which are submodular, enabling efficient greedy selection of training data that outperforms random subsets in predicting held-out performance.
GQA-{\mu}P: The maximal parameterization update for grouped query attention cs.LG · 2026-05-14 · unverdicted · none · ref 4
Derives μP scalings for GQA via promoted spectral-norm definition of feature learning and a modified norm preserving scaling laws for non-full-rank matrices, with experiments showing learning-rate transfer.
Better and Worse with Scale: How Contextual Entrainment Diverges with Model Size cs.CL · 2026-04-14 · unverdicted · none · ref 4
Contextual entrainment decreases for semantic contexts but increases for non-semantic ones as LLMs scale, following power-law trends with 4x better resistance to misinformation but 2x more copying of arbitrary tokens.
Predictable Scaling Laws of Optimal Hyperparameters for LLM Continued Pre-training cs.CL · 2026-06-04 · unverdicted · none · ref 30
Optimal hyperparameters for LLM continued pre-training follow predictable scaling laws derived from proxy models, enabling a two-stage framework that predicts settings from compute budget and checkpoint state to reduce search overhead by 90%.
Validity Threats for Foundation Model Research cs.LG · 2026-06-03 · accept · none · ref 22
Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
Spectral Condition for $\mu$P under Width-Depth Scaling cs.LG · 2026-02-28 · unverdicted · none · ref 8
A unified spectral condition for μP under width-depth scaling reveals a transition at k=1 vs k≥2 transformations per residual block and enables stable feature learning for practical architectures like Transformers.
MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies cs.CL · 2024-04-09 · conditional · none · ref 12
MiniCPM 1.2B and 2.4B models reach parity with 7B-13B LLMs via model wind-tunnel scaling and a WSD scheduler that yields a higher optimal data-to-model ratio than Chinchilla scaling.
The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 273
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only cs.CL · 2023-06-01 · unverdicted · none · ref 20
Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.
SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning cs.CV · 2026-06-22 · unverdicted · none · ref 5
SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.
Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs cs.CV · 2026-04-06 · unverdicted · none · ref 7
A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolLM models.
SpaDA: A Spatial Dataflow Architecture Programming Language cs.DC · 2025-11-12 · unverdicted · none · ref 7
SpaDA provides a concise language and multi-level compiler for spatial dataflow hardware that integrates with stencil DSLs and delivers substantial code reduction and high performance on wafer-scale engines.
Similarity Field Theory: A Mathematical Framework for Intelligence cs.AI · 2025-09-21 · unverdicted · none · ref 14
Similarity Field Theory defines a similarity field over entities, concepts as superlevel-set fibers, and intelligence as a generative operator that preserves fiber membership under evolution.
Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models cs.LG · 2026-05-14 · unverdicted · none · ref 30
Empirical tests on 118 transformers show success falling from 88.1% at 512 tokens to 0% at 2048 tokens, with compressed models achieving 649.2 tokens/sec/M parameters versus 12.5 for large generative ones.
Small Language Models (SLMs) Can Still Pack a Punch: A survey (updated 2026) cs.CL · 2025-01-03 · unverdicted · none · ref 27
A literature survey of Small Language Models (1-8B parameters) that can perform comparably or better than larger models, covering general-purpose and task-specific approaches plus creation techniques.
Statistical Properties of Training & Generalization stat.ML · 2026-06-18 · unverdicted · none · ref 45 · 2 links
Review of neural scaling laws and their relation to constraints and inductive biases when applying machine learning to physics problems.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer