hub

Liu, and Matt Gardner

Welbl, Johannes, Liu, Nelson F · 2017 · DOI 10.18653/v1/w17-4413

22 Pith papers cite this work. Polarity classification is still indexing.

22 Pith papers citing it

open at publisher browse 22 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

MultiHashFormer: Hash-based Generative Language Models

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.

Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

AuthorityBench shows citation presence (real or fabricated) increases LLM hallucination rates vs no-citation baseline, strongest for fabricated citations on true claims, with domain variation but negligible venue or author effects.

Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

SemGrad measures LLM uncertainty via gradients in semantic space using a Semantic Preservation Score to select embeddings, with HybridGrad combining it with parameter gradients to outperform sampling-based baselines especially when multiple responses are valid.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning

cs.CL · 2026-04-21 · unverdicted · novelty 7.0

Answer tokens show forward drift and key-anchor focus when reading correct reasoning traces; a geometric-plus-semantic SRQ steering method boosts quantitative reasoning accuracy without training.

RWKV: Reinventing RNNs for the Transformer Era

cs.CL · 2023-05-22 · unverdicted · novelty 7.0

RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.

The State-Prediction Separation Hypothesis

cs.CL · 2026-07-01 · unverdicted · novelty 6.0

A two-stream Transformer variant that separates state storage from next-token prediction improves validation loss and downstream task performance by 2-3 points over standard Transformers.

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

cs.LG · 2026-06-10 · unverdicted · novelty 6.0

Manifold Power Iteration aligns MoE router rows with principal singular directions of experts via a power-then-retract process, with theory showing convergence and experiments on 1B-11B models showing gains.

Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

cs.CL · 2026-05-22 · unverdicted · novelty 6.0

LINK improves cross-lingual knowledge transfer via lexical substitutions in English pretraining data, yielding notable downstream gains and up to 2x training speedup across eight languages and five model sizes.

Scaling Laws for Mixture Pretraining Under Data Constraints

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.

UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibration sequences.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

cs.AI · 2025-07-01 · conditional · novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models

cs.CL · 2025-02-20 · unverdicted · novelty 6.0

Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.

Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models

cs.CL · 2024-08-20 · unverdicted · novelty 6.0

A regression model using attention features and recurrent uncertainty scores improves selective generation in LLMs over unsupervised and supervised baselines on ten datasets and three models.

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

cs.CL · 2022-04-14 · accept · novelty 6.0

GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.

From Signals to Transfer: A Factorised Study of Probe-Based Uncertainty Estimation in Large Language Models

cs.CL · 2026-06-26 · conditional · novelty 5.0

A factorized study finds raw hidden states and attention features hard to beat in-domain for LLM uncertainty probes, but structured compressed features are more robust under distribution shift, with pretrained probes transferring to open-ended generation.

GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution

cs.LG · 2026-06-05 · unverdicted · novelty 5.0

GRASP is a scalable method for subset-level data attribution in pretraining that models interactions via a geometry-aware quadratic penalty and claims to double rank correlation while cutting costs.

q0: Primitives for Hyper-Epoch Pretraining

cs.LG · 2026-06-02 · unverdicted · novelty 5.0

q0 turns multi-epoch budgets into diverse model populations using three primitives that outperform single-model training and strong ensembles with fewer epochs on a 1.8B model.

Learning Uncertainty from Sequential Internal Dispersion in Large Language Models

cs.CL · 2026-04-17 · unverdicted · novelty 5.0

SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.

(How) Learning Rates Regulate Catastrophic Overtraining

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

Learning rate decay during SFT increases pretrained model sharpness, which exacerbates catastrophic forgetting and causes overtraining in LLMs.

Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked?

cs.CL · 2025-07-21 · unverdicted · novelty 4.0

LLM accuracy on reasoning tasks differs significantly by question type, with step-by-step reasoning accuracy often uncorrelated to final answer selection.

citing papers explorer

Showing 19 of 19 citing papers after filters.

MultiHashFormer: Hash-based Generative Language Models cs.CL · 2026-06-26 · unverdicted · none · ref 51
MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.
Authority, Truth, and Citation Bias: A Large-Scale Multi-Domain Benchmark for Studying Epistemic Susceptibility in Large Language Models cs.LG · 2026-06-11 · unverdicted · none · ref 18
AuthorityBench shows citation presence (real or fabricated) increases LLM hallucination rates vs no-citation baseline, strongest for fabricated citations on true claims, with domain variation but negligible venue or author effects.
Gradients with Respect to Semantics Preserving Embeddings Tell the Uncertainty of Large Language Models cs.CL · 2026-05-06 · unverdicted · none · ref 20
SemGrad measures LLM uncertainty via gradients in semantic space using a Semantic Preservation Score to select embeddings, with HybridGrad combining it with parameter gradients to outperform sampling-based baselines especially when multiple responses are valid.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences cs.LG · 2026-04-22 · unverdicted · none · ref 81
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
How Do Answer Tokens Read Reasoning Traces? Self-Reading Patterns in Thinking LLMs for Quantitative Reasoning cs.CL · 2026-04-21 · unverdicted · none · ref 22
Answer tokens show forward drift and key-anchor focus when reading correct reasoning traces; a geometric-plus-semantic SRQ steering method boosts quantitative reasoning accuracy without training.
RWKV: Reinventing RNNs for the Transformer Era cs.CL · 2023-05-22 · unverdicted · none · ref 8
RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
The State-Prediction Separation Hypothesis cs.CL · 2026-07-01 · unverdicted · none · ref 42
A two-stream Transformer variant that separates state storage from next-token prediction improves validation loss and downstream task performance by 2-3 points over standard Transformers.
Redesign Mixture-of-Experts Routers with Manifold Power Iteration cs.LG · 2026-06-10 · unverdicted · none · ref 49
Manifold Power Iteration aligns MoE router rows with principal singular directions of experts via a power-then-retract process, with theory showing convergence and experiments on 1B-11B models showing gains.
Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions cs.CL · 2026-05-22 · unverdicted · none · ref 38
LINK improves cross-lingual knowledge transfer via lexical substitutions in English pretraining data, yielding notable downstream gains and up to 2x training speedup across eight languages and five model sizes.
Scaling Laws for Mixture Pretraining Under Data Constraints cs.LG · 2026-05-12 · unverdicted · none · ref 50
Empirical study shows mixture pretraining tolerates higher target data repetition than single-source training, with a new repetition-aware scaling law enabling principled mixture selection based on data size, compute, and model scale.
UniPool: A Globally Shared Expert Pool for Mixture-of-Experts cs.LG · 2026-05-07 · unverdicted · none · ref 53
A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models cs.AI · 2026-04-21 · unverdicted · none · ref 65
GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibration sequences.
Token-Level Density-Based Uncertainty Quantification Methods for Eliciting Truthfulness of Large Language Models cs.CL · 2025-02-20 · unverdicted · none · ref 60
Adapts multi-layer token-level Mahalanobis distance with supervised linear regression to yield improved uncertainty scores for LLM truthfulness tasks.
Unconditional Truthfulness: Learning Unconditional Uncertainty of Large Language Models cs.CL · 2024-08-20 · unverdicted · none · ref 42
A regression model using attention features and recurrent uncertainty scores improves selective generation in LLMs over unsupervised and supervised baselines on ten datasets and three models.
GRASP: Geometry-aware Residual Alignment for Scalable Pretraining Data Attribution cs.LG · 2026-06-05 · unverdicted · none · ref 29
GRASP is a scalable method for subset-level data attribution in pretraining that models interactions via a geometry-aware quadratic penalty and claims to double rank correlation while cutting costs.
q0: Primitives for Hyper-Epoch Pretraining cs.LG · 2026-06-02 · unverdicted · none · ref 30
q0 turns multi-epoch budgets into diverse model populations using three primitives that outperform single-model training and strong ensembles with fewer epochs on a 1.8B model.
Learning Uncertainty from Sequential Internal Dispersion in Large Language Models cs.CL · 2026-04-17 · unverdicted · none · ref 55
SIVR detects LLM hallucinations by learning from token-wise and layer-wise variance patterns in internal hidden states, outperforming baselines with better generalization and less training data.
(How) Learning Rates Regulate Catastrophic Overtraining cs.LG · 2026-04-15 · unverdicted · none · ref 21
Learning rate decay during SFT increases pretrained model sharpness, which exacerbates catastrophic forgetting and causes overtraining in LLMs.
Is Large Language Model Performance on Reasoning Tasks Impacted by Different Ways Questions Are Asked? cs.CL · 2025-07-21 · unverdicted · none · ref 36
LLM accuracy on reasoning tasks differs significantly by question type, with step-by-step reasoning accuracy often uncorrelated to final answer selection.

Liu, and Matt Gardner

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer