Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.
super hub Mixed citations
Pointer Sentinel Mixture Models
Mixed citation behavior. Most common role is background (56%).
abstract
Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Treebank (70.9 perplexity) while using far fewer parameters than a standard softmax LSTM. In order to evaluate how well language models can exploit longer contexts and deal with more realistic vocabularies and larger corpora we also introduce the freely available WikiText corpus.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Recent neural network sequence models with softmax classifiers have achieved their best language modeling performance only with very large hidden states and large vocabularies. Even then they struggle to predict rare or unseen words even if the context makes the prediction unambiguous. We introduce the pointer sentinel mixture architecture for neural sequence models which has the ability to either reproduce a word from the recent context or produce a word from a standard softmax classifier. Our pointer sentinel-LSTM model achieves state of the art language modeling performance on the Penn Tree
authors
co-cited works
representative citing papers
A safety direction estimated in a source LLM is transported to a target generator through lightweight alignment on benign data alone, matching native safety performance without any target-side unsafe data.
GPTQ-intrinsic LoRA augments GPTQ with intrinsic low-rank compensation via Hessian modification to achieve layer-wise reconstruction bounds that match information-theoretic lower bounds under structural assumptions.
Presents a solver-verifiable framework for Transformer circuits, with exhaustive checks on small symbolic tasks and surrogate methods for larger models.
HodgeCover isolates the harmonic kernel of a simplicial Laplacian on an expert 2-complex to identify irreducible merge cycles and selects experts for aggressive compression, matching or exceeding baselines on open-weight MoE models.
Cayley unitary adapters executed on real quantum hardware improve LLM perplexity by 1.4% on Llama 3.1 8B with 6000 parameters and recover 83% of compression-induced degradation on SmolLM2.
A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
TallyTrain is a hard-label distillation protocol for federated learning that uses argmax transmission and optional sparse merges to match soft-label performance at up to 1000x lower communication cost.
CARVE introduces key-axis content-aware gating and value-efficient scalar writes in recurrent linear attention, outperforming GDN-2 on perplexity and retrieval tasks while cutting parameters and memory.
Tapered Language Models monotonically decrease MLP width across depth with a cosine schedule, yielding better perplexity and downstream performance than uniform-width baselines across multiple architectures and scales at no extra cost.
Introduces the first community-governed unified JSON schema and crowdsourced repository for AI evaluation results, with converters and a database spanning 22,235 models and 2,273 benchmarks.
LongSpike integrates fractional-order state-space modeling into spiking neural networks, enabling better long-sequence performance than prior SNNs on LRA, WikiText-103, and Speech Commands benchmarks while retaining sparse computation.
STCC introduces a Semantic Token Codec that learns geometrically structured constellations aligning channel topology with semantic embedding spaces so noise produces topological rather than random errors.
The paper introduces Uni-E, a unified energy for DLMs that accounts for model capacity, dependency and invariance, can be computed exactly, and corrects distribution shifts from dependency and invariance.
A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.
APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.
STAR-KV applies differentiable soft thresholding for per-head and per-block adaptive low-rank KV cache compression, combined with hybrid decomposition and low-rank-aware quantization, achieving up to 75% compression and 3.1x throughput gains.
A geometric decomposition framework shows that affine transformations best recover prompt-induced task geometry and behavior in language and vision models across multiple datasets.
SubFit enables better LLM compression by fitting residual bypasses to non-contiguously selected submodules, outperforming layer-granularity baselines in accuracy-perplexity trade-offs at 12.5-37.5% sparsity.
A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.
Successor representation training on natural language causes part-of-speech categories to emerge spontaneously in the learned embeddings, with structure varying by predictive horizon.
Uniform diffusion models rely on a leave-one-out denoiser rather than the usual denoising posterior, with exact conversions derived; an absorbing-state reformulation is introduced that matches or exceeds masked diffusion on language modeling while preserving the original joint distribution.
Probabilistic circuits have an output bottleneck with convex probability combinations and a context bottleneck limited to fixed vtree-aligned partitions, making them less expressive than transformers for language data with heterogeneous dependencies, though decomposable PCs are strictly more capable
citing papers explorer
-
FAAST: Forward-Only Associative Learning via Closed-Form Fast Weights for Test-Time Supervised Adaptation
FAAST performs test-time supervised adaptation by analytically deriving fast weights from examples in one forward pass, matching backprop performance with over 90% less adaptation time and up to 95% memory savings versus memory-based methods.