hub Mixed citations

PIQA: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi · 2020 · DOI 10.1609/aaai.v34i05.6239

Mixed citation behavior. Most common role is background (60%).

27 Pith papers citing it

Background 60% of classified citations

open at publisher browse 27 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 3 dataset 1 other 1

citation-polarity summary

background 3 unclear 1 use dataset 1

representative citing papers

MultiHashFormer: Hash-based Generative Language Models

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

MultiHashFormer enables hash-based autoregression in LMs by encoding tokens as multi-hash signatures, outperforming standard Transformers at 100M-3B scales while keeping parameter count constant for multilingual expansion.

TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.

ConQuR: Corner Aligned Activation Quantization via Optimized Rotations for LLMs

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

ConQuR is a post-training rotation calibration technique that aligns activations to hypercube corners via Procrustes optimization and online updates, delivering competitive LLM quantization performance without end-to-end training or offline activation storage.

EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

SimDiff: Depth Pruning via Similarity and Difference

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.

ReasonXL: Shifting LLM Reasoning Language Without Sacrificing Performance

cs.CL · 2026-04-14 · unverdicted · novelty 7.0

A new parallel reasoning dataset enables LLMs to shift reasoning to non-English languages via SFT and RLVR while matching or exceeding baseline performance.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Aurora: A Leverage-Aware Spectral Optimizer

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

Aurora is a leverage-aware spectral optimizer that enforces uniform row norms in matrix updates while preserving Muon's polar geometry, outperforming Muon and achieving SOTA among spectral methods on modded-nanoGPT.

ConSA: Controllable Sparsity in Hybrid Attention via Learnable Allocation

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

ConSA learns FA/SWA allocation via L0 masks and augmented Lagrangian constraints, outperforming rule-based baselines on 0.6B and 1.7B models with consistent layer patterns.

Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

Models benchmarking as principal-agent game, derives welfare loss from welfare alignment, improvability and variance, and applies an audit framework to OLMES items.

Forecasting Downstream Performance of LLMs With Proxy Metrics

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.

Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.

Training Transformers for KV Cache Compressibility

cs.LG · 2026-05-07 · unverdicted · novelty 6.0 · 2 refs

Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.

SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

SAMoRA is a parameter-efficient fine-tuning framework that uses semantic-aware routing and task-adaptive scaling within a Mixture of LoRA Experts to improve multi-task performance and generalization over prior methods.

TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

TalkLoRA equips MoE-LoRA experts with a communication module that smooths routing dynamics and improves performance on language tasks under similar parameter budgets.

mHC: Manifold-Constrained Hyper-Connections

cs.CL · 2025-12-31 · unverdicted · novelty 6.0

mHC projects hyper-connection residual spaces onto a manifold to restore identity mapping, enabling stable large-scale training with performance gains over standard HC.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

cs.AI · 2025-07-01 · conditional · novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

GPT-NeoX-20B: An Open-Source Autoregressive Language Model

cs.CL · 2022-04-14 · accept · novelty 6.0

GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.

q0: Primitives for Hyper-Epoch Pretraining

cs.LG · 2026-06-02 · unverdicted · novelty 5.0

q0 turns multi-epoch budgets into diverse model populations using three primitives that outperform single-model training and strong ensembles with fewer epochs on a 1.8B model.

MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

MARR uses per-module adaptive residual scaling updated by PID feedback to balance error correction against Hessian-approximation bias in low-bit PTQ.

IO-SVD: Input-Output Whitened SVD for Adaptive-Rank LLM Compression

cs.LG · 2026-05-15 · unverdicted · novelty 5.0

IO-SVD performs SVD-based LLM compression by constructing a KL-aware double-sided whitening space and using first-order loss estimates for heterogeneous rank allocation.

InternLM2 Technical Report

cs.CL · 2024-03-26 · unverdicted · novelty 5.0

InternLM2 is a new open-source LLM that outperforms prior versions on 30 benchmarks and long-context tasks through scaled pre-training to 32k tokens and a conditional online RLHF alignment strategy.

citing papers explorer

Showing 1 of 1 citing paper after filters.

GPT-NeoX-20B: An Open-Source Autoregressive Language Model cs.CL · 2022-04-14 · accept · none · ref 13
GPT-NeoX-20B is a publicly released 20B parameter autoregressive language model trained on the Pile that shows strong gains in five-shot reasoning over similarly sized prior models.

PIQA: Reasoning about physical commonsense in natural language

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer