Title resolution pending

Lintang Sutawika, Hailey Schoelkopf, Leo Gao, Baber Abbasi, Stella Biderman, Jonathan Tow + 2 more · 2024 · Zenodo (CERN European Organization for Nuclear Research) · DOI 10.5281/zenodo.12608602

79 Pith papers cite this work, alongside 23 external citations. Polarity classification is still indexing.

79 Pith papers citing it

23 external citations · Crossref

open at publisher browse 79 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2 dataset 1 other 1

citation-polarity summary

background 2 unclear 1 use dataset 1

representative citing papers

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Convergence of Gradient Descent for General Neural Network Architectures Beyond the NTK Regime

cs.LG · 2026-06-22 · unverdicted · novelty 7.0

Proves GD convergence to stationary point neighborhoods for general NN architectures beyond NTK via block-level analysis, analyticity, and local smoothness conditions.

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

cs.LG · 2026-06-08 · conditional · novelty 7.0

A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

cs.DC · 2026-06-07 · conditional · novelty 7.0 · 2 refs

APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

cs.CL · 2026-06-05 · unverdicted · novelty 7.0

UrduMMLU is a new native-source MCQ benchmark for Urdu that reveals top LLMs reach only ~90% accuracy with large gaps on region-specific humanities content.

Learning What to Forget: Improving LLM Unlearning via Learned Token-Level Importance

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

ATWU jointly optimizes model parameters and token weights via a linear scorer on hidden states, recovering oracle forget-specific tokens under a separation condition and achieving SOTA forget-retain trade-offs on TOFU and RWKU.

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.

Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning

cs.LG · 2026-06-02 · unverdicted · novelty 7.0

Analysis of 15 calibration sources shows opposite-sign Spearman correlations between perplexity and retention across General vs. Math/Code dimensions in LLM pruning, and multi-source mixing via IGSP raises total retention from 40-50% to 58.8%.

Multilingual Idioms in Sentences and Conversations Across High-, Medium-, and Low-Resource Languages

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

MIDI is a new multilingual idiom dataset with sentence and conversational contexts; benchmarking reveals worse performance in low-resource languages and on literal vs. figurative uses.

PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs

cs.CR · 2026-05-22 · accept · novelty 7.0

PoisonForge benchmark shows that 1% poisoned examples achieve over 70% attack success rate on targeted tasks across 11 of 12 tested LLMs with under 0.5% leakage to non-target tasks.

Provable Joint Decontamination for Benchmarking Multiple Large Language Models

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

JECS aggregates per-model conformal p-values via their maximum and reconstructs a conservative envelope of the max-p null distribution to select benchmarks with global contamination rate control.

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

cs.DC · 2026-05-13 · conditional · novelty 7.0

KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.

AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

AAAC learns two 64-byte codebooks per layer for 4-bit LLM weights and lets each group pick the one minimizing activation-weighted reconstruction error, storing the choice at zero extra cost.

RAG over Thinking Traces Can Improve Reasoning Tasks

cs.IR · 2026-05-05 · unverdicted · novelty 7.0

Retrieving structured thinking traces as a corpus improves reasoning performance on AIME, LiveCodeBench, and GPQA over standard RAG or no retrieval.

SimDiff: Depth Pruning via Similarity and Difference

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

cs.AR · 2026-03-30 · unverdicted · novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

cs.AI · 2026-02-19 · unverdicted · novelty 7.0

Conv-FinRe is a new benchmark built from real market data and human trajectories that tests LLMs on generating utility-grounded stock rankings over fixed horizons while distinguishing rational analysis from behavioral mimicry or momentum.

FinAuditing: A Financial Taxonomy-Structured Multi-Document Benchmark for Evaluating LLMs

cs.CL · 2025-10-10 · unverdicted · novelty 7.0

FinAuditing is a taxonomy-structured multi-document benchmark with 1,102 instances averaging over 33k tokens from XBRL filings, defining three tasks to evaluate LLMs on financial auditing capabilities.

FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

cs.CL · 2025-05-27 · unverdicted · novelty 7.0

FinTagging decomposes XBRL tagging into FinNI extraction and FinCL full-taxonomy linking, showing LLMs handle extraction but struggle with fine-grained concept alignment in zero-shot settings.

SharQ: Bridging Activation Sparsity and FP4 Quantization for LLM Inference

cs.LG · 2026-06-25 · unverdicted · novelty 6.0

SharQ combines input-adaptive N:M sparsity and FP4 quantization via sparse backbone plus dense residual, recovering 43-63% of the NVFP4-to-FP16 accuracy gap on Llama and Qwen models without calibration or retraining.

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

Safety Reflection Pretraining adds regular safety reflections to pretraining data to integrate self-monitoring and reduce unsafe generalization from safe data in LLMs.

Demystifying Training-Time Augmentation for Data-Constrained Language Model Pretraining

cs.LG · 2026-06-15 · unverdicted · novelty 6.0

Training-time augmentations in token noise, permutation, and offset categories reduce overfitting and improve minimum validation loss in multi-epoch autoregressive pretraining on fixed corpora.

Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It

cs.CL · 2026-06-09 · conditional · novelty 6.0

CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

cs.LG · 2026-06-05 · unverdicted · novelty 6.0

MIR improves validation loss in repeated-data pretraining and SoftQ fits data-constrained scaling experiments better than additive laws, equating MIR gains to roughly 1.3 times more unique data.

citing papers explorer

Showing 7 of 7 citing papers after filters.

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy cs.LG · 2026-06-08 · conditional · none · ref 59
A GEMM-centric taxonomy and unified benchmark show static depth pruning as the strongest Pareto-optimal baseline for LLM inference acceleration, with the frontier shifting to dynamic depth then static width pruning as quality loss rises.
APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing cs.DC · 2026-06-07 · conditional · none · ref 16 · 2 links
APEX4 co-designs pure INT4 GEMM kernels with ρ-aware granularity adaptation to deliver up to 2.09× end-to-end speedup on GPUs with low ρ while keeping LLaMA-2-70B perplexity within 0.63 of FP16.
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving cs.DC · 2026-05-13 · conditional · none · ref 14
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
Attention Amnesia in Hybrid LLMs: When CoT Fine-Tuning Breaks Long-Range Recall, and How to Fix It cs.CL · 2026-06-09 · conditional · none · ref 110
CoT SFT disrupts long-range routing in hybrid models via changes to W_Q and W_K; QK-Restore restores pre-SFT projections to recover NIAH performance.
Rollout-Level Advantage-Prioritized Experience Replay for GRPO cs.LG · 2026-06-03 · conditional · none · ref 16
Rollout-level advantage-prioritized experience replay for GRPO recycles high-advantage individual rollouts with age eviction and fresh-anchored batches to outperform standard GRPO on math benchmarks, with gains increasing with model size.
Do Transformers Need Three Projections? Systematic Study of QKV Variants cs.LG · 2026-06-01 · conditional · none · ref 96
Q-K=V projection sharing in transformers matches standard QKV performance with 50% KV cache reduction and combines with GQA/MQA for up to 96.9% reduction across vision and language tasks.
Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time cs.CL · 2026-05-13 · conditional · none · ref 4
OP-Mix is an on-policy data mixing method that uses low-rank adapter interpolation to find near-optimal data mixtures throughout language model training with reduced compute.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer