hub Baseline reference

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher, Lee, Kenton, Chang, Ming-Wei, Kwiatkowski, Tom, Collins, Michael, Toutanova, Kristina · 2019 · DOI 10.18653/v1/n19-1300

Baseline reference. 57% of citing Pith papers use this work as a benchmark or comparison.

42 Pith papers citing it

Baseline 57% of classified citations

open at publisher browse 42 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

dataset 4 background 3

citation-polarity summary

use dataset 4 background 2 unclear 1

representative citing papers

TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.

EdgeFlowerTune: Evaluating Federated LLM Fine-Tuning Under Realistic Edge System Constraints

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

EdgeFlowerTune is a real-device benchmark that jointly assesses model quality and system costs for federated LLM fine-tuning on edge hardware using three protocols: Quality-under-Budget, Cost-to-Target, and Robustness.

Layer Collapse in Diffusion Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.

Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.

SimDiff: Depth Pruning via Similarity and Difference

cs.AI · 2026-04-21 · unverdicted · novelty 7.0

SimDiff uses similarity and difference metrics to prune LLM layers more effectively than cosine similarity alone, retaining over 91% performance at 25% pruning on LLaMA2-7B.

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

cs.CL · 2024-05-07 · unverdicted · novelty 7.0

DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

cs.CL · 2021-10-04 · unverdicted · novelty 7.0

Proposes a textbook-based true/false QA task where PTLMs score ~50% closed-book even after pre-training on the text and ~60% open-book with retrieval.

The Power of Scale for Parameter-Efficient Prompt Tuning

cs.CL · 2021-04-18 · unverdicted · novelty 7.0

Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.

Truth or Sophistry? LoFa: A Benchmark for LLM Robustness Against Logical Fallacies

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

LoFa is a new benchmark and LFR@k metric for measuring LLM resistance to sustained logical fallacy attacks via generated question-argument pairs and debate simulations.

Don't Go Breaking My LLM: The Impact of Pruning Attention Layers on Explanation Faithfulness and Confidence Calibration

cs.LG · 2026-06-23 · unverdicted · novelty 6.0

Pruning attention layers in five LLMs across eight datasets maintains accuracy but degrades faithfulness and calibration.

Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs

cs.CL · 2026-06-06 · unverdicted · novelty 6.0

TN-gram replaces per-order hash tables in n-gram memory modules with a CP tensor factorization that shares token-position factors and uses order-absorption vectors, achieving comparable or better performance with fewer parameters.

Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

Post-hoc model-based compression of reasoning traces cuts training tokens to 12-30% and speeds training 2-7.6x while retaining up to 96% of raw-trace accuracy, though raw traces remain superior at every scale.

Activation-Based Active Learning for In-Context Learning: Challenges and Insights

cs.CL · 2026-06-03 · unverdicted · novelty 6.0

MLP activations measured as massive activations or first four moments correlate weakly (max |Spearman| = 0.33) with in-context example quality across Llama-3.2-3B, Qwen2.5-3B, and multiple classification/generative tasks, so activation-based active learning should not be used for ICL.

Eigenvectors of Experts are Training-free Non-collapsing Routers

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

SSMoE uses eigenvectors of expert weights via SVD to build training-free non-collapsing routers for SMoE models in language and vision tasks.

Forecasting Downstream Performance of LLMs With Proxy Metrics

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.

GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibration sequences.

Representation-Guided Parameter-Efficient LLM Unlearning

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models

cs.LG · 2026-04-07 · unverdicted · novelty 6.0

TalkLoRA equips MoE-LoRA experts with a communication module that smooths routing dynamics and improves performance on language tasks under similar parameter budgets.

PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

cs.CL · 2025-11-26 · unverdicted · novelty 6.0

PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and training memory.

ScaLoRA: Optimally Scaled Low-Rank Adaptation for Efficient High-Rank Fine-Tuning

cs.LG · 2025-10-27 · unverdicted · novelty 6.0

ScaLoRA analytically derives per-update column scalings that let low-rank increments accumulate into high-rank weight updates, yielding faster convergence and higher accuracy than prior LoRA variants on LLMs up to 12B parameters.

Does Math Reasoning Improve General LLM Capabilities? Understanding Transferability of LLM Reasoning

cs.AI · 2025-07-01 · conditional · novelty 6.0

Math reasoning gains in LLMs rarely transfer to general domains; RL tuning generalizes while SFT causes forgetting and representation drift.

Titans: Learning to Memorize at Test Time

cs.LG · 2024-12-31 · unverdicted · novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

citing papers explorer

Showing 12 of 12 citing papers after filters.

Layer Collapse in Diffusion Language Models cs.LG · 2026-05-07 · unverdicted · none · ref 5 · 2 links
Diffusion language models develop early-layer collapse around an indispensable super-outlier due to overtraining, resulting in higher compressibility and reversed optimal sparsity patterns versus autoregressive models.
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences cs.LG · 2026-04-22 · unverdicted · none · ref 70
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
Don't Go Breaking My LLM: The Impact of Pruning Attention Layers on Explanation Faithfulness and Confidence Calibration cs.LG · 2026-06-23 · unverdicted · none · ref 8
Pruning attention layers in five LLMs across eight datasets maintains accuracy but degrades faithfulness and calibration.
Compress-Distill: Reasoning Trace Compression for Efficient Knowledge Distillation cs.LG · 2026-06-04 · unverdicted · none · ref 6
Post-hoc model-based compression of reasoning traces cuts training tokens to 12-30% and speeds training 2-7.6x while retaining up to 96% of raw-trace accuracy, though raw traces remain superior at every scale.
Eigenvectors of Experts are Training-free Non-collapsing Routers cs.LG · 2026-05-29 · unverdicted · none · ref 1
SSMoE uses eigenvectors of expert weights via SVD to build training-free non-collapsing routers for SMoE models in language and vision tasks.
TalkLoRA: Communication-Aware Mixture of Low-Rank Adaptation for Large Language Models cs.LG · 2026-04-07 · unverdicted · none · ref 5
TalkLoRA equips MoE-LoRA experts with a communication module that smooths routing dynamics and improves performance on language tasks under similar parameter budgets.
SMoA: Spectrum Modulation Adapter for Parameter-Efficient Fine-Tuning cs.LG · 2026-05-20 · unverdicted · none · ref 5
SMoA is a new PEFT adapter that uses block-wise Hadamard-modulated low-rank branches on spectral partitions to cover more pretrained spectral directions than standard LoRA under a smaller parameter budget.
MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization cs.LG · 2026-05-18 · unverdicted · none · ref 7
MARR uses per-module adaptive residual scaling updated by PID feedback to balance error correction against Hessian-approximation bias in low-bit PTQ.
TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability cs.LG · 2026-05-14 · unverdicted · none · ref 60
Task-aware pruning improves OOD model performance by realigning distorted OOD layerwise norm and pairwise-distance profiles with the task-adapted geometry observed on ID inputs.
Reproducibility Study of "AlphaEdit: Null-Space Constrained Knowledge Editing for Language Models" cs.LG · 2026-06-25 · accept · none · ref 3
Reproducibility study confirms AlphaEdit on original setups but finds performance degrades at high edit counts, fails to generalize to newer models, and harms downstream tasks.
When Good Enough Is Optimal: Multiplication-Only Matrix Inversion Approximation for Quantized Gated DeltaNet cs.LG · 2026-06-04 · unverdicted · none · ref 39
A multiplication-only truncated Neumann approximation for matrix inversion in quantized Gated DeltaNet linear attention delivers up to 5x kernel speedup and 20% decode overhead reduction while preserving accuracy on Qwen3.5 models.
Learning in the Fisher Subspace: A Guided Initialization for LoRA Fine-Tuning cs.LG · 2026-05-01 · unverdicted · none · ref 22
Fisher information from the target data distribution supplies a task-dependent criterion for selecting LoRA directions that outperforms weight-magnitude heuristics.

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer