pith. sign in

arxiv: 1804.07461 · v3 · submitted 2018-04-20 · 💻 cs.CL

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Pith reviewed 2026-05-12 21:18 UTC · model grok-4.3

classification 💻 cs.CL
keywords GLUEnatural language understandingbenchmarkmulti-task learningtransfer learningNLU evaluationdiagnostic analysis
0
0 comments X

The pith

GLUE supplies a benchmark of nine NLU tasks plus diagnostics to test models for general rather than task-specific language understanding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GLUE as a single evaluation platform that aggregates results across existing natural language understanding tasks to measure how well models handle language in a general way. Current models often excel only when trained separately on one task at a time, so the benchmark includes several tasks with very small training sets to reward approaches that share knowledge across problems. It also supplies a separate hand-crafted diagnostic suite that breaks down model errors by specific linguistic features such as coreference or negation. Baseline experiments with multi-task and transfer methods show they produce little gain over training one model per task, which points to the need for new techniques that truly generalize.

Core claim

GLUE is a model-agnostic collection of nine NLU tasks together with a diagnostic test suite that together measure whether a system exhibits broad language understanding, and current multi-task baselines fail to improve substantially on the aggregate score obtained by training separate models per task.

What carries the argument

The GLUE benchmark itself, which combines performance scores from nine tasks with limited-data subsets and a hand-crafted diagnostic test suite for linguistic analysis.

If this is right

  • A single aggregate score can rank models on their overall language understanding ability.
  • Training regimes that move knowledge between tasks become directly measurable and rewarded.
  • The diagnostic suite can isolate which linguistic phenomena still cause models to fail.
  • Further progress requires methods that go beyond simple multi-task fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • GLUE could serve as a stable reference point for comparing new NLU systems over time.
  • Adding tasks that probe longer-range reasoning or world knowledge would test whether current high scores reflect deeper understanding.
  • If GLUE scores predict success on downstream applications, the benchmark could guide practical model selection.

Load-bearing premise

The nine chosen tasks are diverse enough to stand in for general language understanding rather than measuring narrow skills.

What would settle it

A model that scores high on the full GLUE suite but collapses on new tasks that require the same linguistic skills in fresh combinations would show the benchmark does not capture generality.

read the original abstract

For natural language understanding (NLU) technology to be maximally useful, both practically and as a scientific object of study, it must be general: it must be able to process language in a way that is not exclusively tailored to any one specific task or dataset. In pursuit of this objective, we introduce the General Language Understanding Evaluation benchmark (GLUE), a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks. GLUE is model-agnostic, but it incentivizes sharing knowledge across tasks because certain tasks have very limited training data. We further provide a hand-crafted diagnostic test suite that enables detailed linguistic analysis of NLU models. We evaluate baselines based on current methods for multi-task and transfer learning and find that they do not immediately give substantial improvements over the aggregate performance of training a separate model per task, indicating room for improvement in developing general and robust NLU systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the GLUE benchmark as a collection of nine existing NLU tasks (MNLI, QQP, SST-2, CoLA, STS-B, MRPC, RTE, WNLI, QNLI) chosen for diversity in type and data size, along with a hand-crafted diagnostic test suite for linguistic analysis. It evaluates single-task, multi-task, and transfer-learning baselines and reports that the latter approaches do not yield substantial aggregate improvements over per-task training, suggesting room for better general NLU methods.

Significance. If the task collection is representative and the baseline comparisons are reproducible, the work supplies a standardized, model-agnostic platform that directly incentivizes cross-task knowledge sharing and has already become a de-facto evaluation standard. The explicit release of the benchmark, code, and diagnostic suite constitutes a concrete reproducibility strength that supports community-wide adoption and iterative improvement.

major comments (3)
  1. [§3] §3 (task selection): the claim that the nine tasks measure 'general' rather than task-specific capabilities rests on qualitative assertions of diversity; no quantitative analysis (e.g., inter-task error correlations, shared artifact statistics, or phenomenon-coverage matrix) is provided to demonstrate independence, which is load-bearing for the central motivation of the benchmark.
  2. [§4] §4 (baselines): the multi-task and transfer-learning setups omit precise specifications of task-sampling ratios, loss-weighting scheme, hyper-parameter search protocol, and random seeds. Without these details the reported finding that multi-task training yields only marginal GLUE-score gains cannot be independently verified or reproduced.
  3. [§6] §6 (experiments): no statistical significance tests or variance estimates across runs are reported for the single-task versus multi-task comparisons; this weakens the conclusion that current methods 'do not immediately give substantial improvements.'
minor comments (2)
  1. [§5] §5 (diagnostic suite): a few concrete example items for each linguistic phenomenon would improve clarity and allow readers to assess the suite's coverage without consulting external resources.
  2. [Table 1] Table 1 and §3: the WNLI task description should explicitly note its known label-distribution artifacts, as these affect interpretation of model performance on that sub-task.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading of our manuscript and their recommendation for minor revision. We address each major comment below, indicating where we will make revisions to the paper.

read point-by-point responses
  1. Referee: [§3] §3 (task selection): the claim that the nine tasks measure 'general' rather than task-specific capabilities rests on qualitative assertions of diversity; no quantitative analysis (e.g., inter-task error correlations, shared artifact statistics, or phenomenon-coverage matrix) is provided to demonstrate independence, which is load-bearing for the central motivation of the benchmark.

    Authors: We agree that the task selection in §3 relies on qualitative arguments regarding the diversity of the tasks in terms of format, size, and the phenomena they test. While this diversity is detailed in the paper and supported by the diagnostic suite, we acknowledge the benefit of quantitative evidence. In the revised manuscript, we will add an analysis of inter-task error correlations computed from our baseline models to provide quantitative support for the tasks measuring somewhat independent capabilities. revision: yes

  2. Referee: [§4] §4 (baselines): the multi-task and transfer-learning setups omit precise specifications of task-sampling ratios, loss-weighting scheme, hyper-parameter search protocol, and random seeds. Without these details the reported finding that multi-task training yields only marginal GLUE-score gains cannot be independently verified or reproduced.

    Authors: We thank the referee for pointing this out. The original manuscript and accompanying code release aimed to provide sufficient details, but we agree that explicit specifications are needed for full reproducibility. We will revise §4 to include the precise task-sampling ratios, loss-weighting scheme, hyper-parameter search protocol, and random seeds used in our experiments. revision: yes

  3. Referee: [§6] §6 (experiments): no statistical significance tests or variance estimates across runs are reported for the single-task versus multi-task comparisons; this weakens the conclusion that current methods 'do not immediately give substantial improvements.'

    Authors: We agree that including variance estimates and significance tests would strengthen the experimental claims. At the time of the original submission, we reported results from single runs due to computational constraints. For the revised version, we will re-run the main single-task and multi-task experiments with multiple random seeds to report means and standard deviations, and include statistical comparisons where appropriate. revision: yes

Circularity Check

0 steps flagged

No circularity: GLUE is a definitional benchmark without derivations or self-referential reductions

full rationale

The paper introduces GLUE by selecting and aggregating nine existing NLU datasets (MNLI, QQP, etc.) and adding a diagnostic suite. No equations, fitted parameters, predictions, or uniqueness theorems appear. The claim that the collection measures 'general' NLU rests on an explicit assumption of task diversity rather than any derivation that reduces to its own inputs or prior self-citations. This is a resource paper whose central contribution is definitional and externally evaluable; no load-bearing step collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that the selected tasks collectively probe general NLU without introducing new free parameters, axioms beyond standard task definitions, or invented entities.

axioms (1)
  • domain assumption The selected NLU tasks are representative of general language understanding capabilities.
    Invoked in the motivation for combining tasks to incentivize sharing knowledge across limited-data settings.

pith-pipeline@v0.9.0 · 5467 in / 1077 out tokens · 41467 ms · 2026-05-12T21:18:30.925461+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. BEHAVIOR-1K: A Human-Centered, Embodied AI Benchmark with 1,000 Everyday Activities and Realistic Simulation

    cs.RO 2024-03 accept novelty 8.0

    BEHAVIOR-1K introduces a benchmark of 1,000 human everyday activities in realistic simulated scenes together with the OMNIGIBSON physics simulator to evaluate embodied AI.

  2. LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning

    cs.AI 2023-06 conditional novelty 8.0

    LIBERO is a new benchmark for lifelong robot learning that evaluates transfer of declarative, procedural, and mixed knowledge across 130 manipulation tasks with provided demonstration data.

  3. Editing Models with Task Arithmetic

    cs.LG 2022-12 accept novelty 8.0

    Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.

  4. Discovering Latent Knowledge in Language Models Without Supervision

    cs.CL 2022-12 conditional novelty 8.0

    An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...

  5. Measuring Massive Multitask Language Understanding

    cs.CY 2020-09 accept novelty 8.0

    Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

  6. EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting

    cs.LG 2026-05 unverdicted novelty 7.0

    EpiCastBench supplies 40 curated multivariate epidemic datasets and evaluates 15 forecasting models under unified preprocessing, horizons, metrics, and significance tests.

  7. Intrinsic Muon: Spectral Optimization on Riemannian Matrix Manifolds

    cs.LG 2026-05 unverdicted novelty 7.0

    Intrinsic Muon provides closed-form linear maximization oracles on multiple Riemannian matrix manifolds for unitarily invariant norms, with convergence rates depending only on manifold dimension or rank.

  8. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 7.0

    PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.

  9. Analysis and Explainability of LLMs Via Evolutionary Methods

    cs.NE 2026-04 unverdicted novelty 7.0

    Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.

  10. MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts

    cs.CL 2026-04 unverdicted novelty 7.0

    MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B pa...

  11. Winner-Take-All Spiking Transformer for Language Modeling

    cs.NE 2026-04 unverdicted novelty 7.0

    Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.

  12. SpectralLoRA: Is Low-Frequency Structure Sufficient for LoRA Adaptation? A Spectral Analysis of Weight Updates

    cs.LG 2026-04 unverdicted novelty 7.0

    LoRA weight updates are spectrally sparse, with 33% of DCT coefficients capturing 90% of energy on average, enabling 10x storage reduction and occasional gains by masking high frequencies.

  13. Norm Anchors Make Model Edits Last

    cs.LG 2026-01 conditional novelty 7.0

    Norm-Anchor Scaling breaks the norm-feedback loop in sequential LLM editing by anchoring value vectors to original norms, improving long-run performance by 72.2% and extending the editing horizon over 4x.

  14. Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

    cs.CL 2025-11 unverdicted novelty 7.0

    UWE is a task-agnostic bi-encoder that uses many-to-many InfoNCE and token-level soft late interaction to achieve zero-shot ranking across unseen work-related target spaces while using far fewer parameters than Qwen3-...

  15. Power-Softmax: Towards Secure LLM Inference over Encrypted Data

    cs.LG 2024-10 unverdicted novelty 7.0

    Power-Softmax is a new HE-compatible attention variant that permits training and inference of billion-parameter polynomial LLMs with performance matching standard transformers.

  16. Scaling and evaluating sparse autoencoders

    cs.LG 2024-06 unverdicted novelty 7.0

    K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

  17. QLoRA: Efficient Finetuning of Quantized LLMs

    cs.LG 2023-05 conditional novelty 7.0

    QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.

  18. LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention

    cs.CV 2023-03 conditional novelty 7.0

    LLaMA-Adapter turns frozen LLaMA 7B into a capable instruction follower using only 1.2M new parameters and zero-init attention, matching Alpaca while extending to image-conditioned reasoning on ScienceQA and COCO.

  19. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    cs.LG 2022-08 conditional novelty 7.0

    LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

  20. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    cs.LG 2021-01 accept novelty 7.0

    Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

  21. Unsupervised Cross-lingual Representation Learning at Scale

    cs.CL 2019-11 conditional novelty 7.0

    XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.

  22. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    cs.CL 2019-10 accept novelty 7.0

    BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.

  23. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  24. UniSpike: Accelerating Spiking Neural Networks on Neuromorphic Systems via Eliminating Address Redundancy

    cs.NE 2026-05 unverdicted novelty 6.0

    UniSpike eliminates address redundancy in spike packets via co-design of scheduling, runtime assembly hardware, and SNN partitioning, reporting 1.93x average traffic reduction, 1.77x speedup, and 1.50x energy improvement.

  25. PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

    cs.CL 2026-05 unverdicted novelty 6.0

    PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.

  26. SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

    cs.LG 2026-05 unverdicted novelty 6.0

    SURGE introduces a dual-path gradient compensator and adaptive scaler to improve surrogate gradient estimation in binarized neural network training.

  27. SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

    cs.LG 2026-05 unverdicted novelty 6.0

    SURGE proposes a dual-path gradient compensator and adaptive scaler to learn better surrogate gradients for binary neural network training, outperforming prior methods on classification, detection, and language tasks.

  28. AdaPreLoRA: Adafactor Preconditioned Low-Rank Adaptation

    cs.LG 2026-05 unverdicted novelty 6.0

    AdaPreLoRA pairs the Adafactor diagonal Kronecker preconditioner on the full weight matrix with a closed-form factor-space solve that selects the update minimizing an H_t-weighted imbalance, yielding competitive resul...

  29. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 6.0

    PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.

  30. Finding Meaning in Embeddings: Concept Separation Curves

    cs.CL 2026-04 unverdicted novelty 6.0

    Concept Separation Curves provide a classifier-independent method to visualize and quantify how sentence embeddings distinguish conceptual meaning from syntactic variations across languages and domains.

  31. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  32. Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

    cs.CV 2026-04 unverdicted novelty 6.0

    MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.

  33. MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...

  34. PEFT-Bench: A Parameter-Efficient Fine-Tuning Methods Benchmark

    cs.CL 2025-11 unverdicted novelty 6.0

    PEFT-Bench is a standardized end-to-end benchmark for 7 PEFT methods across 27 NLP datasets on autoregressive LLMs, accompanied by the PSCP metric that penalizes based on trainable parameters, inference speed, and tra...

  35. Scaling Laws Meet Model Architecture: Toward Inference-Efficient LLMs

    cs.LG 2025-10 unverdicted novelty 6.0

    A conditional scaling law fitted on over 200 models from 80M to 3B parameters identifies architectures that deliver up to 2.1% higher accuracy and 42% higher inference throughput than LLaMA-3.2 under the same training budget.

  36. Little by Little: Continual Learning via Incremental Mixture of Rank-1 Associative Memory Experts

    cs.LG 2025-06 unverdicted novelty 6.0

    MoRAM frames continual learning as incremental addition of rank-1 adapters viewed as self-activating key-value associative memory units in a mixture-of-experts setup.

  37. LLMs Get Lost In Multi-Turn Conversation

    cs.CL 2025-05 unverdicted novelty 6.0

    LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.

  38. LLaVA-Video: Video Instruction Tuning With Synthetic Data

    cs.CV 2024-10 unverdicted novelty 6.0

    LLaVA-Video-178K is a new synthetic video instruction dataset that, when combined with existing data to train LLaVA-Video, produces strong results on video understanding benchmarks.

  39. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    cs.CL 2024-06 conditional novelty 6.0

    MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

  40. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    cs.LG 2023-09 accept novelty 6.0

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  41. Simple synthetic data reduces sycophancy in large language models

    cs.CL 2023-08 unverdicted novelty 6.0

    Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.

  42. H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

    cs.LG 2023-06 unverdicted novelty 6.0

    H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.

  43. ART: Automatic multi-step reasoning and tool-use for large language models

    cs.CL 2023-03 unverdicted novelty 6.0

    ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.

  44. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  45. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

    cs.SE 2021-02 unverdicted novelty 6.0

    CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.

  46. Linformer: Self-Attention with Linear Complexity

    cs.LG 2020-06 conditional novelty 6.0

    Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.

  47. CTRL: A Conditional Transformer Language Model for Controllable Generation

    cs.CL 2019-09 unverdicted novelty 6.0

    CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.

  48. Generic Intent Representation in Web Search

    cs.IR 2019-07 unverdicted novelty 6.0

    GEN Encoder learns query intent embeddings from click logs as weak supervision and multi-task paraphrase training, outperforming prior methods on intent similarity and using nearest-neighbor search to cover half of un...

  49. Kernel-Based ReLU Approximation for Homomorphic Encryption-Compatible Privacy-preserving Deep Learning Models

    cs.CR 2026-05 unverdicted novelty 5.0

    Kernel-based ReLU is approximated by a quadratic polynomial for low-depth homomorphic encryption compatibility, trained on LLM token embeddings and evaluated across DL and transformer settings.

  50. Interactive Evaluation Requires a Design Science

    cs.AI 2026-05 unverdicted novelty 5.0

    Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axi...

  51. ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

    cs.LG 2026-05 unverdicted novelty 5.0

    ZeroUnlearn is a few-shot unlearning method that overwrites sensitive inputs with neutral targets via closed-form multiplicative parameter updates enforcing representational orthogonality in LLMs.

  52. ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

    cs.LG 2026-05 unverdicted novelty 5.0

    ZeroUnlearn is a few-shot unlearning method that maps sensitive inputs to neutral states and enforces representational orthogonality through a closed-form multiplicative update, outperforming baselines while preservin...

  53. Convex Dataset Valuation for Post-Training

    cs.LG 2026-05 unverdicted novelty 5.0

    A convex KMM-based valuation method that accounts for both target-task alignment and inter-dataset redundancy in gradient space outperforms standard gradient-alignment baselines for LLM post-training data selection.

  54. Strategic Over-Parameterization for Generalizable Low-Rank Adaptation

    cs.LG 2026-05 unverdicted novelty 5.0

    LoRA-Over injects auxiliary parameters into low-rank adapters during training and decomposes them back into standard LoRA at inference, with static or dynamic scheduling to allocate extra capacity where needed, yieldi...

  55. Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices

    cs.AR 2026-04 conditional novelty 5.0

    Hardware approximations for Softmax and LayerNorm preserve exact normalization guarantees and deliver up to 14x area reduction in 28nm silicon with negligible accuracy loss on GLUE, SQuAD, and perplexity.

  56. A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models

    cs.LG 2026-04 unverdicted novelty 5.0

    KL divergence provides a superior forward-only metric for identifying quantization-sensitive parts in SSM-Transformer hybrids, outperforming MSE and SQNR and supporting practical mixed-precision deployment on edge devices.

  57. Adaptive Spiking Neurons for Vision and Language Modeling

    cs.NE 2026-04 unverdicted novelty 5.0

    ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.

  58. BWTA: Accurate and Efficient Binarized Transformer by Algorithm-Hardware Co-design

    cs.LG 2026-04 unverdicted novelty 5.0

    BWTA achieves near full-precision accuracy on BERT and LLMs using binary weights and ternary activations, with 16-24x kernel speedups via specialized CUDA kernels.

  59. PEFT-Factory: Unified Parameter-Efficient Fine-Tuning of Autoregressive Large Language Models

    cs.CL 2025-12 unverdicted novelty 5.0

    PEFT-Factory supplies a ready-to-use, extensible codebase that unifies 19 PEFT methods and evaluation pipelines for fine-tuning large autoregressive language models.

  60. TrustLLM: Trustworthiness in Large Language Models

    cs.CL 2024-01 unverdicted novelty 5.0

    TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt...

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 67 Pith papers

  1. [1]

    Neural machine translation by jointly learning to align and translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Proceedings of the International Conference on Learning Representations, 2015

  2. [2]

    The second PASCAL recognising textual entailment challenge

    Roy Bar Haim, Ido Dagan, Bill Dolan, Lisa Ferro, Danilo Giampiccolo, Bernardo Magnini, and Idan Szpektor. The second PASCAL recognising textual entailment challenge. 2006

  3. [3]

    The fifth PASCAL recognizing textual entailment challenge

    Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. The fifth PASCAL recognizing textual entailment challenge. 2009

  4. [4]

    Bowman, Gabor Angeli, Christopher Potts, and Christopher D

    Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 632--642. Association for Computational Linguistics, 2015

  5. [5]

    Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. In Eleventh International Workshop on Semantic Evaluations, 2017

  6. [6]

    One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

    Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony Robinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv preprint 1312.3005, 2013

  7. [7]

    Natural language processing (almost) from scratch

    Ronan Collobert, Jason Weston, L \'e on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12 0 (Aug): 0 2493--2537, 2011

  8. [8]

    Sent E val: An evaluation toolkit for universal sentence representations

    Alexis Conneau and Douwe Kiela. Sent E val: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 2018

  9. [9]

    Supervised learning of universal sentence representations from natural language inference data

    Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo \" c Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, September 9-11, 2017, pp.\ 681--691, 2017

  10. [10]

    Using the framework

    Robin Cooper, Dick Crouch, Jan Van Eijck, Chris Fox, Josef Van Genabith, Jan Jaspars, Hans Kamp, David Milward, Manfred Pinkal, Massimo Poesio, Steve Pulman, Ted Briscoe, Holger Maier, and Karsten Konrad. Using the framework. Technical report, The F ra C a S Consortium, 1996

  11. [11]

    The PASCAL recognising textual entailment challenge

    Ido Dagan, Oren Glickman, and Bernardo Magnini. The PASCAL recognising textual entailment challenge. In Machine learning challenges. evaluating predictive uncertainty, visual object classification, and recognising tectual entailment, pp.\ 177--190. Springer, 2006

  12. [12]

    Transforming Question Answering Datasets Into Natural Language Inference Datasets

    Dorottya Demszky, Kelvin Guu, and Percy Liang. Transforming question answering datasets into natural language inference datasets. arXiv preprint 1809.02922, 2018

  13. [13]

    Automatically constructing a corpus of sentential paraphrases

    William B Dolan and Chris Brockett. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the International Workshop on Paraphrasing, 2005

  14. [14]

    Towards linguistically generalizable NLP systems: A workshop and shared task

    Allyson Ettinger, Sudha Rao, Hal Daum \'e III, and Emily M Bender. Towards linguistically generalizable NLP systems: A workshop and shared task. In First Workshop on Building Linguistically Generalizable NLP Systems, 2017

  15. [15]

    Liu, Matthew Peters, Michael Schmitz, and Luke S

    Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Peters, Michael Schmitz, and Luke S. Zettlemoyer. Allen NLP : A deep semantic natural language processing platform. 2017

  16. [16]

    The third PASCAL recognizing textual entailment challenge

    Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pp.\ 1--9. Association for Computational Linguistics, 2007

  17. [17]

    Comparing two k-category assignments by a k-category correlation coefficient

    Jan Gorodkin. Comparing two k-category assignments by a k-category correlation coefficient. Comput. Biol. Chem., 28 0 (5-6): 0 367--374, December 2004. ISSN 1476-9271

  18. [18]

    Bowman, and Noah A

    Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. Annotation artifacts in natural language inference data. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018

  19. [19]

    A joint many-task model: Growing a neural network for multiple nlp tasks

    Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsuruoka, and Richard Socher. A joint many-task model: Growing a neural network for multiple nlp tasks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2017

  20. [20]

    Learning distributed representations of sentences from unlabelled data

    Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016

  21. [21]

    Mining and summarizing customer reviews

    Minqing Hu and Bing Liu. Mining and summarizing customer reviews. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.\ 168--177. ACM, 2004

  22. [22]

    Bag of Tricks for Efficient Text Classification

    Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. arXiv preprint 1607.01759, 2016

  23. [23]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations, 2015

  24. [24]

    Skip- T hought vectors

    Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Skip- T hought vectors. In Advances in Neural Information Processing Systems, pp.\ 3294--3302, 2015

  25. [25]

    Distributed representations of sentences and documents

    Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Eric P. Xing and Tony Jebara (eds.), Proceedings of the 31st International Conference on Machine Learning, volume 32 of Proceedings of Machine Learning Research, pp.\ 1188--1196, Bejing, China, 22--24 Jun 2014. PMLR

  26. [26]

    The W inograd schema challenge

    Hector J Levesque, Ernest Davis, and Leora Morgenstern. The W inograd schema challenge. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning , volume 46, pp.\ 47, 2011

  27. [27]

    Comparison of the predicted and observed secondary structure of t4 phage lysozyme

    Brian W Matthews. Comparison of the predicted and observed secondary structure of t4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 405 0 (2): 0 442--451, 1975

  28. [28]

    Learned in translation: Contextualized word vectors

    Bryan McCann, James Bradbury, Caiming Xiong, and Richard Socher. Learned in translation: Contextualized word vectors. In Advances in Neural Information Processing Systems, pp.\ 6297--6308, 2017

  29. [29]

    The Natural Language Decathlon: Multitask Learning as Question Answering

    Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The natural language decathlon: Multitask learning as question answering. arXiv preprint 1806.08730, 2018

  30. [30]

    Thomas McCoy and Tal Linzen

    R. Thomas McCoy and Tal Linzen. Non-entailed subsequences as a challenge for natural language inference. In Proceedings of the Society for Computation in Linguistics, volume 2, pp.\ 357--360, 2019

  31. [31]

    DisSent: Sentence Representation Learning from Explicit Discourse Relations

    Allen Nie, Erin D Bennett, and Noah D Goodman. Dissent: Sentence representation learning from explicit discourse relations. arXiv preprint 1710.04334, 2017

  32. [32]

    A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts

    Bo Pang and Lillian Lee. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, pp.\ 271. Association for Computational Linguistics, 2004

  33. [33]

    Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales

    Bo Pang and Lillian Lee. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp.\ 115--124. Association for Computational Linguistics, 2005

  34. [34]

    G lo V e: Global vectors for word representation

    Jeffrey Pennington, Richard Socher, and Christopher Manning. G lo V e: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language processing, pp.\ 1532--1543, 2014

  35. [35]

    Deep contextualized word representations

    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018

  36. [36]

    Hypothesis only baselines in natural language inference

    Adam Poliak, Jason Naradowsky, Aparajita Haldar, Rachel Rudinger, and Benjamin Van Durme. Hypothesis only baselines in natural language inference. In *SEM@NAACL-HLT, 2018

  37. [37]

    SQ u AD : 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQ u AD : 100,000+ questions for machine comprehension of text. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 2383--2392. Association for Computational Linguistics, 2016

  38. [38]

    Reasoning about entailment with neural attention

    Tim Rockt \"a schel, Edward Grefenstette, Moritz Hermann, Karl, Tom \'a s Ko c isk \`y , and Phil Blunsom. Reasoning about entailment with neural attention. In Proceedings of the International Conference on Learning Representations, 2016

  39. [39]

    Latent Multi-task Architecture Learning

    Sebastian Ruder, Joachim Bingel, Isabelle Augenstein, and Anders S gaard. Sluice networks: Learning what to share between loosely related tasks. arXiv preprint 1705.08142, 2017

  40. [40]

    Roy Schwartz, Maarten Sap, Ioannis Konstas, Li Zilles, Yejin Choi, and Noah A. Smith. The effect of different writing tasks on linguistic style: A case study of the ROC story cloze task. In Proceedings of CoNLL, 2017

  41. [41]

    Bidirectional attention flow for machine comprehension

    Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. Bidirectional attention flow for machine comprehension. In Proceedings of the International Conference of Learning Representations, 2017

  42. [42]

    Recursive deep models for semantic compositionality over a sentiment treebank

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.\ 1631--1642, 2013

  43. [43]

    Deep multi-task learning with low level tasks supervised at lower layers

    Anders S gaard and Yoav Goldberg. Deep multi-task learning with low level tasks supervised at lower layers. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pp.\ 231--235, 2016

  44. [44]

    Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. In Proceedings of the International Conference on Learning Representations, 2018

  45. [45]

    Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment

    Masatoshi Tsuchiya. Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, May 7-12, 2018 2018. European Language Resources Association (ELRA)

  46. [46]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, pp.\ 6000--6010, 2017

  47. [47]

    The TREC -8 question answering track report

    Ellen M Voorhees et al. The TREC -8 question answering track report. In TREC, volume 99, pp.\ 77--82, 1999

  48. [48]

    Neural Network Ac- ceptability Judgments,

    Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXiv preprint 1805.12471, 2018

  49. [49]

    Inference is everything: Recasting semantic resources into a unified evaluation framework

    Aaron Steven White, Pushpendre Rastogi, Kevin Duh, and Benjamin Van Durme. Inference is everything: Recasting semantic resources into a unified evaluation framework. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pp.\ 996--1005, 2017

  50. [50]

    Annotating expressions of opinions and emotions in language

    Janyce Wiebe, Theresa Wilson, and Claire Cardie. Annotating expressions of opinions and emotions in language. In Proceedings of the International Conference on Language Resources and Evaluation, volume 39, pp.\ 165--210. Springer, 2005

  51. [51]

    Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018

  52. [52]

    Aligning books and movies: Towards story-like visual explanations by watching movies and reading books

    Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the International Conference on Computer Vision, pp.\ 19--27, 2015