pith. sign in

Enhancing the transformer with explicit relational encoding for math problem solving, 2019, 1910.06611 http://arxiv.org/abs/1910.06611

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

fields

cs.CL 4 cs.LG 3

representative citing papers

Scaling Laws for Autoregressive Generative Modeling

cs.LG · 2020-10-28 · accept · novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

cs.CL · 2020-06-05 · unverdicted · novelty 7.0

DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model superhuman score on SuperGLUE.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

citing papers explorer

Showing 7 of 7 citing papers.

  • Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions cs.LG · 2026-05-11 · unverdicted · none · ref 20

    Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.

  • Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention cs.CL · 2024-04-10 · conditional · none · ref 24

    Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

  • Scaling Laws for Autoregressive Generative Modeling cs.LG · 2020-10-28 · accept · none · ref 23

    Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.

  • DeBERTa: Decoding-enhanced BERT with Disentangled Attention cs.CL · 2020-06-05 · unverdicted · none · ref 25

    DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model superhuman score on SuperGLUE.

  • Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 119

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  • A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 61

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  • Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 33

    Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.