Enhancing the transformer with explicit relational encoding for math problem solving, 2019, 1910.06611 http://arxiv.org/abs/1910.06611

Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, J ¨urgen Schmidhuber, Jianfeng Gao · 1910 · arXiv 1910.06611

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

representative citing papers

Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

cs.CL · 2024-04-10 · conditional · novelty 7.0

Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.

Scaling Laws for Autoregressive Generative Modeling

cs.LG · 2020-10-28 · accept · novelty 7.0

Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

cs.CL · 2020-06-05 · unverdicted · novelty 7.0

DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model superhuman score on SuperGLUE.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

citing papers explorer

Showing 7 of 7 citing papers.

Tensor Product Representation Probes Reveal Shared Structure Across Linear Directions cs.LG · 2026-05-11 · unverdicted · none · ref 20
Linear probes for Othello board states factor into tensor-product structure with square and color embeddings composed by a binding matrix, from which the linear probes can be directly recovered.
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention cs.CL · 2024-04-10 · conditional · none · ref 24
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
Scaling Laws for Autoregressive Generative Modeling cs.LG · 2020-10-28 · accept · none · ref 23
Autoregressive transformers follow power-law scaling laws for cross-entropy loss with nearly universal exponents relating optimal model size to compute budget across four domains.
DeBERTa: Decoding-enhanced BERT with Disentangled Attention cs.CL · 2020-06-05 · unverdicted · none · ref 25
DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model superhuman score on SuperGLUE.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 119
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 61
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 33
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

Enhancing the transformer with explicit relational encoding for math problem solving, 2019, 1910.06611 http://arxiv.org/abs/1910.06611

fields

years

verdicts

representative citing papers

citing papers explorer