pith. machine review for the scientific record. sign in

arxiv: 2012.14913 · v2 · submitted 2020-12-29 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Transformer Feed-Forward Layers Are Key-Value Memories

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:30 UTC · model grok-4.3

classification 💻 cs.CL
keywords transformersfeed-forward layerskey-value memorieslanguage modelsmodel interpretabilityneural network analysis
0
0 comments X

The pith

Transformer feed-forward layers function as key-value memories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that feed-forward layers, which hold two-thirds of a transformer's parameters, act as key-value memories. Each key matches particular textual patterns seen during training, while each value produces a distribution over likely next tokens. Lower layers focus on simple surface patterns and upper layers on semantic ones, with the layer output combining multiple such memories. Residual connections then refine the combined result into the final prediction.

Core claim

Feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. The learned patterns are human-interpretable, with lower layers capturing shallow patterns and upper layers learning more semantic ones. The values complement the keys by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern. The output of a feed-forward layer is a composition of its memories, refined throughout the model via residual connections.

What carries the argument

Key-value memory pairs inside each feed-forward layer, where a key detects an input pattern and a value supplies a next-token distribution.

Load-bearing premise

The correlations between learned keys and input patterns, and between values and output distributions, reflect the actual computation the model performs at inference time.

What would settle it

Alter the weights of one specific key-value pair and measure whether the model's next-token predictions shift only for inputs that match the corresponding pattern.

read the original abstract

Feed-forward layers constitute two-thirds of a transformer model's parameters, yet their role in the network remains under-explored. We show that feed-forward layers in transformer-based language models operate as key-value memories, where each key correlates with textual patterns in the training examples, and each value induces a distribution over the output vocabulary. Our experiments show that the learned patterns are human-interpretable, and that lower layers tend to capture shallow patterns, while upper layers learn more semantic ones. The values complement the keys' input patterns by inducing output distributions that concentrate probability mass on tokens likely to appear immediately after each pattern, particularly in the upper layers. Finally, we demonstrate that the output of a feed-forward layer is a composition of its memories, which is subsequently refined throughout the model's layers via residual connections to produce the final output distribution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that feed-forward layers in transformer language models function as key-value memories: keys correlate with human-interpretable textual patterns from the training data (shallow in lower layers, semantic in upper layers), values induce complementary next-token distributions, and the layer output is a composition of activated memories that is refined via residual connections to produce the final distribution.

Significance. If the interpretation holds, it supplies a concrete mechanistic account of two-thirds of transformer parameters, grounded in empirical probing that reveals interpretable patterns and input-output complementarity. This could support targeted model editing and deeper understanding of how transformers store and retrieve information.

major comments (2)
  1. [§4] §4 (pattern extraction and activation analysis): the reported correlations between keys and n-gram patterns are statistical matches only; without causal interventions such as key ablation, activation patching, or counterfactual input edits, it remains possible that the observed associations are side-effects rather than the operative mechanism in the forward pass W2 · f(W1x).
  2. [§3.2] §3.2 (memory composition claim): the assertion that the FF output is exactly a composition of memories is not fully reconciled with the non-linearity f; the paper should show (via expansion or controlled experiments) that multiple simultaneously activated keys combine linearly in the effective computation rather than through non-linear interactions.
minor comments (2)
  1. [Figure 3] Figure 3 and Table 1: the value-distribution visualizations would be clearer with an explicit random-key baseline to quantify how much the reported concentration exceeds chance.
  2. Notation: the mapping from matrix rows/columns to keys and values is introduced without a compact equation; adding a single-line definition (e.g., key_i = row i of W1) would aid readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the insightful comments and the recommendation for major revision. We provide detailed responses to each major comment below, indicating where we will revise the manuscript to address the concerns.

read point-by-point responses
  1. Referee: [§4] §4 (pattern extraction and activation analysis): the reported correlations between keys and n-gram patterns are statistical matches only; without causal interventions such as key ablation, activation patching, or counterfactual input edits, it remains possible that the observed associations are side-effects rather than the operative mechanism in the forward pass W2 · f(W1x).

    Authors: We acknowledge that the primary evidence consists of strong statistical correlations between the keys and specific textual patterns, identified by finding inputs that highly activate each key. These correlations are not merely side-effects, as they directly correspond to the computation in the forward pass where high key activation leads to the associated value contributing to the output. Nevertheless, to provide stronger causal evidence, we will add experiments involving the ablation of specific keys and measure the impact on the model's predictions for inputs containing the corresponding patterns. revision: yes

  2. Referee: [§3.2] §3.2 (memory composition claim): the assertion that the FF output is exactly a composition of memories is not fully reconciled with the non-linearity f; the paper should show (via expansion or controlled experiments) that multiple simultaneously activated keys combine linearly in the effective computation rather than through non-linear interactions.

    Authors: The non-linearity f is applied element-wise to the pre-activations, meaning each key's activation scalar is computed independently as f(key_i · x). The layer output is then the linear combination sum_i activation_i * value_i. Therefore, the memories combine linearly once activated, with the non-linearity affecting only the activation strength of each memory individually. We will revise §3.2 to include this explicit mathematical expansion and present controlled experiments where we compare the actual FF output to the linear combination of individually computed memory contributions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical correlations from trained models do not reduce to self-definition or fitted inputs

full rationale

The paper's central claim rests on post-training analysis of existing transformer weights: identifying input patterns that strongly activate specific rows of the first FF matrix (treated as keys) and observing that the corresponding columns of the second matrix induce next-token distributions (treated as values). These are measured correlations on held-out data and activation statistics, not quantities defined in terms of each other or obtained by fitting a parameter whose value is then relabeled as a prediction. No equations are shown to be equivalent by construction, no uniqueness theorem is imported from the authors' prior work to force the interpretation, and the residual composition argument is demonstrated via direct layer-wise ablation rather than assumed. The derivation chain is therefore self-contained against external benchmarks (the trained models themselves).

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on empirical observations from trained transformers rather than new free parameters or invented physical entities; standard language-modeling assumptions are used.

axioms (1)
  • domain assumption Transformers are trained via next-token prediction on large corpora
    Invoked implicitly when linking keys to training patterns and values to output distributions.
invented entities (1)
  • key-value memory structure inside feed-forward layers no independent evidence
    purpose: Interpretive lens to explain layer behavior
    This is a conceptual reframing of existing weights, not a new postulated object with independent falsifiable predictions.

pith-pipeline@v0.9.0 · 5435 in / 1148 out tokens · 38726 ms · 2026-05-13T23:30:42.949422+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    cs.LG 2022-11 conditional novelty 8.0

    GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.

  2. Uncovering Entity Identity Confusion in Multimodal Knowledge Editing

    cs.CL 2026-05 unverdicted novelty 7.0

    Multimodal knowledge editing causes models to confuse original and edited entity identities in text queries by failing to update image-entity bindings and instead overfitting entity-entity shortcuts.

  3. Sharp Capacity Thresholds in Linear Associative Memory: From Winner-Take-All to Listwise Retrieval

    stat.ML 2026-05 unverdicted novelty 7.0

    Winner-take-all linear memory capacity scales as d² ~ n log n due to extreme values; listwise retrieval via Tail-Average Margin yields d² ~ n with exact asymptotic theory.

  4. How Language Models Process Negation

    cs.CL 2026-05 unverdicted novelty 7.0

    LLMs implement both attention-based suppression and constructive representations for negation, with construction dominant, despite poor accuracy from late-layer attention shortcuts.

  5. A framework for analyzing concept representations in neural models

    cs.CL 2026-05 unverdicted novelty 7.0

    A new framework shows concept subspaces are not unique, estimator choice affects containment and disentanglement, LEACE works well but generalizes poorly, and HuBERT encodes phone info as contained and disentangled fr...

  6. A Parametric Memory Head for Continual Generative Retrieval

    cs.IR 2026-04 unverdicted novelty 7.0

    A product-key parametric memory head with selective sparse updates mitigates catastrophic forgetting in generative retrieval models during sequential addition of new documents.

  7. One Model to Translate Them All? A Journey to Mount Doom for Multilingual Model Merging

    cs.CL 2026-04 unverdicted novelty 7.0

    Merging fine-tuned models for multilingual translation fails because fine-tuning redistributes language-specific neurons rather than sharpening them, increasing representational divergence in output-generating layers.

  8. Eliciting Latent Predictions from Transformers with the Tuned Lens

    cs.LG 2023-03 accept novelty 7.0

    Training per-layer affine probes on frozen transformers yields more reliable latent predictions than the logit lens and enables detection of malicious inputs from prediction trajectories.

  9. Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

    cs.LG 2026-05 unverdicted novelty 6.0

    A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

  10. A Geometric Perspective on Next-Token Prediction in Large Language Models: Three Emerging Phases

    cs.LG 2026-05 unverdicted novelty 6.0

    LLMs exhibit three geometric phases in next-token prediction—seeding multiplexing, hoisting overriding, and focal convergence—where predictive subspaces rise, stabilize, and converge across layers.

  11. UniPool: A Globally Shared Expert Pool for Mixture-of-Experts

    cs.LG 2026-05 unverdicted novelty 6.0

    A shared global expert pool in MoE improves validation loss over per-layer experts and allows sublinear expert-parameter growth with depth.

  12. Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

    cs.LG 2026-05 unverdicted novelty 6.0

    Symmetric spectral diagnostics on attention are structurally blind to flow direction, with asymmetry G as the sole control parameter, yielding a two-axis test that distinguishes bottleneck versus diffuse hallucination...

  13. Logical Consistency as a Bridge: Improving LLM Hallucination Detection via Label Constraint Modeling between Responses and Self-Judgments

    cs.CL 2026-05 unverdicted novelty 6.0

    LaaB improves LLM hallucination detection by mapping self-judgment labels back into neural feature space and using mutual learning under logical consistency constraints between responses and meta-judgments.

  14. The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

    cs.LG 2026-04 conditional novelty 6.0

    Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...

  15. From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization

    cs.CL 2026-04 unverdicted novelty 6.0

    LLM 2-bit quantization fails via either cumulative signal degradation or early computation collapse in key components.

  16. Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks

    cs.AI 2026-04 conditional novelty 6.0

    Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.

  17. Representation-Guided Parameter-Efficient LLM Unlearning

    cs.CL 2026-04 unverdicted novelty 6.0

    REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

  18. BID-LoRA: A Parameter-Efficient Framework for Continual Learning and Unlearning

    cs.LG 2026-04 unverdicted novelty 6.0

    BID-LoRA uses bi-directional low-rank adapters with retain/new/unlearn pathways and escape unlearning to enable continual learning and unlearning while minimizing knowledge leakage and parameter updates.

  19. In-Place Test-Time Training

    cs.LG 2026-04 conditional novelty 6.0

    In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

  20. Automated Attention Pattern Discovery at Scale in Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    AP-MAE reconstructs masked attention patterns in LLMs with high accuracy, generalizes across models, predicts generation correctness at 55-70%, and enables 13.6% accuracy gains via targeted interventions.

  21. The Illusion of Latent Generalization: Bi-directionality and the Reversal Curse

    cs.CL 2026-03 unverdicted novelty 6.0

    Bidirectional objectives mitigate reversal by requiring explicit source-as-target signals and storing directions as distinct representations instead of inducing latent generalization.

  22. The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

    cs.LG 2026-04 unverdicted novelty 5.0

    Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.

  23. From Heads to Neurons: Causal Attribution and Steering in Multi-Task Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    HONES ranks feed-forward neurons by their causal contributions from task-relevant attention heads and uses lightweight scaling to steer performance on multiple vision-language tasks.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 22 Pith papers

  1. [1]

    Bastani and Y

    O. Bastani and Y. Ioannou and L. Lampropoulos and D. Vytiniotis and A. Nori and A. Criminisi , booktitle =. Measuring neural net robustness with constraints , year =

  2. [2]

    J. Z. Kolter and E. Wong , journal =. Provable defenses against adversarial examples via the convex outer adversarial polytope (published at

  3. [3]

    Wong and J

    E. Wong and J. Z. Kolter , booktitle =. Provable defenses against adversarial examples via the convex outer adversarial polytope , year =

  4. [4]

    Dvijotham and R

    K. Dvijotham and R. Stanforth and S. Gowal and T. Mann and P. Kohli , journal =. A Dual Approach to Scalable Verification of Deep Networks , year =

  5. [5]

    Hein and M

    M. Hein and M. Andriushchenko , booktitle =. Formal guarantees on the robustness of a classifier against adversarial manipulation , year =

  6. [6]

    A. A. Ahmadi and A. Majumdar , journal =

  7. [7]

    Dalvi and A

    N. Dalvi and A. Dasgupta and R. Kumar and V. Rastogi , booktitle =. Aggregating crowdsourced binary ratings , year =

  8. [8]

    Joglekar and H

    M. Joglekar and H. Garcia-Molina and A. Parameswaran , booktitle =. Comprehensive and reliable crowd assessment algorithms , year =

  9. [9]

    Zhang and X

    Y. Zhang and X. Chen and D. Zhou and M. I. Jordan , journal =. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing , volume =

  10. [10]

    Balsubramani and Y

    A. Balsubramani and Y. Freund , booktitle =. Scalable semi-supervised aggregation of classifiers , year =

  11. [11]

    Craven and J

    M. Craven and J. Kumlien and others , booktitle =. Constructing biological knowledge bases by extracting information from text sources , year =

  12. [12]

    Varma and B

    P. Varma and B. He and D. Iter and P. Xu and R. Yu and C. D. Sa and C. R. arXiv preprint arXiv:1610.08123 , title =

  13. [13]

    Shin and S

    J. Shin and S. Wu and F. Wang and C. D. Sa and C. Zhang and C. R. Incremental knowledge base construction using. Very Large Data Bases (VLDB) , number =

  14. [14]

    Roth and D

    B. Roth and D. Klakow , booktitle =. Combining Generative and Discriminative Model Scores for Distant Supervision , year =

  15. [15]

    Takamatsu and I

    S. Takamatsu and I. Sato and H. Nakagawa , booktitle =. Reducing wrong labels in distant supervision for relation extraction , year =

  16. [16]

    C. D. Sa and A. Ratner and C. R. Deepdive: declarative knowledge base construction , volume =. ACM SIGMOD Record , number =

  17. [17]

    Wu and L

    S. Wu and L. Hsiao and X. Cheng and B. Hancock and T. Rekatsinas and P. Levis and C. R. Proceedings of SIGMOD 2018 , title =

  18. [18]

    Alfonseca and K

    E. Alfonseca and K. Filippova and J. Delort and G. Garrido , booktitle =. Pattern learning for relation extraction with a hierarchical topic model , year =

  19. [19]

    Bunescu and R

    R. Bunescu and R. Mooney , booktitle =. Learning to extract relations from the web using minimal supervision , year =

  20. [20]

    Parkash and D

    A. Parkash and D. Parikh , booktitle =. Attributes for classifier feedback , year =

  21. [21]

    Druck and B

    G. Druck and B. Settles and A. McCallum , booktitle =. Active learning by labeling features , year =

  22. [22]

    Raghavan and O

    H. Raghavan and O. Madani and R. Jones , booktitle =. InterActive Feature Selection , volume =

  23. [23]

    G. S. Mann and A. McCallum , journal =. Generalized expectation criteria for semi-supervised learning with weakly labeled data , volume =

  24. [24]

    MacCartney , howpublished =

    B. MacCartney , howpublished =. SippyCup , year =

  25. [25]

    D. H. Younger , journal =. Recognition and parsing of context-free languages in time n3 , volume =

  26. [26]

    A. J. Ratner and C. M. D. Sa and S. Wu and D. Selsam and C. R. Data programming: Creating large training sets, quickly , year =. Advances in Neural Information Processing Systems (NIPS) , pages =

  27. [27]

    B. S. H. and H. Bryan and R. Alexander and R. Christopher , booktitle =. Learning the Structure of Generative Models without Labeled Data , year =

  28. [28]

    Corney and D

    D. Corney and D. Albakour and M. Martinez-Alvarez and S. Moussa , booktitle =. What do a million news articles look like? , year =

  29. [29]

    Wei and Y

    C. Wei and Y. Peng and R. Leaman and A. P. Davis and C. J. Mattingly and J. Li and T. C. Wiegers and Z. Lu , booktitle =. Overview of the BioCreative

  30. [30]

    A. J. Ratner and S. H. Bach and H. Ehrenberg and J. Fries and S. Wu and C. R. Very Large Data Bases (VLDB) , title =

  31. [31]

    Srivastava and I

    S. Srivastava and I. Labutov and T. Mitchell , booktitle =. Joint concept learning and semantic parsing from natural language explanations , year =

  32. [32]

    Ling and S

    H. Ling and S. Fidler , booktitle =. Teaching Machines to Describe Images via Natural Language Feedback , year =

  33. [33]

    Li and A

    J. Li and A. H. Miller and S. Chopra and M. Ranzato and J. Weston , journal =. Learning Through Dialogue Interactions , year =

  34. [34]

    Andreas and D

    J. Andreas and D. Klein and S. Levine , journal =. Learning with Latent Language , year =

  35. [35]

    J. E. Weston , booktitle =. Dialog-based language learning , year =

  36. [36]

    L. V. Ahn and R. Liu and M. Blum , booktitle =. Peekaboom: a game for locating objects in images , year =

  37. [37]

    Krening and B

    S. Krening and B. Harrison and K. M. Feigh and C. L. Isbell and M. Riedl and A. Thomaz , journal =. Learning from explanations using sentiment and advice in

  38. [38]

    Guidotti and A

    R. Guidotti and A. Monreale and F. Turini and D. Pedreschi and F. Giannotti , journal =. A Survey Of Methods For Explaining Black Box Models , year =

  39. [39]

    Yessenalina and Y

    A. Yessenalina and Y. Choi and C. Cardie , booktitle =. Automatically generating annotator rationales to improve sentiment classification , year =

  40. [40]

    Arora and E

    S. Arora and E. Nyberg , booktitle =. Interactive annotation learning with indirect feature voting , year =

  41. [41]

    Grechkin and H

    M. Grechkin and H. Poon and B. Howe , journal =. EZLearn: Exploiting Organic Supervision in Large-Scale Data Annotation , year =

  42. [42]

    Ratinov and D

    L. Ratinov and D. Roth and D. Downey and M. Anderson , booktitle =. Local and Global Algorithms for Disambiguation to

  43. [43]

    Kalyanpur and B

    A. Kalyanpur and B. K. Boguraev and S. Patwardhan and J. W. Murdock and A. Lally and C. A. Welty and J. M. Prager and B. Coppola and A. Fokoue-Nkoutche and L. Zhang and Y. Pan and Z. M. Qui , journal =. Structured data and inference in DeepQA , volume =

  44. [44]

    Lee and P

    K. Lee and P. H. Seo and J. Choi and S. Koo and G. G. Lee , journal =. Conversational knowledge teaching agent that uses a knowledge base , year =

  45. [45]

    Han and J

    S. Han and J. Bang and S. Ryu and G. G. Lee , journal =. Exploiting knowledge base to generate responses for natural language dialog listening agents , year =

  46. [46]

    Ellis and J

    J. Ellis and J. Getman and H. Simpson and K. Griffitt and H. T. Dang and R. Grishman and H. Ji and C. DePrince and T. Riese and N. Kuster , journal =

  47. [47]

    J. A. Aslam and V. Pavlu and E. Yilmaz , booktitle =. A statistical method for system evaluation using incomplete judgments , year =

  48. [48]

    Buckley and D

    C. Buckley and D. Dimmick and I. Soboroff and E. Voorhees , booktitle =. Bias and the limits of pooling for large collections , year =

  49. [49]

    Buckley and E

    C. Buckley and E. M. Voorhees , booktitle =. Retrieval evaluation with incomplete information , year =

  50. [50]

    Sakai and N

    T. Sakai and N. Kando , booktitle =. On information retrieval metrics designed for evaluation with incomplete relevance assessments , year =

  51. [51]

    G. V. Cormack and C. R. Palmer and C. L. A. Clarke , booktitle =. Efficient Construction of Large Test Collections , year =

  52. [52]

    Yilmaz and E

    E. Yilmaz and E. Kanoulas and J. A. Aslam , booktitle =. A simple and efficient sampling method for estimating

  53. [53]

    Vannella and D

    D. Vannella and D. Jurgens and D. Scarfini and D. Toscani and R. Navigli , booktitle =. Validating and Extending Semantic Knowledge Bases using Video Games with a Purpose , year =

  54. [54]

    Pavlick and H

    E. Pavlick and H. Ji and X. Pan and C. Callison-Burch , booktitle =. The Gun Violence Database: A new task and data set for

  55. [55]

    W. E. Webber , school =. Measurement in Information Retrieval Evaluation , year =

  56. [56]

    Zobel , booktitle =

    J. Zobel , booktitle =. How reliable are the results of large-scale information retrieval experiments? , year =

  57. [57]

    E. M. Voorhees and D. Harman , booktitle =. Overview of the Eight Text REtreival Conference (

  58. [58]

    Adel and B

    H. Adel and B. Roth and H. Sch\". Human Language Technology and North American Association for Computational Linguistics (HLT/NAACL) , title =

  59. [59]

    A. B. Owen , publisher =. Monte Carlo theory, methods and examples , year =

  60. [60]

    K. S. Jones and C. V. Rijsbergen , journal =. Report on the Need for and Provision of an ``Ideal test collection , year =

  61. [61]

    D. K. Harman , journal =. The first text retrieval conference (TREC-1) Rockville, MD, U.S.A., 4-6 November, 1992 , volume =

  62. [62]

    Ji and R

    H. Ji and R. Grishman and H. Text Analytics Conference , title =

  63. [63]

    R. L. Burden and J. D. Faires , publisher =. Numerical Analysis (3rd ed.) , year =

  64. [64]

    Liu and S

    A. Liu and S. Soderland and J. Bragg and C. H. Lin and X. Ling and D. S. Weld , booktitle =. Effective Crowd Annotation for Relation Extraction , year =

  65. [65]

    H. T. Dang , journal =. Cold Start Knowledge Base Population at

  66. [66]

    Ellis and J

    J. Ellis and J. Getman and D. Fore and N. Kuster and Z. Song and A. Bies and S. Strassel , journal =. Overview of linguistic resources for the

  67. [67]

    Ellis and X

    J. Ellis and X. Li and K. Griffitt and S. M. Strassel , journal =. Linguistic Resources for 2012 Knowledge Base Population Evaluations , year =

  68. [68]

    Plank , journal =

    B. Plank , journal =. What to do about non-standard (or non-canonical) language in

  69. [69]

    Novikova and O

    J. Novikova and O. Du. Empirical Methods in Natural Language Processing (EMNLP) , title =

  70. [70]

    Lin and M

    C. Lin and M. Rey , booktitle =. Looking for a Few Good Metrics:

  71. [71]

    Cohan and N

    A. Cohan and N. Goharian , booktitle =. Revisiting Summarization Evaluation for Scientific Articles , year =

  72. [72]

    Lavie and M

    A. Lavie and M. Denkowski , journal =. The Meteor Metric for Automatic Evaluation of Machine Translation , volume =

  73. [73]

    Denkowski and A

    M. Denkowski and A. Lavie , booktitle =. Meteor Universal: Language Specific Translation Evaluation for Any Target Language , year =

  74. [74]

    Vedantam and C

    R. Vedantam and C. L. Zitnick and D. Parikh , booktitle =

  75. [75]

    G. A. Miller and J. G. Beebe-Center , journal =. Some Psychological Methods for Evaluating the Quality of Translations , volume =

  76. [76]

    J. H. Lau and A. Clark and S. Lappin , journal =. Grammaticality, Acceptability, and Probability: A Probabilistic View of Linguistic Knowledge , volume =

  77. [77]

    See and P

    A. See and P. J. Liu and C. D. Manning , booktitle =. Get To The Point: Summarization with Pointer-Generator Networks , year =

  78. [78]

    Paulus and C

    R. Paulus and C. Xiong and R. Socher , booktitle =. A Deep Reinforced Model for Abstractive Summarization , year =

  79. [79]

    Lin and M

    T. Lin and M. Maire and S. Belongie and J. Hays and P. Perona and D. Ramanan and P. Doll. Microsoft. European Conference on Computer Vision (ECCV) , pages =

  80. [80]

    J. M. Conroy and H. T. Dang , booktitle =. Mind the Gap : Dangers of Divorcing Evaluations of Summary Content from Linguistic Quality , year =

Showing first 80 references.