pith. machine review for the scientific record. sign in

arxiv: 1906.05714 · v1 · submitted 2019-06-12 · 💻 cs.HC · cs.CL· cs.LG

Recognition: unknown

A Multiscale Visualization of Attention in the Transformer Model

Authors on Pith no claims yet
classification 💻 cs.HC cs.CLcs.LG
keywords modelattentiontransformermechanismtoolaccessibleadvantageapproach
0
0 comments X
read the original abstract

The Transformer is a sequence model that forgoes traditional recurrent architectures in favor of a fully attention-based approach. Besides improving performance, an advantage of using attention is that it can also help to interpret a model by showing how the model assigns weight to different input elements. However, the multi-layer, multi-head attention mechanism in the Transformer model can be difficult to decipher. To make the model more accessible, we introduce an open-source tool that visualizes attention at multiple scales, each of which provides a unique perspective on the attention mechanism. We demonstrate the tool on BERT and OpenAI GPT-2 and present three example use cases: detecting model bias, locating relevant attention heads, and linking neurons to model behavior.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. In-context Learning and Induction Heads

    cs.LG 2022-09 unverdicted novelty 7.0

    Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...

  2. Rethinking Attention with Performers

    cs.LG 2020-09 unverdicted novelty 7.0

    Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and prote...

  3. TIDE: Every Layer Knows the Token Beneath the Context

    cs.CL 2026-05 unverdicted novelty 5.0

    TIDE augments standard transformers with per-layer token embedding injection via an ensemble of memory blocks and a depth-conditioned router to mitigate rare-token undertraining and contextual collapse.