{"work":{"id":"c5b81688-45ee-4a9a-b095-e6290f45cb6c","openalex_id":null,"doi":null,"arxiv_id":"1904.10509","raw_key":null,"title":"Generating Long Sequences with Sparse Transformers","authors":null,"authors_text":"Rewon Child, Scott Gray, Alec Radford, Ilya Sutskever","year":2019,"venue":"cs.LG","abstract":"Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \\sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same architecture to model images, audio, and text from raw bytes, setting a new state of the art for density modeling of Enwik8, CIFAR-10, and ImageNet-64. We generate unconditional samples that demonstrate global coherence and great diversity, and show it is possible in principle to use self-attention to model sequences of length one million or more.","external_url":"https://arxiv.org/abs/1904.10509","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-06-29T14:53:31.417208+00:00","pith_arxiv_id":"1904.10509","created_at":"2026-05-09T04:51:47.665165+00:00","updated_at":"2026-06-29T14:53:31.417208+00:00","title_quality_ok":true,"display_title":"Generating Long Sequences with Sparse Transformers","render_title":"Generating Long Sequences with Sparse Transformers"},"hub":{"state":{"work_id":"c5b81688-45ee-4a9a-b095-e6290f45cb6c","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":144,"external_cited_by_count":null,"distinct_field_count":16,"first_pith_cited_at":"2019-07-02T15:56:20+00:00","last_pith_cited_at":"2026-05-31T12:55:17+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-29T15:19:00.237837+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":26},{"context_role":"method","n":6},{"context_role":"baseline","n":1}],"polarity_counts":[{"context_polarity":"background","n":27},{"context_polarity":"use_method","n":5},{"context_polarity":"baseline","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Generating Long Sequences with Sparse Transformers","claims":[{"claim_text":"Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \\sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same a","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"(12) and (13), is clearly differentiable with respect toθ and is ready to be employed for 4 Table 1: CIFAR10 results. NLL measured in bits/dim. Model IS FID NLL Test (Train) Conditional EBM [11] 8.30 37 .9 JEM [17] 8.76 38 .4 BigGAN [3] 9.22 14 .73 StyleGAN2 + ADA (v1) [29] 10.06 2 .67 Unconditional Diffusion (original) [53] ≤ 5.40 Gated PixelCNN [59] 4.60 65 .93 3 .03 (2.90) Sparse Transformer [7] 2.80 PixelIQN [43] 5.29 49 .46 EBM [11] 6.78 38 .2 NCSNv2 [56] 31.75 NCSN [55] 8.87±0.12 25 .32 SN","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"[12] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. \"Language Models are Few-shot Learners\". In: Advances in Neural Information Processing Systems (NeurIPS) 33 (2020), pp. 1877-1901. [13] Aydar Bulatov, Yuri Kuratov, and Mikhail S Burtsev. \"Scaling Transformer to 1M tokens and Beyond with RMT\". In: arXiv preprint arXiv:2304.11062 (2023). [14] Rewon Child, Scott Gray, Alec Radford, an","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Transformer architecture (see Example 6) with 12 layers; (b) GPT-2 model [ 97] is stacked by revised Transformer architecture (see Example 7) with 48 layers; (c) GPT-3 model [5] use the same model and architecture as GPT-2, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [98]. fW ,D,Adenote the network architecture, training dataset and learning algorithm of the foundation models, re","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"LG] 9 May 2026 1 Introduction The Transformer [43] underlies modern large language models [2, 6, 9, 42], vision systems [35], and scientific applications [21]. Its core operation is Attn(Q,K,V) = softmax ( QK⊤ √dk ) V, Q,K∈R n×dk, V∈Rn×dv,(1) whosequadraticcostinsequencelength nhasdrivenalargebodyofefficientapproximations including sparse attention [7], local-window methods [3, 49], low-rank factorisation [44], and kernel-feature approximations [8, 22]. A statistical mismatch underlies these com","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"image tokens autoregressively, which are later decoded into pixels using the ViT-VQGAN decoder. We use a maximum length of text tokens of 128, and the length of image tokens are ﬁxed to 1024 (i.e., 32×32 latent codes from a 256× 256 input image). As an example, the 67-word description of the Starry Night prompt given in Figure 1 has a total length of 92 text tokens. All models use conv-shaped masked sparse attention [34]. We train four size variants ranging from 350 million to 20 billion paramet","claim_type":"method","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Let 𝑁 be the sequence length, 𝑑 be the head dimension, and 𝑀 be size of SRAM with 𝑑\u0014 𝑀\u0014 𝑁 𝑑. Block-sparse FlashAttention (Algorithm 5) requiresΘ¹𝑁 𝑑¸ 𝑁 2𝑑2 𝑀1𝑠º HBM accesses where 𝑠 is the fraction of nonzero blocks in the block-sparsity mask. We see that applying block-sparsity yields a direct improvement by the sparsity to the larger term in the IO complexity. For large sequence lengths𝑁, 𝑠 is often set to𝑁12 [11] or 𝑁1 log 𝑁 [3, 17, 92], resulting in Θ¹𝑁 p 𝑁º or Θ¹𝑁 log 𝑁º IO complexity. For","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Generating Long Sequences with Sparse Transformers because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (9 contexts).","role_counts":[{"n":9,"context_role":"background"},{"n":2,"context_role":"method"},{"n":1,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-16T12:18:50.123943+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"66e72c4e-d38b-4fd4-905a-8e63558820e2","orcid":null,"display_name":"Rewon Child"},{"id":"20d78a2f-5e57-4e70-9dbd-7e5ef3cf221b","orcid":null,"display_name":"Scott Gray"},{"id":"9783c850-24f9-4444-91ee-29b7660c744b","orcid":null,"display_name":"Alec Radford"},{"id":"f6e38310-dfd1-45dc-8d1a-7bec5c38944f","orcid":null,"display_name":"Ilya Sutskever"}]},"error":null,"updated_at":"2026-05-16T12:18:50.119297+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T06:57:40.958621+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":37},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":15},{"title":"Linformer: Self-Attention with Linear Complexity","work_id":"4b717b51-6098-45d0-8e9e-b69bef651bc3","shared_citers":14},{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":14},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":13},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":12},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":11},{"title":"Efficient Streaming Language Models with Attention Sinks","work_id":"a8d25452-c237-48c9-88a4-682717c3979a","shared_citers":10},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":10},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":9},{"title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","work_id":"50e3b368-0243-4726-8186-233869802ad1","shared_citers":9},{"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","shared_citers":9},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":9},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":9},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":8},{"title":"Big bird: Transformers for longer sequences","work_id":"605bd800-a1a3-4bcc-b188-604145af1773","shared_citers":8},{"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","shared_citers":8},{"title":"Reformer: The Efficient Transformer","work_id":"eb3dbae4-931f-40ab-a37b-507a35f42712","shared_citers":8},{"title":"Rethinking Attention with Performers","work_id":"4c26d308-8b72-4a98-8e73-950617a75f50","shared_citers":8},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":7},{"title":"FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning","work_id":"fff3953b-5efb-4753-bee4-002f59995810","shared_citers":7},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":7},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":7},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":7}],"time_series":[{"n":1,"year":2019},{"n":8,"year":2020},{"n":6,"year":2021},{"n":7,"year":2022},{"n":5,"year":2023},{"n":3,"year":2024},{"n":2,"year":2025},{"n":49,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T06:57:36.880726+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T06:57:32.699747+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Generating Long Sequences with Sparse Transformers","claims":[{"claim_text":"Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \\sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same a","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"(12) and (13), is clearly differentiable with respect toθ and is ready to be employed for 4 Table 1: CIFAR10 results. NLL measured in bits/dim. Model IS FID NLL Test (Train) Conditional EBM [11] 8.30 37 .9 JEM [17] 8.76 38 .4 BigGAN [3] 9.22 14 .73 StyleGAN2 + ADA (v1) [29] 10.06 2 .67 Unconditional Diffusion (original) [53] ≤ 5.40 Gated PixelCNN [59] 4.60 65 .93 3 .03 (2.90) Sparse Transformer [7] 2.80 PixelIQN [43] 5.29 49 .46 EBM [11] 6.78 38 .2 NCSNv2 [56] 31.75 NCSN [55] 8.87±0.12 25 .32 SN","claim_type":"baseline","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"[12] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. \"Language Models are Few-shot Learners\". In: Advances in Neural Information Processing Systems (NeurIPS) 33 (2020), pp. 1877-1901. [13] Aydar Bulatov, Yuri Kuratov, and Mikhail S Burtsev. \"Scaling Transformer to 1M tokens and Beyond with RMT\". In: arXiv preprint arXiv:2304.11062 (2023). [14] Rewon Child, Scott Gray, Alec Radford, an","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Transformer architecture (see Example 6) with 12 layers; (b) GPT-2 model [ 97] is stacked by revised Transformer architecture (see Example 7) with 48 layers; (c) GPT-3 model [5] use the same model and architecture as GPT-2, with the exception that we use alternating dense and locally banded sparse attention patterns in the layers of the transformer, similar to the Sparse Transformer [98]. fW ,D,Adenote the network architecture, training dataset and learning algorithm of the foundation models, re","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"LG] 9 May 2026 1 Introduction The Transformer [43] underlies modern large language models [2, 6, 9, 42], vision systems [35], and scientific applications [21]. Its core operation is Attn(Q,K,V) = softmax ( QK⊤ √dk ) V, Q,K∈R n×dk, V∈Rn×dv,(1) whosequadraticcostinsequencelength nhasdrivenalargebodyofefficientapproximations including sparse attention [7], local-window methods [3, 49], low-rank factorisation [44], and kernel-feature approximations [8, 22]. A statistical mismatch underlies these com","claim_type":"background","confidence":0.85,"evidence_strength":"citation_context"},{"claim_text":"image tokens autoregressively, which are later decoded into pixels using the ViT-VQGAN decoder. We use a maximum length of text tokens of 128, and the length of image tokens are ﬁxed to 1024 (i.e., 32×32 latent codes from a 256× 256 input image). As an example, the 67-word description of the Starry Night prompt given in Figure 1 has a total length of 92 text tokens. All models use conv-shaped masked sparse attention [34]. We train four size variants ranging from 350 million to 20 billion paramet","claim_type":"method","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"Let 𝑁 be the sequence length, 𝑑 be the head dimension, and 𝑀 be size of SRAM with 𝑑\u0014 𝑀\u0014 𝑁 𝑑. Block-sparse FlashAttention (Algorithm 5) requiresΘ¹𝑁 𝑑¸ 𝑁 2𝑑2 𝑀1𝑠º HBM accesses where 𝑠 is the fraction of nonzero blocks in the block-sparsity mask. We see that applying block-sparsity yields a direct improvement by the sparsity to the larger term in the IO complexity. For large sequence lengths𝑁, 𝑠 is often set to𝑁12 [11] or 𝑁1 log 𝑁 [3, 17, 92], resulting in Θ¹𝑁 p 𝑁º or Θ¹𝑁 log 𝑁º IO complexity. For","claim_type":"background","confidence":0.8,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Generating Long Sequences with Sparse Transformers because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (9 contexts).","role_counts":[{"n":9,"context_role":"background"},{"n":2,"context_role":"method"},{"n":1,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-16T12:18:54.974428+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Generating Long Sequences with Sparse Transformers","claims":[{"claim_text":"Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \\sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same a","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Generating Long Sequences with Sparse Transformers because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T06:57:28.638101+00:00"}},"summary":{"title":"Generating Long Sequences with Sparse Transformers","claims":[{"claim_text":"Transformers are powerful sequence models, but require time and memory that grows quadratically with the sequence length. In this paper we introduce sparse factorizations of the attention matrix which reduce this to $O(n \\sqrt{n})$. We also introduce a) a variation on architecture and initialization to train deeper networks, b) the recomputation of attention matrices to save memory, and c) fast attention kernels for training. We call networks with these changes Sparse Transformers, and show they can model sequences tens of thousands of timesteps long using hundreds of layers. We use the same a","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Generating Long Sequences with Sparse Transformers because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":37},{"title":"Scaling Laws for Neural Language Models","work_id":"b7dd8749-9c45-4977-ab9b-64478dce1ae8","shared_citers":15},{"title":"Linformer: Self-Attention with Linear Complexity","work_id":"4b717b51-6098-45d0-8e9e-b69bef651bc3","shared_citers":14},{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":14},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":13},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":12},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":11},{"title":"Efficient Streaming Language Models with Attention Sinks","work_id":"a8d25452-c237-48c9-88a4-682717c3979a","shared_citers":10},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":10},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":9},{"title":"Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer","work_id":"50e3b368-0243-4726-8186-233869802ad1","shared_citers":9},{"title":"Language Models are Few-Shot Learners","work_id":"214732c0-2edd-44a0-af9e-28184a2b8279","shared_citers":9},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":9},{"title":"RoBERTa: A Robustly Optimized BERT Pretraining Approach","work_id":"41fe12c4-e538-4890-a244-480650ed3078","shared_citers":9},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":8},{"title":"Big bird: Transformers for longer sequences","work_id":"605bd800-a1a3-4bcc-b188-604145af1773","shared_citers":8},{"title":"Layer Normalization","work_id":"20a2d720-0046-4c7c-bcd6-327ec8143f69","shared_citers":8},{"title":"Reformer: The Efficient Transformer","work_id":"eb3dbae4-931f-40ab-a37b-507a35f42712","shared_citers":8},{"title":"Rethinking Attention with Performers","work_id":"4c26d308-8b72-4a98-8e73-950617a75f50","shared_citers":8},{"title":"DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning","work_id":"e6b75ad5-2877-4168-97c8-710407094d20","shared_citers":7},{"title":"FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning","work_id":"fff3953b-5efb-4753-bee4-002f59995810","shared_citers":7},{"title":"Gaussian Error Linear Units (GELUs)","work_id":"0466fd22-03a1-4a61-af0a-a900e77bb023","shared_citers":7},{"title":"On the Opportunities and Risks of Foundation Models","work_id":"a18039e9-928d-47c9-a836-32656a71bf71","shared_citers":7},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":7}],"time_series":[{"n":1,"year":2019},{"n":8,"year":2020},{"n":6,"year":2021},{"n":7,"year":2022},{"n":5,"year":2023},{"n":3,"year":2024},{"n":2,"year":2025},{"n":49,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"9783c850-24f9-4444-91ee-29b7660c744b","orcid":null,"display_name":"Alec Radford","source":"manual","import_confidence":0.72},{"id":"f6e38310-dfd1-45dc-8d1a-7bec5c38944f","orcid":null,"display_name":"Ilya Sutskever","source":"manual","import_confidence":0.72},{"id":"66e72c4e-d38b-4fd4-905a-8e63558820e2","orcid":null,"display_name":"Rewon Child","source":"manual","import_confidence":0.72},{"id":"20d78a2f-5e57-4e70-9dbd-7e5ef3cf221b","orcid":null,"display_name":"Scott Gray","source":"manual","import_confidence":0.72}]}}