Recognition: 3 theorem links
· Lean TheoremLinformer: Self-Attention with Linear Complexity
Pith reviewed 2026-05-12 00:29 UTC · model grok-4.3
The pith
Self-attention in transformers can be approximated by a low-rank matrix to reduce complexity to linear in sequence length.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from O(n²) to O(n) in both time and space. The resulting linear transformer, the Linformer, performs on par with standard Transformer models, while being much more memory- and time-efficient.
What carries the argument
Low-rank projection matrices applied to the key and value vectors before attention, which replace the full n-by-n matrix with a much smaller n-by-k matrix where k is fixed and far smaller than n.
Load-bearing premise
The low-rank projections, whether learned or fixed, retain enough information from the original attention scores for the model to succeed on the tasks and sequence lengths it will see.
What would settle it
If the Linformer shows a clear accuracy gap compared with the standard transformer on a task that uses sequences several times longer than those seen during training, the low-rank approximation would be shown insufficient.
read the original abstract
Large transformer models have shown extraordinary success in achieving state-of-the-art results in many natural language processing applications. However, training and deploying these models can be prohibitively costly for long sequences, as the standard self-attention mechanism of the Transformer uses $O(n^2)$ time and space with respect to sequence length. In this paper, we demonstrate that the self-attention mechanism can be approximated by a low-rank matrix. We further exploit this finding to propose a new self-attention mechanism, which reduces the overall self-attention complexity from $O(n^2)$ to $O(n)$ in both time and space. The resulting linear transformer, the \textit{Linformer}, performs on par with standard Transformer models, while being much more memory- and time-efficient.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that self-attention in Transformers can be approximated via low-rank projections on the key and value matrices (using fixed or learned E, F matrices of size k x n with k << n), reducing attention complexity from O(n²) to O(n) in time and space. The resulting Linformer model is shown to achieve competitive performance with standard Transformers on GLUE, WikiText-103, and machine translation benchmarks while being more memory- and time-efficient.
Significance. If the empirical claims hold under broader validation, this is a significant contribution to efficient sequence modeling. It offers a practical architectural change that preserves the core attention mechanism while delivering linear scaling, which is valuable for long-context applications. The algebraic correctness of the low-rank rewriting and the competitive numbers on public NLP benchmarks are strengths; the work provides a clear efficiency gain without requiring entirely new attention formulations.
major comments (3)
- [§3] §3 (Method), around the definition of the projected attention: the low-rank approximation is presented without any error bound or analysis showing how the approximation error depends on sequence length n, rank k, or the effective rank of the attention matrix. This is load-bearing for the central claim of retained performance, as the paper's own skeptic note and experiments are confined to fixed training lengths.
- [§4] §4 (Experiments), Tables 1-3 and associated text: no standard deviations or results across multiple random seeds are reported for the GLUE or MT scores, and there are no ablations on the choice of projection dimension k as a function of n or task. This makes the 'on par' claim difficult to assess rigorously and directly tests the weakest assumption about projection sufficiency.
- [§4.2] §4.2 and §5: all reported experiments use fixed sequence lengths matching the training regime; no results are provided for substantially longer sequences or domain shifts. This leaves untested whether the learned low-rank projections preserve the necessary subspace when the effective rank of attention grows with n.
minor comments (2)
- [Figure 1] Figure 1 and the surrounding text could include a small diagram explicitly showing the shapes of E and F and how they are applied to K and V.
- [§3.2] The complexity analysis in §3.2 would benefit from an explicit step-by-step derivation of the O(n) claim including the cost of the projections themselves.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We provide point-by-point responses to the major comments below, indicating the revisions we intend to make.
read point-by-point responses
-
Referee: [§3] §3 (Method), around the definition of the projected attention: the low-rank approximation is presented without any error bound or analysis showing how the approximation error depends on sequence length n, rank k, or the effective rank of the attention matrix. This is load-bearing for the central claim of retained performance, as the paper's own skeptic note and experiments are confined to fixed training lengths.
Authors: We appreciate this observation. While the manuscript does not include a formal error bound, we provide empirical analysis demonstrating that attention matrices exhibit low effective rank, justifying the projection (see the singular value plots in the paper). The performance retention is validated across multiple tasks. In revision, we will add further discussion on how the approximation error scales with k and n based on these observations, though a complete theoretical bound remains an open question for future work. revision: partial
-
Referee: [§4] §4 (Experiments), Tables 1-3 and associated text: no standard deviations or results across multiple random seeds are reported for the GLUE or MT scores, and there are no ablations on the choice of projection dimension k as a function of n or task. This makes the 'on par' claim difficult to assess rigorously and directly tests the weakest assumption about projection sufficiency.
Authors: We agree that multiple random seeds and ablations would enhance the rigor. Our reported results follow the single-run convention common for such large-scale experiments due to resource constraints. We will rerun key experiments with multiple seeds to report means and standard deviations, and include ablations on the projection dimension k for different n and tasks in the revised manuscript. revision: yes
-
Referee: [§4.2] §4.2 and §5: all reported experiments use fixed sequence lengths matching the training regime; no results are provided for substantially longer sequences or domain shifts. This leaves untested whether the learned low-rank projections preserve the necessary subspace when the effective rank of attention grows with n.
Authors: This point highlights an important aspect of generalization. The current experiments adhere to the standard fixed-length settings of the benchmarks. We will extend the evaluation in the revision to include tests with longer sequences and some domain shifts to verify that the learned projections maintain effectiveness when the attention rank increases with n. revision: yes
- Providing a formal error bound or complete theoretical analysis of the approximation error's dependence on n, k, and effective rank
Circularity Check
No significant circularity; architectural proposal with independent empirical validation
full rationale
The Linformer derivation proposes an explicit architectural change—projecting the key and value matrices via learned low-rank matrices E and F of size k x n (k << n)—to approximate the O(n^2) attention matrix with O(n) complexity. This is not obtained by fitting parameters to a target quantity and then renaming the fit as a prediction, nor by self-referential definitions or load-bearing self-citations. The low-rank property is motivated by empirical observation of attention matrices but the method itself is a constructive proposal whose performance is measured on held-out public benchmarks (e.g., GLUE, SQuAD) with standard training protocols. No equation reduces to its own input by construction, and the central claim retains independent content beyond any cited prior work.
Axiom & Free-Parameter Ledger
free parameters (1)
- projection dimension k
axioms (1)
- domain assumption The attention matrix admits a useful low-rank approximation for the tasks considered.
Forward citations
Cited by 49 Pith papers
-
Convergent Stochastic Training of Attention and Understanding LoRA
Attention and LoRA regression losses induce Poincaré inequalities under mild regularization, so SGD-mimicking SDEs converge to minimizers with no assumptions on data or model size.
-
Nearly Optimal Attention Coresets
ε-coresets for attention exist of size O(√d e^{ρ+o(ρ)}/ε) for unit-norm keys/values and queries of norm ≤ρ, nearly matching the Ω(√d e^ρ/ε) lower bound.
-
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
-
ASAP: Amortized Doubly-Stochastic Attention via Sliced Dual Projection
ASAP amortizes Sinkhorn-based doubly-stochastic attention by learning a parametric map from 1D potentials to the Sinkhorn dual and reconstructing the plan via two-sided entropic c-transform, delivering 5.3x faster inf...
-
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
-
Projection-Free Transformers via Gaussian Kernel Attention
Gaussian Kernel Attention replaces learned QKV projections with a Gaussian RBF kernel on per-head token features, using 0.42x parameters and 0.49x FLOPs while showing competitive language modeling performance at depth 20.
-
Preconditioned DeltaNet: Curvature-aware Sequence Modeling for Linear Recurrences
Preconditioned delta-rule models with a diagonal curvature approximation improve upon standard DeltaNet, GDN, and KDA by better approximating the test-time regression objective.
-
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
-
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
-
Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis
HKT is a multi-scale attention architecture that bounds computation at 1.31x standard attention, proves kernel and decomposition properties, and reports accuracy gains on ListOps, sequential CIFAR-10, and character-le...
-
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling
Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
-
Collapse-Free Prototype Readout Layer for Transformer Encoders
DDCL-Attention introduces a collapse-free prototype readout for transformers that decomposes the training loss exactly into reconstruction and diversity terms while providing stability guarantees via singular perturba...
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
Vim is a bidirectional Mamba vision backbone that outperforms DeiT in accuracy on standard tasks while being substantially faster and more memory-efficient for high-resolution images.
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
FlashAttention reduces GPU high-bandwidth memory accesses in self-attention via tiling, delivering exact attention with lower IO complexity, 2-3x wall-clock speedups on models like GPT-2, and the ability to train on s...
-
Rethinking Attention with Performers
Performers approximate full-rank softmax attention in Transformers via FAVOR+ random features for linear complexity, with theoretical guarantees of unbiased estimation and competitive results on pixel, text, and prote...
-
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
Nectar: Neural Estimation of Cached-Token Attention via Regression
Nectar fits small per-layer per-head neural networks via regression to predict attention outputs and normalizers, enabling constant-time inference independent of context length while preserving semantic generation quality.
-
Practical Wi-Fi-based Motion Recognition Under Variable Traffic Patterns
A sampling-rate-versatile transformer network with dynamic augmentation achieves stable high accuracy for Wi-Fi-based motion and gesture recognition across variable sampling rates and traffic patterns.
-
Gated Subspace Inference for Transformer Acceleration
Gated Subspace Inference accelerates transformer linear layers 3-10x via low-rank cached subspace computation and per-token gating to skip residuals while preserving output distribution to high accuracy.
-
Stochastic Sparse Attention for Memory-Bound Inference
SANTA sparsifies post-softmax value aggregation via stratified sampling of S << n_k indices to produce an unbiased estimator, delivering 1.5x decode attention speedup on RTX 6000 Ada at 32k contexts while matching bas...
-
Linear-Time Global Visual Modeling without Explicit Attention
Dynamic parameterization of standard layers can replace explicit attention for linear-time global visual modeling.
-
GateMOT: Q-Gated Attention for Dense Object Tracking
GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
-
ELSA: Exact Linear-Scan Attention for Fast and Memory-Light Vision Transformers
ELSA casts online softmax attention as a prefix scan over monoid (m,S,W) to deliver exact FP32 semantics, O(n) memory, O(log n) depth, and Tensor-Core independence as a drop-in kernel.
-
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
-
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion
DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with...
-
RankUp: Towards High-rank Representations for Large Scale Advertising Recommender Systems
RankUp raises effective rank of representations in deep MetaFormer recommenders via randomized splitting and multi-embeddings, delivering 2-5% GMV gains in production deployments at Weixin.
-
On the Effectiveness of Context Compression for Repository-Level Tasks: An Empirical Investigation
Continuous latent-vector compression improves BLEU scores on repository-level code tasks by up to 28.3% at 4x compression while cutting inference latency.
-
Tracing the Chain: Deep Learning for Stepping-Stone Intrusion Detection
ESPRESSO achieves over 0.99 true positive rate at 10^{-3} false positive rate for stepping-stone intrusion detection on synthetic data for SSH, SOCAT, ICMP, DNS and mixed protocols, outperforming DeepCoFFEA while also...
-
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer
PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
-
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
-
YOLOv12: Attention-Centric Real-Time Object Detectors
YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
-
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers
Sana-0.6B produces high-resolution images with strong text alignment at 20x smaller size and 100x higher throughput than Flux-12B by combining 32x image compression, linear DiT blocks, and a decoder-only LLM text encoder.
-
MemGPT: Towards LLMs as Operating Systems
MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
-
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
FlashAttention-2 achieves roughly 2x speedup over FlashAttention by parallelizing attention across thread blocks and distributing work within blocks, reaching 50-73% of theoretical peak FLOPs/s on A100 GPUs.
-
Token Merging: Your ViT But Faster
Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.
-
USEMA: a Scalable Efficient Mamba Like Attention for Medical Image Segmentation
USEMA is a hybrid UNet architecture merging CNNs with scalable Mamba-like attention (SEMA) that achieves better efficiency than transformers and superior segmentation accuracy than pure CNN or Mamba models across medi...
-
PhysEDA: Physics-Aware Learning Framework for Efficient EDA With Manhattan Distance Decay
PhysEDA folds separable Manhattan-distance exponential decay into linear attention and potential-based rewards, cutting complexity to linear while improving zero-shot transfer and sparse-reward performance on decoupli...
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
Convexity in Disguise: A Theoretical Framework for Nonconvex Low-Rank Matrix Estimation
Nonconvex low-rank matrix estimation procedures are shown to be equivalent to locally strongly convex formulations via a benign regularizer that does not change the algorithm's update rule.
-
Cascade Token Selection for Transformer Attention Acceleration
Cascade token selection inherits and updates a small set of representative tokens across layers using cross-Gram validation, reducing selection cost from O(T²d) to O(Trd) per layer with observed Gram savings of 22-63%...
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
-
RankUp: Towards High-rank Representations for Large Scale Advertising Recommender Systems
RankUp enhances representation capacity in deep MetaFormer recommenders via permutation splitting and multi-embeddings, achieving GMV improvements of 2-5% in Weixin production systems.
-
Sinkhorn doubly stochastic attention rank decay analysis
Sinkhorn-normalized doubly stochastic attention preserves rank more effectively than Softmax row-stochastic attention, with both showing doubly exponential rank decay to one with network depth.
-
A Survey on Efficient Inference for Large Language Models
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Reference graph
Works this paper leans on
-
[1]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[2]
Language Models are Few-Shot Learners
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. arXiv preprint arXiv:2005.14165,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[3]
Training Deep Nets with Sublinear Memory Cost
Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174,
work page internal anchor Pith review arXiv
-
[4]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[5]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186,
work page 2019
-
[6]
arXiv preprint arXiv:2004.07320 , year=
Angela Fan, Pierre Stock, Benjamin Graham, Edouard Grave, Remi Gribonval, Herve Jegou, and Armand Joulin. Training with quantization noise for extreme fixed-point compression. arXiv preprint arXiv:2004.07320,
-
[7]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[9]
Pointer Sentinel Mixture Models
9 Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740,
work page internal anchor Pith review arXiv
-
[11]
Abdelrahman Mohamed, Dmytro Okhonko, and Luke Zettlemoyer. Transformers with convolutional context for asr. arXiv preprint arXiv:1904.11660,
-
[12]
fairseq: A fast, extensible toolkit for sequence modeling
Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), pp. 48–53,
work page 2019
-
[13]
Blockwise self-attention for long document understanding
Jiezhong Qiu, Hao Ma, Omer Levy, Scott Wen-tau Yih, Sinong Wang, and Jie Tang. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972,
-
[14]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683,
work page internal anchor Pith review arXiv 1910
-
[15]
Squad: 100,000+ questions for machine comprehension of text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2383–2392,
work page 2016
-
[16]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[17]
Recursive deep models for semantic compositionality over a sentiment treebank
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Y Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp. 1631–1642,
work page 2013
-
[19]
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
URL http://arxiv.org/abs/1804.07461. Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27,
work page internal anchor Pith review arXiv
-
[20]
(JL, for short), the following version is from (Arriaga & Vempala, 2006). Lemma
work page 2006
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.