Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms

Junyu Ren; Victor Veitch; Ziwei Su

arxiv: 2606.30625 · v1 · pith:CSGKCNKSnew · submitted 2026-06-29 · 📊 stat.ML · cs.AI· cs.LG· math.OC

Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms

Ziwei Su , Junyu Ren , Victor Veitch This is my paper

Pith reviewed 2026-06-30 03:23 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LGmath.OC

keywords contrastive embeddingsembedding normsoptimization dynamicsscale-invariant lossessemantic specificitygradient flowtoken frequency

0 comments

The pith

Embedding norms encode semantic specificity as a byproduct of contrastive optimization dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Contrastive embedding models trained with scale-invariant losses ignore embedding magnitudes when paired with metrics like cosine similarity. Despite this design choice, the norms correlate with semantic properties such as concept specificity, token frequency, and human uncertainty. The paper supplies a formal framework by analyzing the optimization dynamics to derive an analytic formula showing how embedding length encodes this information naturally during training. This account explains observed empirical patterns and indicates how the norms can function as calibration signals in models and retrieval tasks.

Core claim

By analyzing the optimization dynamics, we derive an analytic formula demonstrating that embedding length naturally encodes semantic information such as concept specificity as a byproduct of the training process in contrastive models with scale-invariant losses.

What carries the argument

Analytic formula from the continuous-time limit of gradient flow on scale-invariant losses, where norm evolution accumulates semantic signals independently of direction.

If this is right

Embedding norms can serve as free calibration tools in specific models.
Norms provide usable signals in retrieval tasks.
The dynamics supply a grounded explanation for previously heuristic observations that norms correlate with semantic properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dynamics may appear in other embedding architectures that rely on scale-invariant objectives.
The formula could be used to predict how changes in training data frequency would alter norm distributions.
It suggests experiments that isolate the contribution of individual semantic signals to norm growth.

Load-bearing premise

The training uses scale-invariant losses whose gradients produce dynamics in which norm evolution is independent of direction and directly accumulates semantic signals.

What would settle it

A controlled simulation of gradient flow on a scale-invariant contrastive loss in which the observed norm evolution deviates from the derived analytic formula.

Figures

Figures reproduced from arXiv: 2606.30625 by Junyu Ren, Victor Veitch, Ziwei Su.

**Figure 1.** Figure 1: Sphere geometry of the radial drift term. At a training sample X, the transported radial direction K (x, X) ⊤zˆ(x) has tangent projection aK in TX S d−1 . The source update −¯s(X) also lies in this tangent space. Its angle ψK with aK sets the sign of the radial vote: cosψK > 0 pushes x outward, while cosψK < 0 pushes it inward. These NTK expressions are an interpretive decomposition of the task moments, no… view at source ↗

**Figure 2.** Figure 2: Left: average R across tokens against λ on log scale. Right: log-log fitting of R 4 against V˜2 /λ supporting the linearized theoretical scaling. At inference, it is possible to compute V⋆ at some cost but R is already present. Theorem 1 makes R a possible proxy for V⋆ when variance dominates. This then motivates a norm-discounted similarity score score(q, d) = cosθ(q, d)· ∥q∥ −γ · ∥d∥ −γ , (4.20) where γ … view at source ↗

**Figure 3.** Figure 3: (a) Log-log scatter of predicted vs. observed embedding norm from the full balance equation (Eq. (4.15)). Each point is a vocabulary position; color encodes token frequency. The fitted gradient is 0.88 (r = 0.97), showing that V˜ and λ¯ rad capture the dominant mechanism governing equilibrium norms. The systematic deviation from y = x reflects residual non-stationarity in the measurement window. (b) V˜2 vs… view at source ↗

**Figure 4.** Figure 4: Norm-weighted BEIR retrieval: score(q, d) = cos(q, d)· ∥d∥ α (α = −γ). NDCG@10 normalized to cosine baseline (α = 0 → 1.0). Left: CLIP ViT-B/32 gains 37–156% at negative α across three datasets. Right: MiniLM-L12 shows no benefit at any α. 5.3 Norm-aware scoring improves retrieval in specific regimes Section 4.3 proposed the norm-discounted similarity score score(q, d) = cosθ(q, d)· ∥q∥ −γ · ∥d∥ −γ . For a… view at source ↗

**Figure 5.** Figure 5: V˜2 and D˜ measurements. Top row: per-token V˜2 and D˜ values sorted by observed embedding norm for CLIP ViT-B/32 (left) and MiniLM-L12 (right). Bottom row: log-log relationship between V˜2 and norm, and D˜ vs. V˜2 scatter. C Depth-Stratified TREC CAR Retrieval To test whether norm discounting preferentially benefits queries at deeper specificity levels, we evaluate on TREC CAR [Die+17] (Complex Answer Ret… view at source ↗

**Figure 6.** Figure 6: TREC-CAR depth-stratified norm-weighted retrieval. Each color represents a query specificity level (depth 2 = section, depth 3 = subsection, depth 4+ = subsubsection). CLIP benefits most from negative α at deeper heading levels; MiniLM-L12 shows no depth gradient. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Contrastive embedding models trained with scale-invariant losses are typically paired with distance metrics like cosine similarity, effectively ignoring embedding magnitudes. However, surprisingly, empirical studies reveal that despite this, these "discarded" norms seem to correlate with semantic properties such as concept specificity, token frequency, and human uncertainty. In this work, we provide a formal theoretical framework explaining this phenomenon. By analyzing the optimization dynamics, we derive an analytic formula demonstrating that embedding length naturally encodes this information as a byproduct of the training process. We also show how this gives rise to signals that can serve as "free" calibration tools in specific models and retrieval tasks, providing a grounded explanation for a previously heuristic observation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives an analytic formula tying embedding norms to semantic signals via optimization dynamics on scale-invariant losses, but the continuous-time separation step looks fragile for real discrete training.

read the letter

The core claim is that contrastive models with scale-invariant losses produce embedding norms that encode concept specificity and frequency as a direct byproduct of the training trajectory. They turn an observed correlation into something derived rather than just noted.

The work does a clean job of stating the empirical regularity and framing why it matters for calibration signals. The abstract presents the analytic link as new, which matches the literature pattern where this was treated as unexplained.

The soft spot is the passage to continuous-time gradient flow. Once the loss depends only on normalized embeddings, the ODE is said to let norm evolution accumulate semantic information independently of direction. Discrete steps with Adam or momentum can re-couple the radial component to the current angle, and the abstract gives no sign they checked whether the closed form survives that. That assumption carries the derivation; if it fails, the formula becomes an approximation whose error is unknown.

The paper is aimed at representation-learning researchers who already use or study these norm signals in retrieval or uncertainty work. It is worth a serious referee because the claim is specific enough to be falsified and the potential payoff is practical. Send it out, but ask the referees to focus on whether the continuous limit holds under the actual optimizers used in the experiments.

Referee Report

1 major / 1 minor

Summary. The paper claims that contrastive embedding models trained with scale-invariant losses exhibit embedding norms that encode semantic properties such as concept specificity, token frequency, and human uncertainty as a byproduct of optimization dynamics. It derives an analytic formula for this norm evolution and positions the resulting signals as free calibration tools in models and retrieval tasks.

Significance. If the derivation is rigorous, the result would supply a grounded theoretical account for previously heuristic empirical correlations between norms and semantics, strengthening the case for using magnitude information even when cosine similarity is the training metric.

major comments (1)

[Derivation of the analytic formula (continuous-time limit)] The central analytic formula rests on the claim that, once the loss is written in scale-invariant form, the continuous-time gradient-flow ODE separates norm dynamics from the angular component and directly integrates semantic signals. The manuscript must show that this separation survives the passage to discrete steps with Adam or SGD+weight-decay, whose momentum and adaptive scaling can inject direction-dependent radial updates.

minor comments (1)

The abstract refers to 'specific models and retrieval tasks' without naming them; concrete examples and a short empirical illustration would clarify the scope of the claimed calibration utility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The central concern regarding the robustness of the continuous-time separation under discrete optimizers is well-taken. We address it point-by-point below and outline the revisions we will make.

read point-by-point responses

Referee: [Derivation of the analytic formula (continuous-time limit)] The central analytic formula rests on the claim that, once the loss is written in scale-invariant form, the continuous-time gradient-flow ODE separates norm dynamics from the angular component and directly integrates semantic signals. The manuscript must show that this separation survives the passage to discrete steps with Adam or SGD+weight-decay, whose momentum and adaptive scaling can inject direction-dependent radial updates.

Authors: The analytic derivation is performed strictly in the continuous-time gradient-flow limit, where the scale-invariant loss indeed decouples radial and angular dynamics, allowing the norm to integrate semantic signals independently. We agree that momentum terms in Adam and the weight-decay component in SGD can, in principle, introduce direction-dependent radial perturbations that are absent from pure gradient flow. The manuscript currently relies on the continuous approximation as the source of the closed-form expression and supports its relevance through experiments that already employ Adam. In the revision we will (i) add an explicit remark in Section 3 clarifying the continuous-time assumption and the small-step-size regime in which the separation approximately carries over, (ii) include a short empirical subsection comparing norm evolution under Adam versus plain SGD (with and without weight decay) on the same contrastive objectives, and (iii) state the conditions (sufficiently small learning rate, moderate momentum) under which the analytic formula remains predictive. A fully rigorous discrete-time analysis for arbitrary adaptive optimizers lies outside the present scope but is noted as an interesting direction for follow-up work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from gradient-flow ODEs

full rationale

The paper derives an analytic formula for embedding norms directly from the continuous-time gradient flow on scale-invariant contrastive losses, where the loss depends only on normalized embeddings and the ODE separates radial and angular dynamics by construction of the scale-invariance. This separation and the resulting integral for the norm are mathematical consequences of the stated loss form and the continuous limit, not a renaming or refitting of inputs. No load-bearing self-citations, fitted parameters presented as predictions, or ansatzes smuggled via prior work are indicated in the provided text. The result is externally falsifiable against observed norm-semantic correlations and stands as an independent derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Insufficient information; only abstract provided so free parameters, axioms, and invented entities cannot be audited.

pith-pipeline@v0.9.1-grok · 5645 in / 919 out tokens · 37180 ms · 2026-06-30T03:23:46.687160+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 9 canonical work pages · 7 internal anchors

[1]

Cross modal retrieval with querybank normalisation

[Bog+22] S.-V . Bogolin, I. Croitoru, H. Jin, Y. Liu, and S. Albanie. “Cross modal retrieval with querybank normalisation”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022 (cit. on p. 2). [Che+24] J. Chen, S. Xiao, P . Zhang, K. Luo, D. Lian, and Z. Liu. “M3-embedding: multi-linguality , multi- functionality , mu...

2022
[2]

A simple framework for contrastive learning of visual representations

2024 (cit. on p. 14). [Che+20] T . Chen, S. Kornblith, M. Norouzi, and G. Hinton. “A simple framework for contrastive learning of visual representations”. In:Proceedings of the 37th International Conference on Machine Learning. 2020 (cit. on p. 1). [Dee26] DeepSeek-AI.DeepSeek-V4: towards highly efficient million-token context intelligence. 2026 (cit. on ...

2024
[3]

On the Importance of Embedding Norms in Self-Supervised Learning

arXiv:2502.09252(cit. on pp. 1–3). [He+20] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. “Momentum contrast for unsupervised visual representation learning”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020 (cit. on p. 1). [Ilh+21] G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V ...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[4]

ALBERT: a lite BERT for self-supervised learning of language representations

2022 (cit. on pp. 1, 2). [Lan+20] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P . Sharma, and R. Soricut. “ALBERT: a lite BERT for self-supervised learning of language representations”. In:Proceedings of the 8th International Conference on Learning Representations

2022
[5]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

arXiv:1909.11942(cit. on p. 11). 20 [LG25] M. Y. Levi and G. Gilboa. “The double-ellipsoid geometry of clip”. In:Proceedings of the 42nd International Conference on Machine Learning

work page internal anchor Pith review Pith/arXiv arXiv 1909
[6]

Towards general text embeddings with multi-stage contrastive learning

arXiv:2411.14517(cit. on p. 2). [Li+23] Z. Li, X. Zhang, Y . Zhang, D. Long, P . Xie, and M. Zhang. “Towards general text embeddings with multi-stage contrastive learning”. In:arXiv preprint arXiv:2308.03281(2023) (cit. on p. 14). [Lia+22] W . Liang, Y. Zhang, Y . Kwon, S. Yeung, and J. Y . Zou. “Mind the gap: understanding the modality gap in multi-modal...

work page arXiv 2023
[7]

Magface: a universal representation for face recogni- tion and quality assessment

arXiv:2203.02053(cit. on p. 2). [Men+21] Q. Meng, S. Zhao, Z. Huang, and F . Zhou. “Magface: a universal representation for face recogni- tion and quality assessment”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021 (cit. on pp. 1, 2). [Men+24] R. Meng, Y. Liu, S. R. Joty, C. Xiong, Y . Zhou, and S. Yavuz.SFR-Emb...

work page arXiv 2021
[8]

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

arXiv:2104.08663 (cit. on p. 9). [Tys+23] K. Tyshchuk, P . Karpikova, A. Spiridonov, A. Prutianova, A. Razzhigaev, and A. Panchenko. “On isotropy of multimodal embeddings”. In:Information14.7 (2023), p. 392 (cit. on p. 2). [Wan+17] F . Wang, X. Xiang, J. Cheng, and A. L. Yuille. “Normface:L2 hypersphere embedding for face verification”. In:Proceedings of ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

NormFace: L2 Hypersphere Embedding for Face Verification

arXiv:1704.06369(cit. on pp. 2, 3). [Wan+22] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F . Wei. “Text embeddings by weakly-supervised contrastive pre-training”. In:arXiv preprint arXiv:2212.03533(2022) (cit. on pp. 1, 11, 14). [Wan+24] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F . Wei. “Improving text embeddings ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

A broad-coverage challenge corpus for sentence understanding through inference

2023 (cit. on pp. 1, 2). [WNB18] A. Williams, N. Nangia, and S. R. Bowman. “A broad-coverage challenge corpus for sentence understanding through inference”. In:Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018 (cit. on p. 7). [Xia+24] S. Xiao, Z. Liu, P . Zh...

2023
[11]

arXiv:2309.16671(cit. on p. 13). [Yan+25] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al.Qwen3 technical report

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Qwen3 Technical Report

arXiv:2505.09388(cit. on p. 14). [Zha+23] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. “Sigmoid loss for language image pre-training”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Deep metric learning with spherical embedding

arXiv:2303. 15343(cit. on p. 13). [ZLZ20] D. Zhang, Y. Li, and Z. Zhang. “Deep metric learning with spherical embedding”. In:Advances in Neural Information Processing Systems. 2020 (cit. on p. 2). [Zha+18] X. Zhang, F . X. Yu, S. Karaman, W . Zhang, and S.-F . Chang. “Heated-up softmax embedding”. In:Proceedings of the 7th International Conference on Lear...

2020
[14]

Heated-Up Softmax Embedding

arXiv: 1809.04157(cit. on pp. 2, 3). [Zho+22] K. Zhou, K. Ethayarajh, D. Card, and D. Jurafsky. “Problems with cosine as a measure of embedding similarity for high frequency words”. In:Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022 (cit. on pp. 1, 2). 22

work page internal anchor Pith review Pith/arXiv arXiv 2022

[1] [1]

Cross modal retrieval with querybank normalisation

[Bog+22] S.-V . Bogolin, I. Croitoru, H. Jin, Y. Liu, and S. Albanie. “Cross modal retrieval with querybank normalisation”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022 (cit. on p. 2). [Che+24] J. Chen, S. Xiao, P . Zhang, K. Luo, D. Lian, and Z. Liu. “M3-embedding: multi-linguality , multi- functionality , mu...

2022

[2] [2]

A simple framework for contrastive learning of visual representations

2024 (cit. on p. 14). [Che+20] T . Chen, S. Kornblith, M. Norouzi, and G. Hinton. “A simple framework for contrastive learning of visual representations”. In:Proceedings of the 37th International Conference on Machine Learning. 2020 (cit. on p. 1). [Dee26] DeepSeek-AI.DeepSeek-V4: towards highly efficient million-token context intelligence. 2026 (cit. on ...

2024

[3] [3]

On the Importance of Embedding Norms in Self-Supervised Learning

arXiv:2502.09252(cit. on pp. 1–3). [He+20] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. “Momentum contrast for unsupervised visual representation learning”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020 (cit. on p. 1). [Ilh+21] G. Ilharco, M. Wortsman, R. Wightman, C. Gordon, N. Carlini, R. Taori, A. Dave, V ...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[4] [4]

ALBERT: a lite BERT for self-supervised learning of language representations

2022 (cit. on pp. 1, 2). [Lan+20] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P . Sharma, and R. Soricut. “ALBERT: a lite BERT for self-supervised learning of language representations”. In:Proceedings of the 8th International Conference on Learning Representations

2022

[5] [5]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

arXiv:1909.11942(cit. on p. 11). 20 [LG25] M. Y. Levi and G. Gilboa. “The double-ellipsoid geometry of clip”. In:Proceedings of the 42nd International Conference on Machine Learning

work page internal anchor Pith review Pith/arXiv arXiv 1909

[6] [6]

Towards general text embeddings with multi-stage contrastive learning

arXiv:2411.14517(cit. on p. 2). [Li+23] Z. Li, X. Zhang, Y . Zhang, D. Long, P . Xie, and M. Zhang. “Towards general text embeddings with multi-stage contrastive learning”. In:arXiv preprint arXiv:2308.03281(2023) (cit. on p. 14). [Lia+22] W . Liang, Y. Zhang, Y . Kwon, S. Yeung, and J. Y . Zou. “Mind the gap: understanding the modality gap in multi-modal...

work page arXiv 2023

[7] [7]

Magface: a universal representation for face recogni- tion and quality assessment

arXiv:2203.02053(cit. on p. 2). [Men+21] Q. Meng, S. Zhao, Z. Huang, and F . Zhou. “Magface: a universal representation for face recogni- tion and quality assessment”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021 (cit. on pp. 1, 2). [Men+24] R. Meng, Y. Liu, S. R. Joty, C. Xiong, Y . Zhou, and S. Yavuz.SFR-Emb...

work page arXiv 2021

[8] [8]

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

arXiv:2104.08663 (cit. on p. 9). [Tys+23] K. Tyshchuk, P . Karpikova, A. Spiridonov, A. Prutianova, A. Razzhigaev, and A. Panchenko. “On isotropy of multimodal embeddings”. In:Information14.7 (2023), p. 392 (cit. on p. 2). [Wan+17] F . Wang, X. Xiang, J. Cheng, and A. L. Yuille. “Normface:L2 hypersphere embedding for face verification”. In:Proceedings of ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

NormFace: L2 Hypersphere Embedding for Face Verification

arXiv:1704.06369(cit. on pp. 2, 3). [Wan+22] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang, R. Majumder, and F . Wei. “Text embeddings by weakly-supervised contrastive pre-training”. In:arXiv preprint arXiv:2212.03533(2022) (cit. on pp. 1, 11, 14). [Wan+24] L. Wang, N. Yang, X. Huang, L. Yang, R. Majumder, and F . Wei. “Improving text embeddings ...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

A broad-coverage challenge corpus for sentence understanding through inference

2023 (cit. on pp. 1, 2). [WNB18] A. Williams, N. Nangia, and S. R. Bowman. “A broad-coverage challenge corpus for sentence understanding through inference”. In:Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018 (cit. on p. 7). [Xia+24] S. Xiao, Z. Liu, P . Zh...

2023

[11] [11]

arXiv:2309.16671(cit. on p. 13). [Yan+25] A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al.Qwen3 technical report

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Qwen3 Technical Report

arXiv:2505.09388(cit. on p. 14). [Zha+23] X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. “Sigmoid loss for language image pre-training”. In:Proceedings of the IEEE/CVF International Conference on Computer Vision

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Deep metric learning with spherical embedding

arXiv:2303. 15343(cit. on p. 13). [ZLZ20] D. Zhang, Y. Li, and Z. Zhang. “Deep metric learning with spherical embedding”. In:Advances in Neural Information Processing Systems. 2020 (cit. on p. 2). [Zha+18] X. Zhang, F . X. Yu, S. Karaman, W . Zhang, and S.-F . Chang. “Heated-up softmax embedding”. In:Proceedings of the 7th International Conference on Lear...

2020

[14] [14]

Heated-Up Softmax Embedding

arXiv: 1809.04157(cit. on pp. 2, 3). [Zho+22] K. Zhou, K. Ethayarajh, D. Card, and D. Jurafsky. “Problems with cosine as a measure of embedding similarity for high frequency words”. In:Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 2022 (cit. on pp. 1, 2). 22

work page internal anchor Pith review Pith/arXiv arXiv 2022