pith. machine review for the scientific record. sign in

arxiv: 2604.05866 · v1 · submitted 2026-04-07 · 💻 cs.IR · cs.CL· cs.DL

Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching

Pith reviewed 2026-05-10 18:46 UTC · model grok-4.3

classification 💻 cs.IR cs.CLcs.DL
keywords reviewer matchingstructured profilesLLM-based matchingpaper-reviewer recommendationinformation retrievalconference peer reviewrubric scoring
0
0 comments X

The pith

A training-free system builds explicit structured profiles of topics, methods, and applications to match papers with reviewers more accurately than paper-similarity baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that traditional reviewer matching relies too heavily on implicit similarity between a submission and a reviewer's past papers, which fails to capture multi-dimensional expertise. It introduces P2R, which prompts general-purpose LLMs to create separate profiles for submissions and reviewers broken into Topics, Methodologies, and Applications. These profiles feed a two-stage process: hybrid retrieval to gather candidates, followed by an LLM committee that scores them against detailed rubrics from both expert and area-chair viewpoints. Experiments on NeurIPS, SIGIR, and SciRepEval benchmarks show consistent gains over prior methods, with ablations confirming the value of the structured breakdown and rubric step.

Core claim

P2R shifts reviewer matching from paper-to-paper similarity to explicit profile-based comparison by using general LLMs to disentangle expertise into Topics, Methodologies, and Applications for both submissions and reviewers; it then applies hybrid semantic and aspect retrieval to form candidate pools and an LLM committee with strict multi-perspective rubrics to rank matches, delivering higher performance than state-of-the-art baselines on three evaluation sets.

What carries the argument

Structured profiles that separate expertise into Topics, Methodologies, and Applications, processed through a coarse-to-fine pipeline of hybrid retrieval and rubric-guided LLM committee scoring.

If this is right

  • Reviewer assignments improve when expertise is represented explicitly along separate dimensions rather than through overall textual overlap with past publications.
  • The same profile construction and rubric evaluation steps can be reused across different conferences without retraining models.
  • Ablation results indicate that removing either the hybrid retrieval stage or the rubric committee reduces performance, confirming both are required for the observed gains.
  • The framework provides a concrete template for applying general LLMs to other ranking or recommendation tasks that need multi-aspect expertise modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same disentangled profiling approach could be tested for matching grants to reviewers or papers to program committees in other fields.
  • If LLMs systematically under-represent emerging or interdisciplinary work in the profiles, the method might reinforce existing topic clusters rather than discover novel matches.
  • Real-world use would likely need an additional calibration step where area chairs can adjust rubric weights for their specific conference.
  • The training-free design lowers the barrier for smaller venues but may require periodic updates as LLM capabilities change.

Load-bearing premise

General-purpose LLMs can reliably extract accurate, disentangled structured profiles of topics, methodologies, and applications from raw paper text without domain-specific training or fine-tuning.

What would settle it

A side-by-side human annotation study showing that LLM-generated profiles frequently misclassify or conflate the three aspects, or a live conference deployment where P2R recommendations receive lower acceptance rates or area-chair satisfaction scores than current methods.

Figures

Figures reproduced from arXiv: 2604.05866 by Ludi Wang, Yicheng Pan, Yi Du, Zhiyuan Ning.

Figure 1
Figure 1. Figure 1: Paradigm comparison. The left panel illustrates traditional Paper￾to-Paper matching, which relies on similarities between a submission and historical publications. In contrast, the right panel depicts our Paper-to￾Reviewer framework (P2R). P2R leverages LLMs to synthesize structured profiles, enabling direct and interpretable expertise alignment. vectors to measure semantic proximity [17]–[19]. Most re￾cen… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of P2R. The pipeline has three stages: (1) Profile and embedding generation, where an LLM summarizes structured aspect profiles for [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Top-M sensitivity on SciRepEval. (b) LLM-backbone robustness on NeurIPS. Values represent the average of soft and hard P@N metrics. consistently surpasses CoF, a scientific document representa￾tion model fine-tuned for reviewer matching. Notably, various general-purpose models achieve strong results, confirming the framework’s robustness. These results confirm that our perfor￾mance gains derive from th… view at source ↗
read the original abstract

As conference submission volumes continue to grow, accurately recommending suitable reviewers has become a challenge. Most existing methods follow a ``Paper-to-Paper'' matching paradigm, implicitly representing a reviewer by their publication history. However, effective reviewer matching requires capturing multi-dimensional expertise, and textual similarity to past papers alone is often insufficient. To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching. P2R uses general-purpose LLMs to construct structured profiles for both submissions and reviewers, disentangling them into Topics, Methodologies, and Applications. Building on these profiles, P2R adopts a coarse-to-fine pipeline to balance efficiency and depth. It first performs hybrid retrieval that combines semantic and aspect-level signals to form a high-recall candidate pool, and then applies an LLM-based committee to evaluate candidates under strict rubrics, integrating both multi-dimensional expert views and a holistic Area Chair perspective. Experiments on NeurIPS, SIGIR, and SciRepEval show that P2R consistently outperforms state-of-the-art baselines. Ablation studies further verify the necessity of each component. Overall, P2R highlights the value of explicit, structured expertise modeling and offers practical guidance for applying LLMs to reviewer matching.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes P2R, a training-free framework for paper-reviewer matching that uses general-purpose LLMs to construct explicit structured profiles for both submissions and reviewers, disentangling expertise into Topics, Methodologies, and Applications. It employs a coarse-to-fine pipeline consisting of hybrid retrieval (semantic plus aspect-level signals) to build a candidate pool, followed by rubric-based scoring from an LLM committee that incorporates multi-dimensional expert views and an Area Chair perspective. Experiments on NeurIPS, SIGIR, and SciRepEval datasets report consistent outperformance over state-of-the-art baselines, with ablation studies confirming the contribution of each component.

Significance. If the central results hold after validation, this work offers a practical shift from implicit paper-to-paper similarity matching to explicit multi-dimensional expertise modeling in reviewer assignment. The training-free design leveraging off-the-shelf LLMs is a notable strength for deployability, and the rubric-driven committee approach provides a structured way to integrate diverse perspectives, which could inform future systems in scholarly information retrieval.

major comments (2)
  1. [Profile generation pipeline and Experiments section] The outperformance claims on NeurIPS, SIGIR, and SciRepEval rest on the premise that LLM-generated profiles are accurate and disentangled across Topics, Methodologies, and Applications. The manuscript provides no human validation, inter-annotator agreement scores, or error analysis measuring profile fidelity against ground truth (e.g., expert annotations of the same papers/reviewers). Without this, gains cannot be confidently attributed to the structured profiling rather than retrieval heuristics or LLM committee biases.
  2. [Abstract and Experiments section] The abstract states that P2R 'consistently outperforms state-of-the-art baselines' and that 'ablation studies further verify the necessity of each component,' yet reports no concrete metrics (e.g., NDCG, MAP, precision@K), statistical tests, baseline implementation details, or error bars. This absence prevents assessment of effect sizes and reproducibility of the central empirical claim.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., average improvement over the strongest baseline) to give readers an immediate sense of the gains.
  2. [Methods] Notation for the three profile dimensions (Topics, Methodologies, Applications) is introduced clearly but could be reinforced with a small illustrative example table showing a sample paper profile.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and commit to revisions that strengthen the empirical grounding of our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Profile generation pipeline and Experiments section] The outperformance claims on NeurIPS, SIGIR, and SciRepEval rest on the premise that LLM-generated profiles are accurate and disentangled across Topics, Methodologies, and Applications. The manuscript provides no human validation, inter-annotator agreement scores, or error analysis measuring profile fidelity against ground truth (e.g., expert annotations of the same papers/reviewers). Without this, gains cannot be confidently attributed to the structured profiling rather than retrieval heuristics or LLM committee biases.

    Authors: We acknowledge that the manuscript currently lacks explicit human validation or inter-annotator agreement metrics for the generated profiles. The ablation studies provide indirect evidence by quantifying performance degradation when structured profiles are removed, and the consistent gains over implicit paper-to-paper baselines support the value of explicit disentanglement. Nevertheless, we agree this is a valid concern. In the revision we will add a dedicated qualitative analysis subsection with representative profile examples across the three dimensions, plus a small-scale human evaluation (on a held-out subset of 50 papers/reviewers) reporting agreement rates between LLM outputs and expert annotations. revision: yes

  2. Referee: [Abstract and Experiments section] The abstract states that P2R 'consistently outperforms state-of-the-art baselines' and that 'ablation studies further verify the necessity of each component,' yet reports no concrete metrics (e.g., NDCG, MAP, precision@K), statistical tests, baseline implementation details, or error bars. This absence prevents assessment of effect sizes and reproducibility of the central empirical claim.

    Authors: The Experiments section contains the full quantitative results (NDCG@10, MAP, Precision@5/10, with error bars and statistical significance tests via paired t-tests against baselines). The abstract was deliberately kept high-level for brevity. To address the concern directly, we will revise the abstract to report the key performance deltas (e.g., +X% NDCG over strongest baseline) and explicitly mention the use of statistical testing. We will also expand the Experiments section with additional baseline implementation details (hyperparameters, LLM versions, retrieval settings) to improve reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural pipeline with external evaluation

full rationale

The paper describes a training-free framework consisting of LLM-based structured profiling into Topics/Methodologies/Applications, followed by hybrid retrieval and rubric-based LLM committee scoring. No equations, fitted parameters, or derivation steps are present that reduce outputs to inputs by construction. Claims of outperformance rest on empirical results from external datasets (NeurIPS, SIGIR, SciRepEval) rather than self-referential fits or self-citation chains. The central premise is a proposed procedural pipeline whose validity can be assessed independently via the reported ablations and baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the untested premise that off-the-shelf LLMs produce faithful, disentangled expertise profiles and that rubric scoring by another LLM committee yields reliable rankings; no free parameters or invented entities are declared.

axioms (2)
  • domain assumption General-purpose LLMs can extract and structure reviewer and paper expertise into Topics, Methodologies, and Applications without fine-tuning.
    Invoked in the profile-construction step of the proposed framework.
  • domain assumption Rubric-based scoring by an LLM committee produces rankings that correlate with actual reviewer suitability.
    Central to the fine-grained evaluation stage.

pith-pipeline@v0.9.0 · 5538 in / 1385 out tokens · 44110 ms · 2026-05-10T18:46:33.475999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 1 internal anchor

  1. [1]

    Identifying inter- disciplinary sci-tech literature based on multi-label classification,

    W. Weijun, N. Zhiyuan, Y . Du, and Z. Yuanchun, “Identifying inter- disciplinary sci-tech literature based on multi-label classification,”Data Analysis and Knowledge Discovery, vol. 7, no. 1, pp. 102–112, 2023

  2. [2]

    In- terdisciplinary fairness in imbalanced research proposal topic inference: A hierarchical transformer-based method with selective interpolation,

    M. Xiao, M. Wu, Z. Qiao, Y . Fu, Z. Ning, Y . Du, and Y . Zhou, “In- terdisciplinary fairness in imbalanced research proposal topic inference: A hierarchical transformer-based method with selective interpolation,” ACM Transactions on Knowledge Discovery from Data, vol. 19, no. 2, pp. 1–21, 2025

  3. [3]

    Rpt: toward transferable model on heterogeneous researcher data via pre-training,

    Z. Qiao, Y . Fu, P. Wang, M. Xiao, Z. Ning, D. Zhang, Y . Du, and Y . Zhou, “Rpt: toward transferable model on heterogeneous researcher data via pre-training,”IEEE Transactions on Big Data, vol. 9, no. 1, pp. 186–199, 2022

  4. [4]

    Resolving the imbalance issue in hierarchical disciplinary topic inference via llm-based data augmentation,

    X. Cai, M. Xiao, Z. Ning, and Y . Zhou, “Resolving the imbalance issue in hierarchical disciplinary topic inference via llm-based data augmentation,” in2023 IEEE international conference on data mining workshops (ICDMW). IEEE, 2023, pp. 1424–1429

  5. [5]

    A comprehensive survey on vector database: Storage and retrieval technique, challenge.Computing Research Repository, abs/2310.11703, 2023

    L. Ma, R. Zhang, Y . Han, S. Yu, Z. Wang, Z. Ning, J. Zhang, P. Xu, P. Li, W. Juet al., “A comprehensive survey on vector database: Storage and retrieval technique, challenge,”arXiv preprint arXiv:2310.11703, 2023

  6. [6]

    Computational support for academic peer review: a perspective from artificial intelligence,

    S. Price and P. A. Flach, “Computational support for academic peer review: a perspective from artificial intelligence,”Communications of the ACM, vol. 60, no. 3, pp. 70–79, 2017

  7. [7]

    When AI reviews science: Can we trust the referee?

    J. Wang, Y . Liu, H. Xu, K. Hu, S. Di, W. Ni, L. Yue, M.-L. Zhang, K. Ren, and L. Chen, “When AI reviews science: Can we trust the referee?”The Innovation Informatics, vol. 2, no. 1, p. 100030, 2026

  8. [8]

    A robust model for paper reviewer as- signment,

    X. Liu, T. Suel, and N. Memon, “A robust model for paper reviewer as- signment,” inProceedings of the 8th ACM Conference on Recommender systems, 2014, pp. 25–32

  9. [9]

    Rethinking graph contrastive learning through relative similarity preservation,

    Z. Ning, P. Wang, Z. Qiao, P. Wang, and Y . Zhou, “Rethinking graph contrastive learning through relative similarity preservation,”arXiv preprint arXiv:2505.05533, 2025

  10. [10]

    Context-enhanced entity and relation embedding for knowledge graph completion,

    Z. Qiao, Z. Ning, Y . Du, and Y . Zhou, “Context-enhanced entity and relation embedding for knowledge graph completion,”arXiv preprint arXiv:2012.07011, 2020

  11. [11]

    Deep cut-informed graph embedding and clustering,

    Z. Ning, Z. Wang, R. Zhang, P. Xu, K. Liu, P. Wang, W. Ju, P. Wang, Y . Zhou, E. Cambriaet al., “Deep cut-informed graph embedding and clustering,”Information Fusion, p. 103603, 2025

  12. [12]

    Adaptive path-memory network for temporal knowledge graph reason- ing,

    H. Dong, Z. Ning, P. Wang, Z. Qiao, P. Wang, Y . Zhou, and Y . Fu, “Adaptive path-memory network for temporal knowledge graph reason- ing,”arXiv preprint arXiv:2304.12604, 2023

  13. [13]

    Lightcake: A lightweight framework for context-aware knowledge graph embedding,

    Z. Ning, Z. Qiao, H. Dong, Y . Du, and Y . Zhou, “Lightcake: A lightweight framework for context-aware knowledge graph embedding,” inPacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2021, pp. 181–193

  14. [14]

    Graph soft-contrastive learning via neighborhood ranking,

    Z. Ning, P. Wang, P. Wang, Z. Qiao, W. Fan, D. Zhang, Y . Du, and Y . Zhou, “Graph soft-contrastive learning via neighborhood ranking,” arXiv preprint arXiv:2209.13964, 2022

  15. [15]

    Fast random walk with restart and its applications,

    H. Tong, C. Faloutsos, and J.-Y . Pan, “Fast random walk with restart and its applications,” inSixth international conference on data mining (ICDM’06). IEEE, 2006, pp. 613–622

  16. [16]

    Fedgcs: A generative framework for efficient client selection in federated learning via gradient-based optimization,

    Z. Ning, C. Tian, M. Xiao, W. Fan, P. Wang, L. Li, P. Wang, and Y . Zhou, “Fedgcs: A generative framework for efficient client selection in federated learning via gradient-based optimization,”arXiv preprint arXiv:2405.06312, 2024

  17. [17]

    Scibert: A pretrained language model for scientific text,

    I. Beltagy, K. Lo, and A. Cohan, “Scibert: A pretrained language model for scientific text,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3615–3620

  18. [18]

    Specter: Document-level representation learning using citation-informed trans- formers,

    A. Cohan, S. Feldman, I. Beltagy, D. Downey, and D. Weld, “Specter: Document-level representation learning using citation-informed trans- formers,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2270–2282

  19. [19]

    Scirepe- val: A multi-format benchmark for scientific document representations,

    A. Singh, M. D’Arcy, A. Cohan, D. Downey, and S. Feldman, “Scirepe- val: A multi-format benchmark for scientific document representations,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 5548–5566

  20. [20]

    Chain-of- factors paper-reviewer matching,

    Y . Zhang, Y . Shen, S. Kang, X. Chen, B. Jin, and J. Han, “Chain-of- factors paper-reviewer matching,” inProceedings of the ACM on Web Conference 2025, 2025, pp. 1901–1910

  21. [21]

    A framework for optimizing paper matching

    L. Charlin, R. S. Zemel, and C. Boutilier, “A framework for optimizing paper matching.” inUAI, vol. 11, 2011, pp. 86–95

  22. [22]

    Schrijveret al.,Combinatorial optimization: polyhedra and effi- ciency

    A. Schrijveret al.,Combinatorial optimization: polyhedra and effi- ciency. Springer, 2003, vol. 24, no. 2

  23. [23]

    Counterfactual evaluation of peer-review assignment policies,

    M. Saveski, S. Jecmen, N. Shah, and J. Ugander, “Counterfactual evaluation of peer-review assignment policies,”Advances in Neural Information Processing Systems, vol. 36, pp. 58 765–58 786, 2023

  24. [24]

    The toronto paper matching system: An automated paper-reviewer assignment system,

    L. Charlin and R. S. Zemel, “The toronto paper matching system: An automated paper-reviewer assignment system,” inICML 2013 Workshop on Peer Reviewing and Publishing Models, 2013

  25. [25]

    Expertise modeling for matching papers with reviewers,

    D. Mimno and A. McCallum, “Expertise modeling for matching papers with reviewers,” inProceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007, pp. 500– 509

  26. [26]

    Expertise matching via constraint-based optimization,

    W. Tang, J. Tang, and C. Tan, “Expertise matching via constraint-based optimization,” in2010 IEEE/WIC/aCM international conference on web intelligence and intelligent agent technology, vol. 1. IEEE, 2010, pp. 34–41

  27. [27]

    Neighborhood contrastive learning for scientific document representa- tions with citation embeddings,

    M. Ostendorff, N. Rethmeier, I. Augenstein, B. Gipp, and G. Rehm, “Neighborhood contrastive learning for scientific document representa- tions with citation embeddings,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 11 670–11 688

  28. [28]

    Large language models are zero-shot rankers for recommender systems,

    Y . Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. McAuley, and W. X. Zhao, “Large language models are zero-shot rankers for recommender systems,” inEuropean Conference on Information Retrieval. Springer, 2024, pp. 364–381

  29. [29]

    A dataset for expert reviewer recommendation with large language models as zero-shot rankers,

    V . M. Karan, S. McQuistin, R. Yanagida, C. Perkins, G. Tyson, I. Castro, P. Healey, and M. Purver, “A dataset for expert reviewer recommendation with large language models as zero-shot rankers,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 11 422–11 427

  30. [30]

    Frontier- revrec: A large-scale dataset for reviewer recommendation,

    Q. Peng, C. Wang, Y . Wang, H. Liu, X. Guo, and W. Wang, “Frontier- revrec: A large-scale dataset for reviewer recommendation,”arXiv preprint arXiv:2510.16597, 2025

  31. [31]

    Is chatgpt good at search? investigating large language models as re- ranking agents,

    W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, and Z. Ren, “Is chatgpt good at search? investigating large language models as re- ranking agents,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 14 918–14 937

  32. [32]

    A survey on llm-as-a-judge,

    J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liuet al., “A survey on llm-as-a-judge,”The Innovation, p. 101253, 2026

  33. [33]

    Judging llm-as-a-judge with mt-bench and chatbot arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023

  34. [34]

    Qwen3 Technical Report

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  35. [35]

    Nv-embed: Improved techniques for training llms as generalist embedding models,

    C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping, “Nv-embed: Improved techniques for training llms as generalist embedding models,” inInternational Conference on Learning Represen- tations, 2025

  36. [36]

    Reciprocal rank fusion outperforms condorcet and individual rank learning methods,

    G. V . Cormack, C. L. Clarke, and S. Buettcher, “Reciprocal rank fusion outperforms condorcet and individual rank learning methods,” inProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 2009, pp. 758–759

  37. [37]

    Multi-aspect expertise matching for review assignment,

    M. Karimzadehgan, C. Zhai, and G. Belford, “Multi-aspect expertise matching for review assignment,” inProceedings of the 17th ACM conference on Information and knowledge management, 2008, pp. 1113– 1122

  38. [38]

    Coco-dr: Combat- ing the distribution shift in zero-shot dense retrieval with contrastive and distributionally robust learning,

    Y . Yu, C. Xiong, S. Sun, C. Zhang, and A. Overwijk, “Coco-dr: Combat- ing the distribution shift in zero-shot dense retrieval with contrastive and distributionally robust learning,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 1462– 1479