Beyond Paper-to-Paper: Structured Profiling and Rubric Scoring for Paper-Reviewer Matching
Pith reviewed 2026-05-10 18:46 UTC · model grok-4.3
The pith
A training-free system builds explicit structured profiles of topics, methods, and applications to match papers with reviewers more accurately than paper-similarity baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
P2R shifts reviewer matching from paper-to-paper similarity to explicit profile-based comparison by using general LLMs to disentangle expertise into Topics, Methodologies, and Applications for both submissions and reviewers; it then applies hybrid semantic and aspect retrieval to form candidate pools and an LLM committee with strict multi-perspective rubrics to rank matches, delivering higher performance than state-of-the-art baselines on three evaluation sets.
What carries the argument
Structured profiles that separate expertise into Topics, Methodologies, and Applications, processed through a coarse-to-fine pipeline of hybrid retrieval and rubric-guided LLM committee scoring.
If this is right
- Reviewer assignments improve when expertise is represented explicitly along separate dimensions rather than through overall textual overlap with past publications.
- The same profile construction and rubric evaluation steps can be reused across different conferences without retraining models.
- Ablation results indicate that removing either the hybrid retrieval stage or the rubric committee reduces performance, confirming both are required for the observed gains.
- The framework provides a concrete template for applying general LLMs to other ranking or recommendation tasks that need multi-aspect expertise modeling.
Where Pith is reading between the lines
- The same disentangled profiling approach could be tested for matching grants to reviewers or papers to program committees in other fields.
- If LLMs systematically under-represent emerging or interdisciplinary work in the profiles, the method might reinforce existing topic clusters rather than discover novel matches.
- Real-world use would likely need an additional calibration step where area chairs can adjust rubric weights for their specific conference.
- The training-free design lowers the barrier for smaller venues but may require periodic updates as LLM capabilities change.
Load-bearing premise
General-purpose LLMs can reliably extract accurate, disentangled structured profiles of topics, methodologies, and applications from raw paper text without domain-specific training or fine-tuning.
What would settle it
A side-by-side human annotation study showing that LLM-generated profiles frequently misclassify or conflate the three aspects, or a live conference deployment where P2R recommendations receive lower acceptance rates or area-chair satisfaction scores than current methods.
Figures
read the original abstract
As conference submission volumes continue to grow, accurately recommending suitable reviewers has become a challenge. Most existing methods follow a ``Paper-to-Paper'' matching paradigm, implicitly representing a reviewer by their publication history. However, effective reviewer matching requires capturing multi-dimensional expertise, and textual similarity to past papers alone is often insufficient. To address this gap, we propose P2R, a training-free framework that shifts from implicit paper-to-paper matching to explicit profile-based matching. P2R uses general-purpose LLMs to construct structured profiles for both submissions and reviewers, disentangling them into Topics, Methodologies, and Applications. Building on these profiles, P2R adopts a coarse-to-fine pipeline to balance efficiency and depth. It first performs hybrid retrieval that combines semantic and aspect-level signals to form a high-recall candidate pool, and then applies an LLM-based committee to evaluate candidates under strict rubrics, integrating both multi-dimensional expert views and a holistic Area Chair perspective. Experiments on NeurIPS, SIGIR, and SciRepEval show that P2R consistently outperforms state-of-the-art baselines. Ablation studies further verify the necessity of each component. Overall, P2R highlights the value of explicit, structured expertise modeling and offers practical guidance for applying LLMs to reviewer matching.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes P2R, a training-free framework for paper-reviewer matching that uses general-purpose LLMs to construct explicit structured profiles for both submissions and reviewers, disentangling expertise into Topics, Methodologies, and Applications. It employs a coarse-to-fine pipeline consisting of hybrid retrieval (semantic plus aspect-level signals) to build a candidate pool, followed by rubric-based scoring from an LLM committee that incorporates multi-dimensional expert views and an Area Chair perspective. Experiments on NeurIPS, SIGIR, and SciRepEval datasets report consistent outperformance over state-of-the-art baselines, with ablation studies confirming the contribution of each component.
Significance. If the central results hold after validation, this work offers a practical shift from implicit paper-to-paper similarity matching to explicit multi-dimensional expertise modeling in reviewer assignment. The training-free design leveraging off-the-shelf LLMs is a notable strength for deployability, and the rubric-driven committee approach provides a structured way to integrate diverse perspectives, which could inform future systems in scholarly information retrieval.
major comments (2)
- [Profile generation pipeline and Experiments section] The outperformance claims on NeurIPS, SIGIR, and SciRepEval rest on the premise that LLM-generated profiles are accurate and disentangled across Topics, Methodologies, and Applications. The manuscript provides no human validation, inter-annotator agreement scores, or error analysis measuring profile fidelity against ground truth (e.g., expert annotations of the same papers/reviewers). Without this, gains cannot be confidently attributed to the structured profiling rather than retrieval heuristics or LLM committee biases.
- [Abstract and Experiments section] The abstract states that P2R 'consistently outperforms state-of-the-art baselines' and that 'ablation studies further verify the necessity of each component,' yet reports no concrete metrics (e.g., NDCG, MAP, precision@K), statistical tests, baseline implementation details, or error bars. This absence prevents assessment of effect sizes and reproducibility of the central empirical claim.
minor comments (2)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., average improvement over the strongest baseline) to give readers an immediate sense of the gains.
- [Methods] Notation for the three profile dimensions (Topics, Methodologies, Applications) is introduced clearly but could be reinforced with a small illustrative example table showing a sample paper profile.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and commit to revisions that strengthen the empirical grounding of our claims without altering the core contributions.
read point-by-point responses
-
Referee: [Profile generation pipeline and Experiments section] The outperformance claims on NeurIPS, SIGIR, and SciRepEval rest on the premise that LLM-generated profiles are accurate and disentangled across Topics, Methodologies, and Applications. The manuscript provides no human validation, inter-annotator agreement scores, or error analysis measuring profile fidelity against ground truth (e.g., expert annotations of the same papers/reviewers). Without this, gains cannot be confidently attributed to the structured profiling rather than retrieval heuristics or LLM committee biases.
Authors: We acknowledge that the manuscript currently lacks explicit human validation or inter-annotator agreement metrics for the generated profiles. The ablation studies provide indirect evidence by quantifying performance degradation when structured profiles are removed, and the consistent gains over implicit paper-to-paper baselines support the value of explicit disentanglement. Nevertheless, we agree this is a valid concern. In the revision we will add a dedicated qualitative analysis subsection with representative profile examples across the three dimensions, plus a small-scale human evaluation (on a held-out subset of 50 papers/reviewers) reporting agreement rates between LLM outputs and expert annotations. revision: yes
-
Referee: [Abstract and Experiments section] The abstract states that P2R 'consistently outperforms state-of-the-art baselines' and that 'ablation studies further verify the necessity of each component,' yet reports no concrete metrics (e.g., NDCG, MAP, precision@K), statistical tests, baseline implementation details, or error bars. This absence prevents assessment of effect sizes and reproducibility of the central empirical claim.
Authors: The Experiments section contains the full quantitative results (NDCG@10, MAP, Precision@5/10, with error bars and statistical significance tests via paired t-tests against baselines). The abstract was deliberately kept high-level for brevity. To address the concern directly, we will revise the abstract to report the key performance deltas (e.g., +X% NDCG over strongest baseline) and explicitly mention the use of statistical testing. We will also expand the Experiments section with additional baseline implementation details (hyperparameters, LLM versions, retrieval settings) to improve reproducibility. revision: yes
Circularity Check
No circularity: procedural pipeline with external evaluation
full rationale
The paper describes a training-free framework consisting of LLM-based structured profiling into Topics/Methodologies/Applications, followed by hybrid retrieval and rubric-based LLM committee scoring. No equations, fitted parameters, or derivation steps are present that reduce outputs to inputs by construction. Claims of outperformance rest on empirical results from external datasets (NeurIPS, SIGIR, SciRepEval) rather than self-referential fits or self-citation chains. The central premise is a proposed procedural pipeline whose validity can be assessed independently via the reported ablations and baselines.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption General-purpose LLMs can extract and structure reviewer and paper expertise into Topics, Methodologies, and Applications without fine-tuning.
- domain assumption Rubric-based scoring by an LLM committee produces rankings that correlate with actual reviewer suitability.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
P2R uses general-purpose LLMs to construct structured profiles for both submissions and reviewers, disentangling them into Topics, Methodologies, and Applications.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experiments on NeurIPS, SIGIR, and SciRepEval show that P2R consistently outperforms state-of-the-art baselines.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Identifying inter- disciplinary sci-tech literature based on multi-label classification,
W. Weijun, N. Zhiyuan, Y . Du, and Z. Yuanchun, “Identifying inter- disciplinary sci-tech literature based on multi-label classification,”Data Analysis and Knowledge Discovery, vol. 7, no. 1, pp. 102–112, 2023
work page 2023
-
[2]
M. Xiao, M. Wu, Z. Qiao, Y . Fu, Z. Ning, Y . Du, and Y . Zhou, “In- terdisciplinary fairness in imbalanced research proposal topic inference: A hierarchical transformer-based method with selective interpolation,” ACM Transactions on Knowledge Discovery from Data, vol. 19, no. 2, pp. 1–21, 2025
work page 2025
-
[3]
Rpt: toward transferable model on heterogeneous researcher data via pre-training,
Z. Qiao, Y . Fu, P. Wang, M. Xiao, Z. Ning, D. Zhang, Y . Du, and Y . Zhou, “Rpt: toward transferable model on heterogeneous researcher data via pre-training,”IEEE Transactions on Big Data, vol. 9, no. 1, pp. 186–199, 2022
work page 2022
-
[4]
X. Cai, M. Xiao, Z. Ning, and Y . Zhou, “Resolving the imbalance issue in hierarchical disciplinary topic inference via llm-based data augmentation,” in2023 IEEE international conference on data mining workshops (ICDMW). IEEE, 2023, pp. 1424–1429
work page 2023
-
[5]
L. Ma, R. Zhang, Y . Han, S. Yu, Z. Wang, Z. Ning, J. Zhang, P. Xu, P. Li, W. Juet al., “A comprehensive survey on vector database: Storage and retrieval technique, challenge,”arXiv preprint arXiv:2310.11703, 2023
-
[6]
Computational support for academic peer review: a perspective from artificial intelligence,
S. Price and P. A. Flach, “Computational support for academic peer review: a perspective from artificial intelligence,”Communications of the ACM, vol. 60, no. 3, pp. 70–79, 2017
work page 2017
-
[7]
When AI reviews science: Can we trust the referee?
J. Wang, Y . Liu, H. Xu, K. Hu, S. Di, W. Ni, L. Yue, M.-L. Zhang, K. Ren, and L. Chen, “When AI reviews science: Can we trust the referee?”The Innovation Informatics, vol. 2, no. 1, p. 100030, 2026
work page 2026
-
[8]
A robust model for paper reviewer as- signment,
X. Liu, T. Suel, and N. Memon, “A robust model for paper reviewer as- signment,” inProceedings of the 8th ACM Conference on Recommender systems, 2014, pp. 25–32
work page 2014
-
[9]
Rethinking graph contrastive learning through relative similarity preservation,
Z. Ning, P. Wang, Z. Qiao, P. Wang, and Y . Zhou, “Rethinking graph contrastive learning through relative similarity preservation,”arXiv preprint arXiv:2505.05533, 2025
-
[10]
Context-enhanced entity and relation embedding for knowledge graph completion,
Z. Qiao, Z. Ning, Y . Du, and Y . Zhou, “Context-enhanced entity and relation embedding for knowledge graph completion,”arXiv preprint arXiv:2012.07011, 2020
-
[11]
Deep cut-informed graph embedding and clustering,
Z. Ning, Z. Wang, R. Zhang, P. Xu, K. Liu, P. Wang, W. Ju, P. Wang, Y . Zhou, E. Cambriaet al., “Deep cut-informed graph embedding and clustering,”Information Fusion, p. 103603, 2025
work page 2025
-
[12]
Adaptive path-memory network for temporal knowledge graph reason- ing,
H. Dong, Z. Ning, P. Wang, Z. Qiao, P. Wang, Y . Zhou, and Y . Fu, “Adaptive path-memory network for temporal knowledge graph reason- ing,”arXiv preprint arXiv:2304.12604, 2023
-
[13]
Lightcake: A lightweight framework for context-aware knowledge graph embedding,
Z. Ning, Z. Qiao, H. Dong, Y . Du, and Y . Zhou, “Lightcake: A lightweight framework for context-aware knowledge graph embedding,” inPacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2021, pp. 181–193
work page 2021
-
[14]
Graph soft-contrastive learning via neighborhood ranking,
Z. Ning, P. Wang, P. Wang, Z. Qiao, W. Fan, D. Zhang, Y . Du, and Y . Zhou, “Graph soft-contrastive learning via neighborhood ranking,” arXiv preprint arXiv:2209.13964, 2022
-
[15]
Fast random walk with restart and its applications,
H. Tong, C. Faloutsos, and J.-Y . Pan, “Fast random walk with restart and its applications,” inSixth international conference on data mining (ICDM’06). IEEE, 2006, pp. 613–622
work page 2006
-
[16]
Z. Ning, C. Tian, M. Xiao, W. Fan, P. Wang, L. Li, P. Wang, and Y . Zhou, “Fedgcs: A generative framework for efficient client selection in federated learning via gradient-based optimization,”arXiv preprint arXiv:2405.06312, 2024
-
[17]
Scibert: A pretrained language model for scientific text,
I. Beltagy, K. Lo, and A. Cohan, “Scibert: A pretrained language model for scientific text,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3615–3620
work page 2019
-
[18]
Specter: Document-level representation learning using citation-informed trans- formers,
A. Cohan, S. Feldman, I. Beltagy, D. Downey, and D. Weld, “Specter: Document-level representation learning using citation-informed trans- formers,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2270–2282
work page 2020
-
[19]
Scirepe- val: A multi-format benchmark for scientific document representations,
A. Singh, M. D’Arcy, A. Cohan, D. Downey, and S. Feldman, “Scirepe- val: A multi-format benchmark for scientific document representations,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 5548–5566
work page 2023
-
[20]
Chain-of- factors paper-reviewer matching,
Y . Zhang, Y . Shen, S. Kang, X. Chen, B. Jin, and J. Han, “Chain-of- factors paper-reviewer matching,” inProceedings of the ACM on Web Conference 2025, 2025, pp. 1901–1910
work page 2025
-
[21]
A framework for optimizing paper matching
L. Charlin, R. S. Zemel, and C. Boutilier, “A framework for optimizing paper matching.” inUAI, vol. 11, 2011, pp. 86–95
work page 2011
-
[22]
Schrijveret al.,Combinatorial optimization: polyhedra and effi- ciency
A. Schrijveret al.,Combinatorial optimization: polyhedra and effi- ciency. Springer, 2003, vol. 24, no. 2
work page 2003
-
[23]
Counterfactual evaluation of peer-review assignment policies,
M. Saveski, S. Jecmen, N. Shah, and J. Ugander, “Counterfactual evaluation of peer-review assignment policies,”Advances in Neural Information Processing Systems, vol. 36, pp. 58 765–58 786, 2023
work page 2023
-
[24]
The toronto paper matching system: An automated paper-reviewer assignment system,
L. Charlin and R. S. Zemel, “The toronto paper matching system: An automated paper-reviewer assignment system,” inICML 2013 Workshop on Peer Reviewing and Publishing Models, 2013
work page 2013
-
[25]
Expertise modeling for matching papers with reviewers,
D. Mimno and A. McCallum, “Expertise modeling for matching papers with reviewers,” inProceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007, pp. 500– 509
work page 2007
-
[26]
Expertise matching via constraint-based optimization,
W. Tang, J. Tang, and C. Tan, “Expertise matching via constraint-based optimization,” in2010 IEEE/WIC/aCM international conference on web intelligence and intelligent agent technology, vol. 1. IEEE, 2010, pp. 34–41
work page 2010
-
[27]
M. Ostendorff, N. Rethmeier, I. Augenstein, B. Gipp, and G. Rehm, “Neighborhood contrastive learning for scientific document representa- tions with citation embeddings,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 11 670–11 688
work page 2022
-
[28]
Large language models are zero-shot rankers for recommender systems,
Y . Hou, J. Zhang, Z. Lin, H. Lu, R. Xie, J. McAuley, and W. X. Zhao, “Large language models are zero-shot rankers for recommender systems,” inEuropean Conference on Information Retrieval. Springer, 2024, pp. 364–381
work page 2024
-
[29]
A dataset for expert reviewer recommendation with large language models as zero-shot rankers,
V . M. Karan, S. McQuistin, R. Yanagida, C. Perkins, G. Tyson, I. Castro, P. Healey, and M. Purver, “A dataset for expert reviewer recommendation with large language models as zero-shot rankers,” inProceedings of the 31st International Conference on Computational Linguistics, 2025, pp. 11 422–11 427
work page 2025
-
[30]
Frontier- revrec: A large-scale dataset for reviewer recommendation,
Q. Peng, C. Wang, Y . Wang, H. Liu, X. Guo, and W. Wang, “Frontier- revrec: A large-scale dataset for reviewer recommendation,”arXiv preprint arXiv:2510.16597, 2025
-
[31]
Is chatgpt good at search? investigating large language models as re- ranking agents,
W. Sun, L. Yan, X. Ma, S. Wang, P. Ren, Z. Chen, D. Yin, and Z. Ren, “Is chatgpt good at search? investigating large language models as re- ranking agents,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 14 918–14 937
work page 2023
-
[32]
J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liuet al., “A survey on llm-as-a-judge,”The Innovation, p. 101253, 2026
work page 2026
-
[33]
Judging llm-as-a-judge with mt-bench and chatbot arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46 595–46 623, 2023
work page 2023
-
[34]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Nv-embed: Improved techniques for training llms as generalist embedding models,
C. Lee, R. Roy, M. Xu, J. Raiman, M. Shoeybi, B. Catanzaro, and W. Ping, “Nv-embed: Improved techniques for training llms as generalist embedding models,” inInternational Conference on Learning Represen- tations, 2025
work page 2025
-
[36]
Reciprocal rank fusion outperforms condorcet and individual rank learning methods,
G. V . Cormack, C. L. Clarke, and S. Buettcher, “Reciprocal rank fusion outperforms condorcet and individual rank learning methods,” inProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval, 2009, pp. 758–759
work page 2009
-
[37]
Multi-aspect expertise matching for review assignment,
M. Karimzadehgan, C. Zhai, and G. Belford, “Multi-aspect expertise matching for review assignment,” inProceedings of the 17th ACM conference on Information and knowledge management, 2008, pp. 1113– 1122
work page 2008
-
[38]
Y . Yu, C. Xiong, S. Sun, C. Zhang, and A. Overwijk, “Coco-dr: Combat- ing the distribution shift in zero-shot dense retrieval with contrastive and distributionally robust learning,” inProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 1462– 1479
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.