pith. sign in

arxiv: 2606.06744 · v2 · pith:64MYQG6Xnew · submitted 2026-06-04 · 💻 cs.LG · cs.GT· cs.MA· econ.TH

Learn to Match: Two-Sided Matching with Temporally Extended Feedback

Pith reviewed 2026-06-28 02:00 UTC · model grok-4.3

classification 💻 cs.LG cs.GTcs.MAecon.TH
keywords two-sided matchingtemporally extended feedbackmulti-agent reinforcement learningpartially observable Markov gamesocial welfareregretdynamic matching marketsinformation friction
0
0 comments X

The pith

Reinforcement learning agents achieve higher social welfare and lower regret than bandit methods when matching feedback arrives gradually over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models two-sided matching as a process where information about preferences emerges gradually through costly screening, noisy observations after matches, evolving profiles, and decisions about whether to continue or dissolve pairings. It formulates this setting as a partially observable Markov game and builds the Learn2Match benchmark to let agents decide on interviews, matches, and dissolutions while tracking regret, welfare, and the welfare loss from unrevealed preferences. Experiments in this environment show that independent PPO agents produce better cumulative welfare and lower regret than a bandit-style CA-ETC baseline. The results indicate that multi-agent reinforcement learning can handle dynamic markets better than static bandit approaches, yet still leave gaps in coordinated information gathering.

Core claim

Casting two-sided matching as a partially observable Markov game that incorporates costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution decisions yields the Learn2Match benchmark. In this benchmark, independent PPO policies attain higher cumulative social welfare and lower cumulative regret than the CA-ETC bandit baseline under temporally extended feedback, while incurring higher information-friction loss that measures the welfare gap from incomplete preference revelation.

What carries the argument

The partially observable Markov game formulation of two-sided matching with temporally extended feedback, implemented as the Learn2Match multi-agent reinforcement learning benchmark.

If this is right

  • Decentralized RL policies can improve outcomes in markets where agents must choose whom to interview and when to dissolve matches based on gradually arriving information.
  • Bandit algorithms that assume immediate sub-Gaussian feedback may leave welfare on the table once matching decisions affect future observations and continuation values.
  • Effective matching algorithms will need to combine the adaptivity of reinforcement learning with the coordinated exploration structure of bandit methods.
  • Learn2Match provides a testbed for methods that are adaptive like RL agents, statistically disciplined like bandits, and aware of stability constraints like classical matching theory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed information-friction gap suggests that independent learning may miss opportunities for agents to coordinate on which latent attributes to probe first.
  • The framework could be used to test whether adding explicit stability constraints to the RL objective reduces regret without sacrificing welfare gains.
  • Scaling the benchmark to larger numbers of agents would reveal whether the current performance advantage persists when market thickness increases.

Load-bearing premise

Real two-sided matching markets can be faithfully represented as a partially observable Markov game whose state tracks evolving latent profiles, costly screening, noisy observations, and endogenous match continuation decisions.

What would settle it

A run of the Learn2Match benchmark in which independent PPO produces neither higher cumulative social welfare nor lower cumulative regret than the CA-ETC baseline under the same temporally extended feedback conditions.

Figures

Figures reproduced from arXiv: 2606.06744 by Boyang Zhou, Haijing Zong, Natasha Jaques, Yancheng Liang.

Figure 1
Figure 1. Figure 1: Overview of LEARN2MATCH, a dynamic two-sided matching framework with temporally extended feedback. Agents interview, match, learn gradually during tenure, and decide whether to retain or dissolve relationships, in contrast to traditional bandit matching with immediate one-step feedback. typically assume that each matching decision generates an immediate reward, observation, or noisy signal. In many markets… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of LEARN2MATCH (PPO) against CA-ETC in Low-noise / near-static setting in the small market. CA-ETC has near-zero cumulative friction loss. However, PPO still outperforms CA-ETC in both regret and social welfare. 0 50 100 150 200 market period 0 500 1000 1500 cumulative worker regret Cumulative Worker Regret PPO CA-ETC (a) Worker regret 0 50 100 150 200 market period 0 500 1000 1500 cumulative fi… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of LEARN2MATCH (PPO) against CA-ETC in the temporally extended feedback setting in the small market. PPO outperforms CA-ETC in both regret and social welfare, but CA-ETC has lower friction loss. 0.0 0.5 1.0 1.5 env steps 1e6 20000 30000 40000 cumulative worker regret over 600 periods Worker Regret PPO (a) Worker regret 0.0 0.5 1.0 1.5 env steps 1e6 20000 30000 40000 cumulative firm regret over 6… view at source ↗
Figure 4
Figure 4. Figure 4: PPO learning curves in the large market, temporally extended feedback setting. Worker [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of LEARN2MATCH (PPO) against CA-ETC in the temporally extended feedback setting in the large market. The result is consistent with the small market. PPO outperforms CA-ETC in both regret and social welfare, but CA-ETC has lower friction loss [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Left: interview coverage—the fraction that each pair (i, j) was interviewed at least once by the end of the episode across all evaluation environments. Right: mean cumulative tenure of each pair at the final period. Both figures are from the large market setting. Temporally extended feedback. The main benchmark setting restores the structure motivated in the introduction: interviews are noisy, post-match o… view at source ↗
Figure 7
Figure 7. Figure 7: Cumulative per-worker regret of CA-ETC inside [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Left: interview coverage. Right: mean cumulative tenure of each pair at the final period [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cumulative per-firm regret of CA-ETC inside [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
read the original abstract

Two-sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation. Existing matching models typically reduce this process to immediate sub-Gaussian feedback about fixed preferences, missing settings where payoff-relevant information is revealed gradually and changes future matching decisions. We introduce a framework with temporally extended feedback, that formulates two-sided matching as a partially observable Markov game with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. We instantiate this framework in Learn2Match, a multi-agent reinforcement-learning benchmark for dynamic matching markets. Learn2Match supports decentralized decision making over whom to interview, whom to match with, and when to dissolve a match, while evaluating policies using regret, social welfare, and an information-friction loss that measures the welfare gap caused by incomplete revelation of latent preferences. We find that independent PPO achieves higher cumulative social welfare and lower cumulative regret than the bandit-style CA-ETC baseline under temporally extended feedback, demonstrating the promise of MARL for dynamic matching markets. However, PPO still incurs higher information-friction loss, revealing that end-to-end MARL does not yet provide the coordinated exploration structure of matching-bandit methods. These results position Learn2Match as a benchmark for developing the next generation of matching-market algorithms: methods that are adaptive like RL agents, statistically disciplined like bandit algorithms, and structurally aware like stable-matching mechanisms. Please refer to https://sites.google.com/view/learn-to-match/home for the official website and the code link.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces a framework modeling two-sided matching markets with temporally extended feedback as a partially observable Markov game (POMG) that incorporates costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. It instantiates this in the Learn2Match benchmark for decentralized MARL decisions on interviewing, matching, and dissolving, and reports that independent PPO achieves higher cumulative social welfare and lower cumulative regret than a bandit-style CA-ETC baseline under this setting, while incurring higher information-friction loss; the work positions Learn2Match as a benchmark for algorithms combining RL adaptability, bandit statistical discipline, and stable-matching structure.

Significance. If the empirical comparison is shown to use equivalent information structures and rigorous protocols, the work would establish a useful new benchmark at the intersection of MARL and dynamic matching markets, highlighting both the promise of decentralized RL and the remaining gap in coordinated exploration relative to bandit methods. The explicit information-friction loss metric and support for endogenous decisions are strengths that could drive follow-on research.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim that independent PPO outperforms CA-ETC on welfare and regret supplies no experimental details on run count, hyperparameter search, statistical tests, or benchmark construction; without these the claim cannot be evaluated and is load-bearing for the paper's contribution.
  2. [framework description] Framework description (and abstract): the adaptation of the CA-ETC baseline to the POMG with latent-profile evolution, costly screening, noisy observations, and endogenous dissolution is not specified. If the baseline does not receive the same information structure and action space as the RL agents, any performance gap could be an artifact of an under-powered baseline rather than evidence for MARL.
minor comments (1)
  1. [Abstract] Abstract: the website and code link are mentioned but the manuscript should include a permanent reference or DOI for the benchmark to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for clearer experimental details and baseline specification. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim that independent PPO outperforms CA-ETC on welfare and regret supplies no experimental details on run count, hyperparameter search, statistical tests, or benchmark construction; without these the claim cannot be evaluated and is load-bearing for the paper's contribution.

    Authors: We agree the abstract is too high-level on this point. Full details appear in Section 5 (20 independent runs, grid-search hyperparameter tuning, paired t-tests at p<0.05) and Section 4 (benchmark construction). In revision we will append a concise experimental clause to the abstract. revision: yes

  2. Referee: [framework description] Framework description (and abstract): the adaptation of the CA-ETC baseline to the POMG with latent-profile evolution, costly screening, noisy observations, and endogenous dissolution is not specified. If the baseline does not receive the same information structure and action space as the RL agents, any performance gap could be an artifact of an under-powered baseline rather than evidence for MARL.

    Authors: Section 3.3 and Appendix C already describe the extension: CA-ETC maintains belief distributions over evolving latent profiles, uses identical screening and dissolution actions, and receives the same noisy observations. To eliminate ambiguity we will insert an explicit equivalence statement in the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark results

full rationale

The paper defines a POMG framework for matching with temporally extended feedback, instantiates it as the Learn2Match benchmark, and reports simulation outcomes comparing independent PPO against a CA-ETC baseline on welfare, regret, and information-friction loss. These metrics are computed directly from environment rollouts rather than being algebraically equivalent to any fitted parameters, self-cited uniqueness theorems, or ansatzes inside the paper's own equations. The central empirical claim is therefore an observed simulation result, not a quantity forced by construction or by a self-citation chain; the provided code link further allows external reproduction outside the manuscript's fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on standard POMG and RL assumptions plus domain-specific modeling choices for matching; no numerical free parameters are reported in the abstract.

axioms (1)
  • domain assumption Two-sided matching markets can be represented as partially observable Markov games with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution.
    Invoked when the framework is introduced in the abstract.
invented entities (1)
  • Learn2Match benchmark no independent evidence
    purpose: Test environment for decentralized policies over interview, matching, and dissolution decisions under temporally extended feedback
    Newly defined in the paper; no independent evidence supplied beyond the abstract description.

pith-pipeline@v0.9.1-grok · 5818 in / 1322 out tokens · 28528 ms · 2026-06-28T02:00:10.353330+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    School choice: A mechanism design approach

    Atila Abdulkadiro ˘glu and Tayfun Sönmez. School choice: A mechanism design approach. American economic review, 93(3):729–747, 2003

  2. [2]

    From signaling to interviews in random matching markets

    Maxwell Allman, Itai Ashlagi, Amin Saberi, and Sophie H Yu. From signaling to interviews in random matching markets. InProceedings of the 57th Annual ACM Symposium on Theory of Computing, pages 1556–1567, 2025

  3. [3]

    Employer learning and statistical discrimination.The quarterly journal of economics, 116(1):313–350, 2001

    Joseph G Altonji and Charles R Pierret. Employer learning and statistical discrimination.The quarterly journal of economics, 116(1):313–350, 2001

  4. [4]

    Stable matching with inter- views

    Itai Ashlagi, Jiale Chen, Mohammad Roghani, and Amin Saberi. Stable matching with inter- views. In16th Innovations in Theoretical Computer Science Conference (ITCS 2025), pages 12–1. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2025

  5. [5]

    Probably correct op- timal stable matching for two-sided markets under uncertainty.arXiv preprint arXiv:2501.03018, 2025

    Andreas Athanasopoulos, Anne-Marie George, and Christos Dimitrakakis. Probably correct op- timal stable matching for two-sided markets under uncertainty.arXiv preprint arXiv:2501.03018, 2025

  6. [6]

    A better match for drivers and riders: Reinforcement learning at lyft.INFORMS Journal on Applied Analytics, 54(1):71–83, 2024

    Xabi Azagirre, Akshay Balwally, Guillaume Candeli, Nicholas Chamandy, Benjamin Han, Alona King, Hyungjun Lee, Martin Loncaric, Sébastien Martin, Vijay Narasiman, et al. A better match for drivers and riders: Reinforcement learning at lyft.INFORMS Journal on Applied Analytics, 54(1):71–83, 2024

  7. [7]

    Efficient interview scheduling for stable matching.arXiv preprint arXiv:2602.20358, 2026

    Moshe Babaioff, Rotem Gil, and Assaf Romm. Efficient interview scheduling for stable matching.arXiv preprint arXiv:2602.20358, 2026

  8. [8]

    Employer search, training, and vacancy duration.Economic inquiry, 35(1):167–192, 1997

    John M Barron, Mark C Berger, and Dan A Black. Employer search, training, and vacancy duration.Economic inquiry, 35(1):167–192, 1997

  9. [9]

    Beyond log2(t) regret for decentralized bandits in matching markets

    Soumya Basu, Karthik Abinav Sankararaman, and Abishek Sankararaman. Beyond log2(t) regret for decentralized bandits in matching markets. InInternational Conference on Machine Learning, pages 705–715. PMLR, 2021

  10. [10]

    The costs of hiring skilled workers

    Marc Blatter, Samuel Muehlemann, and Samuel Schenker. The costs of hiring skilled workers. European Economic Review, 56(1):20–35, 2012

  11. [11]

    Recruitment policies, job-filling rates, and matching efficiency.Journal of the European Economic Association, 21(6):2413–2459, 2023

    Carlos Carrillo-Tudela, Hermann Gartner, and Leo Kaas. Recruitment policies, job-filling rates, and matching efficiency.Journal of the European Economic Association, 21(6):2413–2459, 2023

  12. [12]

    Common learning

    Martin W Cripps, Jeffrey C Ely, George J Mailath, and Larry Samuelson. Common learning. Econometrica, 76(4):909–933, 2008

  13. [13]

    Aggregate demand management in search equilibrium.Journal of political Economy, 90(5):881–894, 1982

    Peter A Diamond. Aggregate demand management in search equilibrium.Journal of political Economy, 90(5):881–894, 1982

  14. [14]

    Learning and wage dynamics.The Quarterly Journal of Economics, 111(4):1007–1047, 1996

    Henry S Farber and Robert Gibbons. Learning and wage dynamics.The Quarterly Journal of Economics, 111(4):1007–1047, 1996

  15. [15]

    College admissions and the stability of marriage.The American mathematical monthly, 69(1):9–15, 1962

    David Gale and Lloyd S Shapley. College admissions and the stability of marriage.The American mathematical monthly, 69(1):9–15, 1962

  16. [16]

    The u-shapes of occupational mobility.The Review of Economic Studies, 82(2):659–692, 2015

    Fane Groes, Philipp Kircher, and Iourii Manovskii. The u-shapes of occupational mobility.The Review of Economic Studies, 82(2):659–692, 2015

  17. [17]

    We know what you want: An advertising strategy recommender system for online advertising

    Liyi Guo, Junqi Jin, Haoqi Zhang, Zhenzhe Zheng, Zhiye Yang, Zhizhuang Xing, Fei Pan, Lvyin Niu, Fan Wu, Haiyang Xu, et al. We know what you want: An advertising strategy recommender system for online advertising. InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 2919–2927, 2021

  18. [18]

    Hitsch, Ali Hortaçsu, and Dan Ariely

    Günter J. Hitsch, Ali Hortaçsu, and Dan Ariely. Matching and sorting in online dating.American Economic Review, 100(1):130–163, 2010. 10

  19. [19]

    Putting gale & shapley to work: Guaranteeing stability through learning.Advances in Neural Information Processing Systems, 37:69043– 69068, 2024

    Hadi Hosseini, Sanjukta Roy, and Duohan Zhang. Putting gale & shapley to work: Guaranteeing stability through learning.Advances in Neural Information Processing Systems, 37:69043– 69068, 2024

  20. [20]

    Employee screening: theory and evidence, 2006

    Fali Huang and Peter Cappelli. Employee screening: theory and evidence, 2006

  21. [21]

    Designing approxi- mately optimal search on matching platforms

    Nicole Immorlica, Brendan Lucier, Vahideh Manshadi, and Alexander Wei. Designing approxi- mately optimal search on matching platforms. InProceedings of the 22nd ACM Conference on Economics and Computation, pages 632–633, 2021

  22. [22]

    Learn- ing equilibria in matching markets from bandit feedback.Advances in Neural Information Processing Systems, 34:3323–3335, 2021

    Meena Jagadeesan, Alexander Wei, Yixin Wang, Michael Jordan, and Jacob Steinhardt. Learn- ing equilibria in matching markets from bandit feedback.Advances in Neural Information Processing Systems, 34:3323–3335, 2021

  23. [23]

    Occupational mobility and wage inequality.The Review of Economic Studies, 76(2):731–759, 2009

    Gueorgui Kambourov and Iourii Manovskii. Occupational mobility and wage inequality.The Review of Economic Studies, 76(2):731–759, 2009

  24. [24]

    Llm economist: Large population models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025

    Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, and Chi Jin. Llm economist: Large population models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025

  25. [25]

    Player-optimal stable regret for bandit learning in matching markets

    Fang Kong and Shuai Li. Player-optimal stable regret for bandit learning in matching markets. InProceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1512–1522. SIAM, 2023

  26. [26]

    Bandit learning in matching markets with indifference

    Fang Kong, Jingqi Tang, Mingzhu Li, Pinyan Lu, John CS Lui, and Shuai Li. Bandit learning in matching markets with indifference. InThe Thirteenth International Conference on Learning Representations, 2025

  27. [27]

    The speed of employer learning.Journal of Labor Economics, 25(1):1–35, 2007

    Fabian Lange. The speed of employer learning.Journal of Labor Economics, 25(1):1–35, 2007

  28. [28]

    A survey on bandit learning in matching markets

    Shuai Li, Zilong Wang, and Fang Kong. A survey on bandit learning in matching markets. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 10546–10554, 2025

  29. [29]

    Tight regret bounds for infinite-armed linear contextual bandits

    Yingkai Li, Yining Wang, Xi Chen, and Yuan Zhou. Tight regret bounds for infinite-armed linear contextual bandits. InInternational Conference on Artificial Intelligence and Statistics, pages 370–378. PMLR, 2021

  30. [30]

    Dynamic matching bandit for two-sided online markets.arXiv preprint arXiv:2205.03699, 2022

    Yuantong Li, Chi-hua Wang, Guang Cheng, and Will Wei Sun. Dynamic matching bandit for two-sided online markets.arXiv preprint arXiv:2205.03699, 2022

  31. [31]

    Bandit learning in decentralized matching markets.Journal of Machine Learning Research, 22(211):1–34, 2021

    Lydia T Liu, Feng Ruan, Horia Mania, and Michael I Jordan. Bandit learning in decentralized matching markets.Journal of Machine Learning Research, 22(211):1–34, 2021

  32. [32]

    Welfare maximiza- tion in competitive equilibrium: Reinforcement learning for markov exchange economy

    Zhihan Liu, Miao Lu, Zhaoran Wang, Michael Jordan, and Zhuoran Yang. Welfare maximiza- tion in competitive equilibrium: Reinforcement learning for markov exchange economy. In International Conference on Machine Learning, pages 13870–13911. PMLR, 2022

  33. [33]

    Multi-agent actor-critic for mixed cooperative-competitive environments.Advances in neural information processing systems, 30, 2017

    Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments.Advances in neural information processing systems, 30, 2017

  34. [34]

    Economics of information and job search.The Quarterly Journal of Economics, 84(1):113–126, 1970

    John Joseph McCall. Economics of information and job search.The Quarterly Journal of Economics, 84(1):113–126, 1970

  35. [35]

    Job matching and occupational choice.Journal of Political economy, 92(6): 1086–1120, 1984

    Robert A Miller. Job matching and occupational choice.Journal of Political economy, 92(6): 1086–1120, 1984

  36. [36]

    Learn to match with no regret: Reinforcement learning in markov matching markets.Advances in Neural Information Processing Systems, 35:19956–19970, 2022

    Yifei Min, Tianhao Wang, Ruitu Xu, Zhaoran Wang, Michael Jordan, and Zhuoran Yang. Learn to match with no regret: Reinforcement learning in markov matching markets.Advances in Neural Information Processing Systems, 35:19956–19970, 2022

  37. [37]

    Schooling and earnings

    Jacob A Mincer. Schooling and earnings. InSchooling, experience, and earnings, pages 41–63. NBER, 1974. 11

  38. [38]

    Two-Sided Time-Independent Regret for Matching Markets with Limited Interviews

    Amirmahdi Mirfakhar, Xuchuang Wang, Mengfan Xu, Hedyeh Beyhaghi, and Moham- mad Hajiesmaili. Bandit learning in matching markets with interviews.arXiv preprint arXiv:2602.12224, 2026

  39. [39]

    Job creation and job destruction in the theory of unemployment.The review of economic studies, 61(3):397–415, 1994

    Dale T Mortensen and Christopher A Pissarides. Job creation and job destruction in the theory of unemployment.The review of economic studies, 61(3):397–415, 1994

  40. [40]

    Wage growth and the theory of turnover.Journal of Labor Economics, 18 (2):204–220, 2000

    Lalith Munasinghe. Wage growth and the theory of turnover.Journal of Labor Economics, 18 (2):204–220, 2000

  41. [41]

    Two-sided bandit learning in fully-decentralized matching markets

    Tejas Pagare and Avishek Ghosh. Two-sided bandit learning in fully-decentralized matching markets. InICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023

  42. [42]

    Explore-then-commit algorithms for decentralized two-sided matching markets

    Tejas Pagare and Avishek Ghosh. Explore-then-commit algorithms for decentralized two-sided matching markets. In2024 IEEE International Symposium on Information Theory (ISIT), pages 2092–2097. IEEE, 2024

  43. [43]

    Competing bandits in decentralized contextual matching markets.arXiv preprint arXiv:2411.11794, 2024

    Satush Parikh, Soumya Basu, Avishek Ghosh, and Abishek Sankararaman. Competing bandits in decentralized contextual matching markets.arXiv preprint arXiv:2411.11794, 2024

  44. [44]

    MIT press, 2000

    Christopher A Pissarides.Equilibrium unemployment theory. MIT press, 2000

  45. [45]

    Converging to stability in two-sided bandits: The case of unknown preferences on both sides of a matching market.arXiv preprint arXiv:2302.06176, 2023

    Gaurab Pokharel and Sanmay Das. Converging to stability in two-sided bandits: The case of unknown preferences on both sides of a matching market.arXiv preprint arXiv:2302.06176, 2023

  46. [46]

    Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

    Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

  47. [47]

    The national residency matching program as a labor market.JAMA, 275(13): 1054–1056, 1996

    Alvin E Roth. The national residency matching program as a labor market.JAMA, 275(13): 1054–1056, 1996

  48. [48]

    Two-sided matching.Handbook of game theory with economic applications, 1:485–541, 1992

    Alvin E Roth and Marilda Sotomayor. Two-sided matching.Handbook of game theory with economic applications, 1:485–541, 1992

  49. [49]

    Testing for asymmetric employer learning.Journal of Labor Economics, 25(4): 651–691, 2007

    Uta Schönberg. Testing for asymmetric employer learning.Journal of Labor Economics, 25(4): 651–691, 2007

  50. [50]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  51. [51]

    A multiagent reinforcement learning framework for off-policy evaluation in two-sided markets.The Annals of Applied Statistics, 17(4):2701–2722, 2023

    Chengchun Shi, Runzhe Wan, Ge Song, Shikai Luo, Hongtu Zhu, and Rui Song. A multiagent reinforcement learning framework for off-policy evaluation in two-sided markets.The Annals of Applied Statistics, 17(4):2701–2722, 2023

  52. [52]

    Optimal match recommendations in two-sided marketplaces with endogenous prices

    Peng Shi. Optimal match recommendations in two-sided marketplaces with endogenous prices. Management Science, 71(9):7431–7448, 2025

  53. [53]

    The cyclical behavior of equilibrium unemployment and vacancies.American economic review, 95(1):25–49, 2005

    Robert Shimer. The cyclical behavior of equilibrium unemployment and vacancies.American economic review, 95(1):25–49, 2005

  54. [54]

    Labor turnover costs and the cyclical behavior of vacancies and unemployment.Macroeconomic Dynamics, 13(S1):76–96, 2009

    José Ignacio Silva and Manuel Toledo. Labor turnover costs and the cyclical behavior of vacancies and unemployment.Macroeconomic Dynamics, 13(S1):76–96, 2009

  55. [55]

    Job mobility and the careers of young men.The Quarterly Journal of Economics, 107(2):439–479, 1992

    Robert H Topel and Michael P Ward. Job mobility and the careers of young men.The Quarterly Journal of Economics, 107(2):439–479, 1992

  56. [56]

    Online dating recommendations: matching markets and learning preferences

    Kun Tu, Bruno Ribeiro, David Jensen, Don Towsley, Benyuan Liu, Hua Jiang, and Xiaodong Wang. Online dating recommendations: matching markets and learning preferences. In Proceedings of the 23rd international conference on world wide web, pages 787–792, 2014

  57. [57]

    Interview choice reveals your preference on the market: To improve job-resume matching through profiling memories

    Rui Yan, Ran Le, Yang Song, Tao Zhang, Xiangliang Zhang, and Dongyan Zhao. Interview choice reveals your preference on the market: To improve job-resume matching through profiling memories. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 914–922, 2019. 12

  58. [58]

    The surprising effectiveness of ppo in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

    Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

  59. [59]

    Multi-agent reinforcement learning: A selective overview.Foundations and Trends in Machine Learning, 2021

    Kaiqing Zhang et al. Multi-agent reinforcement learning: A selective overview.Foundations and Trends in Machine Learning, 2021

  60. [60]

    Decentralized two-sided bandit learning in matching market

    YiRui Zhang and Zhixuan Fang. Decentralized two-sided bandit learning in matching market. InThe 40th Conference on Uncertainty in Artificial Intelligence, 2024

  61. [61]

    The ai economist: Improving equality and productivity with ai-driven tax policies.arXiv preprint arXiv:2004.13332, 2020

    Stephan Zheng, Alexander Trott, Sunil Srinivasa, Nikhil Naik, Melvin Gruesbeck, David C Parkes, and Richard Socher. The ai economist: Improving equality and productivity with ai-driven tax policies.arXiv preprint arXiv:2004.13332, 2020. 13 A Implementation details A.1 PPO implementation details Both small and large markets use an outside-option penalty of...

  62. [62]

    We prove that we are able to show that we can cover exactly the same CA-ETC results under some parameter setting

    Figures 7 and 8 show that cumulative per-worker and per-firm regret rise during the initial exploration block and then flatten as the algorithm commits to its empirical Gale–Shapley matching, reproducing the structure reported in [ 42]. We prove that we are able to show that we can cover exactly the same CA-ETC results under some parameter setting. Figure...