Learn to Match: Two-Sided Matching with Temporally Extended Feedback

Boyang Zhou; Haijing Zong; Natasha Jaques; Yancheng Liang

arxiv: 2606.06744 · v2 · pith:64MYQG6Xnew · submitted 2026-06-04 · 💻 cs.LG · cs.GT· cs.MA· econ.TH

Learn to Match: Two-Sided Matching with Temporally Extended Feedback

Haijing Zong , Yancheng Liang , Boyang Zhou , Natasha Jaques This is my paper

Pith reviewed 2026-06-28 02:00 UTC · model grok-4.3

classification 💻 cs.LG cs.GTcs.MAecon.TH

keywords two-sided matchingtemporally extended feedbackmulti-agent reinforcement learningpartially observable Markov gamesocial welfareregretdynamic matching marketsinformation friction

0 comments

The pith

Reinforcement learning agents achieve higher social welfare and lower regret than bandit methods when matching feedback arrives gradually over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper models two-sided matching as a process where information about preferences emerges gradually through costly screening, noisy observations after matches, evolving profiles, and decisions about whether to continue or dissolve pairings. It formulates this setting as a partially observable Markov game and builds the Learn2Match benchmark to let agents decide on interviews, matches, and dissolutions while tracking regret, welfare, and the welfare loss from unrevealed preferences. Experiments in this environment show that independent PPO agents produce better cumulative welfare and lower regret than a bandit-style CA-ETC baseline. The results indicate that multi-agent reinforcement learning can handle dynamic markets better than static bandit approaches, yet still leave gaps in coordinated information gathering.

Core claim

Casting two-sided matching as a partially observable Markov game that incorporates costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution decisions yields the Learn2Match benchmark. In this benchmark, independent PPO policies attain higher cumulative social welfare and lower cumulative regret than the CA-ETC bandit baseline under temporally extended feedback, while incurring higher information-friction loss that measures the welfare gap from incomplete preference revelation.

What carries the argument

The partially observable Markov game formulation of two-sided matching with temporally extended feedback, implemented as the Learn2Match multi-agent reinforcement learning benchmark.

If this is right

Decentralized RL policies can improve outcomes in markets where agents must choose whom to interview and when to dissolve matches based on gradually arriving information.
Bandit algorithms that assume immediate sub-Gaussian feedback may leave welfare on the table once matching decisions affect future observations and continuation values.
Effective matching algorithms will need to combine the adaptivity of reinforcement learning with the coordinated exploration structure of bandit methods.
Learn2Match provides a testbed for methods that are adaptive like RL agents, statistically disciplined like bandits, and aware of stability constraints like classical matching theory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed information-friction gap suggests that independent learning may miss opportunities for agents to coordinate on which latent attributes to probe first.
The framework could be used to test whether adding explicit stability constraints to the RL objective reduces regret without sacrificing welfare gains.
Scaling the benchmark to larger numbers of agents would reveal whether the current performance advantage persists when market thickness increases.

Load-bearing premise

Real two-sided matching markets can be faithfully represented as a partially observable Markov game whose state tracks evolving latent profiles, costly screening, noisy observations, and endogenous match continuation decisions.

What would settle it

A run of the Learn2Match benchmark in which independent PPO produces neither higher cumulative social welfare nor lower cumulative regret than the CA-ETC baseline under the same temporally extended feedback conditions.

Figures

Figures reproduced from arXiv: 2606.06744 by Boyang Zhou, Haijing Zong, Natasha Jaques, Yancheng Liang.

**Figure 1.** Figure 1: Overview of LEARN2MATCH, a dynamic two-sided matching framework with temporally extended feedback. Agents interview, match, learn gradually during tenure, and decide whether to retain or dissolve relationships, in contrast to traditional bandit matching with immediate one-step feedback. typically assume that each matching decision generates an immediate reward, observation, or noisy signal. In many markets… view at source ↗

**Figure 2.** Figure 2: Comparison of LEARN2MATCH (PPO) against CA-ETC in Low-noise / near-static setting in the small market. CA-ETC has near-zero cumulative friction loss. However, PPO still outperforms CA-ETC in both regret and social welfare. 0 50 100 150 200 market period 0 500 1000 1500 cumulative worker regret Cumulative Worker Regret PPO CA-ETC (a) Worker regret 0 50 100 150 200 market period 0 500 1000 1500 cumulative fi… view at source ↗

**Figure 3.** Figure 3: Comparison of LEARN2MATCH (PPO) against CA-ETC in the temporally extended feedback setting in the small market. PPO outperforms CA-ETC in both regret and social welfare, but CA-ETC has lower friction loss. 0.0 0.5 1.0 1.5 env steps 1e6 20000 30000 40000 cumulative worker regret over 600 periods Worker Regret PPO (a) Worker regret 0.0 0.5 1.0 1.5 env steps 1e6 20000 30000 40000 cumulative firm regret over 6… view at source ↗

**Figure 4.** Figure 4: PPO learning curves in the large market, temporally extended feedback setting. Worker [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of LEARN2MATCH (PPO) against CA-ETC in the temporally extended feedback setting in the large market. The result is consistent with the small market. PPO outperforms CA-ETC in both regret and social welfare, but CA-ETC has lower friction loss [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Left: interview coverage—the fraction that each pair (i, j) was interviewed at least once by the end of the episode across all evaluation environments. Right: mean cumulative tenure of each pair at the final period. Both figures are from the large market setting. Temporally extended feedback. The main benchmark setting restores the structure motivated in the introduction: interviews are noisy, post-match o… view at source ↗

**Figure 7.** Figure 7: Cumulative per-worker regret of CA-ETC inside [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 9.** Figure 9: Left: interview coverage. Right: mean cumulative tenure of each pair at the final period [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 8.** Figure 8: Cumulative per-firm regret of CA-ETC inside [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

read the original abstract

Two-sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation. Existing matching models typically reduce this process to immediate sub-Gaussian feedback about fixed preferences, missing settings where payoff-relevant information is revealed gradually and changes future matching decisions. We introduce a framework with temporally extended feedback, that formulates two-sided matching as a partially observable Markov game with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. We instantiate this framework in Learn2Match, a multi-agent reinforcement-learning benchmark for dynamic matching markets. Learn2Match supports decentralized decision making over whom to interview, whom to match with, and when to dissolve a match, while evaluating policies using regret, social welfare, and an information-friction loss that measures the welfare gap caused by incomplete revelation of latent preferences. We find that independent PPO achieves higher cumulative social welfare and lower cumulative regret than the bandit-style CA-ETC baseline under temporally extended feedback, demonstrating the promise of MARL for dynamic matching markets. However, PPO still incurs higher information-friction loss, revealing that end-to-end MARL does not yet provide the coordinated exploration structure of matching-bandit methods. These results position Learn2Match as a benchmark for developing the next generation of matching-market algorithms: methods that are adaptive like RL agents, statistically disciplined like bandit algorithms, and structurally aware like stable-matching mechanisms. Please refer to https://sites.google.com/view/learn-to-match/home for the official website and the code link.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

New POMG benchmark for matching with extended feedback and dissolution is the real contribution; the PPO vs CA-ETC comparison is too thinly reported to judge yet.

read the letter

The paper's useful move is to treat two-sided matching as a POMG where agents face costly screening, noisy post-match signals, drifting latent types, and the choice to dissolve. That setup captures more of the real process than the usual one-shot or immediate-feedback models. They package it as the Learn2Match benchmark with regret, welfare, and an information-friction metric, and they show independent PPO beating a bandit baseline on the first two while losing on the third. The benchmark itself is the part that could stick around.

The empirical headline is harder to assess from what is shown. The abstract states PPO wins on welfare and regret but gives no run counts, no hyperparameter protocol, no statistical tests, and no description of how CA-ETC was altered to handle latent profiles, screening costs, or dissolution. If the bandit baseline was not given the same observation and action structure, the gap is not yet evidence for MARL. That is the main soft spot; everything else is standard for an early benchmark paper.

The work is aimed at people who already care about dynamic matching, multi-agent RL, or learning in markets. A reader who wants a shared testbed for those communities will get value even if the current experiments are preliminary. The formulation and metric are clear enough that the paper is worth sending to referees, mainly so the baseline implementation and experimental details can be checked and tightened.

Referee Report

2 major / 1 minor

Summary. The paper introduces a framework modeling two-sided matching markets with temporally extended feedback as a partially observable Markov game (POMG) that incorporates costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. It instantiates this in the Learn2Match benchmark for decentralized MARL decisions on interviewing, matching, and dissolving, and reports that independent PPO achieves higher cumulative social welfare and lower cumulative regret than a bandit-style CA-ETC baseline under this setting, while incurring higher information-friction loss; the work positions Learn2Match as a benchmark for algorithms combining RL adaptability, bandit statistical discipline, and stable-matching structure.

Significance. If the empirical comparison is shown to use equivalent information structures and rigorous protocols, the work would establish a useful new benchmark at the intersection of MARL and dynamic matching markets, highlighting both the promise of decentralized RL and the remaining gap in coordinated exploration relative to bandit methods. The explicit information-friction loss metric and support for endogenous decisions are strengths that could drive follow-on research.

major comments (2)

[Abstract] Abstract: the central empirical claim that independent PPO outperforms CA-ETC on welfare and regret supplies no experimental details on run count, hyperparameter search, statistical tests, or benchmark construction; without these the claim cannot be evaluated and is load-bearing for the paper's contribution.
[framework description] Framework description (and abstract): the adaptation of the CA-ETC baseline to the POMG with latent-profile evolution, costly screening, noisy observations, and endogenous dissolution is not specified. If the baseline does not receive the same information structure and action space as the RL agents, any performance gap could be an artifact of an under-powered baseline rather than evidence for MARL.

minor comments (1)

[Abstract] Abstract: the website and code link are mentioned but the manuscript should include a permanent reference or DOI for the benchmark to ensure reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for clearer experimental details and baseline specification. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim that independent PPO outperforms CA-ETC on welfare and regret supplies no experimental details on run count, hyperparameter search, statistical tests, or benchmark construction; without these the claim cannot be evaluated and is load-bearing for the paper's contribution.

Authors: We agree the abstract is too high-level on this point. Full details appear in Section 5 (20 independent runs, grid-search hyperparameter tuning, paired t-tests at p<0.05) and Section 4 (benchmark construction). In revision we will append a concise experimental clause to the abstract. revision: yes
Referee: [framework description] Framework description (and abstract): the adaptation of the CA-ETC baseline to the POMG with latent-profile evolution, costly screening, noisy observations, and endogenous dissolution is not specified. If the baseline does not receive the same information structure and action space as the RL agents, any performance gap could be an artifact of an under-powered baseline rather than evidence for MARL.

Authors: Section 3.3 and Appendix C already describe the extension: CA-ETC maintains belief distributions over evolving latent profiles, uses identical screening and dissolution actions, and receives the same noisy observations. To eliminate ambiguity we will insert an explicit equivalence statement in the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark results

full rationale

The paper defines a POMG framework for matching with temporally extended feedback, instantiates it as the Learn2Match benchmark, and reports simulation outcomes comparing independent PPO against a CA-ETC baseline on welfare, regret, and information-friction loss. These metrics are computed directly from environment rollouts rather than being algebraically equivalent to any fitted parameters, self-cited uniqueness theorems, or ansatzes inside the paper's own equations. The central empirical claim is therefore an observed simulation result, not a quantity forced by construction or by a self-citation chain; the provided code link further allows external reproduction outside the manuscript's fitted values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on standard POMG and RL assumptions plus domain-specific modeling choices for matching; no numerical free parameters are reported in the abstract.

axioms (1)

domain assumption Two-sided matching markets can be represented as partially observable Markov games with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution.
Invoked when the framework is introduced in the abstract.

invented entities (1)

Learn2Match benchmark no independent evidence
purpose: Test environment for decentralized policies over interview, matching, and dissolution decisions under temporally extended feedback
Newly defined in the paper; no independent evidence supplied beyond the abstract description.

pith-pipeline@v0.9.1-grok · 5818 in / 1322 out tokens · 28528 ms · 2026-06-28T02:00:10.353330+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 9 canonical work pages · 2 internal anchors

[1]

School choice: A mechanism design approach

Atila Abdulkadiro ˘glu and Tayfun Sönmez. School choice: A mechanism design approach. American economic review, 93(3):729–747, 2003

2003
[2]

From signaling to interviews in random matching markets

Maxwell Allman, Itai Ashlagi, Amin Saberi, and Sophie H Yu. From signaling to interviews in random matching markets. InProceedings of the 57th Annual ACM Symposium on Theory of Computing, pages 1556–1567, 2025

2025
[3]

Employer learning and statistical discrimination.The quarterly journal of economics, 116(1):313–350, 2001

Joseph G Altonji and Charles R Pierret. Employer learning and statistical discrimination.The quarterly journal of economics, 116(1):313–350, 2001

2001
[4]

Stable matching with inter- views

Itai Ashlagi, Jiale Chen, Mohammad Roghani, and Amin Saberi. Stable matching with inter- views. In16th Innovations in Theoretical Computer Science Conference (ITCS 2025), pages 12–1. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2025

2025
[5]

Probably correct op- timal stable matching for two-sided markets under uncertainty.arXiv preprint arXiv:2501.03018, 2025

Andreas Athanasopoulos, Anne-Marie George, and Christos Dimitrakakis. Probably correct op- timal stable matching for two-sided markets under uncertainty.arXiv preprint arXiv:2501.03018, 2025

work page arXiv 2025
[6]

A better match for drivers and riders: Reinforcement learning at lyft.INFORMS Journal on Applied Analytics, 54(1):71–83, 2024

Xabi Azagirre, Akshay Balwally, Guillaume Candeli, Nicholas Chamandy, Benjamin Han, Alona King, Hyungjun Lee, Martin Loncaric, Sébastien Martin, Vijay Narasiman, et al. A better match for drivers and riders: Reinforcement learning at lyft.INFORMS Journal on Applied Analytics, 54(1):71–83, 2024

2024
[7]

Efficient interview scheduling for stable matching.arXiv preprint arXiv:2602.20358, 2026

Moshe Babaioff, Rotem Gil, and Assaf Romm. Efficient interview scheduling for stable matching.arXiv preprint arXiv:2602.20358, 2026

work page arXiv 2026
[8]

Employer search, training, and vacancy duration.Economic inquiry, 35(1):167–192, 1997

John M Barron, Mark C Berger, and Dan A Black. Employer search, training, and vacancy duration.Economic inquiry, 35(1):167–192, 1997

1997
[9]

Beyond log2(t) regret for decentralized bandits in matching markets

Soumya Basu, Karthik Abinav Sankararaman, and Abishek Sankararaman. Beyond log2(t) regret for decentralized bandits in matching markets. InInternational Conference on Machine Learning, pages 705–715. PMLR, 2021

2021
[10]

The costs of hiring skilled workers

Marc Blatter, Samuel Muehlemann, and Samuel Schenker. The costs of hiring skilled workers. European Economic Review, 56(1):20–35, 2012

2012
[11]

Recruitment policies, job-filling rates, and matching efficiency.Journal of the European Economic Association, 21(6):2413–2459, 2023

Carlos Carrillo-Tudela, Hermann Gartner, and Leo Kaas. Recruitment policies, job-filling rates, and matching efficiency.Journal of the European Economic Association, 21(6):2413–2459, 2023

2023
[12]

Common learning

Martin W Cripps, Jeffrey C Ely, George J Mailath, and Larry Samuelson. Common learning. Econometrica, 76(4):909–933, 2008

2008
[13]

Aggregate demand management in search equilibrium.Journal of political Economy, 90(5):881–894, 1982

Peter A Diamond. Aggregate demand management in search equilibrium.Journal of political Economy, 90(5):881–894, 1982

1982
[14]

Learning and wage dynamics.The Quarterly Journal of Economics, 111(4):1007–1047, 1996

Henry S Farber and Robert Gibbons. Learning and wage dynamics.The Quarterly Journal of Economics, 111(4):1007–1047, 1996

1996
[15]

College admissions and the stability of marriage.The American mathematical monthly, 69(1):9–15, 1962

David Gale and Lloyd S Shapley. College admissions and the stability of marriage.The American mathematical monthly, 69(1):9–15, 1962

1962
[16]

The u-shapes of occupational mobility.The Review of Economic Studies, 82(2):659–692, 2015

Fane Groes, Philipp Kircher, and Iourii Manovskii. The u-shapes of occupational mobility.The Review of Economic Studies, 82(2):659–692, 2015

2015
[17]

We know what you want: An advertising strategy recommender system for online advertising

Liyi Guo, Junqi Jin, Haoqi Zhang, Zhenzhe Zheng, Zhiye Yang, Zhizhuang Xing, Fei Pan, Lvyin Niu, Fan Wu, Haiyang Xu, et al. We know what you want: An advertising strategy recommender system for online advertising. InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 2919–2927, 2021

2021
[18]

Hitsch, Ali Hortaçsu, and Dan Ariely

Günter J. Hitsch, Ali Hortaçsu, and Dan Ariely. Matching and sorting in online dating.American Economic Review, 100(1):130–163, 2010. 10

2010
[19]

Putting gale & shapley to work: Guaranteeing stability through learning.Advances in Neural Information Processing Systems, 37:69043– 69068, 2024

Hadi Hosseini, Sanjukta Roy, and Duohan Zhang. Putting gale & shapley to work: Guaranteeing stability through learning.Advances in Neural Information Processing Systems, 37:69043– 69068, 2024

2024
[20]

Employee screening: theory and evidence, 2006

Fali Huang and Peter Cappelli. Employee screening: theory and evidence, 2006

2006
[21]

Designing approxi- mately optimal search on matching platforms

Nicole Immorlica, Brendan Lucier, Vahideh Manshadi, and Alexander Wei. Designing approxi- mately optimal search on matching platforms. InProceedings of the 22nd ACM Conference on Economics and Computation, pages 632–633, 2021

2021
[22]

Learn- ing equilibria in matching markets from bandit feedback.Advances in Neural Information Processing Systems, 34:3323–3335, 2021

Meena Jagadeesan, Alexander Wei, Yixin Wang, Michael Jordan, and Jacob Steinhardt. Learn- ing equilibria in matching markets from bandit feedback.Advances in Neural Information Processing Systems, 34:3323–3335, 2021

2021
[23]

Occupational mobility and wage inequality.The Review of Economic Studies, 76(2):731–759, 2009

Gueorgui Kambourov and Iourii Manovskii. Occupational mobility and wage inequality.The Review of Economic Studies, 76(2):731–759, 2009

2009
[24]

Llm economist: Large population models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025

Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, and Chi Jin. Llm economist: Large population models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025

work page arXiv 2025
[25]

Player-optimal stable regret for bandit learning in matching markets

Fang Kong and Shuai Li. Player-optimal stable regret for bandit learning in matching markets. InProceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1512–1522. SIAM, 2023

2023
[26]

Bandit learning in matching markets with indifference

Fang Kong, Jingqi Tang, Mingzhu Li, Pinyan Lu, John CS Lui, and Shuai Li. Bandit learning in matching markets with indifference. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[27]

The speed of employer learning.Journal of Labor Economics, 25(1):1–35, 2007

Fabian Lange. The speed of employer learning.Journal of Labor Economics, 25(1):1–35, 2007

2007
[28]

A survey on bandit learning in matching markets

Shuai Li, Zilong Wang, and Fang Kong. A survey on bandit learning in matching markets. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 10546–10554, 2025

2025
[29]

Tight regret bounds for infinite-armed linear contextual bandits

Yingkai Li, Yining Wang, Xi Chen, and Yuan Zhou. Tight regret bounds for infinite-armed linear contextual bandits. InInternational Conference on Artificial Intelligence and Statistics, pages 370–378. PMLR, 2021

2021
[30]

Dynamic matching bandit for two-sided online markets.arXiv preprint arXiv:2205.03699, 2022

Yuantong Li, Chi-hua Wang, Guang Cheng, and Will Wei Sun. Dynamic matching bandit for two-sided online markets.arXiv preprint arXiv:2205.03699, 2022

work page arXiv 2022
[31]

Bandit learning in decentralized matching markets.Journal of Machine Learning Research, 22(211):1–34, 2021

Lydia T Liu, Feng Ruan, Horia Mania, and Michael I Jordan. Bandit learning in decentralized matching markets.Journal of Machine Learning Research, 22(211):1–34, 2021

2021
[32]

Welfare maximiza- tion in competitive equilibrium: Reinforcement learning for markov exchange economy

Zhihan Liu, Miao Lu, Zhaoran Wang, Michael Jordan, and Zhuoran Yang. Welfare maximiza- tion in competitive equilibrium: Reinforcement learning for markov exchange economy. In International Conference on Machine Learning, pages 13870–13911. PMLR, 2022

2022
[33]

Multi-agent actor-critic for mixed cooperative-competitive environments.Advances in neural information processing systems, 30, 2017

Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments.Advances in neural information processing systems, 30, 2017

2017
[34]

Economics of information and job search.The Quarterly Journal of Economics, 84(1):113–126, 1970

John Joseph McCall. Economics of information and job search.The Quarterly Journal of Economics, 84(1):113–126, 1970

1970
[35]

Job matching and occupational choice.Journal of Political economy, 92(6): 1086–1120, 1984

Robert A Miller. Job matching and occupational choice.Journal of Political economy, 92(6): 1086–1120, 1984

1984
[36]

Learn to match with no regret: Reinforcement learning in markov matching markets.Advances in Neural Information Processing Systems, 35:19956–19970, 2022

Yifei Min, Tianhao Wang, Ruitu Xu, Zhaoran Wang, Michael Jordan, and Zhuoran Yang. Learn to match with no regret: Reinforcement learning in markov matching markets.Advances in Neural Information Processing Systems, 35:19956–19970, 2022

2022
[37]

Schooling and earnings

Jacob A Mincer. Schooling and earnings. InSchooling, experience, and earnings, pages 41–63. NBER, 1974. 11

1974
[38]

Two-Sided Time-Independent Regret for Matching Markets with Limited Interviews

Amirmahdi Mirfakhar, Xuchuang Wang, Mengfan Xu, Hedyeh Beyhaghi, and Moham- mad Hajiesmaili. Bandit learning in matching markets with interviews.arXiv preprint arXiv:2602.12224, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Job creation and job destruction in the theory of unemployment.The review of economic studies, 61(3):397–415, 1994

Dale T Mortensen and Christopher A Pissarides. Job creation and job destruction in the theory of unemployment.The review of economic studies, 61(3):397–415, 1994

1994
[40]

Wage growth and the theory of turnover.Journal of Labor Economics, 18 (2):204–220, 2000

Lalith Munasinghe. Wage growth and the theory of turnover.Journal of Labor Economics, 18 (2):204–220, 2000

2000
[41]

Two-sided bandit learning in fully-decentralized matching markets

Tejas Pagare and Avishek Ghosh. Two-sided bandit learning in fully-decentralized matching markets. InICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023

2023
[42]

Explore-then-commit algorithms for decentralized two-sided matching markets

Tejas Pagare and Avishek Ghosh. Explore-then-commit algorithms for decentralized two-sided matching markets. In2024 IEEE International Symposium on Information Theory (ISIT), pages 2092–2097. IEEE, 2024

2092
[43]

Competing bandits in decentralized contextual matching markets.arXiv preprint arXiv:2411.11794, 2024

Satush Parikh, Soumya Basu, Avishek Ghosh, and Abishek Sankararaman. Competing bandits in decentralized contextual matching markets.arXiv preprint arXiv:2411.11794, 2024

work page arXiv 2024
[44]

MIT press, 2000

Christopher A Pissarides.Equilibrium unemployment theory. MIT press, 2000

2000
[45]

Converging to stability in two-sided bandits: The case of unknown preferences on both sides of a matching market.arXiv preprint arXiv:2302.06176, 2023

Gaurab Pokharel and Sanmay Das. Converging to stability in two-sided bandits: The case of unknown preferences on both sides of a matching market.arXiv preprint arXiv:2302.06176, 2023

work page arXiv 2023
[46]

Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

2020
[47]

The national residency matching program as a labor market.JAMA, 275(13): 1054–1056, 1996

Alvin E Roth. The national residency matching program as a labor market.JAMA, 275(13): 1054–1056, 1996

1996
[48]

Two-sided matching.Handbook of game theory with economic applications, 1:485–541, 1992

Alvin E Roth and Marilda Sotomayor. Two-sided matching.Handbook of game theory with economic applications, 1:485–541, 1992

1992
[49]

Testing for asymmetric employer learning.Journal of Labor Economics, 25(4): 651–691, 2007

Uta Schönberg. Testing for asymmetric employer learning.Journal of Labor Economics, 25(4): 651–691, 2007

2007
[50]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[51]

A multiagent reinforcement learning framework for off-policy evaluation in two-sided markets.The Annals of Applied Statistics, 17(4):2701–2722, 2023

Chengchun Shi, Runzhe Wan, Ge Song, Shikai Luo, Hongtu Zhu, and Rui Song. A multiagent reinforcement learning framework for off-policy evaluation in two-sided markets.The Annals of Applied Statistics, 17(4):2701–2722, 2023

2023
[52]

Optimal match recommendations in two-sided marketplaces with endogenous prices

Peng Shi. Optimal match recommendations in two-sided marketplaces with endogenous prices. Management Science, 71(9):7431–7448, 2025

2025
[53]

The cyclical behavior of equilibrium unemployment and vacancies.American economic review, 95(1):25–49, 2005

Robert Shimer. The cyclical behavior of equilibrium unemployment and vacancies.American economic review, 95(1):25–49, 2005

2005
[54]

Labor turnover costs and the cyclical behavior of vacancies and unemployment.Macroeconomic Dynamics, 13(S1):76–96, 2009

José Ignacio Silva and Manuel Toledo. Labor turnover costs and the cyclical behavior of vacancies and unemployment.Macroeconomic Dynamics, 13(S1):76–96, 2009

2009
[55]

Job mobility and the careers of young men.The Quarterly Journal of Economics, 107(2):439–479, 1992

Robert H Topel and Michael P Ward. Job mobility and the careers of young men.The Quarterly Journal of Economics, 107(2):439–479, 1992

1992
[56]

Online dating recommendations: matching markets and learning preferences

Kun Tu, Bruno Ribeiro, David Jensen, Don Towsley, Benyuan Liu, Hua Jiang, and Xiaodong Wang. Online dating recommendations: matching markets and learning preferences. In Proceedings of the 23rd international conference on world wide web, pages 787–792, 2014

2014
[57]

Interview choice reveals your preference on the market: To improve job-resume matching through profiling memories

Rui Yan, Ran Le, Yang Song, Tao Zhang, Xiangliang Zhang, and Dongyan Zhao. Interview choice reveals your preference on the market: To improve job-resume matching through profiling memories. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 914–922, 2019. 12

2019
[58]

The surprising effectiveness of ppo in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

2022
[59]

Multi-agent reinforcement learning: A selective overview.Foundations and Trends in Machine Learning, 2021

Kaiqing Zhang et al. Multi-agent reinforcement learning: A selective overview.Foundations and Trends in Machine Learning, 2021

2021
[60]

Decentralized two-sided bandit learning in matching market

YiRui Zhang and Zhixuan Fang. Decentralized two-sided bandit learning in matching market. InThe 40th Conference on Uncertainty in Artificial Intelligence, 2024

2024
[61]

The ai economist: Improving equality and productivity with ai-driven tax policies.arXiv preprint arXiv:2004.13332, 2020

Stephan Zheng, Alexander Trott, Sunil Srinivasa, Nikhil Naik, Melvin Gruesbeck, David C Parkes, and Richard Socher. The ai economist: Improving equality and productivity with ai-driven tax policies.arXiv preprint arXiv:2004.13332, 2020. 13 A Implementation details A.1 PPO implementation details Both small and large markets use an outside-option penalty of...

work page arXiv 2004
[62]

We prove that we are able to show that we can cover exactly the same CA-ETC results under some parameter setting

Figures 7 and 8 show that cumulative per-worker and per-firm regret rise during the initial exploration block and then flatten as the algorithm commits to its empirical Gale–Shapley matching, reproducing the structure reported in [ 42]. We prove that we are able to show that we can cover exactly the same CA-ETC results under some parameter setting. Figure...

[1] [1]

School choice: A mechanism design approach

Atila Abdulkadiro ˘glu and Tayfun Sönmez. School choice: A mechanism design approach. American economic review, 93(3):729–747, 2003

2003

[2] [2]

From signaling to interviews in random matching markets

Maxwell Allman, Itai Ashlagi, Amin Saberi, and Sophie H Yu. From signaling to interviews in random matching markets. InProceedings of the 57th Annual ACM Symposium on Theory of Computing, pages 1556–1567, 2025

2025

[3] [3]

Employer learning and statistical discrimination.The quarterly journal of economics, 116(1):313–350, 2001

Joseph G Altonji and Charles R Pierret. Employer learning and statistical discrimination.The quarterly journal of economics, 116(1):313–350, 2001

2001

[4] [4]

Stable matching with inter- views

Itai Ashlagi, Jiale Chen, Mohammad Roghani, and Amin Saberi. Stable matching with inter- views. In16th Innovations in Theoretical Computer Science Conference (ITCS 2025), pages 12–1. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, 2025

2025

[5] [5]

Probably correct op- timal stable matching for two-sided markets under uncertainty.arXiv preprint arXiv:2501.03018, 2025

Andreas Athanasopoulos, Anne-Marie George, and Christos Dimitrakakis. Probably correct op- timal stable matching for two-sided markets under uncertainty.arXiv preprint arXiv:2501.03018, 2025

work page arXiv 2025

[6] [6]

A better match for drivers and riders: Reinforcement learning at lyft.INFORMS Journal on Applied Analytics, 54(1):71–83, 2024

Xabi Azagirre, Akshay Balwally, Guillaume Candeli, Nicholas Chamandy, Benjamin Han, Alona King, Hyungjun Lee, Martin Loncaric, Sébastien Martin, Vijay Narasiman, et al. A better match for drivers and riders: Reinforcement learning at lyft.INFORMS Journal on Applied Analytics, 54(1):71–83, 2024

2024

[7] [7]

Efficient interview scheduling for stable matching.arXiv preprint arXiv:2602.20358, 2026

Moshe Babaioff, Rotem Gil, and Assaf Romm. Efficient interview scheduling for stable matching.arXiv preprint arXiv:2602.20358, 2026

work page arXiv 2026

[8] [8]

Employer search, training, and vacancy duration.Economic inquiry, 35(1):167–192, 1997

John M Barron, Mark C Berger, and Dan A Black. Employer search, training, and vacancy duration.Economic inquiry, 35(1):167–192, 1997

1997

[9] [9]

Beyond log2(t) regret for decentralized bandits in matching markets

Soumya Basu, Karthik Abinav Sankararaman, and Abishek Sankararaman. Beyond log2(t) regret for decentralized bandits in matching markets. InInternational Conference on Machine Learning, pages 705–715. PMLR, 2021

2021

[10] [10]

The costs of hiring skilled workers

Marc Blatter, Samuel Muehlemann, and Samuel Schenker. The costs of hiring skilled workers. European Economic Review, 56(1):20–35, 2012

2012

[11] [11]

Recruitment policies, job-filling rates, and matching efficiency.Journal of the European Economic Association, 21(6):2413–2459, 2023

Carlos Carrillo-Tudela, Hermann Gartner, and Leo Kaas. Recruitment policies, job-filling rates, and matching efficiency.Journal of the European Economic Association, 21(6):2413–2459, 2023

2023

[12] [12]

Common learning

Martin W Cripps, Jeffrey C Ely, George J Mailath, and Larry Samuelson. Common learning. Econometrica, 76(4):909–933, 2008

2008

[13] [13]

Aggregate demand management in search equilibrium.Journal of political Economy, 90(5):881–894, 1982

Peter A Diamond. Aggregate demand management in search equilibrium.Journal of political Economy, 90(5):881–894, 1982

1982

[14] [14]

Learning and wage dynamics.The Quarterly Journal of Economics, 111(4):1007–1047, 1996

Henry S Farber and Robert Gibbons. Learning and wage dynamics.The Quarterly Journal of Economics, 111(4):1007–1047, 1996

1996

[15] [15]

College admissions and the stability of marriage.The American mathematical monthly, 69(1):9–15, 1962

David Gale and Lloyd S Shapley. College admissions and the stability of marriage.The American mathematical monthly, 69(1):9–15, 1962

1962

[16] [16]

The u-shapes of occupational mobility.The Review of Economic Studies, 82(2):659–692, 2015

Fane Groes, Philipp Kircher, and Iourii Manovskii. The u-shapes of occupational mobility.The Review of Economic Studies, 82(2):659–692, 2015

2015

[17] [17]

We know what you want: An advertising strategy recommender system for online advertising

Liyi Guo, Junqi Jin, Haoqi Zhang, Zhenzhe Zheng, Zhiye Yang, Zhizhuang Xing, Fei Pan, Lvyin Niu, Fan Wu, Haiyang Xu, et al. We know what you want: An advertising strategy recommender system for online advertising. InProceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pages 2919–2927, 2021

2021

[18] [18]

Hitsch, Ali Hortaçsu, and Dan Ariely

Günter J. Hitsch, Ali Hortaçsu, and Dan Ariely. Matching and sorting in online dating.American Economic Review, 100(1):130–163, 2010. 10

2010

[19] [19]

Putting gale & shapley to work: Guaranteeing stability through learning.Advances in Neural Information Processing Systems, 37:69043– 69068, 2024

Hadi Hosseini, Sanjukta Roy, and Duohan Zhang. Putting gale & shapley to work: Guaranteeing stability through learning.Advances in Neural Information Processing Systems, 37:69043– 69068, 2024

2024

[20] [20]

Employee screening: theory and evidence, 2006

Fali Huang and Peter Cappelli. Employee screening: theory and evidence, 2006

2006

[21] [21]

Designing approxi- mately optimal search on matching platforms

Nicole Immorlica, Brendan Lucier, Vahideh Manshadi, and Alexander Wei. Designing approxi- mately optimal search on matching platforms. InProceedings of the 22nd ACM Conference on Economics and Computation, pages 632–633, 2021

2021

[22] [22]

Learn- ing equilibria in matching markets from bandit feedback.Advances in Neural Information Processing Systems, 34:3323–3335, 2021

Meena Jagadeesan, Alexander Wei, Yixin Wang, Michael Jordan, and Jacob Steinhardt. Learn- ing equilibria in matching markets from bandit feedback.Advances in Neural Information Processing Systems, 34:3323–3335, 2021

2021

[23] [23]

Occupational mobility and wage inequality.The Review of Economic Studies, 76(2):731–759, 2009

Gueorgui Kambourov and Iourii Manovskii. Occupational mobility and wage inequality.The Review of Economic Studies, 76(2):731–759, 2009

2009

[24] [24]

Llm economist: Large population models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025

Seth Karten, Wenzhe Li, Zihan Ding, Samuel Kleiner, Yu Bai, and Chi Jin. Llm economist: Large population models and mechanism design in multi-agent generative simulacra.arXiv preprint arXiv:2507.15815, 2025

work page arXiv 2025

[25] [25]

Player-optimal stable regret for bandit learning in matching markets

Fang Kong and Shuai Li. Player-optimal stable regret for bandit learning in matching markets. InProceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 1512–1522. SIAM, 2023

2023

[26] [26]

Bandit learning in matching markets with indifference

Fang Kong, Jingqi Tang, Mingzhu Li, Pinyan Lu, John CS Lui, and Shuai Li. Bandit learning in matching markets with indifference. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[27] [27]

The speed of employer learning.Journal of Labor Economics, 25(1):1–35, 2007

Fabian Lange. The speed of employer learning.Journal of Labor Economics, 25(1):1–35, 2007

2007

[28] [28]

A survey on bandit learning in matching markets

Shuai Li, Zilong Wang, and Fang Kong. A survey on bandit learning in matching markets. In Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, pages 10546–10554, 2025

2025

[29] [29]

Tight regret bounds for infinite-armed linear contextual bandits

Yingkai Li, Yining Wang, Xi Chen, and Yuan Zhou. Tight regret bounds for infinite-armed linear contextual bandits. InInternational Conference on Artificial Intelligence and Statistics, pages 370–378. PMLR, 2021

2021

[30] [30]

Dynamic matching bandit for two-sided online markets.arXiv preprint arXiv:2205.03699, 2022

Yuantong Li, Chi-hua Wang, Guang Cheng, and Will Wei Sun. Dynamic matching bandit for two-sided online markets.arXiv preprint arXiv:2205.03699, 2022

work page arXiv 2022

[31] [31]

Bandit learning in decentralized matching markets.Journal of Machine Learning Research, 22(211):1–34, 2021

Lydia T Liu, Feng Ruan, Horia Mania, and Michael I Jordan. Bandit learning in decentralized matching markets.Journal of Machine Learning Research, 22(211):1–34, 2021

2021

[32] [32]

Welfare maximiza- tion in competitive equilibrium: Reinforcement learning for markov exchange economy

Zhihan Liu, Miao Lu, Zhaoran Wang, Michael Jordan, and Zhuoran Yang. Welfare maximiza- tion in competitive equilibrium: Reinforcement learning for markov exchange economy. In International Conference on Machine Learning, pages 13870–13911. PMLR, 2022

2022

[33] [33]

Multi-agent actor-critic for mixed cooperative-competitive environments.Advances in neural information processing systems, 30, 2017

Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments.Advances in neural information processing systems, 30, 2017

2017

[34] [34]

Economics of information and job search.The Quarterly Journal of Economics, 84(1):113–126, 1970

John Joseph McCall. Economics of information and job search.The Quarterly Journal of Economics, 84(1):113–126, 1970

1970

[35] [35]

Job matching and occupational choice.Journal of Political economy, 92(6): 1086–1120, 1984

Robert A Miller. Job matching and occupational choice.Journal of Political economy, 92(6): 1086–1120, 1984

1984

[36] [36]

Learn to match with no regret: Reinforcement learning in markov matching markets.Advances in Neural Information Processing Systems, 35:19956–19970, 2022

Yifei Min, Tianhao Wang, Ruitu Xu, Zhaoran Wang, Michael Jordan, and Zhuoran Yang. Learn to match with no regret: Reinforcement learning in markov matching markets.Advances in Neural Information Processing Systems, 35:19956–19970, 2022

2022

[37] [37]

Schooling and earnings

Jacob A Mincer. Schooling and earnings. InSchooling, experience, and earnings, pages 41–63. NBER, 1974. 11

1974

[38] [38]

Two-Sided Time-Independent Regret for Matching Markets with Limited Interviews

Amirmahdi Mirfakhar, Xuchuang Wang, Mengfan Xu, Hedyeh Beyhaghi, and Moham- mad Hajiesmaili. Bandit learning in matching markets with interviews.arXiv preprint arXiv:2602.12224, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Job creation and job destruction in the theory of unemployment.The review of economic studies, 61(3):397–415, 1994

Dale T Mortensen and Christopher A Pissarides. Job creation and job destruction in the theory of unemployment.The review of economic studies, 61(3):397–415, 1994

1994

[40] [40]

Wage growth and the theory of turnover.Journal of Labor Economics, 18 (2):204–220, 2000

Lalith Munasinghe. Wage growth and the theory of turnover.Journal of Labor Economics, 18 (2):204–220, 2000

2000

[41] [41]

Two-sided bandit learning in fully-decentralized matching markets

Tejas Pagare and Avishek Ghosh. Two-sided bandit learning in fully-decentralized matching markets. InICML 2023 Workshop The Many Facets of Preference-Based Learning, 2023

2023

[42] [42]

Explore-then-commit algorithms for decentralized two-sided matching markets

Tejas Pagare and Avishek Ghosh. Explore-then-commit algorithms for decentralized two-sided matching markets. In2024 IEEE International Symposium on Information Theory (ISIT), pages 2092–2097. IEEE, 2024

2092

[43] [43]

Competing bandits in decentralized contextual matching markets.arXiv preprint arXiv:2411.11794, 2024

Satush Parikh, Soumya Basu, Avishek Ghosh, and Abishek Sankararaman. Competing bandits in decentralized contextual matching markets.arXiv preprint arXiv:2411.11794, 2024

work page arXiv 2024

[44] [44]

MIT press, 2000

Christopher A Pissarides.Equilibrium unemployment theory. MIT press, 2000

2000

[45] [45]

Converging to stability in two-sided bandits: The case of unknown preferences on both sides of a matching market.arXiv preprint arXiv:2302.06176, 2023

Gaurab Pokharel and Sanmay Das. Converging to stability in two-sided bandits: The case of unknown preferences on both sides of a matching market.arXiv preprint arXiv:2302.06176, 2023

work page arXiv 2023

[46] [46]

Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning.Journal of Machine Learning Research, 21(178):1–51, 2020

2020

[47] [47]

The national residency matching program as a labor market.JAMA, 275(13): 1054–1056, 1996

Alvin E Roth. The national residency matching program as a labor market.JAMA, 275(13): 1054–1056, 1996

1996

[48] [48]

Two-sided matching.Handbook of game theory with economic applications, 1:485–541, 1992

Alvin E Roth and Marilda Sotomayor. Two-sided matching.Handbook of game theory with economic applications, 1:485–541, 1992

1992

[49] [49]

Testing for asymmetric employer learning.Journal of Labor Economics, 25(4): 651–691, 2007

Uta Schönberg. Testing for asymmetric employer learning.Journal of Labor Economics, 25(4): 651–691, 2007

2007

[50] [50]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[51] [51]

A multiagent reinforcement learning framework for off-policy evaluation in two-sided markets.The Annals of Applied Statistics, 17(4):2701–2722, 2023

Chengchun Shi, Runzhe Wan, Ge Song, Shikai Luo, Hongtu Zhu, and Rui Song. A multiagent reinforcement learning framework for off-policy evaluation in two-sided markets.The Annals of Applied Statistics, 17(4):2701–2722, 2023

2023

[52] [52]

Optimal match recommendations in two-sided marketplaces with endogenous prices

Peng Shi. Optimal match recommendations in two-sided marketplaces with endogenous prices. Management Science, 71(9):7431–7448, 2025

2025

[53] [53]

The cyclical behavior of equilibrium unemployment and vacancies.American economic review, 95(1):25–49, 2005

Robert Shimer. The cyclical behavior of equilibrium unemployment and vacancies.American economic review, 95(1):25–49, 2005

2005

[54] [54]

Labor turnover costs and the cyclical behavior of vacancies and unemployment.Macroeconomic Dynamics, 13(S1):76–96, 2009

José Ignacio Silva and Manuel Toledo. Labor turnover costs and the cyclical behavior of vacancies and unemployment.Macroeconomic Dynamics, 13(S1):76–96, 2009

2009

[55] [55]

Job mobility and the careers of young men.The Quarterly Journal of Economics, 107(2):439–479, 1992

Robert H Topel and Michael P Ward. Job mobility and the careers of young men.The Quarterly Journal of Economics, 107(2):439–479, 1992

1992

[56] [56]

Online dating recommendations: matching markets and learning preferences

Kun Tu, Bruno Ribeiro, David Jensen, Don Towsley, Benyuan Liu, Hua Jiang, and Xiaodong Wang. Online dating recommendations: matching markets and learning preferences. In Proceedings of the 23rd international conference on world wide web, pages 787–792, 2014

2014

[57] [57]

Interview choice reveals your preference on the market: To improve job-resume matching through profiling memories

Rui Yan, Ran Le, Yang Song, Tao Zhang, Xiangliang Zhang, and Dongyan Zhao. Interview choice reveals your preference on the market: To improve job-resume matching through profiling memories. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 914–922, 2019. 12

2019

[58] [58]

The surprising effectiveness of ppo in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games.Advances in neural information processing systems, 35:24611–24624, 2022

2022

[59] [59]

Multi-agent reinforcement learning: A selective overview.Foundations and Trends in Machine Learning, 2021

Kaiqing Zhang et al. Multi-agent reinforcement learning: A selective overview.Foundations and Trends in Machine Learning, 2021

2021

[60] [60]

Decentralized two-sided bandit learning in matching market

YiRui Zhang and Zhixuan Fang. Decentralized two-sided bandit learning in matching market. InThe 40th Conference on Uncertainty in Artificial Intelligence, 2024

2024

[61] [61]

The ai economist: Improving equality and productivity with ai-driven tax policies.arXiv preprint arXiv:2004.13332, 2020

Stephan Zheng, Alexander Trott, Sunil Srinivasa, Nikhil Naik, Melvin Gruesbeck, David C Parkes, and Richard Socher. The ai economist: Improving equality and productivity with ai-driven tax policies.arXiv preprint arXiv:2004.13332, 2020. 13 A Implementation details A.1 PPO implementation details Both small and large markets use an outside-option penalty of...

work page arXiv 2004

[62] [62]

We prove that we are able to show that we can cover exactly the same CA-ETC results under some parameter setting

Figures 7 and 8 show that cumulative per-worker and per-firm regret rise during the initial exploration block and then flatten as the algorithm commits to its empirical Gale–Shapley matching, reproducing the structure reported in [ 42]. We prove that we are able to show that we can cover exactly the same CA-ETC results under some parameter setting. Figure...