Out-of-sample gravity predictions and trade policy counterfactuals

Dennis Novy; Holger Breinlich; J.M.C. Santos Silva; Nick Green; Nicolas Apfel; Tom Zylkin

arxiv: 2509.11271 · v3 · submitted 2025-09-14 · 💰 econ.GN · q-fin.EC

Out-of-sample gravity predictions and trade policy counterfactuals

Nicolas Apfel , Holger Breinlich , Nick Green , Dennis Novy , J.M.C. Santos Silva , Tom Zylkin This is my paper

Pith reviewed 2026-05-18 16:39 UTC · model grok-4.3

classification 💰 econ.GN q-fin.EC

keywords gravity modeltrade policyout-of-sample predictionmachine learningcounterfactual analysisregional trade agreements

0 comments

The pith

The 3-way gravity model is difficult to beat when evaluating trade policy interventions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether gravity equations used to study trade policies such as regional agreements can deliver unbiased forecasts on data the model has never seen. The authors introduce a procedure that compares the out-of-sample performance of standard gravity specifications against machine-learning approaches. They show that a three-way gravity model maintains strong predictive accuracy precisely when the task is to simulate the effects of policy changes. This finding matters because trade policy decisions rely on counterfactuals that would be unreliable if the underlying model only fits historical patterns without generalizing.

Core claim

Gravity equations are often used to evaluate the effects of trade policies, such as regional trade agreements. We argue that their suitability for this purpose critically depends on their ability to produce unbiased out-of-sample predictions. We propose a methodology to evaluate the out-of-sample predictions obtained with gravity equations and with machine learning methods. We find that the 3-way gravity model is difficult to beat when the purpose is to evaluate policy interventions, further cementing its position as the predominant tool for applied trade policy analysis. However, when the goal is to predict individual flows, machine learning methods can be preferable.

What carries the argument

The three-way gravity model that accounts for exporter-time, importer-time, and bilateral effects to generate out-of-sample trade predictions.

If this is right

Applied researchers should retain the three-way gravity model as the default tool for trade-policy counterfactuals.
Machine-learning methods become attractive mainly when the goal is to forecast specific bilateral trade flows rather than policy effects.
Out-of-sample validation should be adopted as a routine check before using any model for policy simulation.
Gravity-based estimates of regional trade agreements gain credibility when they pass out-of-sample tests.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests that the structural restrictions built into gravity models align better with policy-relevant variation than purely data-driven fits.
Hybrid approaches that use gravity predictions as inputs for machine-learning refinement could be tested in future work.
Policymakers can place higher confidence in gravity-derived simulations of new agreements when those simulations have been validated out of sample.

Load-bearing premise

The suitability of gravity equations for evaluating trade policies critically depends on their ability to produce unbiased out-of-sample predictions.

What would settle it

New data from an actual trade policy change where machine-learning predictions show smaller out-of-sample errors than the three-way gravity model for the same policy counterfactual.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that the suitability of gravity equations for trade policy evaluation hinges on their out-of-sample predictive performance. It develops a methodology for comparing out-of-sample predictions from gravity models (particularly the 3-way specification) against machine learning alternatives, using held-out data to assess bias in policy counterfactuals. The central finding is that the 3-way gravity model is difficult to beat for evaluating policy interventions on aggregates, while ML methods can outperform for predicting individual trade flows.

Significance. If the results hold under the proposed evaluation design, the paper would provide concrete empirical support for the continued dominance of gravity models in applied trade policy work, while clarifying when ML approaches add value. This addresses a key gap in validating counterfactual predictions and could influence model selection in empirical international economics.

major comments (2)

[§4.1] §4.1: The out-of-sample split procedure for policy counterfactuals is described at a high level but lacks explicit discussion of how the held-out periods or country pairs are chosen to avoid leakage from multilateral resistance terms; this choice is load-bearing for the claim that gravity predictions remain unbiased.
[Table 4] Table 4, policy-aggregate rows: The reported RMSE advantage of the 3-way gravity model over random forests is on the order of 5-8 percent; without standard errors or a formal test for the difference, it is difficult to judge whether this difference is statistically meaningful for the 'difficult to beat' conclusion.

minor comments (2)

[§2] The notation for the three-way fixed effects (exporter-time, importer-time, pair) is introduced in §2 but not consistently carried through the results tables; adding a short footnote or column label would improve readability.
[Figure 3] Figure 3 caption does not state the exact number of observations in the test set or the number of policy interventions evaluated; this detail would help readers assess the scope of the out-of-sample exercise.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. The suggestions help clarify key aspects of our out-of-sample evaluation design and strengthen the interpretation of the results. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [§4.1] §4.1: The out-of-sample split procedure for policy counterfactuals is described at a high level but lacks explicit discussion of how the held-out periods or country pairs are chosen to avoid leakage from multilateral resistance terms; this choice is load-bearing for the claim that gravity predictions remain unbiased.

Authors: We agree that greater transparency on this point is important. In the revised manuscript we have expanded Section 4.1 to describe the split in detail: held-out periods consist of complete post-sample years, and held-out country pairs are selected so that all multilateral resistance terms are estimated exclusively on the training sample. This construction ensures that no information from the held-out observations enters the fixed effects or the subsequent counterfactual predictions, preserving the ex-ante unbiasedness property of the gravity model. revision: yes
Referee: [Table 4] Table 4, policy-aggregate rows: The reported RMSE advantage of the 3-way gravity model over random forests is on the order of 5-8 percent; without standard errors or a formal test for the difference, it is difficult to judge whether this difference is statistically meaningful for the 'difficult to beat' conclusion.

Authors: We appreciate the request for a formal assessment of precision. In the revision we have added bootstrap standard errors (1,000 replications) to the RMSE entries in Table 4 and included a supplementary table that reports the differences together with their standard errors and p-values. The advantage of the 3-way gravity specification remains statistically significant at conventional levels for the policy-aggregate outcomes, supporting the claim that it is difficult to beat for counterfactual evaluation while remaining consistent with the manuscript’s broader finding that machine-learning methods can be preferable for individual-flow prediction. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper conducts an empirical out-of-sample evaluation of gravity models against machine learning alternatives using held-out trade flow data and policy counterfactuals. The central performance claims rest on direct comparisons of predictive accuracy for aggregates and individual flows, with no load-bearing steps that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The methodology is self-contained against external benchmarks and does not rename or smuggle in prior results as new derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is limited to the abstract; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5636 in / 1002 out tokens · 34331 ms · 2026-05-18T16:39:42.348936+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

[1]

Hansen, and M

Ahrens, A., C. Hansen, and M. Schaffer (2023). pystacked: Stacking generalization and machine learning in Stata . Stata Journal\/ 23\/ (4), 909--931

work page 2023
[2]

Anderson, J. E. and E. van Wincoop (2003). Gravity with gravitas: A solution to the border puzzle. American Economic Review\/ 93\/ (1), 170--192

work page 2003
[3]

Baier, S. and J. H. Bergstrand (2007). Do free trade agreements actually increase members' international trade? Journal of International Economics\/ 71\/ (1), 72--95

work page 2007
[4]

Yotov, and T

Baier, S., Y. Yotov, and T. Zylkin (2019). On the widely differing effects of free trade agreements: Lessons from twenty years of trade integration. Journal of International Economics\/ 116 , 206--226

work page 2019
[5]

Breiman, L. (1996). Bagging predictors. Machine Learning\/ 24 , 123--140

work page 1996
[6]

Guimar\ a es, and T

Correia, S., P. Guimar\ a es, and T. Zylkin (2020). Fast Poisson estimation with high-dimensional fixed effects. Stata Journal\/ 20\/ (1), 95--115

work page 2020
[7]

Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics\/ 29 , 1189--1232

work page 2001
[8]

Goldberg, P. and N. Pavcnik (2016). The effects of trade policy. Volume 1A of Handbook of Commercial Policy , Chapter 3, pp.\ 161--206. Elsevier

work page 2016
[9]

Monfort, and A

Gourieroux, C., A. Monfort, and A. Trognon (1984). Pseudo maximum likelihood methods: Applications to P oisson models. Econometrica\/ 52\/ (3), 701--720

work page 1984
[10]

Harrison, A. and A. Rodr\' i guez-Clare (2010). Trade, foreign investment, and industrial policy for developing countries. Volume 5 of Handbook of Development Economics , Chapter 63, pp.\ 4039--4214. Elsevier

work page 2010
[11]

Tibshirani, and J

Hastie, T., R. Tibshirani, and J. Friedman (2008). The Elements of Statistical Learning . Heidelberg: Springer

work page 2008
[12]

Head, K. and T. Mayer (2014). Gravity equations: Workhorse, toolkit, and cookbook. Volume 4 of Handbook of International Economics , Chapter 3, pp.\ 131--195. Elsevier

work page 2014
[13]

Export potential assessment methodology

ITC (2025). Export potential assessment methodology . Available at: https://umbraco.exportpotential.intracen.org/media/cklh2pi5/epa-methodology_230627.pdf

work page 2025
[14]

Kiyota, K. (2025). Mind the gap: Does the lasso improve the performance of the gravity model of foreign direct investment? Mimeo\/

work page 2025
[15]

LeCun, Y. A., L. Bottou, G. B. Orr, and K.-R. M \"u ller (2012). Efficient BackProp , Chapter 1, pp.\ 10--48. Lecture Notes in Computer Science. Springer

work page 2012
[16]

(2025, May)

Moreau-Kastler, N. (2025, May). Proportional treatment effects in staggered settings: An approach for P oisson pseudo-maximum likelihood. Working Papers 031, EU Tax Observatory

work page 2025
[17]

Varoquaux, A

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011). Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research\/ 12 , 2825--2830

work page 2011
[18]

Rose, A. K. (2001). Currency unions and trade: The effect is large. Economic Policy\/ 16 , 8--45

work page 2001
[19]

Rose, A. K. (2004). Do we really know that the WTO increases trade? American Economic Review\/ 94\/ (1), 98--114

work page 2004
[20]

Chariag, O

Ruzicska, G., R. Chariag, O. Kiss, and M. Koren (2024). Can machine learning beat gravity in flow prediction? The Econometrics of Multi-dimensional Panels, Chapter 16, pp.\ 511--545. Springer

work page 2024
[21]

Santos Silva , J. M. C. and S. Tenreyro (2006). The log of gravity. Review of Economics and Statistics\/ 88\/ (4), 641--658

work page 2006
[22]

Wolpert, D. H. (1992). Stacked generalization. Neural networks\/ 5 , 241--259

work page 1992
[23]

Yotov, Y. (2025). Gravity for undergrads. Working Paper 202519, Center for Global Policy Analysis, LeBow College of Business, Drexel University

work page 2025
[24]

Yotov, Y. V., R. Piermartini, J.-A. Monteiro, and M. Larch (2016). An Advanced Guide to Trade Policy Analysis: The Structural Gravity Model . Geneva: World Trade Organization

work page 2016

[1] [1]

Hansen, and M

Ahrens, A., C. Hansen, and M. Schaffer (2023). pystacked: Stacking generalization and machine learning in Stata . Stata Journal\/ 23\/ (4), 909--931

work page 2023

[2] [2]

Anderson, J. E. and E. van Wincoop (2003). Gravity with gravitas: A solution to the border puzzle. American Economic Review\/ 93\/ (1), 170--192

work page 2003

[3] [3]

Baier, S. and J. H. Bergstrand (2007). Do free trade agreements actually increase members' international trade? Journal of International Economics\/ 71\/ (1), 72--95

work page 2007

[4] [4]

Yotov, and T

Baier, S., Y. Yotov, and T. Zylkin (2019). On the widely differing effects of free trade agreements: Lessons from twenty years of trade integration. Journal of International Economics\/ 116 , 206--226

work page 2019

[5] [5]

Breiman, L. (1996). Bagging predictors. Machine Learning\/ 24 , 123--140

work page 1996

[6] [6]

Guimar\ a es, and T

Correia, S., P. Guimar\ a es, and T. Zylkin (2020). Fast Poisson estimation with high-dimensional fixed effects. Stata Journal\/ 20\/ (1), 95--115

work page 2020

[7] [7]

Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics\/ 29 , 1189--1232

work page 2001

[8] [8]

Goldberg, P. and N. Pavcnik (2016). The effects of trade policy. Volume 1A of Handbook of Commercial Policy , Chapter 3, pp.\ 161--206. Elsevier

work page 2016

[9] [9]

Monfort, and A

Gourieroux, C., A. Monfort, and A. Trognon (1984). Pseudo maximum likelihood methods: Applications to P oisson models. Econometrica\/ 52\/ (3), 701--720

work page 1984

[10] [10]

Harrison, A. and A. Rodr\' i guez-Clare (2010). Trade, foreign investment, and industrial policy for developing countries. Volume 5 of Handbook of Development Economics , Chapter 63, pp.\ 4039--4214. Elsevier

work page 2010

[11] [11]

Tibshirani, and J

Hastie, T., R. Tibshirani, and J. Friedman (2008). The Elements of Statistical Learning . Heidelberg: Springer

work page 2008

[12] [12]

Head, K. and T. Mayer (2014). Gravity equations: Workhorse, toolkit, and cookbook. Volume 4 of Handbook of International Economics , Chapter 3, pp.\ 131--195. Elsevier

work page 2014

[13] [13]

Export potential assessment methodology

ITC (2025). Export potential assessment methodology . Available at: https://umbraco.exportpotential.intracen.org/media/cklh2pi5/epa-methodology_230627.pdf

work page 2025

[14] [14]

Kiyota, K. (2025). Mind the gap: Does the lasso improve the performance of the gravity model of foreign direct investment? Mimeo\/

work page 2025

[15] [15]

LeCun, Y. A., L. Bottou, G. B. Orr, and K.-R. M \"u ller (2012). Efficient BackProp , Chapter 1, pp.\ 10--48. Lecture Notes in Computer Science. Springer

work page 2012

[16] [16]

(2025, May)

Moreau-Kastler, N. (2025, May). Proportional treatment effects in staggered settings: An approach for P oisson pseudo-maximum likelihood. Working Papers 031, EU Tax Observatory

work page 2025

[17] [17]

Varoquaux, A

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011). Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research\/ 12 , 2825--2830

work page 2011

[18] [18]

Rose, A. K. (2001). Currency unions and trade: The effect is large. Economic Policy\/ 16 , 8--45

work page 2001

[19] [19]

Rose, A. K. (2004). Do we really know that the WTO increases trade? American Economic Review\/ 94\/ (1), 98--114

work page 2004

[20] [20]

Chariag, O

Ruzicska, G., R. Chariag, O. Kiss, and M. Koren (2024). Can machine learning beat gravity in flow prediction? The Econometrics of Multi-dimensional Panels, Chapter 16, pp.\ 511--545. Springer

work page 2024

[21] [21]

Santos Silva , J. M. C. and S. Tenreyro (2006). The log of gravity. Review of Economics and Statistics\/ 88\/ (4), 641--658

work page 2006

[22] [22]

Wolpert, D. H. (1992). Stacked generalization. Neural networks\/ 5 , 241--259

work page 1992

[23] [23]

Yotov, Y. (2025). Gravity for undergrads. Working Paper 202519, Center for Global Policy Analysis, LeBow College of Business, Drexel University

work page 2025

[24] [24]

Yotov, Y. V., R. Piermartini, J.-A. Monteiro, and M. Larch (2016). An Advanced Guide to Trade Policy Analysis: The Structural Gravity Model . Geneva: World Trade Organization

work page 2016