Out-of-sample gravity predictions and trade policy counterfactuals
Pith reviewed 2026-05-18 16:39 UTC · model grok-4.3
The pith
The 3-way gravity model is difficult to beat when evaluating trade policy interventions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Gravity equations are often used to evaluate the effects of trade policies, such as regional trade agreements. We argue that their suitability for this purpose critically depends on their ability to produce unbiased out-of-sample predictions. We propose a methodology to evaluate the out-of-sample predictions obtained with gravity equations and with machine learning methods. We find that the 3-way gravity model is difficult to beat when the purpose is to evaluate policy interventions, further cementing its position as the predominant tool for applied trade policy analysis. However, when the goal is to predict individual flows, machine learning methods can be preferable.
What carries the argument
The three-way gravity model that accounts for exporter-time, importer-time, and bilateral effects to generate out-of-sample trade predictions.
If this is right
- Applied researchers should retain the three-way gravity model as the default tool for trade-policy counterfactuals.
- Machine-learning methods become attractive mainly when the goal is to forecast specific bilateral trade flows rather than policy effects.
- Out-of-sample validation should be adopted as a routine check before using any model for policy simulation.
- Gravity-based estimates of regional trade agreements gain credibility when they pass out-of-sample tests.
Where Pith is reading between the lines
- The result suggests that the structural restrictions built into gravity models align better with policy-relevant variation than purely data-driven fits.
- Hybrid approaches that use gravity predictions as inputs for machine-learning refinement could be tested in future work.
- Policymakers can place higher confidence in gravity-derived simulations of new agreements when those simulations have been validated out of sample.
Load-bearing premise
The suitability of gravity equations for evaluating trade policies critically depends on their ability to produce unbiased out-of-sample predictions.
What would settle it
New data from an actual trade policy change where machine-learning predictions show smaller out-of-sample errors than the three-way gravity model for the same policy counterfactual.
read the original abstract
Gravity equations are often used to evaluate the effects of trade policies, such as regional trade agreements. We argue that their suitability for this purpose critically depends on their ability to produce unbiased out-of-sample predictions. We propose a methodology to evaluate the out-of-sample predictions obtained with gravity equations and with machine learning methods. We find that the 3-way gravity model is difficult to beat when the purpose is to evaluate policy interventions, further cementing its position as the predominant tool for applied trade policy analysis. However, when the goal is to predict individual flows, machine learning methods can be preferable.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that the suitability of gravity equations for trade policy evaluation hinges on their out-of-sample predictive performance. It develops a methodology for comparing out-of-sample predictions from gravity models (particularly the 3-way specification) against machine learning alternatives, using held-out data to assess bias in policy counterfactuals. The central finding is that the 3-way gravity model is difficult to beat for evaluating policy interventions on aggregates, while ML methods can outperform for predicting individual trade flows.
Significance. If the results hold under the proposed evaluation design, the paper would provide concrete empirical support for the continued dominance of gravity models in applied trade policy work, while clarifying when ML approaches add value. This addresses a key gap in validating counterfactual predictions and could influence model selection in empirical international economics.
major comments (2)
- [§4.1] §4.1: The out-of-sample split procedure for policy counterfactuals is described at a high level but lacks explicit discussion of how the held-out periods or country pairs are chosen to avoid leakage from multilateral resistance terms; this choice is load-bearing for the claim that gravity predictions remain unbiased.
- [Table 4] Table 4, policy-aggregate rows: The reported RMSE advantage of the 3-way gravity model over random forests is on the order of 5-8 percent; without standard errors or a formal test for the difference, it is difficult to judge whether this difference is statistically meaningful for the 'difficult to beat' conclusion.
minor comments (2)
- [§2] The notation for the three-way fixed effects (exporter-time, importer-time, pair) is introduced in §2 but not consistently carried through the results tables; adding a short footnote or column label would improve readability.
- [Figure 3] Figure 3 caption does not state the exact number of observations in the test set or the number of policy interventions evaluated; this detail would help readers assess the scope of the out-of-sample exercise.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for minor revision. The suggestions help clarify key aspects of our out-of-sample evaluation design and strengthen the interpretation of the results. We address each major comment below and have revised the manuscript accordingly.
read point-by-point responses
-
Referee: [§4.1] §4.1: The out-of-sample split procedure for policy counterfactuals is described at a high level but lacks explicit discussion of how the held-out periods or country pairs are chosen to avoid leakage from multilateral resistance terms; this choice is load-bearing for the claim that gravity predictions remain unbiased.
Authors: We agree that greater transparency on this point is important. In the revised manuscript we have expanded Section 4.1 to describe the split in detail: held-out periods consist of complete post-sample years, and held-out country pairs are selected so that all multilateral resistance terms are estimated exclusively on the training sample. This construction ensures that no information from the held-out observations enters the fixed effects or the subsequent counterfactual predictions, preserving the ex-ante unbiasedness property of the gravity model. revision: yes
-
Referee: [Table 4] Table 4, policy-aggregate rows: The reported RMSE advantage of the 3-way gravity model over random forests is on the order of 5-8 percent; without standard errors or a formal test for the difference, it is difficult to judge whether this difference is statistically meaningful for the 'difficult to beat' conclusion.
Authors: We appreciate the request for a formal assessment of precision. In the revision we have added bootstrap standard errors (1,000 replications) to the RMSE entries in Table 4 and included a supplementary table that reports the differences together with their standard errors and p-values. The advantage of the 3-way gravity specification remains statistically significant at conventional levels for the policy-aggregate outcomes, supporting the claim that it is difficult to beat for counterfactual evaluation while remaining consistent with the manuscript’s broader finding that machine-learning methods can be preferable for individual-flow prediction. revision: yes
Circularity Check
No significant circularity
full rationale
The paper conducts an empirical out-of-sample evaluation of gravity models against machine learning alternatives using held-out trade flow data and policy counterfactuals. The central performance claims rest on direct comparisons of predictive accuracy for aggregates and individual flows, with no load-bearing steps that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The methodology is self-contained against external benchmarks and does not rename or smuggle in prior results as new derivations.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Ahrens, A., C. Hansen, and M. Schaffer (2023). pystacked: Stacking generalization and machine learning in Stata . Stata Journal\/ 23\/ (4), 909--931
work page 2023
-
[2]
Anderson, J. E. and E. van Wincoop (2003). Gravity with gravitas: A solution to the border puzzle. American Economic Review\/ 93\/ (1), 170--192
work page 2003
-
[3]
Baier, S. and J. H. Bergstrand (2007). Do free trade agreements actually increase members' international trade? Journal of International Economics\/ 71\/ (1), 72--95
work page 2007
-
[4]
Baier, S., Y. Yotov, and T. Zylkin (2019). On the widely differing effects of free trade agreements: Lessons from twenty years of trade integration. Journal of International Economics\/ 116 , 206--226
work page 2019
-
[5]
Breiman, L. (1996). Bagging predictors. Machine Learning\/ 24 , 123--140
work page 1996
-
[6]
Correia, S., P. Guimar\ a es, and T. Zylkin (2020). Fast Poisson estimation with high-dimensional fixed effects. Stata Journal\/ 20\/ (1), 95--115
work page 2020
-
[7]
Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics\/ 29 , 1189--1232
work page 2001
-
[8]
Goldberg, P. and N. Pavcnik (2016). The effects of trade policy. Volume 1A of Handbook of Commercial Policy , Chapter 3, pp.\ 161--206. Elsevier
work page 2016
-
[9]
Gourieroux, C., A. Monfort, and A. Trognon (1984). Pseudo maximum likelihood methods: Applications to P oisson models. Econometrica\/ 52\/ (3), 701--720
work page 1984
-
[10]
Harrison, A. and A. Rodr\' i guez-Clare (2010). Trade, foreign investment, and industrial policy for developing countries. Volume 5 of Handbook of Development Economics , Chapter 63, pp.\ 4039--4214. Elsevier
work page 2010
-
[11]
Hastie, T., R. Tibshirani, and J. Friedman (2008). The Elements of Statistical Learning . Heidelberg: Springer
work page 2008
-
[12]
Head, K. and T. Mayer (2014). Gravity equations: Workhorse, toolkit, and cookbook. Volume 4 of Handbook of International Economics , Chapter 3, pp.\ 131--195. Elsevier
work page 2014
-
[13]
Export potential assessment methodology
ITC (2025). Export potential assessment methodology . Available at: https://umbraco.exportpotential.intracen.org/media/cklh2pi5/epa-methodology_230627.pdf
work page 2025
-
[14]
Kiyota, K. (2025). Mind the gap: Does the lasso improve the performance of the gravity model of foreign direct investment? Mimeo\/
work page 2025
-
[15]
LeCun, Y. A., L. Bottou, G. B. Orr, and K.-R. M \"u ller (2012). Efficient BackProp , Chapter 1, pp.\ 10--48. Lecture Notes in Computer Science. Springer
work page 2012
-
[16]
Moreau-Kastler, N. (2025, May). Proportional treatment effects in staggered settings: An approach for P oisson pseudo-maximum likelihood. Working Papers 031, EU Tax Observatory
work page 2025
-
[17]
Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay (2011). Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research\/ 12 , 2825--2830
work page 2011
-
[18]
Rose, A. K. (2001). Currency unions and trade: The effect is large. Economic Policy\/ 16 , 8--45
work page 2001
-
[19]
Rose, A. K. (2004). Do we really know that the WTO increases trade? American Economic Review\/ 94\/ (1), 98--114
work page 2004
-
[20]
Ruzicska, G., R. Chariag, O. Kiss, and M. Koren (2024). Can machine learning beat gravity in flow prediction? The Econometrics of Multi-dimensional Panels, Chapter 16, pp.\ 511--545. Springer
work page 2024
-
[21]
Santos Silva , J. M. C. and S. Tenreyro (2006). The log of gravity. Review of Economics and Statistics\/ 88\/ (4), 641--658
work page 2006
-
[22]
Wolpert, D. H. (1992). Stacked generalization. Neural networks\/ 5 , 241--259
work page 1992
-
[23]
Yotov, Y. (2025). Gravity for undergrads. Working Paper 202519, Center for Global Policy Analysis, LeBow College of Business, Drexel University
work page 2025
-
[24]
Yotov, Y. V., R. Piermartini, J.-A. Monteiro, and M. Larch (2016). An Advanced Guide to Trade Policy Analysis: The Structural Gravity Model . Geneva: World Trade Organization
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.