The roots of inequality: estimating inequality of opportunity from regression trees and forests*
Published date | 01 October 2023 |
Author | Paolo Brunori,Paul Hufe,Daniel Mahler |
Date | 01 October 2023 |
DOI | http://doi.org/10.1111/sjoe.12530 |
Scand. J. of Economics 125(4), 900–932, 2023
DOI: 10.1111/sjoe.12530
The roots of inequality: estimating
inequality of opportunity from regression
trees and forests*
Paolo Brunori†
London School of Economics, London, WC2A 2AE, UK
paolo.brunori@unifi.it
Paul Hufe‡
University of Bristol, Bristol, BS8 1TU, UK
paul.hufe@bristol.ac.uk
Daniel Mahler
World Bank, Washington, DC 20433, USA
dmahler@worldbank.org
Abstract
We propose the use of machine learning methods to estimate inequality of opportunity and
to illustrate that regression trees and forests represent a substantial improvement over existing
approaches: they reduce the risk of ad hoc model selection and trade off upward and downward
bias in inequality of opportunity estimates. The advantages of regression trees and forests are
illustrated by an empirical application for a cross-section of 31 European countries. We show that
arbitrary model selection might lead to significant biases in inequality of opportunity estimates
relative to our preferred method. These biases are reflected in both point estimates and country
rankings.
Keywords: Equality of opportunity; machine learning; random forests
JEL classification:C38; D31; D63
*We thank Chiara Binelli, Marc Fleurbaey, Torsten Hothorn, Niels Johannesen, Andreas Peichl,
Giuseppe Pignataro, Dominik Sachs, Jan Stuhler, Dirk Van de gaer, and Achim Zeileis for
useful comments and suggestions. Furthermore, we are grateful for the comments received from
seminar audiences at Princeton University, the University of Perugia, the University of Essex,
the World Bank, ifo Munich, the University of Copenhagen, Canazei Winter School 2018, the
European Commission JRC at Ispra, the EBE Meeting 2018, IIPF 2018, and the Equal Chances
Conference in Bari. Any errors remain our own.
†Also affiliated with the University of Florence, Italy.
‡Also affiliated with IZA and CESifo.
c
2023 The Authors. The Scandinavian Journal of Economics published by John Wiley & Sons Ltd on behalf of F¨
oreningen
f¨
or utgivande av the SJE.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution
and reproduction in any medium, provided the original work is properly cited.
P. Brunori, P. Hufe, and D. Mahler 901
1. Introduction
Equality of opportunity is an important ideal of distributive justice. It
has widespread support among the general public and its realization has
been identified as an important goal of public policy intervention (Cappelen
et al., 2007; Corak, 2013; Chetty et al., 2016; Alesina et al., 2018). In spite
of its popularity, it is notoriously difficult to provide empirical estimates of
equality of opportunity. Next to normative dissent about the precise factors that
should be viewed as contributing to unequal opportunities, current estimation
approaches are encumbered by ad hoc model selection that leads researchers
to overestimate or underestimate inequality of opportunity.
In this paper, we propose the use of machine learning methods to overcome
the issue of ad hoc model selection. Machine learning methods allow for
flexible models of how unequal opportunities come about while imposing
statistical discipline through criteria of out-of-sample replicability. These
features serve to establish estimates of inequality of opportunity that are less
prone to upward or downward bias.
The empirical literature on the measurement of unequal opportunities has
been flourishing since the ground-breaking contribution by Roemer (1998),
Equality of Opportunity. At the heart of Roemer’s formulation is the idea that
individual outcomes are determined by two sorts of factors: those factors over
which individuals have control, which he calls “effort”, and those factors for
which individuals cannot be held responsible, which he calls “circumstances”.
While outcome differences due to effort exertion are morally permissible,
differences due to circumstances are inequitable and call for compensation.1
Grounded on this distinction, measures of inequality of opportunity quantify
the extent to which individual outcomes are predicted by circumstance
characteristics. They are usually calculated in a two-step procedure. First,
researchers predict an outcome of interest from observable circumstances.
Second, they calculate inequality in the distribution of predicted outcomes:
the more predicted outcomes diverge, the more circumstances are associated
with outcomes, and there is more inequality of opportunity.
Current approaches to estimate inequality of opportunity suffer from
biases that are the consequence of critical choices in model selection. First,
researchers have to decide which circumstance variables to consider for
estimation.2The challenge of this task grows with the increasing availability
1The distinction between circumstances and efforts underpins many prominent branches of the
economics literature, such as the ones on intergenerational mobility (Chetty et al., 2014a,b), the
gender pay gap (Blau and Kahn, 2017), and racial differences (Kreisman and Rangel, 2015). For
different notions of equality of opportunity, see Arneson (2018).
2Roemer does not provide a fixed list of circumstance variables. Instead, he suggests that
the set of circumstances should evolve from a political process (Roemer and Trannoy, 2015).
c
2023 The Authors. The Scandinavian Journal of Economics published by John Wiley & Sons Ltd on behalf of F¨
oreningen
f¨
or utgivande av the SJE.
902 Estimating inequality of opportunity from regression trees and forests
of high-quality datasets that provide very detailed information with respect
to individual circumstances (Bj¨
orklund et al., 2012; Hufe et al., 2017). On
the one hand, discarding relevant circumstances from the estimation model
limits the explanatory scope of circumstances and leads to downward-biased
estimates of inequality of opportunity (Ferreira and Gignoux, 2011). On the
other hand, including too many circumstances overfits the data and leads to
upward-biased estimates of inequality of opportunity (Brunori et al., 2019).
Second, researchers must choose a functional form according to which
circumstances co-produce the outcome of interest. For example, it is a
well-established finding that the influence of socio-economic disadvantages
during childhood on life outcomes varies by biological sex (Dahl and
Lochner, 2012; Chetty et al., 2016). In contrast to such evidence, many
empirical applications presume that the effect of circumstances on individual
outcomes is log–linear and additive while abstracting from possible interaction
effects (Bourguignon et al., 2007; Ferreira and Gignoux, 2011). On the one
hand, restrictive functional form assumptions limit the ability of circumstances
to explain variation in the outcome of interest and thus force a downward bias
on inequality of opportunity estimates. On the other hand, limitations in the
available degrees of freedom might prove a statistically meaningful estimation
of complex models with many parameters infeasible.
This discussion highlights the non-trivial challenge of selecting the
appropriate model for estimating inequality of opportunity. Researchers
must balance different sources of bias while avoiding ad hoc solutions.
While this task is daunting for the individual researcher, it is a standard
application for machine learning algorithms that are designed to make
out-of-sample predictions of a dependent variable based on a number of
observable predictors. In this paper, we use conditional inference regression
trees and forests to estimate inequality of opportunity (Hothorn et al., 2006).
Introduced by Morgan and Sonquist (1963) and later popularized by Breiman
et al. (1984); Breiman (2001), they belong to a set of machine learning
methods that is increasingly integrated into the statistical toolkit of economists
(Varian, 2014; Mullainathan and Spiess, 2017; Athey, 2018). Trees and
forests obtain predictions by drawing on a clear-cut algorithm that imposes
only minimal assumptions about which circumstances interact in shaping
individual opportunities, and how. Thereby, they restrict judgment calls of the
researcher and inform model specification by data analysis. As a consequence,
they cushion downward bias by flexibly accommodating different ways of how
circumstance characteristics shape the distribution of outcomes. Moreover,
the conditional inference algorithm branches trees (and constructs forests) by a
In empirical implementations, typical circumstances include biological sex, socio-economic
background, and race.
c
2023 The Authors. The Scandinavian Journal of Economics published by John Wiley & Sons Ltd on behalf of F¨
oreningen
f¨
or utgivande av the SJE.
To continue reading
Request your trial