By James Thomas and Dylan Kneale
‘Scientific’ prediction is viewed as a necessary precursor for, and often a determinant of, public policy-making (Sarewitz & Pielke Jr, 1999); without some form of prediction or forecasting of the likely impacts of policy decisions, decision-makers may find themselves paralysed by inaction or shooting in the dark. Rather than gazing into crystal balls, decision-makers can draw on an increasingly sophisticated array of data and methods that are used to generate predictions. There is a lot of enthusiasm for these predictions, and some have claimed that they have the skills to forecast future events precisely; that they are ‘super-forecasters’ (Kirkegaard, Taji, & Gerritsen, 2020). However, while new possibilities are indeed opening up, we need to be mindful of the assumptions that underpin statistical models, and of the fact that large datasets and sophisticated statistics cannot perform miracles, simply because they are large and sophisticated. ‘Super-predictors’ are still subject to basic evidential laws that cover all research.
This is a wide-ranging field, although the issues we focus on here mainly reflect the epistemic impossibilities of using model-based predictions indiscriminately to allocate individuals into discrete categories of real-world significance, without acknowledgement of the uncertainty in doing so.
The implication that arises is that, where uncertainty cannot be incorporated in a real-world use scenario, then the use of the model in that scenario is inadvisable and manifestly unethical. This blog is prompted by recent examples in UK educational policy where there have been attempts to extrapolate model-based predictions to allocate individuals into categories, as well as debates in public health suggesting that there is widespread potential for predictions to be used in this way. These are not isolated examples and there are examples across public policy where model-based predictions have been used to make decisions about individual outcomes; for example model-based predictions of recidivism in the US have been used by judges in decisions about whether to release or hold a defendant before trial (Dressel & Farid, 2021)*. The (mis)use of predictive models and algorithms to ‘objectively’ and deterministically categorise individuals into discrete categories of real-world significance leads us to regard such exercises as attempts to escape epistemic gravity with optimism.
Extrapolating predictions to individuals
Models used for prediction purposes are of huge value across several areas of public policy and practice. Some predictions may have reasonable predictive power for groups of people with particular social profiles (e.g. socioeconomically disadvantaged vs advantaged people), but don’t travel well when applied to individuals. This can be problematic for at least six reasons:
The first, most fundamental issue is that predictions are made on the basis of probabilistic/stochastic models, but the classification of individuals into real-world categories is deterministic and crisp. A recent example of the flaws in this reasoning involved the use of predicted grades in the UK during the COVID-19 pandemic in 2020 to estimate children’s academic achievement. When predicting an exam grade based on a regression model, there would be uncertainty around the prediction itself, and the model would include an error term that denoted the unexplained variance. This uncertainty is ignored when the prediction is extrapolated from the model to allocate an individual grade to a real-world person, and the ‘fuzziness’ of the prediction is ignored. This is despite the range of uncertainty when predicting a (new) individual value from a regression model being wider than the range of uncertainty when interpreting a statistical parameter from a model, because of the added uncertainty in predicting a single response compared to a mean response. In the case of A-level and GCSE exams, a more accurate way of presenting results may have been to put intervals reflecting uncertainty in prediction around them. However, this would have resulted in the slightly bizarre situation where someone, for example, might have been predicted and awarded a grade ‘between A and C’. This is clearly not operational in practice, especially as some students could be allocated grades spanning the entirety of grades A-E, but it would have exposed the inherent uncertainty in the model; something that was concealed by the crisp prediction.
Secondly the predictions, and certainly the approaches underlying the creation of the predictions, may not have been developed with the intention of being applied to wide swathes of the population indiscriminately, and to assign individuals to categories of real-world significance. Instead, the logic underlying these models may hold that predictions from models be useful as diagnostic aids, but not the sole determinant of individual classification. If a prediction is to serve as a diagnostic aid, and not the sole basis of individual classification, the prediction may need to be based on, or combined with, additional information about the individual.
The third issue is that even when we have reliable prior information about an individual, using this in a prediction does not reduce error to non-trivial levels. In the case of predicting grades based on previous attainment for example, we are only able to do this with limited accuracy (Anders, Dilnot, Macmillan, & Wyness, 2020). Nevertheless, in the case of last summer’s school grades debacle, models incorporating past achievement at school level were treated as being entirely deterministic, with zero error – not a defensible position. Furthermore, the accumulating evidence demonstrating how poor we are at predicting individual grades undermines the entire idea that predicting individual academic achievement has analytic value, and suggests that the widespread use of individual predictions in the education system is outdated (see twitter thread by Gill Wyness and paper by Anders et al. (2020)).
The fourth issue is that increasingly sophisticated modelling and technologies can engender false confidence in our ability to predict. A recent example involves the concept of Precision Public Health (PPH), the topic of a new report by the EPPI-Centre, which refers to use of novel data sources and/or computer science-driven methods of data analysis to predict risk or outcomes, in order to improve how interventions are targeted or tailored. Our work on PPH suggests that where evidence is generated using new methods of analysis (specifically Artificial Intelligence) and/or new forms of data, that the findings from this evidence have a tendency to be interpreted by others to support claims about improvements in accuracy in ways that overreach the findings in the original study. Alongside this, we also observed that many of the new ways of analysing data and creating predictions about who or where may benefit most from a given Public Health intervention had not been evaluated against performance from more established methods. Certainly in the case of some analytical techniques, there is conflicting evidence around whether new approaches, such as those used in machine learning, can and do outperform traditional analytical methods such as logistic regression in prediction models (Christodoulou et al., 2019).
A fifth issue revolves around measurement error in the variables and other inaccuracies in the composition of the data used to generate the prediction, including missing data. Although these issues represent caveats to most if not all quantitative social science, the implications for model-based predictions may be unexpected.
A sixth issue is the potential for predictions to reinforce existing systemic biases and disadvantages. Returning to the example of school grades, there is a good deal of evidence demonstrating that the final grades that young people achieve in schools reflect systematic disadvantages faced by poorer and minoritised groups (Banerjee, 2016). A ‘perfect’ prediction model would be one able to replicate the exact results of ‘average’ (non-COVID-19) school years, which would include replicating these systemic inequalities and biases. The subsequent use of these models to predict future outcomes for individuals means that inequalities persist and are actually reinforced by prediction models. For instance, with respect to the earlier example of recidivism, a widely used tool to generate model-based predictions of recidivism in the US systematically overestimated the risk of black defendants re-offending and underestimated the risk of re-offending for white defendants (Dressel & Farid, 2021).
Allocation of individuals into discrete, ‘neat’ categories, on the basis of an ‘objective’ model, belies both the fuzziness of models and the structural disadvantages and structural racism that are underlying features of the models. This was one of the concerns also expressed about the direction of travel of Precision Public Health, where predicted differences in health were at risk of being interpreted almost as ‘innate’ and static features of individuals and attributable to individual factors, thereby overlooking structural and systemic disadvantages that give rise to these differences. Using predictions from models that reflect an unequal society, to predict outcomes for individuals, risks perpetuating further inequalities and undermines claims of objectivity, if the underlying model unquestioningly reproduces and perpetuates the subjectivities and biases of society at large, when categorising individuals.
Predicting outcomes for individuals based on population-level models and treating the source models (and the prediction) as having zero error is an epistemic impossibility. Similarly, treating error from a model as constant and fixed over time, contexts, and populations is also clearly problematic and not supported by evidence. However, the issue is often less to do with the robustness of the underlying model, but more a reflection of the mismatch between the underlying model assumptions and the ambition for real-world individual-level applications of the prediction.
Regardless of their ostensible superiority, we should approach even the most sophistically derived predictions for individuals with a critical eye with regards to the issues outlined above. AI-based predictions can sometimes be viewed as authoritative because of the putative improvement in the precision of the prediction, based on the novelty of the data and complexity and perceived objectivity of the algorithms used to generate them. Our own work around PPH emphasised that there is merit in both (i) questioning the rationale for prediction, and (ii) questioning the evidence base supporting the prediction for individuals (Kneale et al., 2020).
Over the year we have seen the phrase ‘following the science’ bandied about by policy-makers, and often directed at the use of predictions of likely scenarios that could follow in terms of the spread of the COVID-19 pandemic. The narrative of politicians during the COVID-19 pandemic has been one where model-based predictions, albeit ones directed at a population-level, have been viewed as a form of instrumental evidence to base policy decisions upon. However, because of the degree of uncertainty involved in generating these predictions, they may instead be at their most helpful in further illuminating the landscape within which policy decisions are made, rather than for instrumental use.
In the case of predicted grades, the literature was clear for some time before that predictions of future grades for individuals are far from accurate. The aftermath provides an opportunity to think again about how we use individual predictions in the education system, but perhaps more broadly across policy domains. Fundamentally, we should accept that it is simply impossible to use these models deterministically to allocate individuals to crisp categories with real-world consequences: we cannot overcome epistemic gravity with heroic optimism.
About the authors
James Thomas is professor of social research & policy and deputy director of the EPPI-Centre. His interests include the use of research to inform decision-making and the development of methods, technology and tools to support this.
Dylan Kneale is a Principal Research Fellow at the EPPI-Centre. He is interested in developing methods to enhance the use of evidence in decision-making, focusing on public health, ageing and social exclusion.
Notes and references
*The types of model-based prediction of most concern here are those that involve prediction of a future state or event which may be observed under normal conditions or with the passing of time, and less so those that involve ‘prediction’ of a latent (unobserved) contemporaneous state (which reflect more a form of categorisation than prediction).
Anders, Jake; Dilnot, Catherine; Macmillan, Lindsey & Wyness, Gill. (2020). Grade Expectations: How well can we predict future grades based on past performance? Centre for Education Policy and Equalising Opportunities, UCL Institute of Education, University College London.
Banerjee, Pallavi Amitava. (2016). A systematic review of factors linked to poor academic performance of disadvantaged students in science and maths in schools. Cogent Education, 3(1), 1178441.
Christodoulou, Evangelia; Ma, Jie; Collins, Gary S; Steyerberg, Ewout W; Verbakel, Jan Y & Van Calster, Ben. (2019). A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. Journal of clinical epidemiology, 110, 12-22.
Dressel, Julia & Farid, Hany. (2021). The Dangers of Risk Prediction in the Criminal Justice System. MIT Case Studies in Social and Ethical Responsibilities of Computing.
Kirkegaard, Emil Ole William; Taji, Wael & Gerritsen, Arjen. (2020). Predicting a Pandemic: testing crowd wisdom and expert forecasting amidst the novel COVID-19 outbreak.
Kneale, Dylan; Lorenc, Theo; O'Mara-Eves, Alison; Hong, Quan Nha; Sutcliffe, Katy; Sowden, Amanda & Thomas, James. (2020). Precision public health – A critical review of the opportunities and obstacles. London: EPPI-Centre, Social Research Institute, University College London.
Sarewitz, Daniel, & Pielke Jr, Roger. (1999). Prediction in science and policy. Technology in Society, 21(2), 121-133.
Image source, by Emir ÖZDEMİR, pixabay license.