The search for significant others: p-values rarely engage

Sergio Graziosi posted on December 09, 2016 16:53

It is conventional in the social sciences to report p-values when communicating the results of statistical analyses. There are, however, increasing criticisms of the p-value for being open to misinterpretation and – worse – at risk of falsely indicating the presence of an effect. Alison O’Mara-Eves considers a further problem: failing to engage readers with the meaning behind the numbers. Some alternative ways of reporting the results of analyses are considered.

The search for significant others: p-values rarely engage

In the social sciences, statistical analyses are regularly used to test hypotheses and interrogate the collected data. The typical output of such analyses is a mean, correlation, or other statistical value that represents some trend in the data – causal relations, similarities, or differences. This output is a summary or representation of what we have observed over the collected data, or a value for which we can infer will also represent other samples from the same population. Attached to that summary statistic or inferential statistic is usually a p-value.

Statistical p-values are often represented in published reports as asterisks, the number of which tells the reader something about the p-value. Generally, a p-value of less than or equal to .05 is represented by *, whilst =.01 is usually **, and =.001 is usually ***. Whilst most readers of research might not reflect too much on what the numbers mean, the reader will typically get more excited by ‘more asterisks’ (assuming that they are hoping for a statically significant outcome).

You might have noticed that I did not define the p-value but instead launched into the description of the asterisks. This is because this is how many readers (and many study authors) process p-values — i.e., rather superficially. Whilst the audience generally knows the rule of thumb that a p-value less than .05 is ‘significant’, study authors often fail to explain what the actual question underlying the significance test means.

Such ‘black box’ approaches to communicating statistics do not allow the audience to really engage with the research findings: by waving our hand and saying “trust me that it’s important”, the reader does not have a good understanding of how or why the numbers are important, which makes it harder for the reader to determine the relevance of the findings to their own informational needs. Indeed, “p-values characterize only statistical significance, which bears no necessary relationship to practical significance or even to the statistical magnitude of the effect” (Lipsey et al., 2012, p. 3).

Most commonly, the significance value relates to a test of whether there is support for the null hypothesis that there is no observed effect or relationship beyond chance, so a significant result typically means that—statistically speaking—we can reject that null hypothesis. But this is not the same as saying that the observed effect is meaningful and it does not tell us about any variation (e.g., does the observed effect apply to all cases?).

I hasten to add that there are other reasons why we might wish to abandon the p-value (or at least complement it with additional information). Lipsey et al. (2012) argue: “Statistical significance is a function of the magnitude of the difference between the means, to be sure, but it is also heavily influenced by the sample size, the within samples variance on the outcome variable, the covariates included in the analysis, and the type of statistical test applied” (p. 3). Several papers have discussed other statistical reasons why a p-value can be misinterpreted or lead to a false positive result (i.e., the analyses detect an effect that is not actually present). Particularly insightful and/or impactful papers on this issue include Colquhoun (2014) and Ioannidis (2005). At least one journal has made the bold move to ban the p-value significance test because of statistical concerns; see the Royal Statistical Society item discussing this ban.

Significant others

So what are other ways of engaging the reader in interpreting your statistical results? Here are a few starting suggestions, but there are certainly others.

Effect sizes and confidence intervals. Effect sizes focus on the magnitude and direction of the effect, while confidence intervals encourage correct interpretation (e.g., see Cumming 2013), perhaps because they require the reader to think about the range of possible values that an observed effect can take. It should be noted, however, that there are also ways to make effect sizes more interpretable for different audiences (e.g., see Thomas, Harden, & Newman, 2012).
Converting back to the original metric. This involves presenting the findings in terms of what one would actually observe ‘in the real world’. For example, an intervention aimed at increasing vegetable intake could present the findings in terms of how many additional pieces of vegetables the average participant would consume after the intervention. This approach emphasises practical significance over statistical significance.
Exploring variation. Whilst a mean effect or a correlation representing the strength of a relation is interesting, there is perhaps not enough attention paid to variation. Variation is the extent to which different data points (e.g., the responses from individuals) differ from the ‘average’ or ‘typical’ respondent. Some analyses might explore outliers and exclude or truncate them so that they do not unduly influence the analyses, but perhaps there is more that we could be doing with this information. The ‘variants’ could be particularly interesting to practitioners and decision-makers, rather than just being statistical nuisances. For instance, they could help us understand how the finding might apply to different people in our sample (and by inference, our population). Focusing on variation could be as simple as plotting the data points so that the reader can see how the individual data points differ from the mean or predicted values, or it could be more complex, involving subgroup and other statistical analyses to try to explain the variation. (Although note that this should not be seen as an endorsement of practices that lead to data dredging or p-hacking; see Simmons et al. (2011) for a definition. Explorations of variation should be purposive, well-justified, and, ideally, pre-specified).

In conclusion, the “seductive but illusory certainty of a p-value cutoff” (Cumming, 2013, p. 12) is problematic for more than just statistical reasons. It discourages researchers and their audiences from truly thinking about what the significance test is testing. Moreover, beyond the initial excitement of discovering “yay – it’s statistically significant!”, audiences are not likely to be fully engaged by these values because the practical implications of the results are not always clear. Interpreting the results in terms of the likely ‘real-world’ implications or the variation in the dataset will help practitioners and decision-makers decide how the finding might apply to their context.

About the author:

Alison O’Mara Eves is a Senior Researcher at the EPPI-Centre, Social Science Research Unit, UCL Institute of Education. She specialises in methods for systematic reviews and meta-analysis, and has been conducting systematic reviews for over 13 years. In this capacity, she has reviewed many thousands of primary studies, as well as conducting statistical analyses of her own, which has made her acutely aware of the challenges of communicating findings from statistical analyses. Her profile and publications can be found here.

Bibliography

Colquhoun D. (2014) An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1, 140216.

Cumming G. (2013). The new statistics: why and how. Psychological Science, 25, 7-29.

Ioannidis JP. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.

Lipsey, M.W., Puzio, K., Yun, C., Hebert, M.A., Steinka-Fry, K., Cole, M.W., Roberts, M., Anthony, K.S., Busick, M.D. (2012). Translating the statistical representation of the effects of education interventions into more readily interpretable forms. (NCSER 2013-3000). Washington, DC: National Center for Special Education Research, Institute of Education Sciences, U.S. Department of Education.<

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366

Thomas J, Harden A, and Newman M. (2012). Synthesis: Combining results systematically and appropriately. In Gough, Oliver, and Thomas (eds.), An introduction to systematic reviews. London: Sage.

Actions: E-mail | Permalink |