posted on February 10, 2017 17:15
The replication crisis, publication bias, p-hacking, harking, bad incentives, undesirable pressures and probably other factors all contribute to diminish the trustworthiness of published research, with obvious implications for research synthesis. Sergio Graziosi asks whether demanding simple theoretical clarity might be part of the solution.
Within all the talk about publication bias, p-hacking, the replication crisis and so forth, I am finding it harder and harder to keep track of all the proposed solutions. While trying to organise my thoughts, I have realised that the absence of theoretical clarity underlies many of the problems that are currently being discussed. Perhaps this realisation is enough to justify a slight change in focus. For systematic reviewing, as Mark has reminded us, figuring out what results should be trusted, and perhaps more importantly, finding auditable and reasonably objective ways to do so is, naturally, of paramount importance. I do not think I need to convince anyone about this latter point, and will take it for granted in what follows.
More than ten years after warning us that most of the published research is false, Ioannidis (with colleagues) has produced a manifesto for reproducible science (Munafò et al. 2017). It is well worth a read, but it did not soothe my disquiet and confusion. On one hand, the manifesto comes with a range of concrete, actionable and agreeable suggestions. On the other, the same suggestions are, to my eyes, already worrying: the value of each remedial measure is likely to depend on how robust its implementation can be. Let’s consider pre-registration: it is a very attractive strategy and I am sure it is already contributing to diminish practices such as harking and p-hacking. However, on closer examination, one can find the contribution from Wicherts et al., where they list all degrees of freedom that researchers may exploit (consciously or not, legitimately or not) in their search for “significant” results. The list includes 34 potential problems, framing their discussion around the pitfalls that should be avoided when relying on preregistration. Ouch.
Checking for signs of 34 separate questionable practices when reviewing a single study in conjunction with its preregistration looks already daunting and close to utopian – especially when one remembers that the authors’ interest is to paint their own research in the most positive light. How many researchers are likely to critically consider each relevant pitfall of each step of their own workflow, and do so at the right time?
On the other side of the fence, to compile systematic reviews, one would need to go through the same checklist for all studies considered, and perhaps check the consistency of decisions across multiple reviewers. If I extrapolate, and assume that each of the twenty-plus strategies proposed in Munafò’s manifesto comes with a similar number of ways to fail to fully deliver its own potential (even if this doesn’t entail a combinatorial explosion, as there are many overlaps), my mind vacillates and immediately starts looking for strategies that come with lower cognitive costs.
What I will propose is indeed a shortcut. A (hopefully handy) heuristic that revolves around the role of theory in primary research. My starting point is a concise list of typical research phases (up-to and excluding research synthesis as such), being mindful that many alternatives exist. The table below may be read as a simplified version of the list produced by Wicherts et al., compiled with two underlying objectives: keeping it manageable, and highlight the role of theory. My main hunch is that when one clarifies the role played by theory in a given research phase, pitfalls, dangers and malpractice may become easier to isolate. You may decide to read what follows as an argument advocating for epistemological clarity in scientific reporting.
|Research phase & role of theory
|Theory building: this is typically done to try to accommodate the evidence that isn’t satisfactorily accounted-for by existing theories.
||Identify a need: as anomalies are accumulating, people start asking “do we need an entirely new theory?”
Historically, theories such as electromagnetism. More recently, the creation of countless classifications of psychological ‘types’.
|1. Fail to account for all/enough available evidence.
2. Fail to realise how current evidence may fit in existing frameworks.
3. Give new names/labels to existing concepts; fail to appreciate how existing theories use different labels to point at similar concepts or mechanisms.
4. Fail to capture regularities, which directly depend on non-contingent causal chains.
1: No new theory can expect to account for all evidence from day zero.
2: That’s how theories degenerate: what if a new theory can accommodate more evidence with less ad-hoc extensions?
3: Existing theories are confusing, imprecise, too broad or too narrow.
4: This can be established only post-hoc. One needs to first theorise and then check that predictions do apply. Only then can one focus on causal explanations (secondary hypotheses).
|Draft a new theory.
|Formulate new hypotheses: within a theoretical framework, many separate hypotheses can be identified.
||Data exploration: find patterns in existing data.
||Analysis and re-analyses of longitudinal studies
1. Spurious correlations.
2. Pattern-finding bias (we tend to see patterns in noise).
3. Mistaking homogeneity for random noise (the opposite of pattern-finding).
4. Survivorship bias.
|These pitfalls are irrelevant, because hypotheses need to be tested anyway.
||Deductively explore the consequences of a given theory, i.e. recalculation of expected light-bending effect of gravity as a result of general relativity.
1. Logic failures and/or lack of imagination.
2. Overconfidence, producing hypotheses that are too far removed from tested-theory.
3. Lack of ambition: producing ever more detailed hypotheses, just to get publishable positive results.
1-2: as above.
3: this is how “normal science” is done!
||Test an hypothesis.
||Measure the effect of a drug.
1. Bad/insufficient clarity on what is tested.
2. Bad experimental design.
3. Low power.
4. Measure proliferation (encourages p-hacking).
5. Unpublished negative results / publication bias.
1: that’s science, people can’t know it all already.
2-4: budget/capacity. Science happens in the real world, we can do what we can do.
5: ditto, can’t spend ages trying to publish results that no-one wants to read.
|Make predictions – applied science.
||Build bridges, design new microprocessors
1. Overconfidence: stretching a theory beyond its known/expected limits of applicability.
2. Failure to account for theoretical boundaries (not knowing when/why a theory stops to apply).
3. Failure to measure outcomes.
1: But, but, science told us it will work!
2: Can’t anticipate unknown unknowns.
3: We don’t need to, because our theory is solid.
The interesting part of this exercise is how many of the known problems are not, or are only marginally captured by the table above – I would argue that a good number fall in the cracks between the cells above. Thus, my point is that clarifying what one is doing (am I producing a new hypothesis? Am I testing a given one? Am I trying to see if we should start looking for new possible theories?) should be second-nature for all scientists (but alas, I don’t think it is). This may make it easier to double check for well-known pitfalls, but also to avoid stumbling on the boundaries between separate tasks. For example, P-hacking and HARKing can be deliberate malpractice, or could result from “Bad/insufficient clarity on what is tested”. However, it seems to me that it may also be caused by a lack of separation between hypothesis testing and data-exploration.
For example, we may imagine our typical scientist: in this imaginary scenario, her normal work starts by testing a hypothesis – to keep it simple, we’ll assume she is testing the effectiveness of drug D. Let’s imagine she finds a tiny effect size, but with a considerable number of outliers which seem to be clustered together. The original experiment was testing a hypothesis: first result is that drug D doesn’t appear to work. However, we now have new data to explore (a different task), and perhaps we can find that the outliers all have trait T in common. The same experiment therefore yielded a second (separate!) result: we now have a new hypothesis - perhaps drug D only works on subjects with T.
One study yielded two “results”, one is negative or inconclusive; the second is a new hypothesis. Being a new hypothesis, it needs to be tested. In our oversimplified example, the data suggests a new hypothesis, and therefore it can’t also confirm it.
In other words, perhaps we can agree that clarifying and segregating tasks based on how they relate to theory has helped identifying a well-known problem with unreliable science, and has concurrently made it clear how to best use the data collected. Most scientific work actually happens across multiple phases, but nevertheless, having a clear conceptual picture of the boundaries could be a useful approach to avoid repeating well-known mistakes. It goes without saying that such conceptual clarity, if transferred in published research articles, also has the potential of making the task of systematic reviewers less prone to error and less dependent on hard to audit personal judgements.
Is this simplistic proposal enough to overcome all the problems mentioned above? Of course not. It is a mere heuristic; a simple rule of thumb, which I hope might be useful to our readers. If time permits, I hope to explore this same theme in the context of conducting systematic reviews in a follow-up post. In my studies (molecular biology and neuroscience), no one ever helped me realise how the role of theory relates with the different mistakes that may plague scientific results. In fact, no one ever discussed the epistemological foundations of science; I guess they were mostly taken for granted. Thus, perhaps my suggestion is directed to educators in particular: discussing and clarifying the distinctions I’ve mentioned here might be a low-cost strategy to help the next generation of scientists not to repeat our own mistakes.
About the author
Sergio Graziosi is the EPPI-Centre IT manager, and one of the developers of EPPI-Reviewer. Designing tools to conduct systematic reviews implicitly requires exploring what can and cannot count as good/reliable evidence. As a consequence, he’s been exploring science epistemology on his own (non-academic) blog.
Kerr NL (1998). HARKing: hypothesizing after the results are known. Personality and social psychology review : an official journal of the Society for Personality and Social Psychology, Inc, 2 (3), 196-217 PMID: 15647155
Head ML, Holman L, Lanfear R, Kahn AT, & Jennions MD (2015). The extent and consequences of p-hacking in science. PLoS biology, 13 (3) PMID: 25768323
Ioannidis JP (2005). Why most published research findings are false. PLoS medicine, 2 (8) PMID: 16060722
Munafò, M., Nosek, B., Bishop, D., Button, K., Chambers, C., Percie du Sert, N., Simonsohn, U., Wagenmakers, E., Ware, J., & Ioannidis, J. (2017). A manifesto for reproducible science Nature Human Behaviour, 1 (1) DOI: 10.1038/s41562-016-0021
Wicherts JM, Veldkamp CL, Augusteijn HE, Bakker M, van Aert RC, & van Assen MA (2016). Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking. Frontiers in psychology, 7 PMID: 27933012
Image Credits: © Munafò et al., Nature Publishing Group (CC-BY).
posted on December 23, 2016 09:53
[Warning: do not read this with small kids around!] Mark Newman poses some questions in theme with the seasonal festivities: what does it mean to believe in Father Christmas? Does it really differ that much from belief in the role of evidence? We at the EPPI-Centre are happy to rise to the occasion and wish all of our readers a very Merry Christmas and a happy and prosperous New Year.
This festive time of year provides ample cause to reflect on the nature of ‘belief’. After all, there is a lot of ‘believing’ going on at this time of year. Believing that Father Christmas brings your presents down the chimney for example. Now my kids are old enough I can go public and say I don’t believe in Father Christmas. I don’t ‘believe’ most of the foundation myths of Christmas actually, yet I will still be celebrating Christmas with family and friends; happily participating in Christmas rituals like singing carols, putting stockings by the fire on Christmas eve and so on. This got me thinking about what we mean when we say ‘I believe’ or ‘I don’t believe’. Is ‘belief’ (or not) about Christmas the same thing as ‘belief’ or not about the meaning of evidence?
In one way I think you say that yes, we can and do use the term ‘believe’ in different ways in different contexts. Talking about ‘believing’ in the context of faith, myth, tradition, shared communal social norms is meaningfully different to talking about ‘believing’ in the context of a discussion about the interpretation of research evidence. But of course, as an advocate of the greater use of research evidence to inform decision making I would say that, wouldn’t I. So I think it is important to recognise that actually there are quite specific ways in which I might be using the term ‘believe’ in the same way in both contexts.
The claim that that Father Christmas came down the chimney to bring your presents is a ‘knowledge claim’. Therefore I can ask what is the warrant for that knowledge claim. A warrant is provided by some combination of theory, empirical research evidence and personal experience(1). When I say that I do not believe that Santa came down the chimney to bring your Christmas presents I am saying that the theory, empirical evidence and personal experience do not provide a warrant for that knowledge claim. This is what we are saying when we talking about believing or not believing the evidence. Does the warrant provided by the combination of theory, empirical research evidence and personal experience support the knowledge claim made by the researchers?
So no, I don’t believe in Father Christmas but I am still looking forward to seeing what he brings on Christmas day and enjoying all the festivities of the season. I hope you all do too.
About the Author
Mark Newman is a Reader in Evidence informed Policy and Practice at UCL Institute of Education and an Associate Director of the EPPI-Centre. He has a particular interest in evidence use in the context of the Education and Training of healthcare professionals. He will be celebrating Christmas in London where he lives with his two children.
1. James, M., Pollard, A., Rees, G., & Taylor, C. (2005). Researching learning outcomes: building confidence in our conclusions Curriculum Journal, 16 (1), 109-122 DOI: 10.1080/0958517042000336863
posted on December 09, 2016 16:53
It is conventional in the social sciences to report p-values when communicating the results of statistical analyses. There are, however, increasing criticisms of the p-value for being open to misinterpretation and – worse – at risk of falsely indicating the presence of an effect. Alison O’Mara-Eves considers a further problem: failing to engage readers with the meaning behind the numbers. Some alternative ways of reporting the results of analyses are considered.
In the social sciences, statistical analyses are regularly used to test hypotheses and interrogate the collected data. The typical output of such analyses is a mean, correlation, or other statistical value that represents some trend in the data – causal relations, similarities, or differences. This output is a summary or representation of what we have observed over the collected data, or a value for which we can infer will also represent other samples from the same population. Attached to that summary statistic or inferential statistic is usually a p-value.
Statistical p-values are often represented in published reports as asterisks, the number of which tells the reader something about the p-value. Generally, a p-value of less than or equal to .05 is represented by *, whilst =.01 is usually **, and =.001 is usually ***. Whilst most readers of research might not reflect too much on what the numbers mean, the reader will typically get more excited by ‘more asterisks’ (assuming that they are hoping for a statically significant outcome).
You might have noticed that I did not define the p-value but instead launched into the description of the asterisks. This is because this is how many readers (and many study authors) process p-values — i.e., rather superficially. Whilst the audience generally knows the rule of thumb that a p-value less than .05 is ‘significant’, study authors often fail to explain what the actual question underlying the significance test means.
Such ‘black box’ approaches to communicating statistics do not allow the audience to really engage with the research findings: by waving our hand and saying “trust me that it’s important”, the reader does not have a good understanding of how or why the numbers are important, which makes it harder for the reader to determine the relevance of the findings to their own informational needs. Indeed, “p-values characterize only statistical significance, which bears no necessary relationship to practical significance or even to the statistical magnitude of the effect” (Lipsey et al., 2012, p. 3).
Most commonly, the significance value relates to a test of whether there is support for the null hypothesis that there is no observed effect or relationship beyond chance, so a significant result typically means that—statistically speaking—we can reject that null hypothesis. But this is not the same as saying that the observed effect is meaningful and it does not tell us about any variation (e.g., does the observed effect apply to all cases?).
I hasten to add that there are other reasons why we might wish to abandon the p-value (or at least complement it with additional information). Lipsey et al. (2012) argue: “Statistical significance is a function of the magnitude of the difference between the means, to be sure, but it is also heavily influenced by the sample size, the within samples variance on the outcome variable, the covariates included in the analysis, and the type of statistical test applied” (p. 3). Several papers have discussed other statistical reasons why a p-value can be misinterpreted or lead to a false positive result (i.e., the analyses detect an effect that is not actually present). Particularly insightful and/or impactful papers on this issue include Colquhoun (2014) and Ioannidis (2005). At least one journal has made the bold move to ban the p-value significance test because of statistical concerns; see the Royal Statistical Society item discussing this ban.
So what are other ways of engaging the reader in interpreting your statistical results? Here are a few starting suggestions, but there are certainly others.
- Effect sizes and confidence intervals. Effect sizes focus on the magnitude and direction of the effect, while confidence intervals encourage correct interpretation (e.g., see Cumming 2013), perhaps because they require the reader to think about the range of possible values that an observed effect can take. It should be noted, however, that there are also ways to make effect sizes more interpretable for different audiences (e.g., see Thomas, Harden, & Newman, 2012).
- Converting back to the original metric. This involves presenting the findings in terms of what one would actually observe ‘in the real world’. For example, an intervention aimed at increasing vegetable intake could present the findings in terms of how many additional pieces of vegetables the average participant would consume after the intervention. This approach emphasises practical significance over statistical significance.
- Exploring variation. Whilst a mean effect or a correlation representing the strength of a relation is interesting, there is perhaps not enough attention paid to variation. Variation is the extent to which different data points (e.g., the responses from individuals) differ from the ‘average’ or ‘typical’ respondent. Some analyses might explore outliers and exclude or truncate them so that they do not unduly influence the analyses, but perhaps there is more that we could be doing with this information. The ‘variants’ could be particularly interesting to practitioners and decision-makers, rather than just being statistical nuisances. For instance, they could help us understand how the finding might apply to different people in our sample (and by inference, our population). Focusing on variation could be as simple as plotting the data points so that the reader can see how the individual data points differ from the mean or predicted values, or it could be more complex, involving subgroup and other statistical analyses to try to explain the variation. (Although note that this should not be seen as an endorsement of practices that lead to data dredging or p-hacking; see Simmons et al. (2011) for a definition. Explorations of variation should be purposive, well-justified, and, ideally, pre-specified).
In conclusion, the “seductive but illusory certainty of a p-value cutoff” (Cumming, 2013, p. 12) is problematic for more than just statistical reasons. It discourages researchers and their audiences from truly thinking about what the significance test is testing. Moreover, beyond the initial excitement of discovering “yay – it’s statistically significant!”, audiences are not likely to be fully engaged by these values because the practical implications of the results are not always clear. Interpreting the results in terms of the likely ‘real-world’ implications or the variation in the dataset will help practitioners and decision-makers decide how the finding might apply to their context.
About the author:
Alison O’Mara Eves is a Senior Researcher at the EPPI-Centre, Social Science Research Unit, UCL Institute of Education. She specialises in methods for systematic reviews and meta-analysis, and has been conducting systematic reviews for over 13 years. In this capacity, she has reviewed many thousands of primary studies, as well as conducting statistical analyses of her own, which has made her acutely aware of the challenges of communicating findings from statistical analyses. Her profile and publications can be found here.
Colquhoun D. (2014) An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science, 1, 140216.
Cumming G. (2013). The new statistics: why and how. Psychological Science, 25, 7-29.
Ioannidis JP. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.
Lipsey, M.W., Puzio, K., Yun, C., Hebert, M.A., Steinka-Fry, K., Cole, M.W., Roberts, M., Anthony, K.S., Busick, M.D. (2012). Translating the statistical representation of the effects of education interventions into more readily interpretable forms. (NCSER 2013-3000). Washington, DC: National Center for Special Education Research, Institute of Education Sciences, U.S. Department of Education.<
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366
Thomas J, Harden A, and Newman M. (2012). Synthesis: Combining results systematically and appropriately. In Gough, Oliver, and Thomas (eds.), An introduction to systematic reviews. London: Sage.
Image Credits: © Hilda Bastian (CC-BY-NC-ND).
posted on December 05, 2016 15:20
Gillian Stokes and Sergio Graziosi - Blog Editors.
This is the launch post for the new EPPI-Centre blog: we provide a brief introduction of the topics we are planning to cover and the general aims of the blog.
Welcome to the EPPI-Centre blog
Those of you who have worked with the EPPI-Centre, or have read one of our many publications over the last 23 years of operation, will recognise the EPPI-Centre as an organisation that provides internationally impactful evidence on matters of health, social care, international development, education and policy. For those of you who are unfamiliar with who we are, the Evidence for Policy and Practice Information and Co-ordinating Centre, or EPPI-Centre, is part of the Social Science Research Unit at the Department of Social Science, UCL Institute of Education, University College London.
We are committed to informing policy and professional practice with rigorous evidence. Our two main areas of work are systematic reviews and research use. Our systematic review work comprises a variety of research endeavours that include: developing research methods for systematic reviews and research syntheses, conducting reviews, supporting others to undertake reviews, and providing guidance and training in this area. With regard to research use, we study the use/non-use of research evidence in personal, practice and political decision-making, support those who wish to find and use research to help solve problems in a wide variety of disciplines; and to do this we further provide guidance and training.
Why are we opening a new blog?
One of our defining interests revolves around public engagement, and we are keen to open new channels of communication, especially if they cut across the boundaries of the academic “ivory tower”. We have been conducting open seminars since the beginning of 2015, which have been excellently received. Speakers include researchers from the EPPI-Centre, as well as researchers from a wide range of world-class institutions. (Click here for forthcoming events, or see an overview of our past seminars and associated resources). Furthermore, since April 2011, we have provided headlines about our work via our Twitter feed. Twitter has proven useful to engage a large audience, however, we felt it time to provide a platform to discuss our research in more detail. The blog is intended as a platform to allow us expand our research findings and methodological ideas, and to open a dialogue between researchers and readers, in order to engage readers with our research and explore it further. It will also provide a channel to test and refine our current thinking in a less formal, more inclusive medium than the traditional outlets of conferences and peer reviewed publications.
The EPPI-Centre benefits from having research staff from a variety of backgrounds: medicine, education, statistics, media, and economics to name but a few. This multidisciplinary expertise has benefitted our research work greatly, for example by providing insight and understanding of working practices and policy. Here on the EPPI-Centre blog our researchers will be able to share our work and expand their thoughts and ideas with interested readers. We want to provide you with thought-provoking and informative posts that will encourage debate, not just disseminate reports and journal articles. Most of all, we want to blog in order to challenge our approach to our work and explore the issues that we may encounter within our research. Thus, the blog offers us a way to explore new lines of thinking and engage with our audiences in new and productive ways.
What can you look forward to reading about on the EPPI-Centre blog?
- News and reviews – we will keep you informed of new review work, articles and books that we have published, as soon as they have been released.
- Exploratory essays - thoughts about ideas and lines of research that we find worth pursuing, thinking about and discussing with interested parties.
- Training and workshop session updates – we will let you know about training days or workshops that we are running and write about key points emerging from the sessions for those unable to attend.
- Conference news – we will let you know about upcoming conferences that we are running or speaking at, by providing you with key dates for your diaries and links to enable you to register.
- Projects and trials of interest - we will also post links to works by our own researchers and others to inform you of trials, projects, or other developments that you might find useful that relate to our work.
We think that you will find the blog a great way to interact with us here at the EPPI-Centre and hope that you join us regularly for updates of our published work as well as information about our plans for the remainder of 2016 and beyond.
Please also follow us on Twitter – we look forward to hearing from you in the coming months and engaging in online discussions!
Gillian Stokes is a Research Officer at the EPPI-Centre, UCL Institute of Education, University College London. Her main research interests include developing research methods and public and patient involvement in research, particularly children’s involvement in translational medicine. She has been working on systematic reviews focused on health and medicine since May 2013.
Sergio Graziosi is the Information Systems Manager at the EPPI-Centre, UCL Institute of Education, University College London. His main research interests revolve around the use of technology in systematic reviews as well as more generally the challenges and limitations of research synthesis.