Within all the talk about publication bias, p-hacking, the replication crisis and so forth, I am finding it harder and harder to keep track of all the proposed solutions. While trying to organise my thoughts, I have realised that the absence of theoretical clarity underlies many of the problems that are currently being discussed. Perhaps this realisation is enough to justify a slight change in focus. For systematic reviewing, as Mark has reminded us, figuring out what results should be trusted, and perhaps more importantly, finding auditable and reasonably objective ways to do so is, naturally, of paramount importance. I do not think I need to convince anyone about this latter point, and will take it for granted in what follows.
More than ten years after warning us that most of the published research is false, Ioannidis (with colleagues) has produced a manifesto for reproducible science (Munafò et al. 2017). It is well worth a read, but it did not soothe my disquiet and confusion. On one hand, the manifesto comes with a range of concrete, actionable and agreeable suggestions. On the other, the same suggestions are, to my eyes, already worrying: the value of each remedial measure is likely to depend on how robust its implementation can be. Let’s consider pre-registration: it is a very attractive strategy and I am sure it is already contributing to diminish practices such as harking and p-hacking. However, on closer examination, one can find the contribution from Wicherts et al., where they list all degrees of freedom that researchers may exploit (consciously or not, legitimately or not) in their search for “significant” results. The list includes 34 potential problems, framing their discussion around the pitfalls that should be avoided when relying on preregistration. Ouch.
Checking for signs of 34 separate questionable practices when reviewing a single study in conjunction with its preregistration looks already daunting and close to utopian – especially when one remembers that the authors’ interest is to paint their own research in the most positive light. How many researchers are likely to critically consider each relevant pitfall of each step of their own workflow, and do so at the right time?
On the other side of the fence, to compile systematic reviews, one would need to go through the same checklist for all studies considered, and perhaps check the consistency of decisions across multiple reviewers. If I extrapolate, and assume that each of the twenty-plus strategies proposed in Munafò’s manifesto comes with a similar number of ways to fail to fully deliver its own potential (even if this doesn’t entail a combinatorial explosion, as there are many overlaps), my mind vacillates and immediately starts looking for strategies that come with lower cognitive costs.
What I will propose is indeed a shortcut. A (hopefully handy) heuristic that revolves around the role of theory in primary research. My starting point is a concise list of typical research phases (up-to and excluding research synthesis as such), being mindful that many alternatives exist. The table below may be read as a simplified version of the list produced by Wicherts et al., compiled with two underlying objectives: keeping it manageable, and highlight the role of theory. My main hunch is that when one clarifies the role played by theory in a given research phase, pitfalls, dangers and malpractice may become easier to isolate. You may decide to read what follows as an argument advocating for epistemological clarity in scientific reporting.
|Research phase & role of theory
|Theory building: this is typically done to try to accommodate the evidence that isn’t satisfactorily accounted-for by existing theories.
||Identify a need: as anomalies are accumulating, people start asking “do we need an entirely new theory?”
Historically, theories such as electromagnetism. More recently, the creation of countless classifications of psychological ‘types’.
|1. Fail to account for all/enough available evidence.
2. Fail to realise how current evidence may fit in existing frameworks.
3. Give new names/labels to existing concepts; fail to appreciate how existing theories use different labels to point at similar concepts or mechanisms.
4. Fail to capture regularities, which directly depend on non-contingent causal chains.
1: No new theory can expect to account for all evidence from day zero.
2: That’s how theories degenerate: what if a new theory can accommodate more evidence with less ad-hoc extensions?
3: Existing theories are confusing, imprecise, too broad or too narrow.
4: This can be established only post-hoc. One needs to first theorise and then check that predictions do apply. Only then can one focus on causal explanations (secondary hypotheses).
|Draft a new theory.
|Formulate new hypotheses: within a theoretical framework, many separate hypotheses can be identified.
||Data exploration: find patterns in existing data.
||Analysis and re-analyses of longitudinal studies
1. Spurious correlations.
2. Pattern-finding bias (we tend to see patterns in noise).
3. Mistaking homogeneity for random noise (the opposite of pattern-finding).
4. Survivorship bias.
|These pitfalls are irrelevant, because hypotheses need to be tested anyway.
||Deductively explore the consequences of a given theory, i.e. recalculation of expected light-bending effect of gravity as a result of general relativity.
1. Logic failures and/or lack of imagination.
2. Overconfidence, producing hypotheses that are too far removed from tested-theory.
3. Lack of ambition: producing ever more detailed hypotheses, just to get publishable positive results.
1-2: as above.
3: this is how “normal science” is done!
||Test an hypothesis.
||Measure the effect of a drug.
1. Bad/insufficient clarity on what is tested.
2. Bad experimental design.
3. Low power.
4. Measure proliferation (encourages p-hacking).
5. Unpublished negative results / publication bias.
1: that’s science, people can’t know it all already.
2-4: budget/capacity. Science happens in the real world, we can do what we can do.
5: ditto, can’t spend ages trying to publish results that no-one wants to read.
|Make predictions – applied science.
||Build bridges, design new microprocessors
1. Overconfidence: stretching a theory beyond its known/expected limits of applicability.
2. Failure to account for theoretical boundaries (not knowing when/why a theory stops to apply).
3. Failure to measure outcomes.
1: But, but, science told us it will work!
2: Can’t anticipate unknown unknowns.
3: We don’t need to, because our theory is solid.
The interesting part of this exercise is how many of the known problems are not, or are only marginally captured by the table above – I would argue that a good number fall in the cracks between the cells above. Thus, my point is that clarifying what one is doing (am I producing a new hypothesis? Am I testing a given one? Am I trying to see if we should start looking for new possible theories?) should be second-nature for all scientists (but alas, I don’t think it is). This may make it easier to double check for well-known pitfalls, but also to avoid stumbling on the boundaries between separate tasks. For example, P-hacking and HARKing can be deliberate malpractice, or could result from “Bad/insufficient clarity on what is tested”. However, it seems to me that it may also be caused by a lack of separation between hypothesis testing and data-exploration.
For example, we may imagine our typical scientist: in this imaginary scenario, her normal work starts by testing a hypothesis – to keep it simple, we’ll assume she is testing the effectiveness of drug D. Let’s imagine she finds a tiny effect size, but with a considerable number of outliers which seem to be clustered together. The original experiment was testing a hypothesis: first result is that drug D doesn’t appear to work. However, we now have new data to explore (a different task), and perhaps we can find that the outliers all have trait T in common. The same experiment therefore yielded a second (separate!) result: we now have a new hypothesis - perhaps drug D only works on subjects with T.
One study yielded two “results”, one is negative or inconclusive; the second is a new hypothesis. Being a new hypothesis, it needs to be tested. In our oversimplified example, the data suggests a new hypothesis, and therefore it can’t also confirm it.
In other words, perhaps we can agree that clarifying and segregating tasks based on how they relate to theory has helped identifying a well-known problem with unreliable science, and has concurrently made it clear how to best use the data collected. Most scientific work actually happens across multiple phases, but nevertheless, having a clear conceptual picture of the boundaries could be a useful approach to avoid repeating well-known mistakes. It goes without saying that such conceptual clarity, if transferred in published research articles, also has the potential of making the task of systematic reviewers less prone to error and less dependent on hard to audit personal judgements.
Is this simplistic proposal enough to overcome all the problems mentioned above? Of course not. It is a mere heuristic; a simple rule of thumb, which I hope might be useful to our readers. If time permits, I hope to explore this same theme in the context of conducting systematic reviews in a follow-up post. In my studies (molecular biology and neuroscience), no one ever helped me realise how the role of theory relates with the different mistakes that may plague scientific results. In fact, no one ever discussed the epistemological foundations of science; I guess they were mostly taken for granted. Thus, perhaps my suggestion is directed to educators in particular: discussing and clarifying the distinctions I’ve mentioned here might be a low-cost strategy to help the next generation of scientists not to repeat our own mistakes.
About the author
Sergio Graziosi is the EPPI-Centre IT manager, and one of the developers of EPPI-Reviewer. Designing tools to conduct systematic reviews implicitly requires exploring what can and cannot count as good/reliable evidence. As a consequence, he’s been exploring science epistemology on his own (non-academic) blog.
Kerr NL (1998). HARKing: hypothesizing after the results are known. Personality and social psychology review : an official journal of the Society for Personality and Social Psychology, Inc, 2 (3), 196-217 PMID: 15647155
Head ML, Holman L, Lanfear R, Kahn AT, & Jennions MD (2015). The extent and consequences of p-hacking in science. PLoS biology, 13 (3) PMID: 25768323
Ioannidis JP (2005). Why most published research findings are false. PLoS medicine, 2 (8) PMID: 16060722
Munafò, M., Nosek, B., Bishop, D., Button, K., Chambers, C., Percie du Sert, N., Simonsohn, U., Wagenmakers, E., Ware, J., & Ioannidis, J. (2017). A manifesto for reproducible science Nature Human Behaviour, 1 (1) DOI: 10.1038/s41562-016-0021
Wicherts JM, Veldkamp CL, Augusteijn HE, Bakker M, van Aert RC, & van Assen MA (2016). Degrees of Freedom in Planning, Running, Analyzing, and Reporting Psychological Studies: A Checklist to Avoid p-Hacking. Frontiers in psychology, 7 PMID: 27933012
Image Credits: © Munafò et al., Nature Publishing Group (CC-BY).