HARK, I hear a misperception!

Commentary
Article

An explanation of harking, multiplicity, p-value interpretation, and why harking is problematic

Photo: Petchladda/Adobe Stock

Photo: Petchladda/Adobe Stock

Hypothesizing After Results are Known (HARKing) presents as a questionable research practice and a significant problem for science.1 What exactly is HARKing? What does it look like and why is it problematic?

Let’s start by looking at how we typically perform scientific studies. For example, as inquisitive minds, we might want to know if living with aardvarks in the home makes people happier. Now we have a goal. The next step creates alternative and null hypotheses. We need both because science cannot prove anything and instead follows a methodology of falsification.

Australian-British philosopher and academic, Sir Karl Popper’s famous black swan example best describes this: if you hypothesize that all swans are white, finding a thousand or even a million white swans does not prove this hypothesis. Finding one black swan, on the other hand, will disprove the hypothesis.2

The alternative and null hypotheses exist as mirror opposites of each other. We then design studies in an effort to disprove the null hypothesis (since we can only disprove in science). As a result, the alternative hypothesis is typically framed to mirror what we hope to find. Then when the study disproves the null hypothesis, we can accept, but not prove, the alternative hypothesis.

A null hypothesis might go something like: “The happiness of people cohabiting with an aardvark for a period of 30 days is the same as those that do not.” Our goal is to disprove the null hypothesis by seeing a statistical difference between these two groups. Then we can say that people who live with aardvarks are much happier.3

We also want to know what the probability is that we are correct in our findings. We do not want to make a wrong conclusion based on random chance. The amount that we are willing to risk to simple chance is arbitrary but often set at 5%. The probability value, or p-value, measures the statistical probability that our findings might be due to random chance. Given this, you want a very, very low number. With 5% being represented as a p-value of 0.05, we hope for a number that is much lower than 0.05. Anything above this would mean that we are not confident that our findings are not the result of random chance.4

If we compare our aardvark and non-aardvark cohabiting people, we could find a way to measure their happiness levels and compare. And, if a difference exists, we could run statistics to determine the probability that the difference was due to chance alone. Thus, if our results showed that people living with aardvarks were happier with a p-value of 0.05, we could say, “We are 95% confident that the happiness level of people living with aardvarks for 30 days is higher than those that do not.” We would have disproved our null hypothesis, could publish our article, and start promoting aardvarks as pets.

But let’s say that we did not find that living with aardvarks makes you happier. Unfortunately, well-designed scientific studies that do not show an association are less likely to be published. On the surface, this seems like a good thing, but this is the basis for what is called “publication bias.” This means that great data showing no likely association exists between 2 things are never published and can’t be referenced. As a result, the available medical literature may become biased in support of positive findings even when sometimes there is truly no real difference. Researchers must not only look for associations but need to find some in order to be published.

So, we have a bunch of people living with aardvarks for a month and a control group, and we saw no difference in their happiness levels. We want to be published though, so what do we do? Luckily, beyond using the ACME Happy-O-Meter, we also looked at many variables within the study. We can go back and look to find statistical differences between our aardvark-cohabiting and non-aardvark cohabiting people. When we find some associations, we can create hypotheses after the results are known—we can perform HARKing.

Why is this bad? Isn’t this what research is about, looking to find new associations? It is and it isn’t. The problem is that random chance does occur, and we can find false associations. How often? Well, we have already set our own limit for this. We will accept up to 5% random chance.

When we use this set level for a hypothesis prior to testing, we can honestly say that there is a 95% chance that the conclusion regarding the hypothesis is not due to chance. We cannot say this with harked data. Imagine there is a bucket of marbles: 95 white ones and 5 blue ones. The probability of picking a blue marble “on one try” is 5%. If we pick 5 times (keeping the number of balls the same), our chance of picking at least one blue marble is now 22.6%. If we pick 20 times, 64.15% and 30 times, 78.5%. These additional chances to grab blue marbles are referred to as the problem of multiplicity. If we run enough data, we are almost sure to find a blue marble. See the problem?

The problems of multiplicity and HARKing intertwine. Multiplicity within the study is the statistical problem that allows HARKing to result in the publication of false positive (Type I error) results. Studies that look at many variables in many ways may look intriguing but are often at high risk for multiplicity, HARKing, and Type I error results.

Now, let’s assume for this article that there is absolutely no difference for living with aardvarks for a month versus not. Everything is exactly the same. In our aardvark study, we looked at many variables: how long people slept, use of flossing, the frequency of quizzical looks, wearing blue, wearing green, eating cereal, etc. If we looked at 30 different variables, there is a 78.5% chance that we would find a statistical difference between the two groups that is based solely on random chance (multiplicity). Hence, if we found that people who lived with aardvarks statistically drank more orange juice, it might be possible to say that “living with aardvarks has been identified with increased vitamin C consumption and is deemed good for your health,” which would be HARKing. We could also publish a p-value of 0.05 and write the study to show that this was what we looked for. After all, consuming vitamin C and avoiding scurvy is a great step towards happiness, right?

Thus, as readers of studies, it is easy for us to take a quick look at the abstract and conclusions and see that there is a reason to live with aardvarks. When challenged, we can say, “No, it is true. Look at the p-value!” The false positive (Type I) error we are willing to accept for the ACME Happy-O-Meter and for orange juice is the same (5%), but the actual risk for error for each of these variables is very different because of multiplicity.

When we set up the design, the possibility of a false positive (Type I) error or pulling a blue marble with regard to the original null hypothesis remains at 5%. While this threshold is the same for each of the subsequent variables, variables are being pulled and evaluated until significance is found.With multiplicity in play, 30 grabs at marbles mean that the possibility of pulling a blue marble amidst our results is 78.5%. Due to multiplicity, there is concern that increased orange juice consumption while living with aardvarks may be a blue marble—that is, a statistically different value based solely on chance, despite the fact that the p-value is also 0.05.

When we look at associations that are found with HARKing we don’t know if they are white marbles or blue marbles; that is, are they an association based on truth or an association based on chance? The p-value can no longer provide us with the same level of assurance when harking is involved.

How do we know if there was HARKing? Sometimes it can be difficult. If the study only provided data on the relationship between living with aardvarks and orange juice consumption, we may never know. On the other hand, if we see the original null hypothesis, the data by the ACME Happy-O-Meter, along with a long list detailing other evaluated variables such as how long people slept, use of flossing, the frequency of quizzical looks, wearing blue, etc., we can more easily surmise that increased orange juice consumption is the result of multiplicity and HARKing. This doesn’t necessarily mean that this is false, but it certainly doesn’t mean that it is true either. It may make us suspicious that it is a blue marble, but we don’t know. It does warrant another study for orange juice consumption and living with aardvarks. This would be an appropriate use of HARKing.

We could run another study whose null hypothesis was, “The orange juice consumption of people cohabiting with an aardvark in the home for a period of 30 days is the same as those that do not.” Then if we saw a statistical difference with a p-value of 0.05, we could say “We are 95% confident that living with aardvarks for 30 days increases your consumption of orange juice than living without one.” Same p-value, same association but completely different in how we as readers should interpret the data.

The first study does not allow us to be confident in the result regarding orange juice consumption and the second does. But we know from the original discussion that we assumed no difference in living with or without aardvarks. Thus, the second study would likely not find a relationship with increased orange juice consumption. Those researchers could also want their study published and start HARKing through multiple other variables looking for an association. Then perhaps there was a random association with increased flossing. The cycle could continue, and through HARKing, incorrect assumptions may work its way into our scientific beliefs. Soon people could be adopting aardvarks because of the misperception regarding positive benefits for health and happiness.

So, if you don’t want to be living with aardvarks beware of HARKing! HARKing is a questionable research practice that can introduce unnecessary risk for additional Type I errors into the published literature. These errors can detract us from the truth and diminish our scientific knowledge.

References

  1. Kerr NL. HARKing: hypothesizing after the results are known. Pers Soc Psychol Rev. 1998;2(3):196–217. doi:10.1207/s15327957pspr0203_4. PMID: 15647155.
  2. FZE BBC. The Karl Popper concept of falsifiability philosophy essay. UKEssays. Published September 20, 2024. Accessed May 20, 2025. https://www.ukessays.com/essays/philosophy/the-karl-popper-concept-of-falsifiability-philosophy-essay.php
  3. Null hypothesis: definition, symbol, formula, types and examples. BYJU’S. April 25, 2022. Accessed May 20, 2025. https://byjus.com/maths/null-hypothesis/
  4. P-value: comprehensive guide to understand, apply, and interpret. GeeksforGeeks. January 31, 2024. Accessed May 20, 2025. https://www.geeksforgeeks.org/p-value/
Recent Videos
© 2025 MJH Life Sciences

All rights reserved.