Your Guide to P Values and Confidence Intervals – Shoe Size, Penile Length, Football and Heart Attacks!
In work and life, we are always attempting to make links with the hundreds and thousands of variables around us. Some of these links may be true while others may just be false alarms. How do we know which links are significant enough for us to change our behaviour? This is the whole basis of statistics.
It is important to recognise that the only certainty in life is the uncertainty and complexity that surrounds us. In order to make sense of this uncertainty we make certain assumptions and adjust as best as we can for uncertainty.
I’ll take the example of medical studies. In most studies, we want to know if an intervention delivers a positive result or if a particular exposure causes a disease. The first thing we do when we want to investigate two variables is to state a hypothesis. This hypothesis is usually framed as a null hypothesis. So let’s take an example of a fun study.
SHOE SIZE AND PENILE LENGTH
Does shoe size predict penile length? I’m not kidding, this is a real study.
Step 1: State the Null hypothesis: There is no relationship between shoe size and penile length OR Shoe size does not predict penile length. The aim of the study would be to reject this hypothesis which would then favour the alternate hypothesis; i.e. There is a relationship between shoe size and penile length OR shoe size predicts penile length (there are three possibilities ; no correlation, positive or a negative correlation)
Step 2: Two urologists measured the stretched penile length of 104 men in a prospective study and related this to their shoe size.
Step 3: The results were assessed statistically using a least-squares regression model, with the level of significance chosen as P<0.05.
P value
Now say you got a positive result, i.e. penile length increases with shoe size. How can you be sure that this correlation was statistically significant or in other words how well does the sample data support the argument that the null hypothesis is true ? This is where the p value comes in. The p value adjusts for uncertainty, by telling you how likely is the effect observed in your data if the null hypothesis were true.
Here is the Americal Statistical Association (ASA) Definition:
p-value is the probability under a specified statistical model that a statistical summary of the data (for example, the sample mean difference between two compared groups) would be equal to or more extreme than its observed value
If you have established the statistical significance at 0.05, then a p value of <0.05 tells you that, assuming the null hypothesis is true, there is a very small probability of obtaining a result that is equal to or more extreme than the result observed. ( 1 in 20 or 5% probability). Thus, there is evidence to reject the null hypothesis.
On the other hand if the p value was <0.65 then assuming the null hypothesis is true, you would expect to obtain the observed result or more extreme 65% of the time. Thats not too flash is it? The null hypothesis would then stay true. I hope that helps you understand the P value.
Well for those that are curious, the real results were the following:
The linear regression statistic between the stretched penile length and shoe size gave an r2 of 0.012 (P=0.28), suggesting no statistically significant relationship between stretched penile length and shoe size.
The p value interpretation is:
Assuming the null hypothesis is true (shoe size does not predict penile length), the observed effect or more would occur 28% of the time.
CONFIDENCE INTERVALS
The other concept in precision is Confidence Intervals (CI). In the above study, there is no way one can sample all the men in the world and measure their shoe sizes or penile lengths.
If one could, one would get the exact correlation coefficient or mean sizes for the shoe size and also for penile length. Therefore there is a need to provide some range between which the true measure lies. This is the confidence interval.
Usually, the confidence interval is set at 95% which tells you that if you did this study 100 times, 95 out of 100 times, the true measure would lie between the two confidence intervals.
Let’s look at another interesting study.
FOOTBALL AND HEART ATTACKS
Let’s look at another example and try to answer the following questions. Read the following study result in the study by Wilbert-Lampen et al. The study examines the association between cardiovascular events and World Cup football.
Cardiovascular events ( read = heart problems) occurring in patients in the greater Munich area were prospectively assessed by emergency physicians during the World Cup. We compared those events with events that occurred during the control period: May 1 to June 8 and July 10 to July 31, 2006, and May 1 to July 31 in 2003 and 2005.
Acute cardiovascular events were assessed in 4279 patients. On days of matches involving the German team, the incidence of cardiac emergencies was 2.66 times that during the control period (95% confidence interval [CI], 2.33 to 3.04; P<0.001); for men, the incidence was 3.26 times that during the control period (95% CI, 2.78 to 3.84; P<0.001), and for women, it was 1.82 times that during the control period (95% CI, 1.44 to 2.31; P<0.001).
1. Was the incidence of cardiac emergencies statistically significant and why?
2. Do men have an increased risk of cardiovascular events during world cup matches? Is this greater than the risk for women? Is the result statistically significant?
3. Finally, should greater emergency procedures be in place during World Cup events based on this result? (Hint: This requires subjective and analytical thinking and depends on many variables). Answers at the end.
KEY TAKEAWAY POINTS
The p-value on its own means nothing. It has to be put into the context of the methodology of the study and the measure of effect. P values can be made significant by reducing the robustness of the measure ( e.g., if benchmark improvement is 8 points and you get a non-significant result, by reducing the benchmark to 4 points, you may get a statistically significant result).
But a 4 point improvement is not as good as 8 point improvement. The interpretation is always subjective, and this is where analytical skills are important. It ensures you don’t take things at face value.
According to Ioannidis’ striking article, Why Most Published Research Findings Are False:
Research is not most appropriately represented and summarized by p-values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p-values.
According to the recently released statement by ASA-
The p-value was never intended to be a substitute for scientific reasoning.
Over time it appears the p-value has become a gatekeeper for whether work is publishable, at least in some fields,….. This apparent editorial bias leads to the ‘file-drawer effect,’ in which research with statistically significant outcomes are much more likely to get published, while other work that might well be just as important scientifically is never seen in print. It also leads to practices called by such names as ‘p-hacking’ and ‘data dredging’ that emphasize the search for small p-values over other statistical and scientific reasoning.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
CI is the range of values between which the true population measure lies for a given confidence level.
Confidence intervals can be narrowed by increasing sample size, as you start coming closer to the true population measure by including more people from the population.
ANSWERS
1. Yes, the incidence was statistically significant as outlined by the p-value.
2. Men have a greater risk of cardiovascular events than women, and the risk is statistically significant. Look at the CI and P values.
3. There are no fixed answers, and other studies are required. Different people may have different ways of analysing this data.
Note: For the RANZCP and MRCPsych Exams some questions are still framed as p-value being the probability of committing a type 1 error (false positive) AND the probability of the observed effect being due to chance. See video below.