When there is a statistically significant difference between two treatment groups?

To determine whether the difference between two means is statistically significant, analysts often compare the confidence intervals for those groups. If those intervals overlap, they conclude that the difference between groups is not statistically significant. If there is no overlap, the difference is significant.

While this visual method of assessing the overlap is easy to perform, regrettably it comes at the cost of reducing your ability to detect differences. Fortunately, there is a simple solution to this problem that allows you to perform a simple visual assessment and yet not diminish the power of your analysis.

In this post, I’ll start by showing you the problem in action and explain why it happens. Then, we’ll proceed to an easy alternative method that avoids this problem.

Comparing Groups Using Confidence Intervals of each Group Estimate

For all hypothesis tests and confidence intervals, you are using sample data to make inferences about the properties of population parameters. These parameters can be population means, standard deviations, proportions, and rates. For these examples, I’ll use means, but the same principles apply to the other types of parameters.

Related posts: Populations, Parameters, and Samples in Inferential Statistics and Difference between Inferential and Descriptive Statistics

Determining whether confidence intervals overlap is an overly conservative approach for identifying significant differences between groups. It’s true that when confidence intervals don’t overlap, the difference between groups is statistically significant. However, when there is some overlap, the difference might still be significant.

Suppose you’re comparing the mean strength of products from two groups and graph the 95% confidence intervals for the group means, as shown below. Download the CSV dataset that I use throughout this post: DifferenceMeans.

When there is a statistically significant difference between two treatment groups?

Related post: Understanding Confidence Intervals

Jumping to Conclusions

Upon seeing how these intervals overlap, you conclude that the difference between the group means is not statistically significant. After all, if they’re overlapping, they’re not different, right? This conclusion sounds logical, but it’s not necessarily true. In fact, for these data, the 2-sample t-test results are statistically significant with a p-value of 0.044. Despite the overlapping confidence intervals, the difference between these two means is statistically significant.

This example shows how the CI overlapping method fails to reject the null hypothesis more frequently than the corresponding hypothesis test. Using this method decreases the statistical power of your assessment (higher type II error rate), potentially causing you to miss essential findings.

This apparent discrepancy between confidence intervals and hypothesis test results might surprise you. Analysts expect that confidence intervals with a confidence level of (100 – X) will always agree with a hypothesis test that uses a significance level of X percent. For example, analysts often pair 95% confidence intervals with tests that use a 5% significance level. It’s true. Confidence intervals and hypothesis test should always agree. So, what is happening in the example above?

Related posts: How Hypothesis Tests Work and Two Types of Error in Hypothesis Testing

Using the Wrong Types of Confidence Intervals

The problem occurs because we are not comparing the correct confidence intervals to the hypothesis test result. The test results apply to the difference between the means while the CIs apply to the estimate of each group’s mean—not the difference between the means. We’re comparing apples to oranges, so it’s not surprising that the results differ.

To obtain consistent results, we must use confidence intervals for differences between group means—we’ll get to those CIs shortly.

However, if you’re determined to use CIs of each group to make this determination, there are several possible methods.

Goldstein and Healy (1995) find that for barely non-overlapping intervals to represent a 95% significant difference between two means, use an 83% confidence interval of the mean for each group. The graph below uses this confidence level for the same dataset as above, and they don’t overlap.

When there is a statistically significant difference between two treatment groups?

Cumming & Finch (2005) find that the degree of overlap for two 95% confidence intervals for independent means allows you to estimate the p-value for a 2-sample t-test when sample sizes are greater than 10. When the confidence limit of each CI reaches approximately the midpoint between the point estimate and the limit of the other CI, the p-value is near 0.05. The first graph in this post, with the 95% CIs, approximates this condition, and the p-value is near 0.05. Lower amounts of overlap correspond to lower p-values. For example, 95% CIs where the end of one CI just reaches the end of the other CI corresponds to a p-value of about 0.01.

To me, these approaches seem kludgy. Using a confidence interval of the difference is an easier solution that even provides additional useful information.

Assessing Confidence Intervals of the Differences between Groups

Previously, we saw how the apparent disagreement between the group CIs and the 2-sample test results occurs because we used the wrong confidence intervals. Instead, we need a CI for the difference between group means. This type of CI will always agree with the 2-sample t-test—just be sure to use the equivalent combination of confidence level and significance level (e.g., 95% and 5%). We’re now comparing apples to apples!

Using the same dataset as above, the confidence interval below presents a range of values that likely contains the difference between the means for the entire population. The interpretation continues to be a simple visual assessment. Zero represents no difference between the means. Does the interval contain zero? If it does not include zero, the difference is statistically significant because the range excludes no difference. At a glance, we can tell that the difference is statistically significant.

When there is a statistically significant difference between two treatment groups?

This graph corresponds with the 2-sample t-test results below. Both test the difference between the two means. This output also provides a numerical representation of the CI of the difference [0.06, 4.23].

When there is a statistically significant difference between two treatment groups?

In addition to providing a simple visual assessment, the confidence interval of the difference presents crucial information that neither the group CIs nor the p-value provides. It answers the question, based on our sample, how large is the difference between the two populations likely to be? Like any estimate, there is a margin of error around the point estimate of the difference. It’s important to factor in this margin of error before acting on findings.

For our example, the point estimate of the mean difference is 2.15, and we can be 95% confident that the population difference falls within the range of 0.06 to 4.23.

Related posts: How T-tests Work and How Confidence Intervals Work

Interpreting Confidence Intervals of the Mean Difference

Statisticians consider differences between group means to be an unstandardized effect size because these values indicate the strength of the effect using values that retain the natural data units. Effect sizes help you understand how important the findings are in a practical sense. To learn more about unstandardized and standardized effect sizes, read my post about Effect Sizes in Statistics.

As with all CIs, the width of the interval for the mean difference reveals the precision of the estimated effect size. Narrower intervals suggest a more precise estimate. And, you can assess whether the full range of values is practically significant. Remember, statistical significance doesn’t necessarily indicate that the results are meaningful in the real world. For more information about this issue, see Practical vs. Statistical Significance.

When the interval is too wide (imprecise) to be helpful and/or the range includes differences that are not practically significant, you have reason to hesitate before making decisions based on the results. These types of CI results indicate that you might not obtain meaningful benefits even though the difference is statistically significant.

There’s no statistical method for answering questions about how precise an estimate must be or how large an effect must be to be practically useful. To use the confidence interval of the difference to answer these questions, you’ll need to apply your subject-area knowledge.

For the example in this post, it’s important to note that the low end of the CI is very close to zero. It will not be surprising if the actual population difference falls close to zero, which might not be practically significant despite the statistically significant result. If you are considering switching to Group B for a stronger product, the mean improvement might be too small to be meaningful.

When you’re comparing groups, assess confidence intervals of those differences rather than comparing confidence intervals for each group. This method is simple, and it even provides you with additional valuable information.

References

Harvey Goldstein; Michael J. R. Healy. The Graphical Presentation of a Collection of Means, Journal of the Royal Statistical Society, Vol. 158, No. 1. (1995), p. 175-177.

Cumming, Geoff; Finch, Sue. Inference by Eye: Confidence Intervals and How to Read Pictures of Data, American Psychologist, Vol 60(2), Feb-Mar 2005, p. 170-180.

What does it mean when there is a statistically significant difference between groups?

A “statistically significant difference” simply means there is statistical evidence that there is a difference; it does not mean the difference is necessarily large, important, or significant in terms of the utility of the finding.

What is the best statistical test when comparing more than two treatments?

When comparing more than two sets of numerical data, a multiple group comparison test such as one-way analysis of variance (ANOVA) or Kruskal-Wallis test should be used first.