# Journal Club on “Cluster Failure: Why fMRI Inferences for Spatial Extent Have Inflated False-Positive Rates

### By David Mehler

In a recent blog post, I summarized some important findings of the recent study “*Cluster failure: Why fMRI **inferences** for spatial extent have inflated false-positive rates” *by Anders Eklund and colleagues and also debunked some myths that have been circulating around the paper. Here, I would like to discuss the insights of the paper in more statistical terms and also condense some insights from journal clubs I have held on the paper. My aim is to provide readers who would like to understand the paper’s implication in more detail with a summary of the background, as well as practical recommendations and some take home messages based on the paper’s main results.

fMRI researchers are usually either interested in the activity of certain brain areas (so called region of interests, ROI), or whole brain analyses (and sometimes in both). In an ROI analysis, we only test a hypothesis about a particular brain region or between regions (e.g. “Is the motor cortex more active during arm movements?”). In contrast, in whole brain analyses we test which brain areas show task related brain activity (e.g. “Where in the brain do we find differences in activity during arm movements?”). Different techniques exist (e.g. univariate and multivariate), but most analyses use the “mass univariate approach” in which a test is conducted for every voxel. The insights by Eklund and colleagues are only relevant for univariate whole brain analyses in which one can end up with 100,000! tests or more.

Univariate whole brain analyses thus require a reliable method to correct for multiple testing to avoid that the type I error inflates massively beyond the 5%: by definition we falsely reject the null hypothesis, e.g. in 5% of 100,000 tests and so the errors mount up leading to a large number of false positives. We could simply correct for the number of tests (e.g. with Bonferroni), however the exact number of independent tests is not clear because the activity of a given voxel is not independent of its neighbors (e.g. due to common vascular supply, similar computations performed, and importantly also similar noise).

Also in cognitive neuroscience, we are usually not interested in the activity of single voxels but rather in the activity of brain regions, and we thus seek to find clusters of task related activity. Therefore, instead of correcting on voxel level (which can be too conservative), we can correct for the minimum number of adjacent voxels we’d expect to find not just by pure chance. These methods are called cluster wise inference methods and they are widely used to correct for multiple testing. These routines work out the minimum size a cluster must have so that we still control the type I error rate at 5%. Eklund and colleagues compared in particular the reliability in controlling for type I error rates between two cluster wise inference methods – *parametric* *cluster-extent* and *non-parametric permutation test*.

How do these two approaches differ? The most important difference between these two methods is that parametric approaches *assume* the distribution of the statistic of interest under the null hypothesis. Depending on the type of data we work with (e.g. the level and shape of noise which can turn out a bit “strange” for fMRI) our assumptions might not be quite right. Non-parametric permutation based procedures, in contrast, are set up so that they *find* the statistic under the null hypothesis by permuting the labels and reiterating an analysis many thousand times.

Importantly, parametric cluster-extent based methods procedures consist of two steps. First, a cluster-defining threshold (CDT; e.g. CDT p=0.01 or CDT p=0.001) is applied to the statistical map to retain only supra-threshold voxels (software packages usually have a default value set, but also allow users to reset it). Second, based on retained supra-threshold voxels, a cluster-level extent threshold is estimated from the data which is supposed to give the minimum cluster size that is considered significant.

So what did Eklund and colleagues exactly look at? They used resting-state data (from the Connectome project) to compare the reliability of (parametric and nonparametric) multiple testing procedures as implemented in some main software packages for different testing parameters (importantly CDT p=0.01 and CDT p=0.001). The rationale to use resting-state data was to “fake” experimental differences based on data that should not give experimental differences but has real fMRI noise properties. Thus, finding clusters in more than 5% of all analyses (for a given combination of parameters) would indicate that the used multiple testing correction method was problematic and led to higher false positives than we deem acceptable.

In total, they conducted about 3,000,000 group analyses, which is an incredible data volume they were able to handle using a special cluster and an in-house developed software (__BROCCOLI__) that provides parallel processing on the cluster. For each of the different parameters that they tested, 1,000 group analyses were carried out. The benchmark for reliability was defined as the number of any cluster found in any of these analyses divided by 1,000, i.e. the false positive rate.

The authors made six main observations:

- As suggested already in another recent
__paper__by a different group, a too high threshold (i.e. CDTs p=0.01) leads to unacceptably high false positive rates (i.e. too high type I error rates). In fact, this was the 70% cited on some science blogs. - Also a more conservative threshold (i.e. CDT p=0.001) is not sufficient if the cluster extent is not estimated but an arbitrary ad-hoc cluster extent is chosen instead (as already demonstrated by the salmon study that demonstrated that with the wrong statistics one can even “find” a cluster in a dead fish).
- Importantly, the same threshold (i.e. CDT p=0.001) in combination with cluster-extent estimation gives much better results, though still slightly biased.
- Non-parametric permutation tests seemed most reliable and usually controlled type I errors at the expected 5% rate
*.* - Voxel wise inference was too conservative (as expected).

These findings were less surprising and rather largely in line with previous work. However, another interesting finding by the authors was that the assumptions of the null distribution for fMRI data are partly wrong, e.g. it does not take the assumed Gaussian shape. This issue has certainly contributed to the error rates that were found in the study. Essentially, the activity of neighboring voxels is even more similar than previously thought, simply because of the way MR physics work. The software packages that were tested have since been corrected, so if you are using one of these packages better make sure you have the latest update. Further, neighboring voxels vary in the degree of similarity depending on the site of the brain (e.g. because of tissue differences and head shape) and false positives indeed clustered more in certain areas. Also, this finding has stimulated __some exciting developments__.

So given these results, shall we from now on only and exclusively use permutation testing with packages (e.g. SnPM or FSL Randomise/PALM)? This depends on a few aspects and I would argue that depending on the context, cluster wise inference methods are often at least equally good and sometimes even a better option. Here, I list five arguments in support of cluster-based inference methods (when used with an appropriate CDT):

- One pragmatic aspect can be computation resources/time available because permutation testing requires far more time.
- The effect size one is interested in and the sample size available play a role because for small effects the sensitivity is often better in parametric tests (using CDT p=0.001 in combination with an estimated cluster-extent), but potentially at the cost of slightly higher false positives. Importantly, the probability of finding false positives (i.e. the type I error rate) is not the same as the percentage of false positives for a given study! For instance, in experiments with large effects and many true positives, multiple testing correction associated with a 70% chance of finding false positives would overall still result in a relatively small percentage of false positives.
- The slightly higher false positive rates reported by Eklund and colleagues for CDT p=0.001 seem tolerable, one might want to risk/afford to not correct at the nominal 5% (i.e. slightly biased result, but greater sensitivity)
- This bias is probably less than the results in the paper reported for CDT p=0.001 suggest if they were now replicated with the corrected software packages that do not assume a Gaussian shape of the null-distribution and thus provide more stringent control.
- Lastly, a compelling
__recent analysis__suggests that parametric cluster-inference with a CDT p=0.001 seem to correct at the nominal 5% when one takes the proportion (i.e. false discovery rate, FDR) rather than any false positive (family-wise error, FWER, as employed by Eklund and colleagues) as a benchmark.

Of interest, a __recent white paper__ by one of the co-authors (Tom Nichols) and other leading scientists in the field discusses good practices to assure better quality control data analysis and handling. As a rather general point, __another recent paper__ has highlighted the importance of effect sizes for fMRI research, which have so far largely been neglected in favor of reporting mainly statistical values.

The study by Eklund and colleagues has clearly shown that certain practices in fMRI research are flawed, while others provide reliable FWER control. The paper has stimulated vivid discussions and new developments and made many fMRI researchers more aware of the assumptions made when conducting fMRI data analysis (for more information, here an OHBM statement). In conclusion, the findings of Eklund and colleagues are important, but should not be overstated and rather be seen as part of a larger self-correcting (learning) process currently happening in the field including better reporting standards for neuroimaging.

—

David Mehler is an MD-PhD candidate in medicine and neuroscience at Cardiff University and University of Münster. He uses neuroimaging techniques (fMRI, EEG) to investigate neurofeedback training in healthy participants and patients with a focus on motor rehabilitation.