Devil’s Advocate: Uncorrected Stats and the Trouble With fMRI
This rather technical post is a response to a recent Observer article by Vaughan Bell. In it he follows up on a recent internet discussion around the validity or otherwise of fMRI methods. A blog post by StokesBlog responded. I must admit I hesitate to step into the fray. I think “flaws” in neuroimaging methods are sometimes overstated, but I don’t want people to think I am flawed, so perhaps I should keep my head down. Oh well, here goes…
Some of the criticism of fMRI methods seem to be founded on the idea that some imaging researchers are unaware of issues surrounding multiple comparison. To the best of my knowledge this is a misconception. Methods for correction of multiple comparisons have been built into analysis software since the late 90s. Researchers who report uncorrected statistics always qualify them as such and they always adopt a more stringent threshold than would be used for corrected stats (albeit that this threshold is often rather liberal compared to corrected stats, typically p<0.001). These researchers appreciate that uncorrected statistics do not control for the probability of Type I error in the dataset as a whole, although they accurately quantify the probability of a Type I error at the voxel level.
It is sometimes argued the reporting of uncorrected statistics this is indefensible and flawed given the availability of appropriate corrections. I don’t want to defend the indefensible (e.g., passing off uncorrected statistics as corrected), but I think the situation is more nuanced than this. If I speak frankly about it, I hope I will not be thrown out of the neuroimaging union as a heretic. I am not arguing that people using corrections are wrong, or that people using uncorrected data are right, rather that each approach has its strengths and weaknesses, and that current methods for the correction for multiple comparisons have yet to reach the point where they should be mandatory for all fMRI experiments. I have a great deal of sympathy for those who favour a very conservative interpretation of fMRI data, but I am going to play Devil’s advocate here, because I think there is another side to the story which is still worth serious consideration.
The key measures in fMRI are BOLD signal changes. These inferred by fitting a general linear model to appropriately preprocessed data. Typically regressors for each distinct condition are constructed so that the parameters of the fitted model reflect the change in BOLD signal associated with the condition relative to some implicit baseline (mean signal during unmodelled periods). These parameters are then analysed statistically: parameter estimates are obtained for each voxel in each participant, and the within- and between-subject variation is used to determine T- or Z-statistics at each location in a “standarized” brain. There is some flexibility about how the model is constructed, and the GLM and preprocessing steps also embodies a large number of assumptions about the nature of BOLD changes following neural activity and the way these combine in time and space. One might raise some caveats about these in a few cases, but my impression is that there is little dispute in the field that these steps provide a valid measure of signal change, a valid statistical characterisation of signal change and a valid estimate of the voxel-level probability of a Type I error.
In traditional analyses the Z- or T-statistics are thresholded at some level and the thresholded maps are used both for display purposes and for statistical inference. It is important to note that there is very little concern about the validity of the underlying statistics, the concern relates only to the choice of threshold. Some researchers, the majority, feel that a very conservative threshold is appropriate, because they are principally concerned with the possibility that results might prove to be false-positives (i.e., unreplicable). Other researchers are concerned to ensure that positive findings are not missed since they could lead to important discoveries and insights. Because statistical thresholds are also used in the visualisation of the data, a conservative threshold has the effect of hiding the underlying data from scrutiny. In my view, it is the responsibility of each scientist to ensure that they avoid being wrong (i.e., avoid false-positives) while at the same time being right (i.e., contributing to scientific knowledge by making new discoveries and reporting their data in full). The present methods tend to put these two perfectly reasonably objectives into tension, but they are always in tension. Good scientists might in good faith reach different conclusions about where the balance should lie.
Crucially good scientists will always understand and accurately report their own methods so that others can form their own conclusions. Of course the standard criterion of p<0.05 corrected for multiple comparisons is completely arbitrary, and no sensible researcher should believe something as undeniable fact because it passes the criterion or meaningless because it falls just short of it. In my view it is not sensible for scientists to base their understanding of the brain (or anything else) on an uncritical acceptance of someone else’s interpretation, the output of a program, or an arbitrarily chosen statistical threshold. So I tend to think all data is good data, and I want to see as much of it as possible – including the noisy stuff that might undermine my impression of a particular study.
In neuroimaging, it is very easy to construct a poorly-designed experiment that will produce activity all over the brain. This type of flaky study is easily spotted because the uncorrected, low threshold, unmasked statistics show “activations” scattered randomly throughout the brain. This is also a good way to spot underpowered experiments where the activation is due solely to noise (it is surprising they were not seen in the famed dead salmon “study”, for example – which makes me suspicious about the validity of the methods used). How much easier it is to restrict the analysis to a prespecified region of interest, or to raise the statistical criteria to obscure this inconvenient evidence. Then one can claim that the effect is highly selective. In my view this problem is at least as serious as the over-interpretation of uncorrected stats.
Neuroimaging appears to be almost unique in that using more conservative statistical thresholds seems to make your data look better. In most other branches of science, one is obliged to show the data that don’t reach statistical significance, however inconvenient it may be. In my view we ought to be presenting unthresholded effect size data much more often (e.g., parameter estimate maps). It has always been unnecessarily tricky to get this data out of imaging analysis packages, whether in the form of images or voxel level statistics – in my ideal world, it would be the default.
Effect sizes will never tell the whole story however, we do need some way to evaluate different findings and gain an objective measure of the signal relative to the noise. T- and Z-statistics already fulfil this function at the level of the individual voxel, but interpretation is typically focussed on larger regions which are not very clearly delineated either functionally or anatomically. Indeed, there is a nagging sense that there may not be any scale at which we can understand brain function in terms of discrete modules that either do, or don’t do something. Rather it seems that parts of the brain down to the level of individual neurons are either more or less active in a specific computation and that they form more or less continuous (and embedded) functional maps with few clear boundaries. Many of our current methods depend ultimately on the unrealistic assumption that the brain is made up of discrete regions with discrete functions and that there is little or no important information in the spatial organisation. Indeed the use of thresholds in visualising data gives the casual reader the impression that brain activity seen in a particular task is only in the tiny active chunks, and that nearby regions are uninvolved. In fact the unthresholded responses are distributed, often showing clear symmetry and other signs of systematic organisation.
This organisation plays no part in the corrections we currently perform in traditional voxelwise analyses. To the extent that the spatial structure is not random, it ought to inform our estimate of the probability of a Type I error at scales larger than one voxel, (beyond merely affecting the smoothness of the data), but it does not. So the p-values from corrected analyses are likely to overestimate the true probability of spurious findings. In my ideal world, we will eventually develop detailed predictions about patterns of activity (based on specific functional-anatomical models) and compare the fit of alternative models. Alternatively we need better procedures for formal inference which take into account the information present at different spatial scales.
Given however that we want to make to draw some conclusions from the data using the voxelwise analytical techniques that we currently have, we need to have some idea where to draw the line: when is an observation completely spurious, and when is it worth reporting? There are advantages and disadvantages to different thresholding approaches, but I think it is arguable that none of the current methods of correction for multiple comparison is problem-free.
FWE correction seems overly conservative, cluster correction seems overly dependent on the size of the structure and on the sometimes spurious contiguity of voxels. SVC can be affected by post-hoc bias in the selection of regions to be corrected. ROI analyses are also prone to these issues, and there is some flexibility in the way the regions are determined and sampled based on functional and anatomical data. Notably, both techniques effectively apply a different, more liberal, criterion to some regions (where activity is predicted) compared with others (where it is not). FDR is the pick of the parametric approaches, in my view. However both FWE and FDR methods, as I understand them, involve peeking at the data to determine the threshold, albeit automatically and on a principled basis. However the mathematics of these principled corrections is not easy to follow, and this data-dependent thresholding is troubling.
In addition, many of the degrees of analytic flexibility appear to be associated with the many options for correction, which can all be selectively justified in retrospect. Data too noisy – try an SVC or ROI analysis? An FWE correction will certainly clean things up nicely, and your reviewers will never know. Not enough activity – perhaps an FDR threshold would be better?
In these circumstances I think that one can make a case that uncorrected statistics provide a non-flexible and principled approach to presenting the data, the principles being:
i) the threshold for inference is independent of the data,
ii) the same across the whole brain
iii) the same across different studies (i.e., not chosen to obscure inconvenient findings).
iv) the method for calculating the threshold is transparent to the authors, reviewers and readers of the paper.
Like StokesBlog I’d favour a permutation testing approach for statistical inference because it meets principles ii), iii), iv) above and has all the strengths of the current parametric corrections, but without the assumptions and flexibility. However, I also think in many ways neuroimaging is not yet at the stage where we ought to be testing detailed hypotheses against tight statistical criteria. In many parts of the brain the function is not yet well-enough observed to form viable hypotheses. Testing a very detailed hypothesis against a (very weak) null hypothesis is questionable in these circumstances. Yes, there are many questions for which it is already quite reasonable to be very conservative when testing detailed and well-founded hypotheses against alternatives of similar plausibility and complexity. However, there is also an important place for observational and exploratory approaches. Dressing up observational/exploratory research as hypothesis-driven is potentially problematic, and perhaps this is the underlying problem with the flexibility of the methods we have.
For some purposes the fMRI scanner can be viewed as analogous to a telescope. Early astronomers pointed their telescopes at the sky and described what they saw, they could not form hypotheses about nebulae, galaxies, Saturn’s rings and Jupiter’s moons, because they had no idea they were there. It seems at least possible that similiarly striking, significant and unanticipated phenomena are present in our brains. In my experience fMRI data is far richer and more reliable than one might think if only looking at the thresholded peaks in a conventional analysis. Indeed promising results from MVPA/classifier approaches imply that subthreshold activity is often meaningful. Methods for investigating these patterns and their meaning are still in their infancy, but I think it a safe bet that in 10 years time, we will recognize that in addition to the many false positives our era has produced, there will be a large number of missed opportunities that were staring us in the face. I think that more advanced nonparametric thresholding methods, multivariate and spatial analyses will probably show that our current methods are overly conservative.
Overall, I tend to think that fellow scientists are acting in good faith, and that in many cases they have given careful consideration to the methods most appropriate to the experimental question at hand. I don’t think the field has yet arrived at a consensus as to how to analyse the data, and I don’t think we yet have ideal methods for quantifying the probability of Type I errors above the voxel scale. It may turn out that different questions require different forms of analysis and statistical approaches.