Devil’s Advocate: Uncorrected Stats and the Trouble With fMRI

This rather technical post is a response to a recent Observer article by Vaughan Bell. In it he follows up on a recent internet discussion around the validity or otherwise of fMRI methods. A blog post by StokesBlog responded.  I must admit I hesitate to step into the fray. I think “flaws” in neuroimaging methods are sometimes overstated, but I don’t want people to think I am flawed, so perhaps I should keep my head down. Oh well, here goes…

Some of the criticism of fMRI methods seem to be founded on the idea that some imaging researchers are unaware of issues surrounding multiple comparison. To the best of my knowledge this is a misconception. Methods for correction of multiple comparisons have been built into analysis software since the late 90s. Researchers who report uncorrected statistics always qualify them as such and they always adopt a more stringent threshold than would be used for corrected stats (albeit that this threshold is often rather liberal compared to corrected stats, typically p<0.001). These researchers appreciate that uncorrected statistics do not control for the probability of Type I error in the dataset as a whole, although they accurately quantify the probability of a Type I error at the voxel level.

It is sometimes argued the reporting of uncorrected statistics  this is indefensible and flawed given the availability of appropriate corrections. I don’t want to defend the indefensible (e.g., passing off uncorrected statistics as corrected), but I think the situation is more nuanced than this. If I speak frankly about it, I hope I will not be thrown out of the neuroimaging union as a heretic. I am not arguing that people using corrections are wrong, or that people using uncorrected data are right, rather that each approach has its strengths and weaknesses, and that current methods for the correction for multiple comparisons have yet to reach the point where they should be mandatory for all fMRI experiments. I have a great deal of sympathy for those who favour a very conservative interpretation of fMRI data, but I am going to play Devil’s advocate here, because I think there is another side to the story which is still worth serious consideration.

The key measures in fMRI are BOLD signal changes. These inferred by fitting a general linear model to appropriately preprocessed data. Typically regressors for each distinct condition are constructed so that the parameters of the fitted model reflect the change in BOLD signal associated with the condition relative to some implicit baseline (mean signal during unmodelled periods). These parameters are then analysed statistically: parameter estimates are obtained for each voxel in each participant, and the within- and between-subject variation is used to determine T- or Z-statistics at each location in a “standarized” brain. There is some flexibility about how the model is constructed, and the GLM and preprocessing steps also embodies a large number of assumptions about the nature of BOLD changes following neural activity and the way these combine in time and space. One might raise some caveats about these in a few cases, but my impression is that there is little dispute in the field that these steps provide a valid measure of signal change, a valid statistical characterisation of signal change and a valid estimate of the voxel-level probability of a Type I error.

In traditional analyses the Z- or T-statistics are thresholded at some level and the thresholded maps are used both for display purposes and for statistical inference. It is important to note that there is very little concern about the validity of the underlying statistics, the concern relates only to the choice of threshold. Some researchers, the majority, feel that a very conservative threshold is appropriate, because they are principally concerned with the possibility that results might prove to be false-positives (i.e., unreplicable). Other researchers are concerned to ensure that positive findings are not missed since they could lead to important discoveries and insights. Because statistical thresholds are also used in the visualisation of the data, a conservative threshold has the effect of hiding the underlying data from scrutiny. In my view, it is the responsibility of each scientist to ensure that they avoid being wrong (i.e., avoid false-positives) while at the same time being right (i.e., contributing to scientific knowledge by making new discoveries and reporting their data in full). The present methods tend to put these two perfectly reasonably objectives into tension, but they are always in tension. Good scientists might in good faith reach different conclusions about where the balance should lie.

Crucially good scientists will always understand and accurately report their own methods so that others can form their own conclusions. Of course the standard criterion of p<0.05 corrected for multiple comparisons is completely arbitrary, and no sensible researcher should believe something as undeniable fact because it passes the criterion or meaningless because it falls just short of it. In my view it is not sensible for scientists to base their understanding of the brain (or anything else) on an uncritical acceptance of someone else’s interpretation, the output of a program, or an arbitrarily chosen statistical threshold. So I tend to think all data is good data, and I want to see as much of it as possible – including the noisy stuff that might undermine my impression of a particular study.

In neuroimaging, it is very easy to construct a poorly-designed experiment that will produce activity all over the brain. This type of flaky study is easily spotted because the uncorrected, low threshold, unmasked statistics show “activations” scattered randomly throughout the brain. This is also a good way to spot underpowered experiments where the activation is due solely to noise (it is surprising they were not seen in the famed dead salmon “study”, for example – which makes me suspicious about the validity of the methods used). How much easier it is to restrict the analysis to a prespecified region of interest, or to raise the statistical criteria to obscure this inconvenient evidence. Then one can claim that the effect is highly selective. In my view this problem is at least as serious as the over-interpretation of uncorrected stats.

This image from Bennett et al. This dead salmon was reportedly “…shown a series of photographs depicting human individuals in social situations with a specified emotional valence, either socially inclusive or socially exclusive. The salmon was asked to determine which emotion the individual in the photo must have been experiencing.” The red regions are those that show greater signal during the presentation of images, using an uncorrected threshold of p<0.001 (uncorrected). Presumably this results solely from noise a salutary lesson in the value of statistical corrections, perhaps. Yet, It is remarkable that they are all within the animal’s tiny CNS. What are the chances of that!?

Neuroimaging appears to be almost unique in that using more conservative statistical thresholds seems to make your data look better. In most other branches of science, one is obliged to show the data that don’t reach statistical significance, however inconvenient it may be. In my view we ought to be presenting unthresholded effect size data much more often (e.g., parameter estimate maps). It has always been unnecessarily tricky to get this data out of imaging analysis packages, whether in the form of images or voxel level statistics – in my ideal world, it would be the default.

Effect sizes will never tell the whole story however, we do need some way to evaluate different findings and gain an objective measure of the signal relative to the noise. T- and Z-statistics already fulfil this function at the level of the individual voxel, but interpretation is typically focussed on larger regions which are not very clearly delineated either functionally or anatomically. Indeed, there is a nagging sense that there may not be any scale at which we can understand brain function in terms of discrete modules that either do, or don’t do something. Rather it seems that parts of the brain down to the level of individual neurons are either more or less active in a specific computation and that they form more or less continuous (and embedded) functional maps with few clear boundaries. Many of our current methods depend ultimately on the unrealistic assumption that the brain is made up of discrete regions with discrete functions and that there is little or no important information in the spatial organisation. Indeed the use of thresholds in visualising data gives the casual reader the impression that brain activity seen in a particular task is only in the tiny active chunks, and that nearby regions are uninvolved. In fact the unthresholded responses are distributed, often showing clear symmetry and other signs of systematic organisation.

Illustrating symmetry and structure beneath the conventional thresholds using a technique developed with my PhD student Chris Racey. Participants viewed blocks of objects and places in a standard localiser task. Colour is determined by the angle formed by a vector representing the (volume normalised) Z-scores of the contrasts for objects and places (relative to a scrambled condition), with the opacity from the length of the vector formed by the two variables. Outlines show conventionally defined place (blue, places>faces) and object (yellow, objects>scrambled) selective regions. Slices are centred on MNI coordinates -40, -67, -14 and are displayed with the left hemisphere on the left of each section.

This organisation plays no part in the corrections we currently perform in traditional voxelwise analyses. To the extent that the spatial structure is not random, it ought to inform our estimate of the probability of a Type I error at scales larger than one voxel, (beyond merely affecting the smoothness of the data), but it does not. So the p-values from corrected analyses are likely to overestimate the true probability of spurious findings. In my ideal world, we will eventually develop detailed predictions about patterns of activity (based on specific functional-anatomical models) and compare the fit of alternative models. Alternatively we need better procedures for formal inference which take into account the information present at different spatial scales.

Given however that we want to make to draw some conclusions from the data using the voxelwise analytical techniques that we currently have, we need to have some idea where to draw the line: when is an observation completely spurious, and when is it worth reporting? There are advantages and disadvantages to different thresholding approaches, but I think it is arguable that none of the current methods of correction for multiple comparison is problem-free.

FWE correction seems overly conservative, cluster correction seems overly dependent on the size of the structure and on the sometimes spurious contiguity of voxels. SVC can be affected by post-hoc bias in the selection of regions to be corrected. ROI analyses are also prone to these issues, and there is some flexibility in the way the regions are determined and sampled based on functional and anatomical data. Notably, both techniques effectively apply a different, more liberal, criterion to some regions (where activity is predicted) compared with others (where it is not). FDR is the pick of the parametric approaches, in my view. However both FWE and FDR methods, as I understand them, involve peeking at the data to determine the threshold, albeit automatically and on a principled basis. However the mathematics of these principled corrections is not easy to follow, and this data-dependent thresholding is troubling.

In addition, many of the degrees of analytic flexibility appear to be associated with the many options for correction, which can all be selectively justified in retrospect. Data too noisy – try an SVC or ROI analysis? An FWE correction will certainly clean things up nicely, and your reviewers will never know. Not enough activity – perhaps an FDR threshold would be better?

In these circumstances I think that one can make a case that uncorrected statistics provide a non-flexible and principled approach to presenting the data, the principles being:

i) the threshold for inference is independent of the data,
ii) the same across the whole brain
iii) the same across different studies (i.e., not chosen to obscure inconvenient findings).
iv) the method for calculating the threshold is transparent to the authors, reviewers and readers of the paper.

Like StokesBlog I’d favour a permutation testing approach for statistical inference because it meets principles ii), iii), iv) above and has all the strengths of the current parametric corrections, but without the assumptions and flexibility. However, I also think in many ways neuroimaging is not yet at the stage where we ought to be testing detailed hypotheses against tight statistical criteria. In many parts of the brain the function is not yet well-enough observed to form viable hypotheses. Testing a very detailed hypothesis against a (very weak) null hypothesis is questionable in these circumstances. Yes, there are many questions for which it is already quite reasonable to be very conservative when testing detailed and well-founded hypotheses against alternatives of similar plausibility and complexity. However, there is also an important place for observational and exploratory approaches. Dressing up observational/exploratory research as hypothesis-driven is potentially problematic, and perhaps this is the underlying problem with the flexibility of the methods we have.

This lovely figure is taken from a highly relevant and interesting blogpost from Neuroskeptic. With thresholded data, we only see the tip of the iceberg of brain activity, and the regions which survive thresholding are highly dependent on our assumptions and analytical decisions, and everything else however meaningful, or undermining is hidden from the reader and reviewer.

For some purposes the fMRI scanner can be viewed as analogous to a telescope. Early astronomers pointed their telescopes at the sky and described what they saw, they could not form hypotheses about nebulae, galaxies, Saturn’s rings and Jupiter’s moons, because they had no idea they were there. It seems at least possible that similiarly striking, significant and unanticipated phenomena are present in our brains. In my experience fMRI data is far richer and more reliable than one might think if only looking at the thresholded peaks in a conventional analysis.  Indeed promising results from MVPA/classifier approaches imply that subthreshold activity is often meaningful. Methods for investigating these patterns and their meaning are still in their infancy, but I think it a safe bet that in 10 years time, we will recognize that in addition to the many false positives our era has produced, there will be a large number of missed opportunities that were staring us in the face. I think that more advanced nonparametric thresholding methods, multivariate and spatial analyses will probably show that our current methods are overly conservative.

Overall, I tend to think that fellow scientists are acting in good faith, and that in many cases they have given careful consideration to the methods most appropriate to the experimental question at hand. I don’t think the field has yet arrived at a consensus as to how to analyse the data, and I don’t think we yet have ideal methods for quantifying the probability of Type I errors above the voxel scale. It may turn out that different questions require different forms of analysis and statistical approaches.


Author: tomhartley

Neuroscientist and University Lecturer in Psychology

9 thoughts on “Devil’s Advocate: Uncorrected Stats and the Trouble With fMRI”

  1. Thank you for your detailed and rational response. I really find the red herring – I mean salmon – study very tiresome. I rarely even use fMRI but I have become rather defensive about it because I find the discourse about it to be troubling in its generalizations. People go as far as calling fMRI pseudoscience!( That is a very serious assertion that should not be made lightly.

    There are well-deserved criticisms against fMRI, but the issues are rarely about statistics. Noone scans a subject for one run and interpret the data without some attempt at controlling for false positives (salmon study) so in practice all they’re saying is multiple comparisons can be a problem. How do we get from there to pseudoscience? Talk about throwing the baby with the bathwater.

    Of course, like in any field, there are weak fMRI papers, but usually the reason is not statistical or ‘pseudoscience’ but because people are not designing good experiments and maybe some are overhyping their results to get high impact papers (and the media misrepresents stuff). This can happen in any field. Singling out fMRI has hurt neuroimaging significantly. I also think this is broadly bad for science. At a time of increasing skepticism of science and decreasing public funding, all this infighting is not good for anyone. Most scientists in the field are keenly interested in using fMRI in the most appropriate ways (and methods such as repetition suppresion and MVPA/representational similarity analyses are trying to overcome some of the weaknesses of the standard GLM approach). There is no big salmon fraud happening, with researchers trying to fool everyone with their shiny blobs to attain fame and fortune. Just as blind acceptance of fMRI results without critical thinking is not ideal, so are these sweeping generalizations. I wish people would stop being paranoid haters and assess fMRI studies rationally and scientifically. It’s just a method like any other than can be used appropriately or inappropriately.

    1. Yes, thank you for these comments. I agree wholeheartedly and I really enjoyed your impassioned post on the ‘pseudoscience’ charge.

      Like you I think that work to improve and critique fMRI methods is very valuable (I am thinking of people like Russ Poldrack and Tal Yarkoni, but there are many others). However this message is easily mischaracterised and oversimplified. I think we need a parallel message which isn’t articulated as often as it should be: fMRI data, while noisy, is plainly ordered and accurately reflects known functional anatomy (e.g., maps in retinotopic cortex). fMRI data can even be used to reconstruct movies that people are looking at! The signal is demonstrably reliable and valid. We may encounter poorly designed experiments, and poorly interpreted results, but at the level of the BOLD signal we can really see (through a glass, darkly) what happens in the brain when you think.

  2. Fantastic post Tom – thanks very much. Personally I find cluster correction for multiple-comparisons to be *conceptually* the most useful, as it’s the only one which is actually rooted in how we know the brain works – ‘true’ activations do tend to occur in clusters, reflecting the underlying functional areas. It has the issue of missing small areas though, so not perfect… I’m fully on board with your position of reporting results at multiple thresholds though – seems like the only way past these issues with the appropriate degree of transparency.

  3. Pingback: Quora
  4. Hi Tom, you mentioned “Indeed promising results from MVPA/classifier approaches imply that subthreshold activity is often meaningful”. Could you mention some MVPA/classifier papers in connection with that assertion? Much appreciated.

    1. Hi Rheisa,

      One of the earliest, and clearest papers that made this point was Haxby et al.’s (2001) paper in Science: “Distributed and Overlapping Representations of Faces and Objects in Ventral Temporal Cortex”. This shows that patterns of activity in ventral visual cortex are reliably correlated when the participant views the same category of visual object (and less so when they view different categories). Tellingly for the current argument, when the most category selective regions (i.e., those that normally exceed univariate statistical thresholds) are removed from the analysis (e.g., the FFA is removed when considering faces) the pattern in the remaining voxels is still more strongly correlated for within- rather than between-category comparisons. In general multivariate methods can show systematic, patterned responses in regions where they are not apparent in univariate comparisons.

      I hope this answers your question.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s