Posted by: Jeremy Fox | December 16, 2011

Cool new method for detecting associations between variables in large datasets (UPDATEDx2)

Writing in Science this week, Reshef et al. present what looks like a very clever, powerful, and general method for detecting associations between variables in large (many variables) datasets. Statistician Andrew Gelman has a good write-up over at his blog, and the “Perspectives” piece in Science is also good.

Basically, the authors propose a new measure of association between two variables, ranging from 0 (no association) to 1 (perfectly associated). If the two variables are linearly related to one another, the new measure is basically equivalent to the familiar R². But the cool thing is that it works for pretty much any form of association, apparently including not just nonlinear and non-monotonic relationships between variables, but even including associations that can’t be described by a single mathematical function!

As many-variable datasets become increasingly common in ecology, it’s going to become increasingly important to have good tools for data exploration, since rarely will we have strong a priori hypotheses about which variables should be associated with which others, and in what way. This new approach looks like it could be just the ticket.

Some questions and food for thought (none of which are criticisms, and some of which aren’t original to me; see the links above):

  • How data hungry is this approach? How does it perform with relatively small numbers of observations?
  • As Andrew Gelman points out, it seems that it only provides a relative measure of association, which I don’t think can be sensibly compared across datasets.
  • This approach doesn’t free you from the problem of multiple comparisons. You’re still going to have to do Bonferroni correction or look at your false discovery rate or etc. in order to separate real associations between variables from ones that just reflect random chance.
  • Correlation is still not causality, no matter how much data you have, how clever your measure of correlation, and how well you correct for multiple comparisons. Especially when it’s not even a measure of partial correlation. An important direction for future work will be to see if it’s possible to extend this new measure of association into something analogous to partial correlation, so that you can ask about the association between two variables independent of the other variables in the dataset. That doesn’t prove causality either, of course, but it can be a better hint than a plain ol’ measure of association. Whether or not such an extension proves possible, I think for most applications it will be important to follow up this approach with independent checks meant to reveal the causal underpinnings of putative associations.

I propose a race: let’s see who can be the first to write an R package implementing this new approach. Ethan, Jarrett, Scott, Carl, Ted: on your marks, get set…go!

UPDATE: Whoops, turns out there’s already an R wrapper on the authors’ site (HT Scott Chamberlin). Too bad, I was kind of looking forward to posting odds and taking bets on that race…

UPDATE #2: Hmm, looks like there’s a drawback to this approach: the extreme generality is (probably not surprisingly) purchased at the cost of power. See the comment by Simon and Tibshirani here. So if you do have some reasonable a priori idea of what sort of relationships you’re looking for, or what variables you expect to be related, this may not be the best approach for you.


  1. And here I was just dusting off my spurs and about to get ready…

    • p.s. I wholly support the idea of such a neRd race in the future, though.

      • Also, the code and such is here.

  2. Jeremy, what are your thoughts on the Bonferroni correction? I recently read Moran (2003), and found it quite thought provoking, though I’m still not entirely sure where I stand on the issue.

    Moran, M.D. 2003. Arguments for rejecting the sequential Bonferroni in ecological studies – Oikos 100(2): 403-405

    • It has its place, as xkcd recently illustrated. But the logic of it gets silly if taken to extremes. And it’s not practical for massively multivariate datasets (e.g., from gene expression chips), where you’re better off doing something like shifting your focus to the false discovery rate. I know Andrew Gelman doesn’t think you should worry about multiple comparisons and says the right way to deal with the underlying issue is to do multilevel modeling, but I’m not super familiar with his work on this.

      • Ah, but what do you make of Hurlbert, S.H. and C.M. Lombardi. 2011. Lopsided reasoning on lopsided tests and multiple comparisons. Australian and New Zealand Journal of Statistics (in press). Or Hurlbert,S.H. and C.M. Lombardi. 2009. Final collapse of the Neyman-Pearson decision theoretic framework and rise of the neoFisherian. Annales Zoologici Fennici 46:311-349 for that matter?

        Yes, I realize this is bait.

      • I don’t make anything of either of them, because I haven’t read them.

  3. I offer to be last in the R competition…and not last in the sense of putting your anchor man at the back of the relay, if you know what I mean.

    • Actually, rather than having individuals race I was thinking we should do it in pairs, like a three-legged race.

  4. Too bad it didn’t work out but I look forward to Jarrett’s neRd race. Also that has the longest supporting methods I’ve ever seen in a paper, over 50 pages!

    • Yup, that’s big, but check out supp mat for Lorenzen et al’s recent Nature paper (Species-specific responses of Late Quaternary megafauna to climate and humans)… weighs in at 17.8mb and 129 pages!

      • That kind of thing is why at least one leading neuroscience journal no longer routinely accepts supplementary material. They take the view (with which I have some sympathy) that, if the material is truly important to the argument of the paper, it belongs in the paper. And if it’s not truly important to the argument of the paper, why should the journal have to host it or (perhaps more importantly) reviewers have to review it?

        I wonder if Nature and Science in particular shouldn’t take the same view. There used to be a time when Nature and Science papers in ecology and evolution were really different beasts from ecology and evolution papers in specialist journals. Nature and Science papers were simple, clear, and incisive (and therefore short), but also deep and important (and therefore worthy of publication in Nature or Science). Ok, that wasn’t always the case, but that was the ideal. And it seems to me that the ideal is met substantially less often these days, as Nature and Science papers increasingly become mere abstracts of regular papers, with the bulk of the paper shunted into online appendices.

  5. Ideally, we could re-write their javascript code in R, or if it runs too slow in real R code, we could write in C++ using Rcpp…just a thought.

  6. @JeremyFox, but there is no ‘native’ R version of the MINE code….mwhahaha

    • So you’re saying you want the neRd race after all? Do you and your fellow neRds need me to act as a “starter”? 😉

      • Sure, why not, but I will lag behind until I can learn Java(script)…

    • Hi Scott,
      In the comment posted by Tibshirani there is the R code to use MIC….
      and the on the website of MIC there is all the rest….

      command <- 'java -jar MINE.jar "test.csv" -allPairs'
      # res=scan("test.csv,B=n^0.6,k=15.0x,Results.csv",what="",sep=",")

  7. […] a recent post on a new method for detecting associations between variables in many-variable datasets, I jokingly […]

  8. […] while back I posted on a cool new nonparametric method for detecting associations between variables in multivariate […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: