Posted by: Jeremy Fox | June 6, 2012

Garbage in, garbage out: what if your Big Dataset is lousy data? (UPDATED)

I’m all for making the most of the data we already have–but no more than that. An hazard of trying to wring as much as possible from any dataset is that you’ll overstep and try to use the data to address questions or draw conclusions that can’t be addressed or drawn. In ecology, this was the motivation behind the excellent NutNet project-existing data weren’t really adequate, so they had to go collect new data.

Over at Cop in the Hood there’s a fun rant by Peter Moskos on just this point, in a social science context. A huge, information-rich Big Dataset recently was used to argue that people in poor neighborhoods have just as easy access to nutritious food as people in rich neighborhoods, so lack of easy access to nutritious food can’t explain higher incidence of obesity in poor neighborhoods. Which is total bunk because the data on what constitutes a “grocery store” are, if not total garbage, at least totally inadequate for the purpose for which this study tried to use them. A fact which the study recognizd, only to dismiss it with the excuse that better data would have been difficult and expensive to obtain. Which amounts to saying “Doing it right would’ve been hard, so we decided to do it badly.” Click through to read the whole thing, it’s a great, short read and not at all technical.

I’m curious to hear from readers who work more with pre-existing data than I do: Have you ever looked into doing some sort of analysis of pre-existing data, only to drop it because you decided that the data weren’t good enough? Or have you ever reviewed a synthetic paper and told the authors, “Sorry, but your whole project is worthless because the data just aren’t good enough”?

And are there any general strategies that can be used to guard against making more of the data than is reasonable? One possibility is to  involve the people who collected the data in any synthetic effort using those data. That’s certainly something my CIEE working group on plankton dynamics did, and I think it was a good thing, even if it does have its own risks (e.g., causing the synthesizer to worry about truly minor flaws in the data that don’t actually affect the results).

Note that one strategy that doesn’t guard against poor-quality data is “make sure you have a really big dataset.” Having more fundamentally-flawed numbers, or more non-flawed numbers to go with the fundamentally-flawed ones, doesn’t make the fundamentally-flawed numbers any less flawed. Put another way, flaws in your data don’t just create “noise” from which a “signal” can be extracted if only you have enough data. Flaws in your data can eliminate the signal entirely, or worse, generate false signals (as in the social science study linked above).

It’s only natural that someone like me would worry about this sort of thing, as I don’t work with pre-existing data that much. I’d be interested to hear from people who do data synthesis for a living and are really invested in it (the ‘synthesis ecologists‘). How often do you run into serious problems with data quality, bad enough to prevent you from answering the question you want to answer? Does the possibility keep you up at night? What do you do about it?

HT Andrew Gelman, who also comments.

p.s. Before anyone points this out in the comments: I freely grant that everyone always tries to push every method or approach as far as it will go, so everyone always runs the risk of overstepping what their chosen method or approach can teach them. But ‘synthesis ecology’ is what’s hot right now, so that’s the context in which I think it’s most important to raise this issue.

UPDATE: Here’s this post in cartoon form.


  1. Yep. This is huge. I run into it all the time. Particularly when you take a big picture view, you’re more aware of the consequences of missing a key covariate. It’s painful to throw out a dataset, or to realize that the data you want for a synthetic project just doesn’t exist, but it’s shockingly common.

    I’d also argue that this is one reason to make sure you know a system (or collaborate with someone who does) before you use its data. It’s not uncommon to see datasets KEPT when folk who know that system can easily point out a key missing variable that invalidates conclusions from that system.

    • Thanks Jarrett.

      Do you have any impression as to whether people look for or worry about certain sorts of data problems more than others?

      For instance, I think everyone doing a meta-analysis pays lots of attention to study selection criteria. But what about worrying about problems that are common to all studies on a particular topic? Seems like those are the kinds of problems that would tempt the “synthesis ecologist” to hold his or her nose and just go ahead with the analysis on the basis of “it’s the best we can do with the data we’ve got.” Which may be true, but be rather cold comfort (if “the best we can do with the data we’ve got” isn’t actually very good).

      As another example, you bring up the worry about a dataset missing variables that you wish it included. A common worry, I’m sure–everyone always wants more variables rather than fewer. But do people also often worry about the variables that are included? For instance, whether the variables that are included really measure what we want them to measure? As in the social science example where “grocery stores” are categorized too coarsely for the data to be useful. Or in debates about the humped diversity-productivity relationship where people argue about the appropriate measure of productivity.

      It would be nice to have a checklist of “data problems”, but I’m not sure I’d know how to come up with one, as there are so many and they’re so varied.

  2. Almost allmy work over the past 10 years has involved analyzing data that others have collected. Some of these data were collected for purposes completely removed from their later ecological use, while others were collected for the purpose they are used for, but not in an optimal and problem-free way. Both types have their own unique challenges. I sometimes come close to concluding that both types need to be trashed and all projects attempting to use them abandoned.

    About the only thing I can conclude is that the user has to know the data inside and out, and the analytical methods applied to them, and that readers of publications in general should have a default suspicious attitude towards use of the data–guilty until proven innocent maybe–and should look extremely closely at the methods section. Methods developed long ago sometimes get perpetuated, assumed by unknowing users to be legitimate when they are not (the “people have been doing it this way forever” argument).

    “How often do you run into serious problems with data quality, bad enough to prevent you from answering the question you want to answer?”

    It’s a constant problem. I sometimes have to alter the question I can address.

    “Does the possibility keep you up at night?”
    Yes, often, mostly in trying to solve a problem.

    “What do you do about it?”
    Become obsessive, get frustrated, make some type of breakthrough, repeat.

    • Re: the “people have always done it this way” argument, I have this poster on the wall in my lab. I think I’ve linked to it before in some old post.

  3. Ah well, I’m a Bayesian, so this problem never occurs.

    More seriously, yes of course I’ve run into this problem (even worse when it’s a pushy collaborator who’s convinced that we can wring what they want out of the data, and aren’t prepared to take the word of the data analysis expert). it doesn’t keep me awake at night, because it’s usually possible to squeeze something out of the data.

    • “I’m a Bayesian, so this problem never occurs.”

      Don’t even kid about that! 😉

  4. I’m traveling and don’t have time to respond fully, but I think the short answer is that like Jarrett says, anyone who does this for a living thinks about it a lot. The important thing to keep in mind is that this is always problem specific and I think that the biggest mistake that is made in evaluating the utility of large datasets is to focus on a problem that has been pointed out with the data (all data have problems) without considering whether it will have an impact on the question of interest. It’s not enough to believe there is a weakness with the data, it’s important to be able to justify why it will influence the results.

    You can see more of my thoughts on this in my comment in response to a question about a recent post on one of our papers.

    In general, my philosophy is that a decent answer is better than no answer, if the perfect answer isn’t possible (and perfect answers never are), and I do think that more data can help allay concerns about data quality, if one is combining data that doesn’t all have the same problems. Many areas outside of ecology have shown that big data, however noisy, can be incredibly powerful. It must certainly be treated carefully, but it also must certainly be used if ecology is to address important questions that cannot be otherwise addressed.

    • Jeremy, another point. I think this issue ties in directly to our discussion some months back about efficiency and planning in scientific research. This is a perfect example where people, for whatever reason (often to get a pub, or be seen to be doing something novel, or both) use some data set(s) inappropriately. And it happens over and over again–people trying to make some type of synthetic or large scale statement about x, y or z, when the data are really just marginally suited for it, at best.

      Instead, I argue we should be putting those collective person-hours into evaluating the state of the science in x, y, or z (and this may well involve some extensive analysis on existing data sets, but only to tell us what they can, and cannot, tell us about x, y, or z), and then planning and executing the collection of the needed data (often in combination with model building and evaluation).

      This is why I have argued that we need a tight integration of theoretical/modeling work, and empirical data collection in a very structured and coordinated way. Instead, we have research teams a through z scrambling to see who can come up with the most “novel” or trend setting or whatever, use of existing data sets. In my opinion, this is a big mistake; it’s symptomatic or our collective failure to approach the research enterprise in a very well planned way.

      • So you’d argue for more central coordination and direction of research, rather than relying on various funding agencies with various missions just picking and choosing among whatever investigator-initiated applications are submitted to them? Are you arguing for some kind of central (or collective?) decision as to how to allocate available resources to mining existing data vs. collecting new data? Perhaps analogous to how, say, astronomers all get together and submit an agreed, prioritized list of expensive instruments to NASA?

        Or is it fine to evaluate investigator-initiated proposals on their merits (e.g., if you propose a data-mining exercise when there’s good reason to believe existing data aren’t adequate, or you propose to collect new data on something we already have a lot of data on, you don’t get funded)? So that the problem, if there is one, is that we’re mis-evaluating those proposals? Or that the ratio of data mining to new data proposals has become so skewed towards the former that in practice we end up funding too many of the former and too few of the latter?

        Regarding that last possibility, I’d think that funding agencies have, or could fairly easily compile, the relevant data on frequencies of different proposal types. And I seem to recall a post over on The EEB and Flow a while back which showed that, contrary to what you might think, the proportion of published papers in leading ecology journals based on data the authors collected themselves hasn’t really changed much over time.

      • Yes, basically your first paragraph there captures my view. Spend considerable time evaluating what data exists and what uses it has–potentially for a wide number of topics–, also spend similar time evaluating what the existing models tell us about what we need to know, what data are needed etc. and what improvements are needed in the models. So yes, to be short, I’m arguing for a very highly planned endeavor; I do not like the willy-nilly approach that seems to predominate.

        In regards to your question, which is a good one, I’m honestly not sure what the cause of the existing problem, or the best solution to it, is. It’s likely to be topic specific I imagine. My feeling–and it’s only that–is that the funding agencies are maybe just not set up or structured to undertake the kind of comprehensive “state of the science” evaluation and wholesale planning I’m talking about. One point is that I’m sort of talking about data that may take decades to collect–it has to be a serious, long term commitment. And the model building may be as serious also–since some intensive model development may have to occur before any decision on what data to collect, where, how, etc., can be made in a well informed way.

        I realize too though, that there almost surely IS a lot of such planning already and I need to make myself more aware of exactly what and where it is. Still, I think it tends to be of very short term and not adequate for the most effective long term progress of the science. I realize I’m being sort of general here also.

  5. I have to say that unfortunately I’ve seen mostly the opposite problem that Ethan describes. Rather, I’ve seen plenty of, for lack of a better word “look the other way” behavior in not acknowledging the drawbacks of existing data sets.

    It’s possible that my view is biased however, because I’ve been working very intensively on tree ring analysis methods for paleo-climatic estimation. It’s not ecology per se, but it relates closely, and the enormous repository of archived field data (over 3000 sites at the Intl. Tree Rind Database, ITRDB) were, almost without exception, collected for that purpose. Notwithstanding this, the data have numerous, sometimes debilitating weaknesses in predicting past climates. Notwithstanding, there are just countless climatic estimates made from them, and the trend has been towards increasing large scale, where the problems only grow, not shrink.

  6. should be:
    “and the trend has been towards increasingly large scale studies”

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s


%d bloggers like this: