DATA: The bigger the better? A survey of analytical traps and tricks

by Elizabeth Beam


In the way of gas-guzzling vehicles and the great American gut, data these days is big and getting bigger. And why shouldn’t it? By contrast to the toll that other excesses take on the environment and our bodies, the physical burden of a large-scale dataset is nearly negligible, and decreasing. In June 2013, researchers discovered a new technique for optical recording that will make it possible to store a petabyte of data on a DVD-sized disc (1). This means that you could soon hold in the palm of your hand the genome of every US citizen—with enough room leftover for two clones each (2). Conversely, using a clever method developed by a Harvard bioengineer last year, you can now take DNA as your storage media and encode 5.5 petabits of binary data per droplet (3).

Big is beautiful when it comes to data because quick and dirty analyses run on a large enough dataset can turn up remarkably subtle effects. Borrowing data from Google Trends, Preis et al. made just such a discovery (4). Interested in the relationship between prospective thought and economic success, they compared future-oriented search histories—how often users searched for the year to come (e.g., “2014”) versus the previous year (e.g., “2012”)—and the GDP of countries where those searches originated. Combining their well-directed hypothesis with the statistical power[1] lent by the size of their dataset, Preis et al. uncovered a strong correlation (r=0.78) between GDP and an interest in the future.

So, does this mean we can supersize our data without worrying we will feel sick to our stomachs later? Despite promising advances in data storage, some argue we have already bitten off more than we can chew. A report by the International Data Corporation shows that in 2007, the estimated data created that year exceeded the capacity of storage devices on the market (5, Figure 1). We must also keep in mind that as our Google searches and emails continue to pile up, they become increasingly difficult to keep to ourselves. Thanks to Edward Snowden, we now know that there may be a pasty-skinned, bespectacled NSA agent combing through our private emails from the comfort of his poorly lit office. If that thought gives you the heebie-jeebies, then you have come down with at least one symptom of the modern day data glut.

Beam (Fall 2013) Figure 1

Figure 1: Estimates of the data created or captured each year and the storage available on hard drives, tapes, CDs, DVDs, and other memory devices in the market. In 2007, the data created exceeded storage capacity for the first time. Adapted from a report by the IDC (5).

Furthermore, it is critical to understand that analyzing a big dataset is not the same as analyzing any old dataset, scaled up. With more data come more analytical challenges. First, because big data are often curated from observational studies that yield correlative effects[2], we must address the alluring fallacy that correlation implies causation. Second, it is necessary to note that correlative effects may be due to the structure of a dataset rather than the variables under study. When genomic studies consider the population as a homogeneous whole, they may find that a mutation is associated with a disease—only to realize that the mutation is not pathogenic, but rather, is frequent in an ethnic or geographic subgroup that is commonly affected by the disease. Finally, exploratory analyses of big data are prone to the discovery of false effects. With each test run on a dataset to search for an effect, you run the risk of happening upon a false positive—an effect that was observed by chance, and that with continued observation would fade away. Together, these analytical traps prevent the size of a dataset from guaranteeing the validity of results. The formulation of robust hypotheses remains important for guiding analyses towards logically defensible conclusions.

My intention here is not to pop the big data bubble. Rather, by pointing out and patching a few common analytical holes, I hope to keep the data from being deflated by the disappointment of false conclusions. This intellectual check-up is essential now more than ever as scientists amass data, data everywhere with not enough brains to think. Increasingly, the burden of interpreting data is being shifted from experts wielding supercomputers to curious data consumers squinting at hastily written articles on their smartphones. Fortunately, even if you cringe at the sight of a spreadsheet, there are a few basic logical principles that you can quickly and easily apply before you buy into the latest trend reports. 

Correlations Ahead:  Proceed with Caution and Hope

Consider the finding that a part of the brain called the amygdala becomes especially active when people laying down in a functional magnetic resonance imaging (fMRI) scanner view images of Mitt Romney (6, 7). An exciting find, no? Well, not necessarily—as pretty as pictures of the brain in action may appear (Figure 2), if you had never heard of the amygdala, then this association of a task condition with a neural activation does not mean much. If you are indeed familiar with the neuroscience literature on the amygdala, then this result means marginally more. In the context of several studies that show heightened amygdala responses when people view stimuli that make them anxious, you may suspect that Romney evokes feelings of anxiety.

Figure 2: Activity of the brain is visualized through the cross-section scans that MRI reveals (shown above), and the activation of specific regions is mapped onto these sections of the brain, resulting in functional MRI, or fMRI. Image courtesy of Wikimedia Commons.

Yet, if you take the time to survey the literature in toto, you will be left scratching your head. Based on other studies linking the amygdala to emotions that are strongly negative as well as strongly positive, you could instead conclude that Romney is an emotionally polarizing candidate—just as plausible, especially if the subjects were of both political parties. Or, pointing to studies that show the amygdala is sensitive to novel stimuli, you may theorize that Romney is the new candidate on the block—another likely theory, considering that the study was conducted well before Romney became a household name during the 2012 president election. You may even theorize, based on studies that show the amygdala is more active for faces than inanimate stimuli, that Romney possesses a face—who knew!

Because any one brain region can be involved in many different tasks, to draw a conclusion about the mind from activations observed in the brain is to make an unsubstantiated “reverse” inference. Unfortunately, the authors who conducted this neuroimaging study of politics and the brain made no analytical reservations when presenting their results to the public. In a now infamous New York Times article (7), they explain that the “voter anxiety” indicated by amygdala activation in response to images of Romney was attenuated when subjects viewed and listened to videos of the candidate. They reason, “Perhaps voters will become more comfortable with Mr. Romney as they see more of him.” Huh? Three days after the article hit the stands, a team of neuroscientists co-authored an exacting letter to the editor that criticizes the use of “flawed reasoning to draw unfounded conclusions about topics as important as the presidential election” (8). The truth hurts.

Should neuroimaging researchers, doomed to the logical limbo of correlative results, turn in their fancy fMRI machines? Fortunately, big data offer promise for more meaningful interpretations of correlative findings. The beauty of big data is that, rather than reducing a system to a few parts that are amenable to study, it enables scientists to consider a system as a whole. Neuroscientists should find that a holistic approach to the brain opens the door to remarkable new findings. This is because the brain is a hierarchical system, built up from molecules to neurons to regions that share a common function to our everyday experiences and behaviors. Mysteriously, the properties that emerge on higher levels—our capacities for writing poetry, for feeling moral outrage, for falling in love, for cracking jokes, for being aware of our thoughts—cannot be predicted from the way that the lower levels function. This is why the brain continues to baffle neuroscientists who have heretofore focused their efforts on a few cells or groups of cells at a time.

Thus, a central goal of the government’s $100 million BRAIN initiative is to develop tools for recording simultaneously from many neurons throughout the entire brain (9). The hope is that, when we are able to see how the networks for motion and sensation and language and emotion and logic and memory interacting in real time, we may gain a mechanistic view of the way that poetry and morality and love and humor and consciousness and more happen in the brain. Someday, when our understanding of the brain leaves less to the imagination, we may even be able to breach the problem of reverse inference and make predictions about voting results from patterns of neural activity. 

Vanishing Acts and Magically Materializing Effects

                   Common sense is a mighty tool. If wielded properly, it may defeat the biggest of data and the most mystical of mental powers. To put our common sense to the test, let us go back in time to 1940, when two psychologists published the first meta-analysis of all studies within a research domain (10). This comprehensive analysis sought to settle a cut-and-dry controversy in the field—some studies reported the existence of an effect, while others did not.

When the psychologists gathered together all 145 studies published on the effect between 1880 and 1939, including nearly 80,000 subjects and 5 million trials, the results were overwhelmingly positive. Joseph Rhine and his student Joseph Pratt had found solid evidence in favor of extrasensory perception (ESP)—the ability to use something other than logic and the known senses to predict events. No fewer than 106 studies yielded results in support of ESP. When the trials were grouped by the probability of successfully predicting an event by chance, subjects beat the odds at every level (Figure 3A).

Figure 3

Figure 3: The results of ESP experiments from 1880 to 1939 broken down by time period and the probability of success. Probability is the probability of making a successful prediction by chance. Reports is the number of studies, and trials is the number of trials summed across all reports. Deviation is the number of successful predictions minus the probability of success. Critical ratio is the deviation divided by the standard deviation, where the standard deviation is the square root of the number of trials times the probability of success times the probability of failure. Adapted from results published in Extra-Sensory Perception After Sixty Years (10).

Yet, Rhine and Pratt noticed a curious trend in the data. When they pulled out the most recent studies published between 1934 and 1939, many of their positive results diminished (Figure 3C). Taking a step back, we should take note that most of the studies included in the meta-analysis had invited subjects to guess an attribute of the next card in a series. Researchers in the earlier years tended to use a standard deck of playing cards, asking subjects to predict the color, suit, rank, or exact identity of the card to come. Thus, the most well tested probabilities in the first time period are 1/2, 1/4, 1/13, and 1/52 (Figure 3B). Intriguingly, it is those same probabilities that experienced a decline in their critical ratio—a measure of just how unusual it was for correct predictions to occur—when compared to studies from the later time period. Meanwhile, in the later group, the 1/5 probability spiked in both popularity and the positive value of its critical ratio. This dramatic increase tracks with the rising use of Zener cards to test ESP. There are five Zener card designs—a circle, a plus sign, a square, a star, and a set of three wavy lines—and as before, subjects guessed the design on the next card in the deck.

Why is it that the most popular testing procedure for a given time period produced the most positive results? Unsurprisingly, there is reason to suspect that the data were prone to bias. Even though scientists at the turn of the twentieth century were making conscious efforts to study ESP by empirical methods, that these scientists were interested in the questionable phenomenon at all suggests that they believed it to be true. Deliberately or not, when they found a way to test ESP that produced positive results, they ran with it. What Rhine and Pratt do not report, and what would be most useful to know, are the details of how the card guessing studies were conducted and the data were collected. One possibility is that researchers discontinued testing after a subject hit a lucky streak with the cards, preventing the data from regressing to the mean. However, this fails to account for the switch to Zener cards in 1934. To explain that, we need look no further than the cards themselves, which contain large and simple designs printed on thin, white paper.

This example from the archives of science exemplifies two common themes in big data analysis. The first is the importance of stratification. When Rhine and Pratt considered all the data together, they saw positive effects for every probability tested. When they separated the data by time period, only a select few probabilities retained their strongly positive effects—revealing the difference in testing methods and leading us to our suspicion of bias. Similarly, in modern studies that seek to tie inherited diseases to DNA mutations, genes may be spuriously identified as pathogenic unless the population is divided by the appropriate factors.

Second, the Rhine and Pratt study points to the problem of multiple comparisons. The more cards they asked each subject to guess, the more likely that the subject would hit a lucky streak. Big datasets are especially vulnerable to the problem of multiple comparisons because they tempt researchers to test and test again until an attractive correlation catches their eyes. Likewise, genome-wide and whole-brain analyses that treat each locus or voxel as an independent test can yield false positives unless the p-value[3] is sufficiently low. In the sections to come, we inspect each of these deep yet frequently overlooked statistical pitfalls.

The Sly Effects of Data Structure

                  When Preis et al. cracked open the Google Trends data containing trillions of searches logged across the globe, they discovered a strong effect in part because they brought the right tools with them to dig (4). Having done their homework in economics and sociology, the authors had good reason to select GDP and an index for future thinking as two promising variables to include in their cross-country analysis. However, in many cutting-edge analyses of big data that are more complex or that lack a foundation in previous studies, it is not possible to formulate hypotheses ahead of the results. In those cases, the inherent structure of the data may compete with the loose analytical structure to determine the observed “effects.”

Before genes that cause an inherited disease have been identified, geneticists often begin their search with a genome-wide association study (GWAS). This type of exploratory analysis takes DNA from people with a disease and compares it to that of people without the disease, seeking point mutations that are correlated with the disease trait. The strength of GWAS is that it does not rely on assumptions about the data—any correlation between a mutation and the disease will turn up as a candidate for the genetic cause, to be investigated further by more targeted approaches.

Yet, the strength of GWAS is also a weakness. The unstructured form of the analysis leaves the demographic structure of the population free to exert undesired effects. If a certain mutation occurs in only a subset of people with a disease—as is more often than not the case for genetic conditions—then any mutation that is common in that subset will be associated with the disease (11). For instance, in an early associational study, Blum et al. reported a link between alcoholism and an allele of a particular dopamine receptor (12). The allele was present in 77% of alcoholics and absent in 72% of non-alcoholics in their unstratified experimental and control samples—rather impressive statistics for such a straightforward study. In the years since, however, attempts to replicate the findings have shown that the relationship is not so simple, pointing to ethnic heterogeneity as a confounding factor (13). While certain ethnic subgroups such as Mayans and Colombians have the dopamine receptor allele in high frequency, others such as Jews and Pygmies possess it rarely (14).

Looking ahead, what will there be for a biologist to hold onto in the vast and rapidly expanding seas of profuse genetic data and profound genetic complexity? Perhaps most important will be the collection of meta-data—that is, data about the data. Divisions in the population can be drawn along not only basic demographic lines like ethnicity and gender, but also along differences in lifestyle, personality, substance abuse, family history, medications, and health conditions other than the disease of interest. Vilhjálmsson and Nordborg recently made the case that typical methods of population stratification are insufficient, as demographics can be traced back one step further to differences in environment and genetic background (15). In order to control for the confounding variables we know of as well as those not yet identified, researchers must keep track of as many potentially relevant factors as possible.

Unfortunately, as Howe et al. note in their report on the future of biocuration, “not much of the research community is rolling up its sleeves to annotate” (16). One solution that would not only encourage thorough surveying, but also ensure the quality and consistency of data, is to create an exclusive data repository. Biologists could be permitted elite access to this repository only if they agree to submit their data along with a common battery of meta-measures. This method is now proving successful for the Brain Genomics Superstruct Project, a large-scale collection of neuroimaging, genetic, and survey data collected under a common protocol and made available to those who contribute (17).

The Promising (Yet False) Positive Effect

For scientists afloat in big data and hungry for results, it can be tempting to embark on fishing expeditions. These unguided analyses undermine one of the strengths of big data—the statistical power that comes with a large number of data points. As more and more tests are run on a dataset, there is an increasing probability that a test will come up positive by chance alone. This is the problem of multiple comparisons. Although not inherent to big data, the problem of multiple comparisons often arises when independent tests are run on sets of many data points. This is a common way of analyzing genome-wide and whole-brain data (e.g., testing every locus or every voxel for an association with a disease or a task condition). Likewise, overeager researchers may run excessive tests on a dataset, expressing a strong sense of determination but a lack of predetermined hypotheses (e.g., if Preis et al. had tested whether GDP is associated with Google searches for various celebrities, ice cream brands, musical instruments, butterfly species, and office furniture stores).

The strange effects that can turn up after running many tests can be rather unbelievable, challenging the crude probabilistic expectations that our brains use to anticipate how the world should work. For this reason, we need to be the most skeptical of the most exciting results. Looking back to the data from Rhine and Pratt (10), we see that the largest probability they tested was 1/100. While none of the subjects in their meta-analysis succeeded in beating those odds, someone out there most certainly could. Think about it. If tested on cards numbered one through 100, a subject guessing 33 every time would guess correctly once. If you shuffled the deck and tested subject after subject, it would become increasingly likely to find a “clairvoyant” that guesses correctly on the first try.

To one group of neuroscientists, the term “fishing expedition” is more than a metaphor. The team of Bennett et al. inserted a post-mortem Atlantic Salmon into an fMRI scanner to serve as biological filler material while they tested their protocol, a series of images of social interactions (18). Remarkably, when they analyzed the data for kicks and giggles, the researchers found evidence of activity in several voxels of the dead fish brain. Bennett’s response? “[I]f I were a ridiculous researcher, I’d say, ‘A dead salmon perceiving humans can tell their emotional state” (19). Of course, Bennett is not ridiculous—despite his dry sense of humor and his proclivity for scanning unusual objects—and so he points to the number of voxels in the human brain as the problem. With 130,000 voxels and independent tests run on all of them, the problem of multiple comparisons is not one that can be overlooked.

Unfortunately, when the subject is a human instead of a dead fish, it is trickier to tell whether an activation corresponds to neural activity evoked by a task or to a statistical fluke. A quick fix for this problem is to run a correction for multiple comparisons. The popular Bonferroni correction is an adjustment to the p-value that takes into account the number of voxels in the brain that are checked for activation. Of course, as the p-value is reduced by this correction, an even larger sample may be required to obtain significant effects. For a neuroimaging study, this would mean that the large number of voxels tested in the brain requires a larger number of subjects in a study.

It is also critical to note the Bonferroni correction does not address the more spurious problem with running many tests on a big dataset. Exploratory analyses that seek correlative results are not true applications of the scientific method—of manipulating one variable to measure the effect in another. As such, they cannot inform causal models for how the world works. Neuroimaging studies can turn up interesting results—as you may recall, the finding that our amygdalae go wild for Romney (6, 7)—but there is no straightforward way to turn isolated correlations between brain activity and task conditions into reliable models of brain systems.

In order to use big data wisely, researchers must follow the lead of Preis et al. and do their homework before beginning an analysis (4). When it is finally time to dive into the data, it is important to have an idea of what to expect and how to interpret the results. As quoted in a Nature Methods report, Sean Carroll argues, “hypotheses aren’t simply useful tools in some potentially outmoded vision of science; they are the whole point. Theory is understanding, and understanding our world is what science is all about” (20). Amidst the mesmerizing quantities of data, scientists must not lose sight of this mission.

The Next Big Step

Humans are fascinated by size. Indeed, scientists often go to extraordinary lengths to unearth the heaviest, the tallest, the longest, the largest creatures of their kind. Perhaps that is the only explanation for the excitement in the news when the world’s biggest organism was discovered to be a fungus living in the Blue Mountains of eastern Oregon. Yet, after I had processed the image of a continuous network of spindly roots extending through 2,200 acres of topsoil, I found this humungous fungus to be of limited interest. It is the same everywhere, and it is going nowhere.

Like a fungus feeding off the forest, big data is no more than a big burden if it sits unanalyzed on our hard drives. Although businesspeople pitch data as if it were the latest hot commodity—the “new oil,” according to data scientist and company founder Clive Humby—data is not a resource with intrinsic value. Like soil, data acquires value only after it has been cultivated into knowledge. And it is this potential that makes big data so thrilling. With a few statistical filters and a reasonable hypothesis, new insights can be ours for the taking. The possibilities are nearly endless and growing, as big data begets bigger data. As noted earlier, the solution to several big data problems is even more data—specifically, meta-data for annotating genetic information and larger sample sizes for overcoming corrections to the p-value.

Going forward, scientists can maintain a high level of rigor in their big data analyses by investing in data of high quality and of anticipated utility for informing future hypotheses. Witness the Human Genome Project, a massive effort to uncover the raw information and experimental methods that now guide the modern generation of experiments in biology. The ability to decode an individual’s genome is the basis for GWAS, the first step in studying the genetic basis of a disease. After candidate genes are identified, researchers can then begin the classically empirical work of manipulating genes and proteins to deduce the biological mechanisms of disease. As Massachusetts General Hospital now embarks on the Human Connectome Project, a similar endeavor that will map the connections in the human brain using advanced diffusion imaging techniques, it is time to start thinking about how this map can guide future studies of the neurobiology of human experience.

Awe-inspiring though it may be, scientists must not let their eyes glaze over at the size of their data. As the big data continue to grow, now is the time to run with it—to run logically sound analyses, that is. With big data come the big responsibilities to create knowledge that is as close to the truth as can be managed and to wield that knowledge wisely.


  1. Z. Gan, A. Cao, R.A. Evans, M. Gu, Three-dimensional deep sub-diffraction optical beam lithography with 9 nm feature size. Nature Communications 4, 1-7 (2013).
  2. B. McKenna, What does a petabyte look like? Computer Weekly (2013).
  3. G.M. Church, Y.  Gao, S. Kosuri, Next-generation digital information storage in DNA. Science 337, 1628 (2012).
  4. T. Preis et al., Quantifying the advantage of looking forward. Scientific Reports 2, 350 (2012).
  5. J.F. Gantz et al., The diverse and exploding digital universe: an updated forecast of worldwide information growth through 2011. IDC White Paper (2008).
  6. J.T. Kaplan, J. Freedman, M. Iacoboni, Us versus them: Political attitudes and party affiliation influence neural response to faces of presidential candidates. Neuropsychologica 45, 55-64 (2007).
  7. M. Iacoboni et al., This is your brain on politics. The New York Times (2007).
  8. A. Aron et al., Politics and the brain. The New York Times (2007).
  9. C. Bargmann et al., Interim report: Brain research through advancing innovative neurotechnologies (BRAIN) working group. Advisory Committee to the NIH Director (2013).
  10. J.B. Rhine, J.G. Pratt, Extra-Sensory Perception After Sixty Years (Bruce Humphries Publishers, Boston, 1940).
  11. J. McClellan, M.C. King, Genetic heterogeneity in human disease. Cell 141, 210-217 (2012).
  12. K. Blum et al., Allelic association of human dopamine D2 receptor gene in alcoholism. Journal of the American Medical Association 15, 2055-2060 (1990).
  13. C.N. Pato et al., Review of the putative association of dopamine D2 receptor and alcoholism: a meta-analysis. American Journal of Medical Genetics 48, 78-82 (1993).
  14. C.L. Barr, K.K. Kidd, Population frequencies of the A1 allele at the dopamine D2 receptor locus. Biological Psychiatry 34, 204-209 (1996).
  15. B.J. Vilhjálmsson, M. Nordborg, The nature of confounding in genome-wide association studies. Nature Reviews Genetics 14, 1-2 (2013).
  16. D. Howe et al., The future of biocuration. Nature 455, 47-50 (2008).
  17. R.L. Buckner et al., The brain genomics superstruct project. Society for Neuroscience 510.15 (2011).
  18. C.M. Bennett et al., Neural correlates of interspecies perspective taking in the post-mortem atlantic salmon: An argument for proper multiple comparisons correction. Journal of Serendipitous and Unexpected Results 1, 1-5 (2010).
  19. A. Madrigal, Scanning dead salmon in fMRI machine highlights risk of red herrings. Wired (2009).
  20. Defining the scientific method. Nature Methods 6, 237 (2009).

 [1] In statistics, “power” is the probability of rejecting the null hypothesis (i.e., the hypothesis that an effect does not exist). Power depends on the criteria for statistical significance, the size of an effect, and the size of the sample. Big data lend large sample sizes, but as we will see, require more stringent criteria for significance in order to avoid the discovery of false positives.

[2] The type of “effect” I refer to is the relationship between variables (e.g., A and B). This relationship may be due to one variable causing a change in another (e.g., A changing B or B changing A), a third variable causing a change in both, (e.g., C changing A and B), or more complex interactions and chains of causality.

[3] The “p-value” is the probability of obtaining a result as high or as low as the result observed, assuming that the null hypothesis is true. The null hypothesis can be rejected only if the probability of obtaining the result is smaller than a threshold p-value for significance. For example, for a significance level of 0.05, if the null hypothesis is that Harvard students have average intelligence, then we can reject the null hypothesis only if their mean IQ is within the top 5% on the standard normal distribution.

Categories: Fall 2013

Tagged as: ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s