Clash of Paradigms: Statistics and Big Data(Discussion)

Search for most updated materials ↑

Discussion of Article:

oct. 8 2016

Clash of Paradigms: Statistics and Big Data Back to BIG Euclid Page

Ronald LaPorte, Ph.D. for the Library of Alexandria/University of Pittsburgh

Crude but Interesting Analyses

Only a few weeks ago, we thought that statistics and big data analytics were one in the same: A common definition of statistics is:

“the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample”. (Wikipedia)

A Wiki definition of big data is: “extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions”

These definitions do seem closely entwined. It seemed to us that these two work hand in hand to address evaluation of hypotheses, large quantities of data, and the use of classical statistics to reveal patterns, the internet and the data. This would suggest a very high degree of overlap between these two concepts as seen in this figure above.

On a lark we decided to look at the overlap between statistics and big data. We were very surprised. Much of our work over the past few years has been to collect and share large quantities of Powerpoint presentations and share these for free through the Library of Alexandria. We used Google advanced search and found that there were 218,000 .ppt sites for statistics, and 5040 for big data. When we combined these we found only 815 discussing both, or .3% of all statistics lectures include big data, and only 16% of big data lectures included statistics. The degree of overlap was very low. Also, most of the sites were not directly involved with data analysis but rather just a brief mention of each other.

We then decided to do a further set of analysis using Google advanced search on all sites. We searched on “statistics” which yielded 1,520,000,000 sites, and for “big data”, 70,000 sites, the overlap was 4750, which represented .5% overlap with statistics and 7% overlap with big data, again amazingly low.

To examine this from another direction we used google scholar to look for statistics, Big Data and Statistics/big data articles. For statistics there were 5,590,000 articles for the “statistics” search, and 184,000 for big data. The combination of Big Data yield 47,300, or .8% overlap with statistics and 26% overlap with big data. We did noticed again that most of the “overlap” articles were not data articles but mentions of the other field.

We further did a rough analysis of the first 50 in the search with google scholar. We found that only about 20% of the overlap actually had data, the rest for statistics courses, conferences, review papers and brief mention of the other field. Therefore the degree of overlap in terms of data analysis is indeed far less than the projections above. A better representation of the fields is below, where there is only a miniscule commonality.

We recognize that this is a crude set of analyses, but we can at least say that there is little overlap between classical statistics and big data, potentially indicative of separate fields. Our analyses thus review little linkage between these two data analysis fields.

There may need to be greater interaction between those in classical statistics and those developing big data. We can learn much from each other.

Alternatively we should consider what the Beatles had to say: You say yes, I say no / You say stop and I say go go go, oh no / You say goodbye and I say hello / Hello hello, perhaps the two paradigms should "go it alone"

We would value your thoughts about how we can bring these together, and what are the factors keeping us apart

Ron LaPorte Sep 26, 2016, 7 Oct 2016, 8 Oct 2016

Comments by: Gary Clark Hetan Shah Kan Yee Stergios Tzortzios Rich Davies Patty Becker Sam Lanfranco Feng Liang John Mather Bill Goodman John Peterson Roger Hoerl Francois Sauer Daniel Commenges

Ron, I agree that not much attention has been paid to classical statistical methods when big data sets are analyzed and results are presented. There was certainly an awareness of the big data problem (<100 patients with 10,000+ variables) as far back as the early 1990s [Benjamini Y and Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B 1995; 57:289-300]; however, it required a decade or so before this technique was frequently referenced in big data publications. Similarly, the problem of variable selection from many variables was also addressed in the 1990s [Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med 1997; 16:285-395], with several attempts to generalize the technique over the years [Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Statist Soc 2011; 73 (Part 3):273-292].

Very few of my statistical colleagues (myself included) feel comfortable designing big data studies or analyzing data from these studies. Instead, we tend to refer researchers to individuals/groups that specialize in bioinformatics. There are now programs that offer advanced degrees in data analytics with a focus on big data (see, for example, Penn State’s program at http://www.worldcampus.psu.edu/degrees-and-certificates/data-analytics-base/overview). They do offer some introductory statistics modules, but nothing very advanced.

A conference like you suggest would probably be quite popular. It should include individuals with expertise in several areas, including people who set up and run the assays that produce the multidimensional data, people who pre-process the data (eg, hybridization normalization, central tendency normalization, calibration), and people who build models (using data reduction / variable selection techniques) and validate their biological and clinical utility.

Gary Clark Sep 26, 2016

Dear Ronald,

Thanks for copying us into this. As you might expect, discussions about big data are rife amongst statistical associations at the moment, so I’m sure you are onto something.

Best,

Hetan Shah Sep 26, 2016

Dear Ron,

Thank you for soliciting my opinion. I have to confess first that I know little about big data (certainly heard a lot of it) as this is not my day job. We only deal with ‘small data’ in contrast to the big data.

If your big data problem is related to the statistical books collection for the Library of Alexandria, then the first question is what is the objective of the research and what you want to achieve. The quote at the end of your first email:

"Big data analytics is the process of examining large datasets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information"

Is this your objective? If yes, then a relative modern discipline of data mining seems to be the right approach. Do we need classical statistical methods? It depends. Many classical statistical methods evolve around hypothesis testing, mostly suitable for a modest amount of data. When you have a lot of hypotheses to test, then the control of false discovery rate as mentioned by Gary Clark may be relevant. Are there any hypotheses testing you have had in mind? So it is really what is the right tool to address your question (and what is the question), it does not matter whether it is classical, Bayesian, or others.

Dick De Veaux of Williams College is one of the experts in data mining.

I think I exhausted my knowledge here. I hope you get some practical advices from others.

Good luck with your next endeavor and kind regards,

Kan Yee Sep 26, 2016

Dear Ron,

I come to give you my reply to your message (as enclosed) about the idea of including a good approach of the knowledge on Big Data in the “Big Euclid”. My instinctive reaction to this is very enthusiastic as I am sure it would enrich remarkably the scientific authority of the Euclid Research Methods Library of Alexandria. My belief in this is based on what I used to report since many years ago about the importance of the huge amounts of data in the most reliable and holistic approach in the biological research. I was always supporting that any sole experiment based on the study of one or two-three factors on the output of any biological research results just only to an indicative information about the phenomenon in study as a whole, for it refers to a very small part of the multifactor matter-as any biological phenomenon is. It was probably due to my relevant experience during the period of my work for my PhD in LondonUniversity in collaboration with the computing statistics Department and the Computer Centre of London. Then, I was given the chance to work on the huge amounts of data of the “British National Beef Cattle Recording Scheme” of the “MLC” (Meat and Livestock Commission) in their task to replace and reorganize the existing database by a new management system –the SIR/DBMS (Scientific Information Retrieval/Database Management System). That was a very good opportunity for us to see what could be the most completed-integrated information in that huge database (with lots of genetic, productive and reproductive data of the entire number of breeds in UK.). At the same time we were lucky to live the appearance of the new personal computers (PCs) with the most respectable known then statistical packages (e.g. SAS, SPSS….), which offered us the chance to search for the answers to our questions on the limits in the computers ability (and the compatible statistical packages) in solving multifactor research problems. The result of the whole our effort was the happiness we (the University, the MLC and a special Division of the E.U.) were given when we could get very interesting-significant outputs of multifactor effects and the estimations of certain interesting genetic parameters for practical applications. At the same time we had the chance to check the limits in the computers and the statistical packages ability in organizing, manipulating and analyzing huge amounts of data.

Unfortunately, after my coming back in Greece I didn’t have the same luck of being in a similar rich research material to continue that work. However, a year ago, in an International Conference, I met a Professor from the University ofFlorida-USA who presented a speech on “Data Mining and Optimization Issues in the Food Industry” in which he referred to the importance of having big data and the possible efforts for using it properly. When I received your message I remembered him and I sent him a message on which I was asking him to send me his speech and to discuss somehow more about the matter; I am waiting for his reply.

In conclusion, I can tell you that I agree with your idea as I consider it a very significant contribution to the scientific authority of the “Big Euclid”. I find also very good your idea about the organization of a special conference on the importance and use of big data as the unique source for getting the best integrated and holistic approach in the biological research, for it can offer various very interesting utilities, e.g. variables selection (based on the limits wanted), correlations, interactions, specific relations, e.tc. resulting to many multidimensional (multi purpose) analyses. Based on my experience that I mentioned previously, I could undertake a presentation of a such interest, as a sample of the big job. Together with this, I can invite other more specialists (as the colleague I mentioned) to present their work on this subject.

I think I have given my thoughts you wanted from me.

With best regards

Stergios Tzortzios Sep 30, 2016

I have the same views as Gary, and I agree with you that statisticians have to play a big part going forward in how we tackle big data. I think most (older) statisticians like me were trained primarily in how to model data and make inferences from controlled experiments and for statisticians in the pharma industry, from controlled randomized experiments. As a result, many of us are generally uncomfortable with data collected from observational studies due to the unknown confounders that likely exist. And big data is sort of like observational studies on steroids. So I think it comes down to needing more statisticians trained in how to best make sense of such data, and to come up with methods and approaches that instill more discipline in the process such that inferences can be made that are reproducible. I see that newer graduates in statistics are starting to take classes in big data, but it’s still early days on that front, so it will likely take some time before we see appreciable gains in the capabilities of classically trained statisticians in the handling of big data. I think the key is for departmental chairs at the universities to expand their programs to include more and more classes on big data.
Rich Davies Oct 1, 2016

There are two meanings to the word statistics: (1) mathematical processes, and (2) a synonym for data.

These findings do not surprise me. Statisticians don't use big data very much because (1) the concept is pretty new, still, and (2) big data have no real statistical basis - there's no consistent source, no clearly defined denominator (or universe) for the numerator (the data themselves). Big data users don't use statistics, as in mathematical processes, because many don't know how and because these concept really don't apply to this amorphous mass referred to as big data.

Hope this makes sense.

Patty Becker Oct 6, 2016

Ron,
I am traveling and have not had time to give much though to this but I would like to put a simple idea or two on the table suggesting that classical statistics and big data are in fact complementary approaches to knowledge, and that what is missing, at this early point in time, is how the two become linked in a synergistic way. Here are my simple thoughts. They are short so I am cc:ing this to all.
1. Big data is primarily data mining. It starts with: Let's develop algorithms that rake through (mine) massive amounts of data looking for patterns of interest. This can be mining google searches to find out what user X is interested in - to position ads, or epidemic data to identify patterns that lead to causes, or treatment possibilities.
2. Classical Statistics has elements of this (what do the mode, mean and midpoint, of Coef of Variation tell us) but it is also the corner stone to hypothesis testing and "scientific knowledge (i.e., that which survived the hypothesis test).
To keep it overly simple, but as a good starting point, Big Data is looking for patterns, and Classical Statistics is testing for patterns. We then, in both cases, use the results to make inferences about further mining, further testing, or policies and implementation based on the knowledge teased out of the data.
For me the way forward is how we work with both sets of techniques to broaden and deepen our knowledge so we have a better idea of what we should be doing when we wake up in the morning and go back to our work, be that research or implementation based on that knowledge.
(My 2 cents worth! :-\  )
Sam Lanfranco   7 Oct 2016

Ron,
Big data and statistics may well continue to evolve by themselves in part, but also science, devoted to testing, is also encountering big data sets. One way one might start a dialogue here is as follows, and it would be aided by those having a foot, or knowledge, in both camps. Start with two simple questions:
1. Ask big data people what it is about classical statistics that they see as useless in their work, where they see useful tools, and where they see things that are missing.
2. Ask classical statistics people what it is about big data that they see as useless in their work, where they see useful tools, and where they see things that are missing.
That discussion should generate some clarity about who does what, what is useful, what is missing, and some starting points for collaboration and "hybrid vigor" in our uses of data.
Sam Lanfranco   7 Oct 2016

Dear Ron,

It seems your conclusion is drawn based on the word frequency of "statistics" and that of "big data," which, in my point, cannot support your conclusion that the two areas has little overlap.

Instead, you need to compare a list of keywords used commonly in these two areas, e.g., "prediction", "test", "cross-validation", etc.

This is related to the use of topic model in text-mining. For example, for news related to "Finance", you probably won't see the word "finance" appearing that much, but "equity", "world bank", "company", "IPO". Those key words can be provided automatically by topic modelling code/package. Ideally, you should use a topic model to analyze the two corpses of text/slides/articles, and then compare the similarity of them.

Best,

Feng Liang Oct 7, 2016

Ron,

NASA’s use of big data is primarily to observe things and make the archives available. Our NASA astronomy archives are relatively small owing to the difficulty of getting back large data volumes from space observatories. Our Earth observing missions (and the missions we build for NOAA) produce far more data. I’m not surprised at the small overlap of statistics and big data papers. The Big Data issues are more about discovering patterns and the statistics papers are more about determining how significant they are.

John Mather Oct 7, 2016

Dear Dr. LaPorte,

I was fascinated by your Google-based analyses that you shared. But something about the method seemed to invite a possible fallacy (or at least, room for confounding). I took me a while to pin down what it was.

The problem became clear from this search pair that behaved in a similar way to what you described:

On Google Scholar, “Physics” gets about 5.4 million hits; “auto mechanics” gets about 16,000 hits; but “Physics AND “auto mechanics”” gets only 3,000 hits—implying only a very small overlap.

….But does it follow that there “is very little linkage” between these two fields? Surely, Forces, Torque, Acceleration, and Friction are just a few very key areas of overlap between auto mechanics and physics. Indeed, the authors of a paper “Auto mechanics in the physics lab: Science education for all”, John W. Tillotson and Paula Kluth, lament that this overlap is not used explicitly to bridge the stereotypical gap between “academic”- and “trades”-oriented streams in high school curricula.

What this suggests is that absence of Google overlaps tells a lot about the “silos” from which different writers may think about and contribute on certain subjects; but it is not evidence of there being no inherent relation between the subjects. It may just need champions to get out the message.

As this relates to your letter, may I recommend that you clarify your objective in asking your question. You have certainly shown that, as a historical fact, not a lot of people write about these (related) topics outside the ‘silo’ or framework of their own particular interest. But if you are asking, is there inherent relation between the two topics “statistics” and “big data”, which people should be keeping in mind, the Google data neither confirm or refute a connection.

Does this make sense?

Bill Goodman 7 Oct 2016

Dear Ron,

I believe there are probably several types of differences between “Statistics” and “Big Data”. In many cases, “Big Data” deals with the data from entire population (e.g. all users of Google say), while classical statistics typically deals with a representative sample (subset) of the population. This can lead to differences in keys areas such as “experimental design” and “statistical inference”.

With experimental design we carefully plan how we are going to sample our data (to make inferences about our population).

With statistical inference, we often use special models that involve random variables and probability concepts can that we can compute probabilities of making an incorrect decision.

Many practitioners of “Big Data” however, use clever algorithms to find structure (e.g. trends, clusters, etc.) in the data. Since many such practitioners of “Big Data” are working with the entire population, they seem less concerned about the classical statistics issues of “experimental design” and “statistical inference”.

However, with changing technology, more and more situations are arising where scientific experiments (e.g. genetic experiments, high-dimensional measurements made with spectroscopy devices, etc.) are themselves generating “Big data”. It is my belief that these particular “Big Data” situations still require good experimental design and statistical inference.

You might find the review paper “Statistical Modeling: The Two Cultures” by Leo Breiman (with Discussions from several leading statisticians) interesting as it touches upon the subtle but important issue of classical statistical modeling vs. the algorithmic approaches now popular in “Big Data” practice. See http://projecteuclid.org/euclid.ss/1009213726 Best regards,

John Peterson Oct 7, 2016

Ron: I agree that this is a critical conversation for our profession at this time, and I do have some thoughts on this issue. Enclosed are three publications about integrating statistical principles into Big Data Analytics/Data Science. The "Missing Link" paper is now published online (http://onlinelibrary.wiley.com/doi/10.1002/sam.11303/abstract), and will appear in Statistical Analysis and Data Mining. The statistical engineering paper has been accepted by The American Statistician. The discussion of data science versus statistical engineering is in Section 3.4.

In summary, my view is that both data science/Big Data and statistics have important roles to play in the field of analytics, but the current "competition" between the groups, with each claiming universal superiority, is hurting both professions. Chemists and chemical engineers have learned to live together despite significant overlap in technical backgrounds, and I think we can as well. Best wishes,
Roger Hoerl 8 Oct 2016

As human it is well documented that our thinking embraces the physical world that we perceive through our senses (expanded through technology), as well as, the conceptual world that is associated with language, culture and our self-awareness about some of the mental paradigms we use (expanded through IT).

When exploring Ron LaPorte’s question about possible distinctions between “classic statistics” and “big analytics” my interpretation is that before the exponential development of IT in the last 50 years, looking at the physical world as constituted by “a collection of parts”, was making sense with “classic statistics”. Today with the “Internet of Things” and the “growing interconnections between things and humans” the resulting exploding complexity creates for our environment new attributes more linked to the conceptual world than the physical world. The study of these new attributes requires “Big Analytics” therefore I propose the below distinctions:

“Classic statistics” basically are for the Physical world:	“Big Analytics” basically are for the Conceptual world:
Objective and tangible	Subjective and intangible
Shared external references as objects	Language to communicate internal references as concepts
Matter: atoms, molecules	Mental models: Networks of interconnections, relationships and structures
Substance	Form/patterns
Simple structures	Complex networks
Deductive	Holistic
Quantitative	Qualitative
Mechanistic	Systemic
Cause-effect	Feedback loops
Lineal response	Non lineal response
Static	Dynamic (a system that is self organized, self regulated and self regenerating [autopoiesis] for sustainability, creates new structures and behaviors, and increase it complexity over time)
Focus on: equilibrium	Focus on: homeostasis, adaptation and resilience (relearn, evolve and adapt)
Close system	Concurrently “Close and Open” system to maintain identity as well as to relearn, evolve and adapt
Friction and entropy (to go from order to chaos)	Traction and life (Incorporate information, energy and raw materials from the environment to maintain internal order to survive and achieve its purpose and eliminate waste)

Francois Sauer October 9, 2016

Dear Ronald,
Obviously "Big Data" fall in the scope of statistics. However the advent of big data is a rather new challenge for statistics.
For big data one must distinguish "big n" and "big p". "big n" arise in very large studies (cohorts or registers) that are now available. It's not entirely new. The risk here is to detect spuriously significant effects because of the very large power of the tests, which are thus very sensitive to assumptions. "Big p" is more even more problematic and the main issue here is multiplicity, which is clearly a statistical issue.
Best wishes,
Daniel Commenges Oct 10, 2016

Search inside of Supercourse and lectures in HTML and PPT format

Donate to Supercourse