Search for most updated materials ↑
Discussion of Article:
oct. 8 2016
Clash of Paradigms: Statistics and Big Data
Ronald LaPorte, Ph.D.
Crude but Interesting Analyses
Only a few weeks ago, we thought that statistics and big data analytics were one in the same: A common definition of statistics is:
“the practice or science of collecting and analyzing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample”. (Wikipedia)
A Wiki definition of big data is: “extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions”
These definitions do seem closely entwined. It seemed to us that these two work hand in hand to address evaluation of hypotheses, large quantities of data, and the use of classical statistics to reveal patterns, the internet and the data. This would suggest a very high degree of overlap between these two concepts as seen in this figure above.
On a lark we decided to look at the overlap between statistics and big data. We were very surprised. Much of our work over the past few years has been to collect and share large quantities of Powerpoint presentations and share these for free through the Library of Alexandria. We used Google advanced search and found that there were 218,000 .ppt sites for statistics, and 5040 for big data. When we combined these we found only 815 discussing both, or .3% of all statistics lectures include big data, and only 16% of big data lectures included statistics. The degree of overlap was very low. Also, most of the sites were not directly involved with data analysis but rather just a brief mention of each other.
We then decided to do a further set of analysis using Google advanced search on all sites. We searched on “statistics” which yielded 1,520,000,000 sites, and for “big data”, 70,000 sites, the overlap was 4750, which represented .5% overlap with statistics and 7% overlap with big data, again amazingly low.
To examine this from another direction we used google scholar to look for statistics, Big Data and Statistics/big data articles. For statistics there were 5,590,000 articles for the “statistics” search, and 184,000 for big data. The combination of Big Data yield 47,300, or .8% overlap with statistics and 26% overlap with big data. We did noticed again that most of the “overlap” articles were not data articles but mentions of the other field.
We further did a rough analysis of the first 50 in the search with google scholar. We found that only about 20% of the overlap actually had data, the rest for statistics courses, conferences, review papers and brief mention of the other field. Therefore the degree of overlap in terms of data analysis is indeed far less than the projections above. A better representation of the fields is below, where there is only a miniscule commonality.
We recognize that this is a crude set of analyses, but we can at least say that there is little overlap between classical statistics and big data, potentially indicative of separate fields. Our analyses thus review little linkage between these two data analysis fields.
There may need to be greater interaction between those in classical statistics and those developing big data. We can learn much from each other.
Alternatively we should consider what the Beatles had to say: You say yes, I say no / You say stop and I say go go go, oh no / You say goodbye and I say hello / Hello hello, perhaps the two paradigms should "go it alone"
We would value your thoughts about how we can bring these together, and what are the factors keeping us apart
Ron
Comments by:
Gary
Clark
Hetan Shah
Kan
Yee
Stergios Tzortzios
Rich Davies
Patty Becker
Sam Lanfranco
Feng Liang
John Mather
Bill Goodman
John Peterson
Roger
Hoerl |
||||||||||||||||||||||||||||||
Ron, I agree that not much attention has been paid to classical statistical methods when big data sets are analyzed and results are presented. There was certainly an awareness of the big data problem (<100 patients with 10,000+ variables) as far back as the early 1990s [Benjamini Y and Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B 1995; 57:289-300]; however, it required a decade or so before this technique was frequently referenced in big data publications. Similarly, the problem of variable selection from many variables was also addressed in the 1990s [Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med 1997; 16:285-395], with several attempts to generalize the technique over the years [Tibshirani R. Regression shrinkage and selection via the lasso: a retrospective. J R Statist Soc 2011; 73 (Part 3):273-292].
Very few of my statistical
colleagues (myself included) feel comfortable
designing big data studies or analyzing data
from these studies. Instead, we tend to refer
researchers to individuals/groups that
specialize in bioinformatics. There are now
programs that offer advanced degrees in data
analytics with a focus on big data (see, for
example, Penn State’s program at http://www.worldcampus.psu. A conference like you suggest would probably be quite popular. It should include individuals with expertise in several areas, including people who set up and run the assays that produce the multidimensional data, people who pre-process the data (eg, hybridization normalization, central tendency normalization, calibration), and people who build models (using data reduction / variable selection techniques) and validate their biological and clinical utility. Gary Clark Sep 26, 2016 |
||||||||||||||||||||||||||||||
Dear Ronald, Thanks for copying us into this. As you might expect, discussions about big data are rife amongst statistical associations at the moment, so I’m sure you are onto something. Best, Hetan Shah Sep 26, 2016 |
||||||||||||||||||||||||||||||
Dear Ron, Thank you for soliciting my opinion. I have to confess first that I know little about big data (certainly heard a lot of it) as this is not my day job. We only deal with ‘small data’ in contrast to the big data. If your big data problem is related to the statistical books collection for the Library of Alexandria, then the first question is what is the objective of the research and what you want to achieve. The quote at the end of your first email: "Big data analytics is the process of examining large datasets to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information" Is this your objective? If yes, then a relative modern discipline of data mining seems to be the right approach. Do we need classical statistical methods? It depends. Many classical statistical methods evolve around hypothesis testing, mostly suitable for a modest amount of data. When you have a lot of hypotheses to test, then the control of false discovery rate as mentioned by Gary Clark may be relevant. Are there any hypotheses testing you have had in mind? So it is really what is the right tool to address your question (and what is the question), it does not matter whether it is classical, Bayesian, or others. Dick De Veaux of Williams College is one of the experts in data mining. I think I exhausted my knowledge here. I hope you get some practical advices from others. Good luck with your next endeavor and kind regards, Kan Yee Sep 26, 2016 |
||||||||||||||||||||||||||||||
Dear Ron, I come to give you my reply to your message (as enclosed) about the idea of including a good approach of the knowledge on Big Data in the “Big Euclid”. My instinctive reaction to this is very enthusiastic as I am sure it would enrich remarkably the scientific authority of the Euclid Research Methods Library of Alexandria. My belief in this is based on what I used to report since many years ago about the importance of the huge amounts of data in the most reliable and holistic approach in the biological research. I was always supporting that any sole experiment based on the study of one or two-three factors on the output of any biological research results just only to an indicative information about the phenomenon in study as a whole, for it refers to a very small part of the multifactor matter-as any biological phenomenon is. It was probably due to my relevant experience during the period of my work for my PhD in LondonUniversity in collaboration with the computing statistics Department and the Computer Centre of London. Then, I was given the chance to work on the huge amounts of data of the “British National Beef Cattle Recording Scheme” of the “MLC” (Meat and Livestock Commission) in their task to replace and reorganize the existing database by a new management system –the SIR/DBMS (Scientific Information Retrieval/Database Management System). That was a very good opportunity for us to see what could be the most completed-integrated information in that huge database (with lots of genetic, productive and reproductive data of the entire number of breeds in UK.). At the same time we were lucky to live the appearance of the new personal computers (PCs) with the most respectable known then statistical packages (e.g. SAS, SPSS….), which offered us the chance to search for the answers to our questions on the limits in the computers ability (and the compatible statistical packages) in solving multifactor research problems. The result of the whole our effort was the happiness we (the University, the MLC and a special Division of the E.U.) were given when we could get very interesting-significant outputs of multifactor effects and the estimations of certain interesting genetic parameters for practical applications. At the same time we had the chance to check the limits in the computers and the statistical packages ability in organizing, manipulating and analyzing huge amounts of data. Unfortunately, after my coming back in Greece I didn’t have the same luck of being in a similar rich research material to continue that work. However, a year ago, in an International Conference, I met a Professor from the University ofFlorida-USA who presented a speech on “Data Mining and Optimization Issues in the Food Industry” in which he referred to the importance of having big data and the possible efforts for using it properly. When I received your message I remembered him and I sent him a message on which I was asking him to send me his speech and to discuss somehow more about the matter; I am waiting for his reply. In conclusion, I can tell you that I agree with your idea as I consider it a very significant contribution to the scientific authority of the “Big Euclid”. I find also very good your idea about the organization of a special conference on the importance and use of big data as the unique source for getting the best integrated and holistic approach in the biological research, for it can offer various very interesting utilities, e.g. variables selection (based on the limits wanted), correlations, interactions, specific relations, e.tc. resulting to many multidimensional (multi purpose) analyses. Based on my experience that I mentioned previously, I could undertake a presentation of a such interest, as a sample of the big job. Together with this, I can invite other more specialists (as the colleague I mentioned) to present their work on this subject. I think I have given my thoughts you wanted from me. With best regards Stergios Tzortzios Sep 30, 2016 |
||||||||||||||||||||||||||||||
I have the same views as Gary, and I agree with
you that statisticians have to play a big part
going forward in how we tackle big data. I
think most (older) statisticians like me were
trained primarily in how to model data and make
inferences from controlled experiments and for
statisticians in the pharma industry, from
controlled randomized experiments. As a result,
many of us are generally uncomfortable with data
collected from observational studies due to the
unknown confounders that likely exist. And big
data is sort of like observational studies on
steroids. So I think it comes down to needing
more statisticians trained in how to best make
sense of such data, and to come up with methods
and approaches that instill more discipline in
the process such that inferences can be made
that are reproducible. I see that newer
graduates in statistics are starting to take
classes in big data, but it’s still early days
on that front, so it will likely take some time
before we see appreciable gains in the
capabilities of classically trained
statisticians in the handling of big data. I
think the key is for departmental chairs at the
universities to expand their programs to include
more and more classes on big data. Rich Davies Oct 1, 2016 |
||||||||||||||||||||||||||||||
There are two meanings to the word statistics: (1) mathematical processes, and (2) a synonym for data.
These findings
do not surprise
me.
Statisticians
don't use big
data very much
because (1) the
concept is
pretty new,
still, and (2)
big data have no
real statistical
basis - there's
no consistent
source, no
clearly defined
denominator (or
universe) for
the numerator
(the data
themselves). Big
data users don't
use statistics,
as in
mathematical
processes,
because many
don't know how
and because
these concept
really don't
apply to this
amorphous mass
referred to as
big data. |
||||||||||||||||||||||||||||||
Ron, I am traveling and have not had time to give much though to this but I would like to put a simple idea or two on the table suggesting that classical statistics and big data are in fact complementary approaches to knowledge, and that what is missing, at this early point in time, is how the two become linked in a synergistic way. Here are my simple thoughts. They are short so I am cc:ing this to all. 1. Big data is primarily data mining. It starts with: Let's develop algorithms that rake through (mine) massive amounts of data looking for patterns of interest. This can be mining google searches to find out what user X is interested in - to position ads, or epidemic data to identify patterns that lead to causes, or treatment possibilities. 2. Classical Statistics has elements of this (what do the mode, mean and midpoint, of Coef of Variation tell us) but it is also the corner stone to hypothesis testing and "scientific knowledge (i.e., that which survived the hypothesis test). To keep it overly simple, but as a good starting point, Big Data is looking for patterns, and Classical Statistics is testing for patterns. We then, in both cases, use the results to make inferences about further mining, further testing, or policies and implementation based on the knowledge teased out of the data. For me the way forward is how we work with both sets of techniques to broaden and deepen our knowledge so we have a better idea of what we should be doing when we wake up in the morning and go back to our work, be that research or implementation based on that knowledge. (My 2 cents worth! :-\ ) Sam Lanfranco 7 Oct 2016 Ron, Big data and statistics may well continue to evolve by themselves in part, but also science, devoted to testing, is also encountering big data sets. One way one might start a dialogue here is as follows, and it would be aided by those having a foot, or knowledge, in both camps. Start with two simple questions: 1. Ask big data people what it is about classical statistics that they see as useless in their work, where they see useful tools, and where they see things that are missing. 2. Ask classical statistics people what it is about big data that they see as useless in their work, where they see useful tools, and where they see things that are missing. That discussion should generate some clarity about who does what, what is useful, what is missing, and some starting points for collaboration and "hybrid vigor" in our uses of data. Sam Lanfranco 7 Oct 2016 |
||||||||||||||||||||||||||||||
Dear Ron,
It seems your conclusion is
drawn based on the word
frequency of "statistics" and
that of "big data," which, in my
point, cannot support your
conclusion that the two areas
has little overlap.
Instead, you need to compare a
list of keywords used commonly
in these two areas, e.g.,
"prediction", "test",
"cross-validation", etc.
This is related to the use of
topic model in text-mining. For
example, for news related to
"Finance", you probably won't
see the word "finance" appearing
that much, but "equity", "world
bank", "company", "IPO". Those
key words can be provided
automatically by topic modelling
code/package. Ideally, you
should use a topic model to
analyze the two corpses of
text/slides/articles, and then
compare the similarity of them.
Best,
Feng Liang
Oct 7, 2016
|
||||||||||||||||||||||||||||||
Ron,
NASA’s use of
big data is
primarily to
observe things
and make the
archives
available. Our
NASA astronomy
archives are
relatively small
owing to the
difficulty of
getting back
large data
volumes from
space
observatories.
Our Earth
observing
missions (and
the missions we
build for NOAA)
produce far more
data. I’m not
surprised at the
small overlap of
statistics and
big data
papers. The Big
Data issues are
more about
discovering
patterns and the
statistics
papers are more
about
determining how
significant they
are.
John Mather
Oct 7, 2016
|
||||||||||||||||||||||||||||||
Dear Dr. LaPorte, I was fascinated by your Google-based analyses that you shared. But something about the method seemed to invite a possible fallacy (or at least, room for confounding). I took me a while to pin down what it was. The problem became clear from this search pair that behaved in a similar way to what you described: On Google Scholar, “Physics” gets about 5.4 million hits; “auto mechanics” gets about 16,000 hits; but “Physics AND “auto mechanics”” gets only 3,000 hits—implying only a very small overlap. ….But does it follow that there “is very little linkage” between these two fields? Surely, Forces, Torque, Acceleration, and Friction are just a few very key areas of overlap between auto mechanics and physics. Indeed, the authors of a paper “Auto mechanics in the physics lab: Science education for all”, John W. Tillotson and Paula Kluth, lament that this overlap is not used explicitly to bridge the stereotypical gap between “academic”- and “trades”-oriented streams in high school curricula. What this suggests is that absence of Google overlaps tells a lot about the “silos” from which different writers may think about and contribute on certain subjects; but it is not evidence of there being no inherent relation between the subjects. It may just need champions to get out the message. As this relates to your letter, may I recommend that you clarify your objective in asking your question. You have certainly shown that, as a historical fact, not a lot of people write about these (related) topics outside the ‘silo’ or framework of their own particular interest. But if you are asking, is there inherent relation between the two topics “statistics” and “big data”, which people should be keeping in mind, the Google data neither confirm or refute a connection. Does this make sense? Bill Goodman 7 Oct 2016 |
||||||||||||||||||||||||||||||
Dear Ron, I believe there are probably several types of differences between “Statistics” and “Big Data”. In many cases, “Big Data” deals with the data from entire population (e.g. all users of Google say), while classical statistics typically deals with a representative sample (subset) of the population. This can lead to differences in keys areas such as “experimental design” and “statistical inference”. With experimental design we carefully plan how we are going to sample our data (to make inferences about our population). With statistical inference, we often use special models that involve random variables and probability concepts can that we can compute probabilities of making an incorrect decision. Many practitioners of “Big Data” however, use clever algorithms to find structure (e.g. trends, clusters, etc.) in the data. Since many such practitioners of “Big Data” are working with the entire population, they seem less concerned about the classical statistics issues of “experimental design” and “statistical inference”. However, with changing technology, more and more situations are arising where scientific experiments (e.g. genetic experiments, high-dimensional measurements made with spectroscopy devices, etc.) are themselves generating “Big data”. It is my belief that these particular “Big Data” situations still require good experimental design and statistical inference.
You might find the review paper “Statistical
Modeling: The Two Cultures” by Leo Breiman (with
Discussions from several leading statisticians)
interesting as it touches upon the subtle but
important issue of classical statistical
modeling vs. the algorithmic approaches now
popular in “Big Data” practice. See http://projecteuclid.org/eucli John Peterson Oct 7, 2016 |
||||||||||||||||||||||||||||||
Ron: I agree that this is a critical
conversation for our profession at this time,
and I do have some thoughts on this issue.
Enclosed are three publications about
integrating statistical principles into Big Data
Analytics/Data Science. The "Missing Link" paper
is now published online (http://onlinelibrary.wiley.co
In summary, my view is that both
data science/Big Data and
statistics have important roles
to play in the field of
analytics, but the current
"competition" between the
groups, with each claiming
universal superiority, is
hurting both professions.
Chemists and chemical engineers
have learned to live together
despite significant overlap in
technical backgrounds, and I
think we can as well. Best
wishes,
Roger Hoerl 8 Oct 2016 |
||||||||||||||||||||||||||||||
As human it is well
documented that our thinking embraces the
physical world
that we perceive through our senses (expanded
through technology), as well as, the
conceptual world
that is associated with language, culture and
our self-awareness about some of the mental
paradigms we use (expanded through IT).
When exploring Ron
LaPorte’s question about possible distinctions
between “classic statistics” and “big analytics”
my interpretation is that before the exponential
development of IT in the last 50 years, looking
at the physical world as constituted by “a
collection of parts”, was making sense with
“classic statistics”. Today with the “Internet
of Things” and the “growing interconnections
between things and humans” the resulting
exploding complexity creates for our environment
new attributes
more linked to the conceptual world than the
physical world. The study of these new
attributes requires “Big Analytics” therefore I
propose the below distinctions:
|
||||||||||||||||||||||||||||||
Dear Ronald, Obviously "Big Data" fall in the scope of statistics. However the advent of big data is a rather new challenge for statistics. For big data one must distinguish "big n" and "big p". "big n" arise in very large studies (cohorts or registers) that are now available. It's not entirely new. The risk here is to detect spuriously significant effects because of the very large power of the tests, which are thus very sensitive to assumptions. "Big p" is more even more problematic and the main issue here is multiplicity, which is clearly a statistical issue. Best wishes, Daniel Commenges Oct 10, 2016 |
||||||||||||||||||||||||||||||
Search inside of Supercourse and lectures in HTML and PPT format