No, Machine Learning is not just glorified Statistics
Published on: Towards Data Science
by: Joe Davidson
on: June 27, 2018
https://towardsdatascience.com/no-machine-learning-is-not-just-glorified-statistics-26d3952234e3
This
meme has been all over social media lately, producing appreciative
chuckles across the internet as the hype around deep learning begins to
subside. The sentiment that machine learning is really nothing to get
excited about, or that it’s just a redressing of age-old statistical
techniques, is growing increasingly ubiquitous; the trouble is it isn’t
true.
I
get it — it’s not fashionable to be part of the overly enthusiastic,
hype-drunk crowd of deep learning evangelists. ML experts who in 2013
preached deep learning from the rooftops now use the term only with a
hint of chagrin, preferring instead to downplay the power of modern
neural networks lest they be associated with the scores of people that
still seem to think that
import keras
is the leap for every hurdle, and that they, in knowing it, have some tremendous advantage over their competition.
While it’s true that deep learning has outlived its usefulness as a buzzword, as Yann LeCun put it,
this overcorrection of attitudes has yielded an unhealthy skepticism
about the progress, future, and usefulness of artificial intelligence.
This is most clearly seen by the influx of discussion about a looming AI winter, in which AI research is prophesied to stall for many years as it has in decades past.
The
purpose of this post isn’t to argue against an AI winter, however. It
is also not to argue that one academic group deserves the credit for
deep learning over another; rather, it is to make the case that credit is
due; that the developments seen go beyond big computers and nicer
datasets; that machine learning, with the recent success in deep neural
networks and related work, represents the world’s foremost frontier of
technological progress.
Machine Learning != Statistics
“When you’re fundraising, it’s AI. When you’re hiring, it’s ML. When you’re implementing, it’s logistic regression.”
—everyone on Twitter ever
The
main point to address, and the one that provides the title for this
post, is that machine learning is not just glorified statistics—the
same-old stuff, just with bigger computers and a fancier name. This
notion comes from statistical concepts and terms which are prevalent in
machine learning such as regression, weights, biases, models, etc.
Additionally, many models approximate what can generally be considered
statistical functions: the softmax output of a classification model
consists of logits, making the process of training an image classifier a
logistic regression.
Though
this line of thinking is technically correct, reducing machine learning
as a whole to nothing more than a subsidiary of statistics is quite a
stretch. In fact, the comparison doesn’t make much sense. Statistics is
the field of mathematics which deals with the understanding and
interpretation of data. Machine learning is nothing more than a class of
computational algorithms (hence its emergence from computer science).
In many cases, these algorithms are completely useless in aiding with
the understanding of data and assist only in certain types of
uninterpretable predictive modeling. In some cases, such as in
reinforcement learning, the algorithm may not use a pre-existing dataset
at all. Plus, in the case of image processing, referring to images as instances of a dataset with pixels as features was a bit of a stretch to begin with.
The
point, of course, is not that computer scientists should get all the
credit or that statisticians should not; like any field of research, the
contributions that led to today’s success came from a variety of
academic disciplines, statistics and mathematics being first among them.
However, in order to correctly evaluate the powerful impact and
potential of machine learning methods, it is important to first
dismantle the misguided notion that modern developments in artificial
intelligence are nothing more than age-old statistical techniques with
bigger computers and better datasets.
Machine Learning Does Not Require An Advanced Knowledge of Statistics
Hear me out.
When I was learning the ropes of machine learning, I was lucky enough
to take a fantastic class dedicated to deep learning techniques that was
offered as part of my undergraduate computer science program. One of
our assigned projects was to implement and train a Wasserstein GAN in
TensorFlow.
At
this point, I had taken only an introductory statistics class that was a
required general elective, and then promptly forgotten most of it.
Needless to say, my statistical skills were not very strong. Yet, I was
able to read and understand a paper on a state-of-the-art generative
machine learning model, implement it from scratch, and generate quite
convincing fake images of non-existent individuals by training it on the
MS Celebs dataset.]
Throughout
the class, my fellow students and I successfully trained models for
cancerous tissue image segmentation, neural machine translation,
character-based text generation, and image style transfer, all of which
employed cutting-edge machine learning techniques invented only in the
past few years.
Yet,
if you had asked me, or most of the students in that class, how to
calculate the variance of a population, or to define marginal
probability, you likely would have gotten blank stares.
That seems a bit inconsistent with the claim that AI is just a rebranding of age-old statistical techniques.
True,
an ML expert probably has a stronger stats foundation than a CS
undergrad in a deep learning class. Information theory, in general,
requires a strong understanding of data and probability, and I would
certainly advise anyone interested in becoming a Data Scientist or
Machine Learning Engineer to develop a deep intuition of statistical
concepts. But the point remains: If
machine learning is a subsidiary of statistics, how could someone with
virtually no background in stats develop a deep understanding of
cutting-edge ML concepts?
It
should also be acknowledged that many machine learning algorithms
require a stronger background in statistics and probability than do most
neural network techniques, but even these approaches are often referred
to as statistical machine learning or statistical learning,
as if to distinguish themselves from the regular, less statistical
kind. Furthermore, most of the hype-fueling innovation in machine
learning in recent years has been in the domain of neural networks, so
the point is irrelevant.
Of
course, machine learning doesn’t live in a world by itself. Again, in
the real world, anyone hoping to do cool machine learning stuff is
probably working on data problems of a variety of types, and therefore
needs to have a strong understanding of statistics as well. None of this
is to say that ML never uses or builds on statistical concepts either,
but that doesn’t mean they’re the same thing.
Machine Learning = Representation + Evaluation + Optimization
To
be fair to myself and my classmates, we all had a strong foundation in
algorithms, computational complexity, optimization approaches, calculus,
linear algebra, and even some probability. All of these, I would argue,
are more relevant to the problems we were tackling than knowledge of
advanced statistics.
Machine
learning is a class of computational algorithms which iteratively
“learn” an approximation to some function. Pedro Domingos, a professor
of computer science at the University of Washington, laid out three components that make up a machine learning algorithm: representation, evaluation, and optimization.
Representation involves
the transformation of inputs from one space to another more useful
space which can be more easily interpreted. Think of this in the context
of a Convolutional Neural Network. Raw pixels are not useful for
distinguishing a dog from a cat, so we transform them to a more useful
representation (e.g., logits from a softmax output) which can be
interpreted and evaluated.
Evaluation
is essentially the loss function. How effectively did your algorithm
transform your data to a more useful space? How closely did your softmax
output resemble your one-hot encoded labels (classification)? Did you
correctly predict the next word in the unrolled text sequence (text
RNN)? How far did your latent distribution diverge from a unit Gaussian
(VAE)? These questions tell you how well your representation function is
working; more importantly, they define what it will learn to do.
Optimization is the last piece of the puzzle. Once you have the evaluation component, you can optimize the representation function in order to improve your evaluation metric.
In neural networks, this usually means using some variant of stochastic
gradient descent to update the weights and biases of your network
according to some defined loss function. And voila! You have the world’s
best image classifier (at least, if you’re Geoffrey Hinton in 2012, you
do).
When
training an image classifier, it’s quite irrelevant that the learned
representation function has logistic outputs, except for in defining an
appropriate loss function. Borrowing statistical terms like logistic regression
do give us useful vocabulary to discuss our model space, but they do
not redefine them from problems of optimization to problems of data
understanding.
Aside: The term artificial intelligence is stupid. An AI problem is just a problem that computers aren’t good at solving yet. In the 19th century, a mechanical calculator was considered intelligent (link).
Now that the term has been associated so strongly with deep learning,
we’ve started saying artificial general intelligence (AGI) to refer to
anything more intelligent than an advanced pattern matching mechanism.
Yet, we still don’t even have a consistent definition or understanding
of general intelligence. The only thing the term AI does is inspire fear
of a so-called “singularity” or a terminator-like killer robot. I wish
we could stop using such an empty, sensationalized term to refer to real
technological techniques.
Techniques For Deep Learning
Further
defying the purported statistical nature of deep learning is, well,
almost all of the internal workings of deep neural networks. Fully
connected nodes consist of weights and biases, sure, but what about
convolutional layers? Rectifier activations? Batch normalization?
Residual layers? Dropout? Memory and attention mechanisms?
These
innovations have been central to the development of high-performing
deep nets, and yet they don’t remotely line up with traditional
statistical techniques (probably because they are not statistical
techniques at all). If you don’t believe me, try telling a statistician
that your model was overfitting, and ask them if they think it’s a good
idea to randomly drop half of your model’s 100 million parameters.
And let’s not even talk about model interpretability.
Regression Over 100 Million Variables — No Problem?
Let
me also point out the difference between deep nets and traditional
statistical models by their scale. Deep neural networks are huge. The
VGG-16 ConvNet architecture, for example, has approximately 138 million parameters. How do you think your average academic advisor would respond to a student wanting to perform a multiple regression of over 100 million variables? The idea is ludicrous. That’s because training VGG-16 is not multiple regression — it’s machine learning.
New Frontiers
You’ve
probably spent the last several years around endless papers, posts, and
articles preaching the cool things that machine learning can now do, so
I won’t spend too much time on it. I will remind you, however, that not
only is deep learning more than previous techniques, it has enabled to
us address an entirely new class of problems.
Prior
to 2012, problems involving unstructured and semi-structured data were
challenging, at best. Trainable CNNs and LSTMs alone were a huge leap
forward on that front. This has yielded considerable progress in fields
such as computer vision, natural language processing, speech
transcription, and has enabled huge improvement in technologies like
face recognition, autonomous vehicles, and conversational AI.
It’s
true that most machine learning algorithms ultimately involve fitting a
model to data — from that vantage point, it is a statistical procedure.
It’s also true that the space shuttle was ultimately just a flying
machine with wings, and yet we don’t see memes mocking the excitement
around NASA’s 20th century space exploration as an overhyped rebranding
of the airplane.
As
with space exploration, the advent of deep learning did not solve all
of the world’s problems. There are still significant gaps to overcome in
many fields, especially within “artificial intelligence”. That said, it
has made a significant contribution to our ability to attack problems
with complex unstructured data. Machine learning continues to represent
the world’s frontier of technological progress and innovation. It’s much
more than a crack in the wall with a shiny new frame.
Edit:
Many
have interpreted this article as a diss on the field of statistics, or
as a betrayal of my own superficial understanding of machine learning.
In retrospect, I regret directing so much attention on the differences
in the ML vs. statistics perspectives rather on my central point: machine learning is not all hype.
Let
me be clear: statistics and machine learning are not unrelated by any
stretch. Machine learning absolutely utilizes and builds on concepts in
statistics, and statisticians rightly make use of machine learning
techniques in their work. The distinction between the two fields is
unimportant, and something I should not have focused so heavily on.
Recently, I have been focusing on the idea of Bayesian neural networks. BNNs involve
approximating a probability distribution over a neural network’s
parameters given some prior belief. These techniques give a principled
approach to uncertainty quantification and yield better-regularized
predictions.
I
would have to be an idiot in working on these problems to say I’m not
“doing statistics”, and I won’t. The fields are not mutually exclusive,
but that does not make them the same, and it certainly does not make
either without substance or value. A mathematician could point to a
theoretical physicist working on Quantum field theory and rightly say
that she is doing math, but she might take issue if the mathematician
asserted that her field of physics was in fact nothing more than
over-hyped math.
So
it is with the computational sciences: you may point your finger and
say “they’re doing statistics”, and “they” would probably agree.
Statistics is invaluable in machine learning research and many
statisticians are at the forefront of that work. But ML has developed
100-million parameter neural networks with residual connections and
batch normalization, modern activations, dropout and numerous other
techniques which have led to advances in several domains, particularly
in sequential decision making and computational perception. It has found
and made use of incredibly efficient optimization algorithms, taking
advantage of automatic differentiation and running in parallel on
blindingly fast and cheap GPU technology. All of this is accessible to
anyone with even basic programming abilities thanks to high-level,
elegantly simple tensor manipulation software. “Oh, AI is just logistic
regression” is a bit of an under-sell, don’t ya think?
https://towardsdatascience.com/no-machine-learning-is-not-just-glorified-statistics-26d3952234e3
No comments:
Post a Comment