With AI, data preparation which was already important is becoming crucial. The potential to get it wrong is multiplying as the options of what we call data taxonomy increases and many of the steps are not yet standardized as highlighted in the article.
Published in Forbes COGNITIVE WORLD
by Ron Schmelzer
On May 7, 2019
https://www.forbes.com/sites/cognitiveworld/2019/03/07/the-achilles-heel-of-ai/#4cbff5177be7
Garbage in is garbage out. There’s no
saying more true in computer science, and especially is the case with
artificial intelligence. Machine learning algorithms are very dependent on
accurate, clean, and well-labeled training data to learn from so that they can
produce accurate results. If you train your machine learning models with
garbage, it’s no surprise you’ll get garbage results. It’s for this reason that
the vast majority of the time spent during AI projects are during the data
collection, cleaning, preparation, and labeling phases.
According to a recent report from AI
research and advisory firm Cognilytica, over 80% of the time spent in AI
projects are spent dealing with and wrangling data. Even more importantly, and
perhaps surprisingly, is how human-intensive much of this data preparation work
is. In order for supervised forms of machine learning to work, especially the
multi-layered deep learning neural network approaches, they must be fed large
volumes of examples of correct data that is appropriately annotated, or
“labeled”, with the desired output result. For example, if you’re trying to get
your machine learning algorithm to correctly identify cats inside of images,
you need to feed that algorithm thousands of images of cats, appropriately
labeled as cats, with the images not having any extraneous or incorrect data
that will throw the algorithm off as you build the model. (Disclosure: I’m a
principal analyst with Cognilytica)
Data Preparation: More than Just Data
Cleaning
According to Cognilytica’s report, there
are many steps required to get data into the right “shape” so that it works for
machine learning projects:
Removing or correcting bad data and duplicates - Data in the enterprise
environment is exceedingly “dirty” with incorrect data, duplicates, and other
information that will easily taint machine learning models if not removed or
replaced.
Standardizing and formatting data - Just how many different ways are there
to represent names, addresses, and other information? Images are many different
sizes, shapes, formats, and color depths. In order to use any of this for
machine learning projects, the data needs to be represented in the exact same
manner or you’ll get unpredictable results.
Updating out of date information - The data might be in the right format
and accurate, but out of date. You can’t train machine learning systems when
you’re mixing current with obsolete (and irrelevant) data.
Enhancing and augmenting data - Sometimes you need extra data to make
the machine learning model work, such as calculated fields or additional
sourced data to get more from existing data sets. If you don’t have enough
image data, you can actually “multiply” it by simply flipping or rotating
images while keeping their data formats consistent.
Reduce noise - Images, text, and data can have “noise”, which is
extraneous information or pixels that don’t really help with the machine
learning project. Data preparation activities will clear those up.
Anonymize and de-bias data - Remove all unnecessary personally
identifiable information from machine learning data sets and remove all
unnecessary data that can bias algorithms.
Normalization - For many machine learning algorithms, especially
Bayesian Classifiers and other approaches, data needs to be represented in
standard ranges so that one input doesn’t overpower others. Normalization works
to make training more effective and efficient.
Data sampling - If you have very large data sets, you need to sample
that data to be used for the training, test, and validation phases, and also
extract subsamples to make sure that the data is representative of what the real-world
scenario will be like.
Feature enhancement - Machine learning algorithms work by training on
“features” in the data. Data preparation tools can accentuate and enhance the
data so that it is more easily able to separate the stuff that the algorithms
should be trained on from less relevant data.
You can imagine that performing all these
steps on gigabytes, or even terabytes, of data can take significant amounts of
time and energy. Especially if you have to do it over and over until you get
things right. It’s no surprise that these steps take up the vast majority of
machine learning project time. Fortunately, the report also details solutions
from third-party vendors, including Melissa Data, Paxata, and Trifacta that
have products that can perform the above data preparation operations on large
volumes of data at scale.
Data Labeling: The Necessary Human in the
Loop
In order for machine learning systems to
learn, they need to be trained with data that represents the thing the system
needs to know. Obviously, as detailed above that data needs to not only be good
quality, but it needs to be “labeled” with the right information. Simply having
a bunch of pictures of cats doesn’t train the system unless you tell the system
that those pictures are cats -- or a
specific breed of cat, or just an animal, or whatever it is you want the system
to know. Computers can’t put those labels on the images themselves, because it
would be a chicken-and-egg problem. How can you label an image if you haven’t
fed the system labeled images to train it on?
The answer is that you need people to do
that. Yes, the secret heart of all AI systems is human intelligence that labels
the images systems later use to train on. Human powered data labeling is the necessary
component for any machine learning model that needs to be trained on data that
hasn’t already been labeled. There are a growing set of vendors that are
providing on-demand labor to help with this labeling, so companies don’t have
to build up their own staff or expertise to do so. Companies like CloudFactory,
Figure Eight, and iMerit have emerged to provide this capacity to organizations
that are wise enough not to build up their own labor force for necessary data
labeling.
Eventually, there will be a large amount of
already trained neural networks that can be used by organizations for their own
model purposes, or extended via transfer learning to new applications. But
until that time, organizations need to deal with the human-dominated labor
involved in data labeling, something Cognilytica has identified takes up to 25%
of total machine learning project time and cost.
AI helping Data Preparation
Even with all this activity in data
preparation and labeling, Cognilytica sees that AI will have an impact on this
process. Increasingly, data preparation firms are using AI to automatically
identify data patterns, autonomously clean data, apply normalization and
augmentation based on previously learned patterns, and aggregate data where
necessary based on previous machine learning projects. Likewise, machine
learning is being applied to data labeling to speed up the process by
suggesting potential labels, applying bounding boxes, and otherwise speeding up
the labeling process. In this way, AI is being applied to help make future AI
systems even better.
The final conclusion of this report is that
the data side of any machine learning project is usually the most labor
intensive part. The market is emerging to help make those labor tasks less onerous
and costly, but they can never be eliminated. Successful AI projects will learn
how to leverage third-party software and services to minimize the overall cost
and impact and lead to quicker real-world deployment.
No comments:
Post a Comment