As Niccolo Machiavelli once said, “There is nothing more difficult, more perilous or more uncertain of success, than to take the lead in introducing a new order of things.”
This is what the data revolution is about. Not adding a few spreadsheets here and there or collecting more data, but rethinking what is available as well as the data flow within a company to make it meaningful. It requires to "think different!" To make this difference more palpable we should call it "smart data" with the understanding that the "smart" which adds value to the data and transform it into information is not intrinsic to the data, it is knowledge.
Smart data is not AI as it is understood today. It is data which has gone through a process which allows information to be extracted from the data. Conversely is requires to step back from pure statistical analysis to focus on process and context and in that respect, it incorporates "intelligence".
Usually, most companies generate raw data from their operation. It can be client data such as addresses and names, POS data such as actual sales or any types of other data. These data are often poorly structured, neither clean nor accurate and almost always lack context.
This is where generating data must start. Many companies only give cursory attention to their data believing that analysis will generate the insight. This is a mistake! Analysis is only the very last step in a long process and often not the most important one as we will see.
From data mining in the 1990s to artificial intelligence tool nowadays we have made great progress in our understanding of "data" although most of the great insights came over the years by accident.
What started as quantitative and brute force analysis with data mining gave very little actual results for the simple reason that information has nothing to do with mineral and that consequently the chance of finding anything of relevance by accident (or statistical analysis however advanced) is negligible. This is tantamount to buying a lottery ticket and expecting to be a winner. Obvious correlations are just that and were therefore known long before statistics confirmed that they were reel and already included in most companies' DNA as "knowledge", business practice or intangible assets. As for non obvious correlations, they were often little more than that too and usually represented no causation whatsoever. Pure statistical tools are deterministic and therefore not conductive to insight, contrary to most people opinion.
And this from the beginning has been the real challenge for most companies. It is easy and getting easier year after year to generate data but it is extremely hard to find actionable insight in the data and conversely common to get swamped by misleading numbers and wishful confirmation of hard held pre-existing ideas.
From the early garbage-in garbage-out meme to the ability to prove anything and the opposite, data scientists have shown that real science can quickly give birth to voodoo practices after the right number of iterations and complexification. The main reason is that data analysis should head into the opposite direction: It should be kept simple, using as little data as possible but within a smart context which makes the data effective and actionable.
So, step by step, based on our experience, let's try to see how to build such a context to make sense of the data and actually get insight from what is available without the complexity which often ends up generating vast amount of misleading information. The chapters below are only an outline which will be developed further in follow-up posts.
It is also important to note that these techniques only apply to client data and more generally "people" and are not relevant to other types of big data. Finance and markets, biology and weather modeling all use big data and statistical tools which are specific and mostly very different to the ones described below which apply to marketing and client data analysis.
Starting point
The first obstacle is to define precisely your goals.
This was the birth defect of data mining! If you do not know what you are looking for, the chance that you will find something is very low. This sounds obvious but it had to be proved the hard way for everyone to be convinced. The reason is that although goals are usually easy to define in commercial terms, they can be much harder to define in data terms because in the end it requires the ability to link data input to sales output and therefore to understand perfectly the data, the process, the context and the results.
So right from the beginning, it is clear that data can only generate information if it is transformed into what I call "smart data" first.
"Smart Data"
As mentioned earlier, smart data is data which has gone through a process which permit information to be extracted from the data. This means understanding the data itself, creating a process, a context and linking all this to actual results.
Let's look at these point one by one.
Data taxonomy (generation and normalization of data)
The very first step although obvious is often overlooked because it requires from the beginning to understand the whole process:
What data should be collected and for what purpose?
Is the data static and can therefore be transferred in batch (a client list or POS transactions for example) or is it variable and updated permanently (On-line data)? This is important because it will determine the tools which can be used to understand, visualize and analyze the data.
What is the flexibility of the data, it's range and variability?
This is most obvious when you create a graph and everything in crammed at the bottom! Obvious with a graph but not necessarily obvious with other tools or when you do not yet understand the characteristics of the data or of its variability.
Data quality and cleaning (de-duplication and homogenization)
Data cleaning is the epitome of data analysis. Without clean data, further analysis is useless. This is something which is now well understood and almost every company is aware of the necessity to have "clean" data... and actually does very little about it!
And that is simply because it is very hard!
In Japan, this problem can easily be understood through the challenge of client's names and addresses. Names can be written in Chinese characters, Japanese characters (hiragana and katakana) or Roman characters. These can be mixed together and the addresses can be arranged in rising or declining order. The result is that two database of different origin are usually almost impossible to merge. Often because they containing large numbers of duplicated which are difficult to eliminate.
To solve this problem, it is necessary to format the data in a uniform way, Easier said than done! One way to do this is to break down the challenge into smaller ones and create as many fields as there are types of entries.
This is slightly easier in English than in Japanese but the challenge is similar.
Data maintenance and updating
Another point related to data cleaning is to know and manage the time frame of the database. Older data may or may not be relevant. The same person may appear under 3 different addresses at different points in time with little hint that this is the same individual.
To give an actual example, while working with Facebook, initially we succeeded in getting only 40% of address matching. This was too low to be effective. Only after much effort and reaching a little over 60% were we able to start sharing anonymized data with them and actually add value to their data analysis tools.
Data visualization
Data visualization is a first step towards data analysis which often brings more insight than anything else if done correctly.
Putting data on a map for example can highlight very simply gaps or complex correlations (geographic) which may not be obvious on a spreadsheet. Conversely, spreadsheets are more powerful when using very large numbers of data, which may look random on a map or on a graph. (Which is often the case with On-line data.)
In this respect, tools such as Tableau can be useful to visualize the same data in very different ways and give depth and angles to a database.
Data clustering
Finally with data clustering, we are leaving the realm of raw data and entering data pre-analysis as we generate clusters, index and proxy data which will help understand the data and start more advanced analysis.
Since we have created many data clustering tools over the years in Japan, I will describe some of these tools as well as the insight we gained while building these in a specific post.
What is important to understand is that at this stage, the data is already structured, cleaned and well organized and therefore much easier to make sense of. Although, the most important part of the equation is still missing: Context.
Creating context
To some extent, creating context is the most difficult part of data analysis and consequently the most important. Without context, reasoning is often circular and almost anything is possible. What does a 2% growth rate means without a reference to a market, a goal or past achievements?
Context is necessarily external to the data otherwise it is self referential and therefore meaningless. This concept is very important to understand as it is the reason data mining failed and the reason why the current wave of AI will eventually hit a brick wall too.
For this reason, as for data clustering, we will also soon come back to this subject in details.
Finally data analysis
This subject conversely will not be developed simply because there is already a lot of literature about it which highlight all the tools available. Correlations, random forest or Bayesian analysis are in any case the very last step of data analysis and as explained usually not necessarily the most crucial one. (At least for 95% of the companies I have worked with which have not reached this level of sophistication.)