Wednesday, January 15, 2020

Smart data (part 1) - An overview of client data analysis




As Niccolo Machiavelli once said, “There is nothing more difficult, more perilous or more uncertain of success, than to take the lead in introducing a new order of things.”

This is what the data revolution is about. Not adding a few spreadsheets here and there or collecting more data, but rethinking what is available as well as the data flow within a company to make it meaningful. It requires to "think different!" To make this difference more palpable we should call it "smart data" with the understanding that the "smart" which adds value to the data and transform it into information is not intrinsic to the data, it is knowledge.

Smart data is not AI as it is understood today. It is data which has gone through a process which allows information to be extracted from the data. Conversely is requires to step back from pure statistical analysis to focus on process and context and in that respect, it incorporates "intelligence".

Usually, most companies generate raw data from their operation. It can be client data such as addresses and names, POS data such as actual sales or any types of other data. These data are often poorly structured, neither clean nor accurate and almost always lack context.

This is where generating data must start. Many companies only give cursory attention to their data believing that analysis will generate the insight. This is a mistake! Analysis is only the very last step in a long process and often not the most important one as we will see.

From data mining in the 1990s to artificial intelligence tool nowadays we have made great progress in our understanding of "data" although most of the great insights came over the years by accident.

What started as quantitative and brute force analysis with data mining gave very little actual results for the simple reason that information has nothing to do with mineral and that consequently the chance of finding anything of relevance by accident (or statistical analysis however advanced) is negligible. This is tantamount to buying a lottery ticket and expecting to be a winner. Obvious correlations are just that and were therefore known long before statistics confirmed that they were reel and already included in most companies' DNA as "knowledge", business practice or intangible assets. As for non obvious correlations, they were often little more than that too and usually represented no causation whatsoever. Pure statistical tools are deterministic and therefore not conductive to insight, contrary to most people opinion.

And this from the beginning has been the real challenge for most companies. It is easy and getting easier year after year to generate data but it is extremely hard to find actionable insight in the data and conversely common to get swamped by misleading numbers and wishful confirmation of hard held pre-existing ideas.

From the early garbage-in garbage-out meme to the ability to prove anything and the opposite, data scientists have shown that real science can quickly give birth to voodoo practices after the right number of iterations and complexification. The main reason is that data analysis should head into the opposite direction: It should be kept simple, using as little data as possible but within a smart context which makes the data effective and actionable.   

So, step by step, based on our experience, let's try to see how to build such a context to make sense of the data and actually get insight from what is available without the complexity which often ends up generating vast amount of misleading information. The chapters below are only an outline which will be developed further in follow-up posts.

It is also important to note that these techniques only apply to client data and more generally "people" and are not relevant to other types of big data. Finance and markets, biology and weather modeling all use big data and statistical tools which are specific and mostly very different to the ones described below which apply to marketing and client data analysis.



Starting point

The first obstacle is to define precisely your goals.

This was the birth defect of data mining! If you do not know what you are looking for, the chance that you will find something is very low. This sounds obvious but it had to be proved the hard way for everyone to be convinced. The reason is that although goals are usually easy to define in commercial terms, they can be much harder to define in data terms because in the end it requires the ability to link data input to sales output and therefore to understand perfectly the data, the process, the context and the results.

So right from the beginning, it is clear that data can only generate information if it is transformed into what I call "smart data" first.

"Smart Data"

As mentioned earlier, smart data is data which has gone through a process which permit information to be extracted from the data. This means understanding the data itself, creating a process, a context and linking all this to actual results.

Let's look at these point one by one.

Data taxonomy (generation and normalization of data)

The very first step although obvious is often overlooked because it requires from the beginning to understand the whole process:

What data should be collected and for what purpose?

Is the data static and can therefore be transferred in batch (a client list or POS transactions for example) or is it variable and updated permanently (On-line data)? This is important because it will determine the tools which can be used to understand, visualize and analyze the data.  

What is the flexibility of the data, it's range and variability?

This is most obvious when you create a graph and everything in crammed at the bottom! Obvious with a graph but not necessarily obvious with other tools or when you do not yet understand the characteristics of the data or of its variability.

Data quality and cleaning (de-duplication and homogenization)

Data cleaning is the epitome of data analysis. Without clean data, further analysis is useless. This is something which is now well understood and almost every company is aware of the necessity to have "clean" data... and actually does very little about it!

And that is simply because it is very hard!

In Japan, this problem can easily be understood through the challenge of client's names and addresses. Names can be written in Chinese characters, Japanese characters (hiragana and katakana) or Roman characters. These can be mixed together and the addresses can be arranged in rising or declining order. The result is that two database of different origin are usually almost impossible to merge. Often because they containing large numbers of duplicated which are difficult to eliminate.

To solve this problem, it is necessary to format the data in a uniform way, Easier said than done! One way to do this is to break down the challenge into smaller ones and create as many fields as there are types of entries.

This is slightly easier in English than in Japanese but the challenge is similar.

Data maintenance and updating

Another point related to data cleaning is to know and manage the time frame of the database. Older data may or may not be relevant. The same person may appear under 3 different addresses at different points in time with little hint that this is the same individual.
 
To give an actual example, while working with Facebook, initially we succeeded in getting only 40% of address matching. This was too low to be effective. Only after much effort and reaching a little over 60% were we able to start sharing anonymized data with them and actually add value to their data analysis tools.

Data visualization

Data visualization is a first step towards data analysis which often brings more insight than anything else if done correctly.

Putting data on a map for example can highlight very simply gaps or complex correlations (geographic) which may not be obvious on a spreadsheet. Conversely, spreadsheets are more powerful when using very large numbers of data, which may look random on a map or on a graph. (Which is often the case with On-line data.) 

In this respect, tools such as Tableau can be useful to visualize the same data in very different ways and give depth and angles to a database.

Data clustering

Finally with data clustering, we are leaving the realm of raw data and entering data pre-analysis as we generate clusters, index and proxy data which will help understand the data and start more advanced analysis.

Since we have created many data clustering tools over the years in Japan, I will describe some of these tools as well as the insight we gained while building these in a specific post.

What is important to understand is that at this stage, the data is already structured, cleaned and well organized and therefore much easier to make sense of.  Although, the most important part of the equation is still missing: Context.

Creating context

To some extent, creating context is the most difficult part of data analysis and consequently the most important. Without context, reasoning is often circular and almost anything is possible. What does a 2% growth rate means without a reference to a market, a goal or past achievements?

Context is necessarily external to the data otherwise it is self referential and therefore meaningless. This concept is very important to understand as it is the reason data mining failed and the reason why the current wave of AI will eventually hit a brick wall too.

For this reason, as for data clustering, we will also soon come back to this subject in details. 

Finally data analysis

This subject conversely will not be developed simply because there is already a lot of literature about it which highlight all the tools available. Correlations, random forest or Bayesian analysis are in any case the very last step of data analysis and as explained usually not necessarily the most crucial one. (At least for 95% of the companies I have worked with which have not reached this level of sophistication.)


  

Saturday, January 4, 2020

For Softbank's Son, "Conflating Luck And Talent Is Dangerous" (article)





The longest bull market in history has segregated talents from losers or so it seems. But to get the big picture, more time is necessary for patterns and cycles to emerge. The success and recent failure of Softbank is a good metaphor, the lesson of hubris and arrogance timeless.


Authored by Scott Galloway via ProfGalloway.com,
Third Base

The Dunning-Kruger effect posits that dumb people are too stupid to know they are dumb. They are not perplexed by difficult situations but overconfident — not knowing what they don’t know. As few people believe they are stupid, or a bad driver, a more relatable component of Dunning-Kruger is incorrectly believing one area of skill translates to another.

I suffer massively from this. I’m smarter than your average bear when it comes to marketing, so I’ve come to believe that makes me an expert on pretty much anything. I don’t know much about physics but constantly reference Galileo despite knowing little besides the fact that he dared challenge the church.

There is evidence of this all over the marketplace. Great P/E guys believe they would make great VCs and vice versa. Hedge fund managers believe two years of above-market returns means they are also great operators. To disabuse anybody of this notion, take them to a Sears. Billionaires running for president, actors starting skincare lines, and tech CEOs founding media firms. Being rich also naturally makes you a great film producer.

Masayoshi Son created $64 billion in shareholder value, mostly through deft acquisitions. Mr. Son can also boast of perhaps the best venture investment in history, $20 million into Alibaba that became $100 billion. That investment is tantamount to Michael Jordan hitting a grand slam on his first at bat wearing a Birmingham Barons hat.

Mr. Son has mistaken luck in venture investing for the ability to responsibly allocate billions based on a gut feeling. The size of SoftBank investments, relative to the diligence, now looks stupid, if not negligent. A writedown on an investment in a dog-walking app may have been avoided had someone in the SoftBank diligence team taken the time to discover they were investing $300 million in … a dog-walking app.

Conflating luck and talent is dangerous. As I get older, I’m struck by how big a part luck played in my life, and how much I mistook it for skill, well into my forties. The Pareto principle shows that even if competence is evenly distributed, 80% of effects stem from 20% of the causes.

Not recognizing your blessings feeds into the dark side of capitalism and meritocracy: the notion that success is a choice, and that those who haven’t achieved success are not unlucky, but unworthy. This leads to regressive policies that further reward the perceived winners and punish the perceived losers based on income level. The most recent example of our belief that poor people are guilty: The US now has the fourth-lowest tax rate in the world, and billionaires have the lowest tax rate of any cohort.
First Base

I constantly humblebrag that I was raised by a single immigrant mother who lived and died a secretary. But truth is I was born on third base. My parents got me to first base before I was born, immigrating to the US. This took courage, desire, and a dose of selfishness. Both left families that needed them. My mom left London when her two youngest siblings were still in an orphanage.

In Europe I’d make much less money being an entrepreneur and challenging institutions. In China I’d likely be in jail. Having one of my companies fail would have bankrupted me in Europe, as the tolerance for risk or failure is scant. I have no idea what would have happened in China. In the US, a tolerance for failure meant a lifestyle my parents couldn’t have imagined crossing the Atlantic on a steamship in 1961.
Second Base

I have some talent and have worked really hard, but mostly my success is due to being born in the right place at the right time, and being a white heterosexual male. Coming of professional age as a white male in the nineties was the greatest economic arbitrage in history. Today’s 54-to-70-year-olds saw the Dow Jones increase an average of 445% from 25-40, their prime working years. For other ages, it doubles at most.

Economic liberalization (globalization, technology, market deregulation) coupled with social norms that clung to the past meant 31% of America (white males) were given license over a lion’s share of the spoils. In nineties San Francisco, I raised over $100 million for my start-ups. I didn’t know a single woman under 40 who raised more than a million. And it seemed normal. Even today, white men hold 65% of elected offices despite being 31% of the population.

Third Base

Rich, fabulous people are the ideal billboards for luxury brands. Our nation’s best universities have adopted the same strategy. Universities are no longer nonprofits, but the highest-gross-margin luxury brands in the world. Another trait of a luxury brand is the illusion of scarcity. Over the last 30 years, the number of applicants to Stanford has tripled, while the size of the freshman class has remained static. Harvard and Stanford have become finishing school for the global wealthy.

In the class of 2013 in the Ivy League, five of the eight colleges (Dartmouth, Princeton, Yale, Penn, and Brown) had more students from the top 1% of the income scale than the bottom 60%.
Fast and Slow Thinking

According to @thetweetofgod, intelligence looks in the mirror and sees ignorance; ignorance looks in the mirror it sees intelligence. The sectors that have enjoyed the greatest prosperity spread across increasingly few people — technology and finance — have created an unprecedented level of arrogance among people born on third base.

When we feel threatened, we are more prone to see each other as an enemy, rather than someone who has a different opinion. We want to dismiss and fight the whole person, rather than just what they said. From primeval times, our brains have been set up to identify “enemy” or “one of us,” that simple binary distinction. Do I trust them as a person or are they not “one of us.” When we are in our more evolved, slow thinking mode (Daniel Kahneman), we evaluate arguments. When we are in our knee-jerk, threatened fast thinking, we decide the person is our enemy and argue from our amygdala, not our forebrain.

When we are threatened, we are also less empathic. Altruistic behavior decreases in times of greater income inequality. The rich are more generous in times of lesser inequality and less generous when inequality grows more extreme. When the poor need our help more, we are less likely to offer it, because we don’t see the poor as one of us. They become “them.”

Michael Lewis writes, “The problem is caused by the inequality itself: it triggers a chemical reaction in the privileged few. It tilts their brains. It causes them to be less likely to care about anyone but themselves or to experience the moral sentiments needed to be a decent citizen.”

The answer to the Fermi paradox (article)



Since this is a holiday in Japan, I would like to discuss an interesting subject;
"Where is everybody?" as formulated by Enrico Fermi in 1950.

But before giving "my" answer at the end, let's see with a short article what are the current views on this subject.


Scientists Say Aliens Should Have Already Visited Earth
Authored by Manuel Garcia Aguilar via TheMindUnleashed.com,

The debate about the existence of alien life has been a topic that has interested humans for a long time and the scientific community has had split opinions regarding our solitude in this amazingly big universe.

Now, new research published in the Astronomical Journal provides further information that invites us to rethink our mindset on this topic.

During the summer of 1950, physicist Enrico Fermi posed a question to his colleagues over lunch:

    “Don’t you ever wonder where everybody is?”

He was referring to alien life.

The Earth is 4.5 billion years old, and we could say that that was roughly the time it took a “kind of life” to be capable of space travel. Our universe is approximately 13.8 billion years old.

Fermi proposed that during this time, the galaxy should have been overrun with intelligent, technologically-advanced aliens. Yet, we have no evidence of this despite decades of searching. This postulate became known as the Fermi Paradox.

Briefly, some of the main points of this paradox, formalized by Michael H. Hart, are:

    There are billions of stars in the Milky Way similar to the Sun.
    With high probability, some of these stars have Earth-like planets, and if the Earth is typical, some may have already developed intelligent life.
    Some of these civilizations may have developed interstellar travel.
    Even at the slow pace of currently envisioned interstellar travel, the Milky Way galaxy could be completely traversed in a few million years
    And since many of the stars similar to the Sun are billions of years older, this would seem to provide plenty of time

Now, you can have a clearer view of why this paradox is so interesting for scientists and further investigation is being done, the odds seem to be really high.

The expectation that the universe should be teeming with intelligent life is linked to models like the Drake equation, which suggests that even if the probability of intelligent life developing at a given site is small, the sheer multitude of possible sites should nonetheless yield a large number of potentially observable civilizations.

This new study offers a different perspective on the question: maybe aliens are just taking their time and being strategic.

If you don’t account for the motion of stars when you try to solve this problem, you’re basically left with one of two solutions,” Jonathan Carroll-Nellenback the study’s lead author said.

    “Either nobody leaves their planet, or we are in fact the only technological civilization in the galaxy.”

Stars orbit the center of the galaxy on different paths at different speeds. They occasionally pass each other, so, aliens could be waiting for their next destination to come closer, Caroll-Nellenback’s study says.

Researchers have formulated different theories trying to answer the Fermi Paradox, including the possibility that all alien life forms in oceans below a planet’s surface and there’s even the “zoo hypothesis” which imagines that societies in our galaxy decided to not contact us to “preserve” us in a way analogical to how we preserve some natural places—or even to prevent them from getting some kind of “disease” from us.

A crucial fact to this new study is the fact that, as previously mentioned, the galaxy moves. So, aliens could be waiting for an optimal travel distance to explore new territories.

    “If long enough is a billion years, well then that’s one solution to the Fermi paradox,” Carroll-Nellenback said.

Another important thing to notice is that the research team did not attempt to guess at the alien’s motivations or politics, something that usually delayed the attempts to solve the Fermi Paradox.

We have to consider also that our consciousness and our perception of the “civilization” concept may play a crucial part in this kind of studies. So, our predictions may be based on our own behavior.

    “We tried to come up with a model that would involve the fewest assumptions about sociology that we could,” Carroll-Nellenback said.

So far, we’ve detected about 4,000 planets outside of our solar system and none have been shown to host life. But we haven’t looked that hard—there are at least 100 billion stars in the Milky Way and even more planets, so we still have a lot more to explore.

Maybe, merging philosophy and science together for a moment, we could believe that at some point, if there is in fact alien life out there in the universe, we (or our kids, grandkids, or great grandkids) will get to know them and make really close contact, assuming all of this in basis of some of the ideas exposed in Kant’s Critique of Pure Reason, where he says that if something can happen, and there is enough time for that to happen, it will happen



My answer to the Fermi paradox:

Here's what we know on the subject at this stage:
- Our galaxy contains between 200 and 400 billions stars (nobody knows exactly how many) and most probably have planets, including earth-like ones in the habitable zone.
- The history of earth is complex with the stabilizing presence of the moon, a low tilt of the orbit and plate tectonic but not improbably strange so there must be many other similar cases.

Based on this alone, it is highly likely that we will find some clues such as the presence of oxygen around earth-line objects with 10 to 15 years.

But what about advanced civilizations. Indeed, why don't we see then? Why are they not here already?

The answer may in fact be extremely simple: Advanced civilizations are like super novas. They explode technologically into something we cannot understand within the blink of an eye due to exponential growth. We can call it a paradigm shift or more likely a phase transition into the unknown.

This may be why we do not see them! They do not travel among the stars and do not colonize entire galaxies. Our universe may be the crib of countless civilizations but as soon as they mature beyond our level of technology, they simply transform into something which is not accessible yet to our understanding.

That I believe is another, more logical answer to the Fermi paradox!

OpenAI o3 Might Just Break the Internet (Video - 8mn)

  A catchy tittle but in fact just a translation of the previous video without the jargon. In other words: AGI is here!