The Data Insider

Thursday, March 7, 2019

The four waves of AI

A very realistic overview of the current state of AI
by Kai Fu Lee
Published in Fortune
On Nov 1st 2018

http://fortune.com/2018/10/22/artificial-intelligence-ai-deep-learning-kai-fu-lee/

We are not "there" yet! What we currently call AI is nothing of the sort and limited to very specific applications. And still, this is already enough to start a new technology revolution as Kai Fu Lee explains.

THE TERM ‘ARTIFICIAL INTELLIGENCE” was coined in 1956, at a historic conference at Dartmouth, but it has been only in the past 10 years, for the most part, that we’ve seen the first truly substantive glimpses of its power and application. A.I., as it’s now universally called, is the pursuit of performing tasks usually reserved for human cognition: recognizing patterns, predicting outcomes clouded by uncertainty, and making complex decisions. A.I. algorithms can perceive and interpret the world around us—and some even say they’ll soon be capable of emotion, compassion, and creativity—though the original dream of matching overall “human intelligence” is still very far away.
What changed everything a decade or so ago was an approach called “deep learning”—an architecture inspired by the human brain, with neurons and connections. As the name suggests, deep-learning networks can be thousands of layers deep and have up to billions of parameters. Unlike the human brain, however, such networks are “trained” on huge amounts of labeled data; then they use what they’ve “learned” to mathematically pick out and recognize incredibly subtle patterns within other mountains of data. A data input to the network can be anything digital—say, an image, or a sound segment, or a credit card purchase. The output, meanwhile, is a decision or prediction related to whatever question might be asked: Whose face is in the image? What words were spoken in the sound segment? Is the purchase fraudulent?
This technological breakthrough was paralleled with an explosion in data—the vast majority of it coming from the Internet—which captured human activities, intentions, and inclinations. While a human brain tends to focus on the most obvious correlations between the input data and the outcomes, a deep-learning algorithm trained on an ocean of information will discover connections between obscure features of the data that are so subtle or complex we humans cannot even describe them logically. When you combine hundreds or thousands of them together, they naturally outstrip the performance of even the most experienced humans. A.I. algorithms now beat humans in speech recognition, face recognition, the games of chess and Go, reading MRIs for certain cancers, and any quantitative field—whether it’s deciding what loans to approve or detecting credit card fraud.
Such algorithms don’t operate in a vacuum. To perform their analyses, they require huge sets of data to train on and vast computational power to process it all. Today’s A.I. also functions only in clearly defined single domains. It’s not capable of generalized intelligence or common sense—AlphaGo, for example, which beat the world’s masters in the ancient game of Go, does not play chess; algorithms trained to determine loan underwriting, likewise, cannot do asset allocation.
With deep learning and the data explosion as catalysts, A.I. has moved from the era of discovery to the era of implementation. For now, at least, the center of gravity has shifted from elite research laboratories to real-world applications. In essence, deep learning and big data have boosted A.I. onto a new plateau. Companies and governments are now exploring that plateau, looking for ways to apply present artificial intelligence capabilities to their activities, to squeeze every last drop of productivity out of this groundbreaking technology (see our next story). This is why China, with its immense market, data, and tenacious entrepreneurs, has suddenly become an A.I. superpower.
What makes the technology more powerful still is that it can be applied to a nearly infinite number of domains. The closest parallel we’ve seen up until now may well be electricity. The current era of A.I. implementation can be compared with the era in which humans learned to apply electricity to all the tasks in their life: lighting a room, cooking food, powering a train, and so on. Likewise, today we’re seeing the application of A.I. in everything from diagnosing cancer to the autonomous robots scurrying about in corporate warehouses.

FROM WEB-LINKED TO AUTONOMOUS

A.I. APPLICATIONS can be categorized into four waves, which are happening simultaneously, but with different starting points and velocity:
The first stage is “Internet A.I.” Powered by the huge amount of data flowing through the web, Internet A.I. leverages the fact that users automatically label data as we browse: buying vs. not buying, clicking vs. not clicking. These cascades of labeled data build a detailed profile of our personalities, habits, demands, and desires: the perfect recipe for more tailored content to keep us on a given platform, or to maximize revenue or profit.
The second wave is “business A.I.” Here, algorithms can be trained on proprietary data sets ranging from customer purchases to machine maintenance records to complex business processes—and ultimately lead managers to improved decision-making. An algorithm, for example, might study many thousands of bank loans and repayment rates, and learn if one type of borrower is a hidden risk for default or, alternatively, a surprisingly good, but overlooked, lending prospect. Medical researchers, similarly, can use deep-learning algorithms to digest enormous quantities of data on patient diagnoses, genomic profiles, resultant therapies, and subsequent health outcomes and perhaps discover a worthy personalized treatment protocol that would have otherwise been missed. By scouting out hidden correlations that escape our linear cause-and-effect logic, business A.I. can outperform even the most veteran of experts.
The third wave of artificial intelligence—call it “perception A.I.”— gets an upgrade with eyes, ears, and myriad other senses, collecting new data that was never before captured, and using it to create new applications. As sensors and smart devices proliferate through our homes and cities, we are on the verge of entering a trillion-sensor economy. This includes speech interfaces (from Alexa and Siri to future supersmart assistants that remember everything for you) as well as computer-vision applications—from face recognition to manufacturing quality inspection.
The fourth wave is the most monumental but also the most difficult: “autonomous A.I.” Integrating all previous waves, autonomous A.I. gives machines the ability to sense and respond to the world around them, to move intuitively, and to manipulate objects as easily as a human can. Included in this wave are autonomous vehicles that can “see” the environment around them: recognizing patterns in the camera’s pixels (red octagons, for instance); figuring out what they correlate to (stop signs); and then using that information to make decisions (applying pressure to the brake in order to slowly stop the vehicle). In the area of robotics, such advanced A.I. algorithms will be applied to industrial applications (automated assembly lines and warehouses), commercial tasks (dishwashing and fruit-harvesting robots), and eventually consumer ones too.

THE CHANGES YET TO COME

BECAUSE A.I. CAN BE PROGRAMMED to maximize profitability or replace human labor, it adds immediate value to the economy. A.I. is fast, accurate, works around-the-clock, doesn’t complain, and can be applied to many tasks, with substantial economic benefit. How substantial? PwC estimates that the technology will contribute about $16 trillion to worldwide GDP by 2030.
But that gift doesn’t come without challenges to humanity. The first and foremost is job displacement: Since A.I. can perform single tasks with superhuman accuracy—and most human jobs are single-task—it follows that many routine jobs will be replaced by this next-generation tech. That includes both white-collar and blue-collar jobs. A.I. also faces questions with security, privacy, data bias, and monopoly maintenance. All are significant issues with no known solution, so governments and corporations should start working on them now.
But one concern we don’t have to face quite yet is the one that may be most common these days, cast in the image of science-fiction movies—that machines will achieve true human-level (or even superhuman-level) intelligence, making them capable presumably of threatening mankind.
We’re nowhere near that. Today’s A.I. isn’t “general artificial intelligence” (the human kind, that is), but rather narrow—limited to a single domain. General A.I. requires advanced capabilities like reasoning, conceptual learning, common sense, planning, cross-domain thinking, creativity, and even self-awareness and emotions, which remain beyond our reach. There are no known engineering paths to evolve toward the general capabilities above.

How far are we from general A.I.? I don’t think we even know enough to estimate. We would need dozens of big breakthroughs to get there, when the field of A.I. has seen only one true breakthrough in 60 years. That said, narrow A.I. will bring about a technology revolution the magnitude of the Industrial Revolution or larger—and one that’s happening much faster. It’s incumbent upon us to understand its monumental impact, widespread benefits, and serious challenges.

This essay is adapted from Lee’s new book, AI Superpowers: China, Silicon Valley, and the New World Order (Houghton Mifflin Harcourt). He is the chairman and CEO of Sinovation Ventures and the former president of Google China.
This article originally appeared in the November 1, 2018 issue of Fortune.

http://fortune.com/2018/10/22/artificial-intelligence-ai-deep-learning-kai-fu-lee/

Wednesday, March 6, 2019

Your Data Literacy Depends on Understanding the Types of Data and How They’re Captured

Article by Hugo Bowne-Anderson
published on Harward Business Review
Oct 23, 2018
https://hbr.org/2018/10/your-data-literacy-depends-on-understanding-the-types-of-data-and-how-theyre-captured

The ability to understand and communicate about data is an increasingly important skill for the 21st-century citizen, for three reasons. First, data science and AI are affecting many industries globally, from healthcare and government to agriculture and finance. Second, much of the news is reported through the lenses of data and predictive models. And third, so much of our personal data is being used to define how we interact with the world.

When so much data is informing decisions across so many industries, you need to have a basic understanding of the data ecosystem in order to be part of the conversation. On top of this, the industry that you work in will more likely than not see the impact of data analytics. Even if you yourself don’t work directly with data, having this form of literacy will allow you to ask the right questions and be part of the conversation at work.

To take just one striking example, imagine if there had been a discussion around how to interpret probabilistic models in the run up to the 2016 U.S. presidential election. FiveThirtyEight, the data journalism publication, gave Clinton a 71.4% chance of winning and Trump a 28.6% chance. As Allen Downey, Professor of Computer Science at Olin College, points out, fewer people would have been shocked by the result had they been reminded that, Trump winning, according to FiveThirtyEight’s model, was a bit more likely than flipping two coins and getting two heads – hardly something that’s impossible to imagine.

What we talk about when we talk about data

The data-related concepts non-technical people need to understand fall into five buckets: (i) data generation, collection and storage, (ii) what data looks and feels like to data scientists and analysts, (iii) statistics intuition and common statistical pitfalls, (iv) model building, machine learning and AI, and (v) the ethics of data, big and small.

The first four buckets roughly correspond to key steps in the data science hierarchy of needs, as recently proposed by Monica Rogati. Although it has not yet been formally incorporated into data science workflows, I have added data ethics as the fifth key concept because ethics needs to be part of any conversation about data. So many people’s lives, after all, are increasingly affected by the data they produce and the algorithms that use them. This article will focus the first two; I’ll leave the other three for a future article.

How data is generated, collected and stored

Every time you engage with the Internet, whether via web browser or mobile app, your activity is detected and most often stored. To get a feel for some of what your basic web browser can detect, check out Clickclickclick.click, a project that opens a window into the extent of passive data collection online. If you are more adventurous, you can install data selfie, which “collect[s] the same information you provide to Facebook, while still respecting your privacy.”
The collection of data isn’t relegated to merely the world of laptop, smartphone and tablet interactions but the far wider Internet of Things (IoT), a catch-all for traditionally dumb objects, such as radios and lights, that can be smartified by connecting them to the Internet, along with any other data-collecting devices, such as fitness trackers, Amazon Echo and self-driving cars.
All the collected data is stored in what we colloquially refer to as “the cloud” and it’s important to clarify what’s meant by this term. Firstly, data in cloud storage exists in physical space, just like on a computer or an external hard drive. The difference for the user is that the space it exists in is elsewhere, generally on server farms and data centers owned and operated by multinationals, and you usually access it over the Internet. Cloud storage providers occur in two types, public and private. Public cloud services such as Amazon, Microsoft and Google are responsible for data management and maintenance, whereas the responsibility for data in private clouds remains that of the company. Facebook, for example, has its own private cloud.
It is essential to recognize that cloud services store data in physical space, and the data may be subject to the laws of the country where the data is located. This year’s General Data Protection Regulation (GDPR) in the EU impacts user data privacy and consent around personal data. Another pressing question is security and we need to have a more public and comprehensible conversation around data security in the cloud.

The feel of data

Data scientists mostly encounter data in one of three forms: (i) tabular data (that is, data in a table, like a spreadsheet), (ii) image data or (iii) unstructured data, such as natural language text or html code, which makes up the majority of the world’s data.
Tabular data. The most common type for a data scientist to use is tabular data, which is analogous to a spreadsheet. In Robert Chang’s article on “Using Machine Learning to Predict Value of Homes On Airbnb,” he shows a sample of the data, which appears in a table in which each row is a particular property and each column a particular feature of properties, such as host city, average nightly price and 1-year revenue. (Note that data are rarely delivered directly from the user to tabular data; data engineering is an essential step to make data ready for such an analysis.)
Such data is used to train, or teach, machine learning models to predict Lifetime Values (LTV) of properties, that is, how much revenue they will bring in over the course of the relationship.
Image data. Image data is data that consists of, well, images. Many of the successes of deep learning, have occurred in the realm of image classification. The ability to diagnose disease from imaging data, such as diagnosing cancerous tissue from combined PET and CT scans, and the ability of self-driving cars to detect and classify objects in their field-of-vision are two of many use cases of image data. To work with image data, a data scientist will convert an image into a grid (or matrix) of red-green-blue pixel values or numbers and use these matrices as inputs to their predictive models.
Unstructured data. Unstructured data is, as one might guess, data that isn’t organized in either of the above manners. Part of the data scientist’s job is to structure such unstructured data so it may be analyzed. Natural language, or text, provides the clearest example. One common method of turning textual data into structured data is to represent it as word counts, so that “the cat chased the mouse” becomes “(cat,1),(chased,1),(mouse,1),(the,2)”. This is called a bag-of-words model, and allows us to compare texts, to compute distances between them, and to combine them into clusters. Bag-of-words performs surprisingly well for many practical applications, especially considering that it doesn’t distinguish “build bridges not walls” from “build walls not bridges.” Part of the game here is to turn textual data into numbers that we can feed into predictive models, and the principle is very similar between bag-of-words and more sophisticated methods. Such methods allow for sentiment analysis (“is a text positive, negative or neutral?”) and text classification (“is a given article news, entertainment or sport?”), among many others. For a recent example of text classification, check out Cloudera Fast Forward Labs’ prototype Newsie.
These are just two of the five steps to working with data, but they’re essential starting points for data literacy. When you’re dealing with data, think about how the data was collected and what kind of data it is. That will help you understand its meaning, how much to trust it, and how much work needs to be done to convert it into a useful form.

Hugo Bowne-Anderson, Ph.D., is a data scientist and educator at DataCamp, as well as the host of the podcast “DataFramed.” @hugobowne

https://hbr.org/2018/10/your-data-literacy-depends-on-understanding-the-types-of-data-and-how-theyre-captured

What publishers need to know now about creating a better ad experience

Article by Kelsay Lebeau,
published on "Think with Google"
https://www.thinkwithgoogle.com/marketing-resources/better-ad-standards/

When you work in digital advertising, it’s easy to forget what it’s like to be an average internet user. But when you take a step back and experience the web as most people do, you begin to understand why so many people employ ad blockers.
Ad blocking is bad news for everyone in digital advertising, including publishers who depend on ad revenue to fund content and advertisers trying to connect with audiences. But ad blocking is really a symptom of a broken user experience — one that marketers, agencies, publishers, and ad technology providers must work together to help fix.

For years, the user experience has been tarnished by irritating and intrusive ads. Thanks to extensive research by the Coalition for Better Ads, we now know which ad formats and experiences users find the most annoying. Working from this data, the Coalition has developed the Better Ads Standards, offering publishers and advertisers a road map for the formats and ad experiences to avoid.

Since the Coalition launched the ad standards in January 2018, I’ve been working with publishers to help them improve the ads on their sites. Based on my experience, here are three key things publishers need to know about what this change means for them.

1. Bad ad formats can hurt you

After surveying nearly 66,000 web and mobile users, the Coalition identified four categories of desktop ads and eight types of mobile ads that fall below the threshold for acceptability. For example, more than half of all consumers said they would not revisit or share a page that had a pop-up ad.¹ Similarly, many desktop web users were annoyed by video ads that automatically played audio, and most said they would avoid prestitial ads with countdowns and sites featuring sticky ads that obscured large portions of a site’s content.

More than half of all consumers said they would not revisit or share a page that had a pop-up ad.

On mobile, user preferences are even broader. In addition to the above ad types, consumers also said they disliked mobile pages with ad densities greater than 30%, flashing animated ads, prestitial and poststitial ads, and full-screen rollovers.
Before joining this project, I admit that I was a champion for mobile interstitials. It was easy to celebrate the revenue potential and underestimate the damage these ads caused to user experience. Now, we all know better.

2. Convert ‘bad ads’ into good ones

The good news is that people don’t hate all ads, just the most annoying ones. In fact, the Coalition’s research identified several common ad practices that resonated more positively with people. For example, narrow ads running down the right side of desktop webpages or small sticky ads at the top of mobile screens are viewed more favorably. The research also offers a path for converting more irksome formats (such as pop-ups) into less intrusive ones (like full-screen, inline ads) that are just as effective.

Saying yes to a potentially lucrative but annoying campaign can be tempting, but it is important to take stock of the potential negative effects.

Publishers have told us that they are relieved to finally have data to understand which intrusive ad formats to avoid selling. Saying yes to a potentially lucrative but annoying campaign can be tempting, but it is important to take stock of the potential negative effects.

3. Determine if your ad experiences are compliant

Since the Coalition’s definitions of ad experiences may differ from commonly used ad format names, many publishers may not even be aware if their ads violate the Better Ads Standards.
That’s why we released the Ad Experience Report in the Google Search Console. The tool reviews a sample of the pages on your site, identifies any ads that run afoul of the standards, and gives you the opportunity to remove or replace the ads and have your site reviewed again.
There’s a good chance that if you don’t have pop-up or self-playing ads on your site, you’re already compliant with the Better Ads Standards. In fact, around 98% of sites have no violations, and most sites with violations have already resolved their issues.

If issues are identified, they will be listed in the report and you’ll be notified. When all ads on the site comply with the standards, you can resubmit the site for review. Publishers should make sure that ads across their entire site comply with the Better Ads Standards before submitting for another review. Only a limited number of review requests are allowed, and only a sample of the site’s pages are reviewed each time.

It’s encouraging to see the progress being made over the last year to improve user experiences and to make the user a top priority for publishers everywhere. Putting the user first is a win-win situation. What’s good for them often ends up being good for business.

Kelsey LeBeau

Contributor, Publisher Product Partnerships at Google

https://www.thinkwithgoogle.com/marketing-resources/better-ad-standards/

The personal data ecosystem

This is the most complete table of the full personal data ecosystem I have found. It has gone from almost nothing to gigantic in less than 10 years and is starting to transform industries and sectors the ones after the others. Almost everything will be impacted over the coming years. There is no going back. The "data" (quantitative analysis) understanding of each and every activity will keep growing until it becomes our understand of how everything works and AI redefines everything we do

Tuesday, March 5, 2019

The evolution of Marketing (Weekend cartoon)

The evolution of Marketing
(and more generally forecasting)
This is a sleek and funny infographics but quite accurate nevertheless.
#artificialintelligence #bigdata #digitaltransformation #dataprivacy #data #AI #marketing

Sunday, March 3, 2019

Facebook, PII Data and the GDPR!

Now that the Facebook controversy is behind, it is time to turn back and understand what all this was really about and ask what “privacy” means on the Internet where the product is “us”!

When Facebook launched in February 2004, the motto was: “It’s free and always will be!” Not as daunting as Google’s “Do no evil!” but quite bold nevertheless since it is hard to have a less commercial statement for a company to list on a stock exchange.

Since, of course, we’ve learned better and everybody now knows the deeper truth of the Internet that “When it’s free, you are the product!” The economic model is in fact so overwhelming that the whole Internet has developed on this principle without much thought about the consequences. Whenever we install a new “free” application, we just say “yes” to 10 pages of legal gibberish which states in plain words the actual price of the service, usually on page 7 to make sure that nobody goes that far, and explain what data will be collected, and consequently monetize to pay for the application.

It is a model without options (Or rather it was before the implementation of Europe GDPR, General Data Protection Regulation, as we will see later.). You can say “no”, in which case you are left with no application and no service so in reality there is no choice. Reason why nobody reads the terms of use! But it works. A whole ecosystem has developed around this principle introduced as “applications” by Apple and is now here to stay. Most people have now forgotten the fact that computers, as initially designed, were “universal” machines and could therefore be adapted to any use. We now have a system where applications restrict the possibilities in exchange for stability, ease of use and an invisible price.

Conversely, what the applications do for the providers has exploded and one thing that they do better than any other software is collect information. It is no hazard that at the same time computers migrated from our desktops to our pockets where they can follow and track our every move, action and thoughts and digitalize everything for quantitative and qualitative analysis.

The initial and obvious use of the technology was marketing, targeting and advertising. These commercial activities have traditionally been more “art” than science, and the sudden explosion of “data” allowed for a completely new and more accurate approach of people transforming them from clients and consumers to users.

But to go further in the understanding of the controversy, it is necessary to digress and have a quick look behind the curtain to understand what is really going on.

Traditional advertising is simple. Put your advertising somewhere: on a wall, on TV, on a train and try to measure who is seeing it and what impact it has. It is an impossible task! We use panels, interviews and other techniques but the result is “nobody knows!” as illustrated by the famous statement: “Half of advertising is useless but we cannot know which half!” On the Internet, it’s different because everything can be measured, not just in a static way but in a dynamic way and improvement therefore becomes possible.

Based on this principle, a whole ecosystem has developed to put the right advertising in front of the right eyeballs. It is a complex ecosystem with many actors and two giants at the top: Google and Facebook who control the system.

Initially, Internet advertising developed around banners and pop-ups based on very simple metrics: location and type of articles. But the limits of the system were quickly reached. Then Internet companies had the great idea to “follow” people around with cookies and suddenly the same irritating advertising appeared on every site you visited with the obvious negative consequences on people’s perception of being “followed”. Then advertisers realized that to improve efficiency which they decided could be measured with “clicks” as actual sales were difficult to link with on-line behavior, they not only needed to exchange data with other sites, but to actually buy “other” data to make sense of context. Third party data was born.

To do this, complex systems were developed, where client data was first anonymized then exchanged as data “clusters” so that valuable information could be traded without breach of individual information. In this way, you could now “know” that within your 10,000 clients data list. 2000 had great potential (cluster A), 5000 not so much (cluster B) and 3000 none (cluster C) A valuable and flexible insight made even more useful by its instant, automatic application and use.

The combination of all these technologies, on-line and off-line data, as well as actual feedback has created a dynamic ecosystem where something, advertising, which was mostly “art” is suddenly becoming measurable science with an infinite scope for improvement. And this is of course where the nature of marketing changes and morph into “manipulation”. If you know exactly how a certain cluster of people will react to a certain type of advertising, you can actually toggle your approach to have exactly the desired effect on your target.

As long as this is used for commercial applications, it seems to be acceptable. After all this has always been the stated goal of advertising. But why restrict such powerful tools to marketing? If you can manipulate people into buying a product, you can probably also manipulate them into buying ideas, and from product to politics the gap is small. This too is of course not a very new idea. Already in the early 20th century, early sociologists like Edward Barney realized that much could be achieved in this field provided the right techniques were implemented. Soon the tools were developed and political science became more “technical”. But, just as advertising, the measurement tools were crude and the “science” was likewise more art than science.

Until the advent of the Internet where large scale “improvements” became possible and money could suddenly buy elections (which had always been the case) but in a more complex and apparently “democratic” way, undermining the foundations of our social system. And this is where, Facebook crossed the Rubicon!

When it became apparent that not only were they “monetizing” data but actually pro-actively using the data to undermine the political and economic system for monetary and ideological gains… with no real limits in sight.

So now what?

It is in fact extremely difficult to answer this question at this stage.

Facebook had done a “mea culpa” and promised to stop using third party data. Cosmetically they will, but in reality they cannot because the whole “free” Internet system as we know it, is based on this principle. Moreover, there is nothing wrong with third party data! Third party data as it is currently used does not breach any privacy law, including the GDPR since it is anonymous and index based. As long as no personal data is exchanged which is the case of most marketing applications, there should be no problem. In many cases, the population census is the base of third party data and no country is planning to cancel their census or give an opt-out option to people! So, just putting restrictions on some type of applications should do the trick and insure that the current system can stay in place without inviting further controversies. In this respect, it would be wise to rename third party data, “context data” which would be a better description.

If only!

The problem with the Internet is that nothing is static. Technology is progressing at a breakneck pace and transforming the world in front of our eyes.

Two technologies in particular will completely change marketing on the Internet in the next few years. The first one is the IoT (Internet of Things) and the arrival of smart always on-line objects which exchange information and communication with other objects creating the potential for a huge and permanent breach of privacy. Thinking about it: How much privacy is left when every object knows where you are and broadcast your presence in real time to anyone who cares to know? The extreme case is beacon technology for example.

The 2nd technology which will change everything is the arrival of AI (Artificial Intelligence) applications which will improve tremendously the efficiency of targeting thanks to the ability to “learn” and improve targeting in real time to insure optimal efficiency and eventually create clusters of “one” with optimal results.

Seen from this point of view, it is clear that the question raised by the Facebook controversy is an important one. Can we still protect individual privacy in the 21st century? And more fundamentally, what exactly is “privacy” in the digital age?

The answer to this question is far less obvious than it seems and it will take many years of trial and error to find solutions which are both commercially workable and socially acceptable. Facebook because it was so far in front was the first company to be confronted with these questions.

The implementation of the GDPR (General Data protection Regulation) by the European Union which applies to all database which store data from European Citizens wherever they are located, will oblige us to confront these complex problems.

What exactly is “personal” data?

Who can use it and for what purpose?

Who can sell it and with what restrictions?

To illustrate how complex the question can be, let’s take the answer of the NSA as an example: “We do not collect telephone data, only meta-data!”
What does this mean? The content of the conversation is not recorded (really?) Only who you call, when, how long, etc… But although the example works for a telephone call, it does not for the Internet, where there is no difference between a call and a mail, where you can attach links, pictures and documents.

Eventually, as our lives become more and more on-line, data disappear and give rise to a meta-data only world. When a person with (a,b,c) characteristics does a X transaction with a person with (d.e.f) characteristics on the net, there will be no “data” left, just meta-data. The transaction is of course recorded in 10 different places for different purposes (identification, authentication, analysis, etc…). What is acceptable and what is not.

To answer this question, we need to take a look at the architecture of the Internet and of the centralized database which have been built to take advantage of the opportunities offered by this extraordinary network. This will be the object of a follow-up post. But just as a hint of a possible answer, the implementation of structured distributed networks could offer a local solution to a global problem.

Saturday, March 2, 2019

Should we care about our citizen Score?

How trustworthy are you? Is it OK to accept you as a “friend”, read your mail? Offer you some credit? Sell you a gun? A book?

There was some heated discussions last year on the Internet about a novel idea emanating from China when two different concepts: The sesame credit score system developed by Alibaba and Tencent, the country's two main providers of social networks and the new Citizen score concept developed by the Government where presented together as a new ominous social control apparatus which in the end proved to be ahead of reality.
Or was it? It may well be that the attractiveness of such a system is such that eventually it will be implemented one way or the other. Worse, it will be fun, efficient and effective at offering what people, companies and the government need. So much so that soon enough it will become ubiquitous in China first and the rest of the world soon after.

So without knowing all the details yet, let's have a look at what it could become, why it may be so powerful and fun and why in the end, in retrospect, it may seem quaint that it wasn't invented earlier.
The idea introduced by Alibaba and Tencent under the name of Sesame Credit score is similar to the better known credit score used in the United States with the score being calculated based on information about hobbies, lifestyle, and expenses while relying on information from your social network to validate, raise or lower your score base on the score of other people. As for Amazon, E-Bay and other Internet sites in Western countries, there are in this system elements (public information) of trust and validation to smoother trade although this goes one step beyond by merging it with the older idea of credit score (private information) to create a number (or score), well defined and easy to grasp.
Conversely, although the idea introduced by the government remains less clear at the moment, it seems broader and more akin in the end to measuring political compliance than anything else. This “Citizen score”, tied to every person national ID number will be introduced on a voluntary basis at first before becoming compulsory in 2020.
By linking the two ideas, it doesn't take a rocket scientist to understand that you could easily build an extraordinarily powerful social control tool and although it does not seems to be the case yet, this is how it was presented in news articles last year to the outrage of the Internet community.
But to understand why the credibility, attractiveness and dangers of such a system are so high. let's forget the political implications for a moment and focus on the technical and social characteristics.
Over the last 10 years, with the introduction of the I-phone, the amount and quality of interactive data at the personal level has exploded beyond anyone's imagination so much so that creating a personal profile, a complex, expensive and forever incomplete task if there ever was one (provided in the US by specialized companies such as Acxiom and Experian among others) has become ubiquitous. If you have trillions of data including location, transactions, relations, financial, likes and propensities as is the case of Tencent and Alibaba (or Google, Yahoo, Amazon and the likes), you can in fact go extremely far, creating powerful predictive models with amazing capabilities.
But if the data is the fuel and the statistical models are the engine, we have learned that you also need to introduce some flexibility to the system in order not only to adapt but also to evolve and improve over time. This to my opinion is “the” most potent and frightening part of the tool: It doesn't need to be born perfect. It is impossible. In fact it may even be counter-productive. Understanding how to harness the power and intelligence of the people to improve your system is probably one of the greatest and most significant discovery of the early 21 century. It may well be the force which will continue to power the transformation of the Internet as earlier innovations lose their mojo in the coming years. But more than anything, it is what will makes such a system so powerful, eventually.
Now that the product is technically ready, the next step is the market introduction. Here too, we have made great progress in understanding what works and what does not and the Chinese schedule is a textbook case. Start by offering the product as a useful tool (or application) to innovators, who will embrace it heartily while debugging the system, then roll out the “mature” concept to the general population with a heavy dose of gamification to accelerate adoption and generate addiction. Tone down all aspects seen as “negative” and offer conversely incentives (points and rewards) to facilitate adoption.
The reality today is that if the Chinese authorities manage these steps carefully, and there is little doubt that they will, then there can be no obstacle whatsoever to the introduction of such a tool however Orwellian it may appear. Worse, once launched, it seems difficult to avoid its wider spread around the world. What if you need a score to apply for a visa to visit China for example?

What I find so frightening with this concept are not the possible negative applications but the fact that the data, the technology and our knowledge means that it is an idea which time has come. The defense of privacy, in this respect, looks more and more like a slender dam behind which a tumultuous tsunami of data and technology is rising at an exponential rate. When it breaks, the potential for social engineering will be far greater than in the 1930s. This to my opinion, is one of the most potent, non linear change coming our way in the coming years. And that I am afraid may well be a Chinese innovation.

The future of data visualization

Data visualization today varies from basic infographics with little data and almost no functionality to impossible designs with convoluted graphics created by computer scientists to display the behavior of complex systems, MS-DOS-style. Likewise, it takes longer to understand what's on your screen than the data is worth.

Recently the explosive growth of data collection and storage created the need for better management and understanding of data and consequently the development of improved data visualization tools. But to understand the challenge we are facing, it is necessary to go back in time and understand better the history of data visualization.

When we think early data visualization, we are either confronted with old maps or the famous Chart of Napoleon's Great Army March of 1812. One of the earliest successful info-graphics ever created.

But the real origin of the field is much older. It dates back to the early 3rd millennium BC, when the Mesopotamians looking for a more effective way to “sign” contracts, first made clay balls containing rods and sphere representing what was being traded and soon replaced these representations by markings on clay tablets, inventing writing in the process.

From clay tablet to papyrus, to paper and finally computer screen, the technological progress has been relentless. And likewise, from writing to accounting to Excel spreadsheets, the conceptual innovations have been too numerous to count... until now.

But as we stand on the edge of a new revolution instigated by big data, the need for a new leap has become more obvious.

As the technology stands today, computers allow us to build complex multi-dimensional excel tables linking hundreds of variables to each other or similarly, maps with multiple variables represented as layers on top of each others. But in almost every case, the results are too complex to visualize or to make sense of. Simplification has becomes necessary but how?

To get a better grasp of the challenge, a practical example is necessary so let's look at aviation where in a little over 100 years, we have learned, mostly the hard way, how to visualize data effectively to keep our planes in the air.

Early on, there was no cockpit and therefore no data. The only feedback the pilots received was through their senses: listening to the engine, keeping their balance, direction and if possible altitude and speed. But as the machines became bigger and the engines more complex, dials and gauges made their appearance, multiplying endlessly until the early 1970s when the cockpit of the new 747-100 ended up completely covered from floor to ceiling. As two layers of instruments where not technically possible, a new approach became necessary and monitor screens made their apparition. All the data was still accessible but as long as no “urgent” action was needed, the data was not displayed, although the pilot could “dive” as deep as necessary on demand. We now had a new hierarchy of data with only essential and urgent data being displayed at any time. As the cockpit became “digital”, all the earlier mono-dimentional feedbacks disappeared to be replaced by a few complex monitor screens combining data as needed for practical use.

It is there, in the plane cockpit, I believe that we will find the future of data visualization. The reason why aviation progressed so far so fast is that we had direct and unforgiving feedback, in the sense that planes poorly designed to fly invariably crashed and this severe natural selection quickly brought efficient and effective cockpits so much so that today flying has become one of the safest way to travel.

Compared to this, in almost every other field, data visualization is still relatively backwards. Sometimes because it is difficult to get feedback from a data spreadsheet, even when there is a real company behind the spreadsheet, but also because many fields such as mobile phone interaction and intelligent home management are still at a very early stage of their development.

If this is correct, the new technologies which are just emerging: 3D, holographic and haptic will only be the most recent embodiment of the Sumerian clay tablets, the medium we use to access invisible data. But the actual data visualization will depend on advanced software, which will build a hierarchy of data, deciding what must be displayed at any time, in conjunction with other data and background information to insure maximum attention and relevance of the interface.

The systems will benefit from data mining to look at historical relevant events, deep learning to improve the interface and other add-ons which will completely transform the visualization process while preserving the possibility of direct proactive access to data.

Let's take the example of a big company having a million on-line clients with off-line data (addresses, names, age, marital status, etc...) and on-line data such as geo-location, propensity to buy, tastes and interface with the brand.

Today, it is still extraordinarily difficult to make sense of all the data. There is simply too much data, trends and events to make sense of in real time and have any predictable, instant effect. So the efforts are still haphazard, based on on-line behavior, geo-marketing, cluster analysis or A-B testing. The effects are real but remarkably similar to gold digging in the old West. Wherever you look, there is a good chance that you will find something but with so many possibilities that it is almost impossible to optimize the system.

With self learning software looking automatically for events, trends, outliers, correlations and all other statistical characteristics of the data, it will become far easier to analyses huge multi-dimentional database without ever visualizing the whole which in any case will become more and more difficult to do. The only data that managers will request will be decisional data, “Tableau” on steroid where the data is auto-organized for optimum ease of understanding and action. Including the equivalent of what the financial industry has developed for its own lucrative use, commonly referred to as “algo”, available for all types of non financial applications!

Amazingly, this will probably take place over the next few years, giving a critical advantage to early adopters and generating a new arm race of software efficiency, accelerating greatly in the process the development of artificial intelligence.

It may still be early, but it is not absurd to envision an environment where the marketing department disappears as all types of analysis are generated on the fly and adapt automatically to the policy of the company enabling a smooth integration of management and sales.

Big Data – What is it? (a new approach)

What is Big Data?
The first step to understand Big Data is to agree upon what it is not: Big Data is not a lot of ordinary data. As Tera-bites change to Peta-bites, they do not suddenly transform into Big Data. A corollary is that as computers become more powerful, they do not help us solve the problem of Big Data. Quite the opposite in fact: They generate more diverse and complex data exponentially. Computer power and ubiquitousness is the cause of Big Data, not its solution.
So what is it, then?
The usual definition is the following: data that is too large, complex and dynamic for any conventional data system to capture, store, manage and analyze. It is also commonly defined by the 3V: Volume, Variety and Velocity. But experience shows that this definition comes short to really understanding what Big Data is so here's a better one:
Big Data is an emergent phenomenon created by the convergence and combination of different types of data within a complex system which generates more data and meta-data than the input. It is therefore dependent on a system, real or virtual, which non only collects and processes data but also creates new data.
If the data sources are multiple and the system is very complex, Big Data can emerge from a relatively small amount of data. Conversely, a very large amount of unprocessed data will remain just “data” with none of the characteristics of Big Data.
Where do we find Big Data?
Big Data has always been around us as the best Big Data processing machine we know is the human brain. It can accept a huge amount of information from inside and outside our body and make it understandable in a simple and accessible way to our consciousness. This is the ultimate Big Data system.
Of course, if we put the goal so high everything else we create, look simple and to some extent, it is.
In marketing, Big Data is a relatively new phenomenon mostly related to the Internet and the very large amount of information that Google, Facebook, Tweeter and the likes generate, compounded by the I-phone.
But some of our clients such as “smart home” system providers will soon create an even larger amount of Big Data thanks to the Internet of Things (IoT). This data will need to be organized and conveyed, from the fridge reading the RFIDs of objects to inform the “home manager” which will send an order automatically over the Internet while informing us for confirmation or authorization. These systems will soon need to replicate most of the simple and complex functions we perform daily without giving them second thoughts. Artificial intelligence will develop concomitantly.
But conversely, why is the emergence of Big Data so slow? Since we understand the concept, surely applications should follow in droves.
Visualizing and understanding Big Data
This is a difficult question to answer, but it seems that one of the main obstacle for a more widespread use of Big Data is the lack of visualization tools and therefore our inability to grasp complex answers to simple questions.
To take the example of marketing, we now have access to a huge number of disparate data but mostly struggle to make sense out of it beyond what is already proven and well known. The concepts of clusters, social groups and one to one marketing are progressing slowly but mostly in a haphazard way based on trial and error. The main difference compared to 20 years ago is that the cycles have accelerated tremendously and we now learn faster than ever with instant testing and feedbacks.
But for most companies, the main tools to display and analyse data remains Excel or related systems such as Tableau and different types of Dashboards.
Some companies use our GIS (Geographic Information Systems) to analyze client data but very few go beyond that simply because the tools do not exist yet. GIS systems are among the very few which allow company to visualize different layers of data over a map, but not all client data can be Geo-localized and almost all the non geographic systems are far less advanced to display complex data.
We are currently working on this subject and I will come back to it with innovative solutions in future posts.
The problem of Privacy and security
Eventually, as most of our lives either become digital or to put it another way, as our digital footprint becomes larger and larger, we will need to improve both security and privacy to insure that we can trust these systems. We are currently doing far too little concerning privacy and this should be a major concern as many innovative applications will lag or even may not be developed at all if we do not develop appropriate answers.
Likewise, although the issues concerning security are well known, we are far too “cavalier” currently in protecting Big Data, especially under the form of Meta-data. Eventually, most of the data transiting over the Internet will be through applications and will therefore be “meta-data” with little or no data contents. Already, a link to a YouTube video on Facebook generates nothing but meta-data and the video itself. The Internet of Things will take this concept to a new level where information, temperature outside for example, will generate a cascade of automatic and inferred consequences far beyond the initial data input.
Privacy will need to be defined far more precisely and the correct level of security must apply. Personal data, for example, need to be protected by biometrics but biometrics itself needs a higher level of security since it is quite difficult to “recover” from a breach of biometric data. Less important data will need less drastic “passwords” down to our public “access” points which must be easy to interface with.
In this respect, we must once again learn from natural systems which are more than 2 billion years ahead as far as “technology” and usability are concerned. The DNA, the body's way to store information shows us that it is not possible for a system to be both flexible and therefore able to evolve and closed. Nature could not get rid of virus or hacking and neither will we. But the DNA shows us how to keep the damage local and confined most of the time. We will need to replicate this knowledge if we want to succeed. Large, centralized databased will be accessed and compromised, it is just a mater of time. We will therefore need to learn how to build distributed systems which communicate just the right amount of information to the right processing level.
This is in fact a key factor in the development of big data. Sending all the data over the Internet to be processed in data centers and sending “orders” back for execution is a recipe for failure, or more likely hacking. One of the challenge of big data is therefore to understand: what data must be processed locally, what data must be sent, to what system, at what level with what level of privacy and what level of security, to be stored (or not) and where. This complexity alone could be another definition of Big Data.
The future of Big Data
The I-phone is already unleashing a torrent of Big Data which is now covering the globe from the richest to the poorest countries. The Internet of Things which is just being born as we speak has the potential to lead us much farther down the road of Big Data. Within a few years our environment will first become “aware” of our presence then able to communicate pro-actively with us.
This can easily become a nightmare of unwelcome marketing intrusion and government surveillance if we do nothing about it. Conversely, we can significantly increase the complexity of our lives in the background while simplifying the interface with everything around us and therefore our well being. The most likely outcome is of course a mixture of both worlds. But let's hope that we learn quickly from our mistakes and find the right “mix”, knowing that there is no correct answer or optimum balance between all the factors.
The Big Data technologies we are currently working on sound and look very much like science fiction although they are only a few short years away from applications. The road ahead is still uncharted but a whole new continent of data is waiting for adventurous minds to explore and “map”.

Segmentation in Japan – 10 years of Chomonicx - 2005 - 2015

In 2005, we introduced Chomonicx, a segmentation system built by Acxiom to the Japanese market. Here's what we have learned about Japan over the last 10 years:
In 2005, we were told that Japan was “different”, that it was too homogeneous, that we could not segment it and that nobody would buy our product. 10 years later, we are selling the third version of Chomonicx so clearly we were right not to listen to these negative opinions.
In 2005, Japanese lifestyles where already diverse if still much less than in the US and Europe. Compared to the 65 clusters of the UK and the 45 of Holland, we created 32 clusters in Japan focussing mostly on urban areas. In 10 years, Japan has catched up with most European countries, if not yet with the US and the UK. We now use 36 clusters in Japan and can segment further with other tools.
The most obvious differenciation has been between the big cities and the countryside. In 2006, for the first time, the Tokyo, Osaka and Nagoya areas represented half the population of the country. Since growth in these cities as well as in a few secondary cities such a Fukuoka, Sendai and Sapporo has accelerated while the countryside is now contracting.
The evolution of Japan over the last 10 years has followed that of The UK and France where growth concentrated in London and Paris to the detriment of the countryside. Japan is a complex country with many cities of different size and it could therefore have followed the example of the US and Germany where growth is more balanced among a large number of cities without one taking a clear lead. It did not. Looking at clusters, the contrast between Tokyo and Osaka is now extreme. The richest clusters which did not exist 10 years ago are now almost exclusively concentrated in Tokyo.
Conversely, the emptying of the countryside and of small cities is accelerating. The two main factors: lower incomes and aging being the main engine of this decline. This pattern is mostly visible in the hollowing out of small cities where city centers have been decimated by suburban shopping malls. The contrast between cities where the center has survived and those where it has not is extreme and is clearly visible through its impact on income.
Likewise, in the countryside, the income decline has been relentless over 10 years but it is mostly due to aging and the impact has therefore been less severe. Still, the avarage age in many areas is now over 65 and should therefore be of extreme concern to the government. The merger of villages has dampened the effect of the closure of facilities and ironed out the statistics but eventually the “trick” will stop working when there is only 2 or 3 hospitals left in Tottori or Akita and the avarage population of the whole prefecture approaches retirement age!
The income in the large suburbs of Japan has also declined although less severely than in the countryside. The contrast between the social groups is less severe than in the downtown urban areas but very visible nevertheless.
The Japanese suburbs have grown as onion rings around the old dense city centers after the second world war, following the road and rail networks and were therefore, from the beginning segmented by age, or year of development. This structure has now cristalised with many areas where everybody belong to the same age group. This is especially true for council flats and “my home” housing developments.
The consequence is that after 20 years, schools are empty and inexistant nursing homes in high demand.
We have also seen the growth of the suburbs stop with very few localised exeptions and a trend of retired people going back to live in mansions closer to the city centers. This has been one of the trends behind the explosion of high rise apartments in downtown areas of Tokyo and Osaka.
There will be a new census in 2015 and we should therefore see the confirmation of these patterns over time. When the results of this census are published by the end of 2017, we will have moved to our new, more precise household segmentation system. Japan will then have a very precise view of its population and the challenges it is facing but it will still need to find appropriate answers.