Do You Understand How Big ‘Big Data’ Really Is?
Big Data is far more than an ever-expanding mass of meaningless bits and bytes which provides a playground for geeky data scientists. The truth is that Big Data is the fuel that will power human development in the future.
Only five years ago the term ‘Big Data’ was confined to obscure tech blogs and IBM research papers. But today ‘Big Data’ is part of mainstream journalistic parlance, as a Google search for [“big data” site:nytimes.com] will quickly reveal.
So it seems that everyone understands what ‘Big Data’ means. Here’s how Wikipedia defines the term:
“Big Data is a broad term for data sets so large or complex that they are difficult to process using traditional data processing applications. Challenges include analysis, capture, curation, search, sharing, storage, transfer, visualization, and information privacy.”
While this definition is technically correct, for me it fails to convey why Big Data is important.
In this article I’ll take a shot at explaining why Big Data is not just ‘important’ but that it is literally the fuel that will power the next phase of humanity’s development.
First Point: Big Data is not about data, per se, but about useful patterns within data
By now, we all know that Big Data means a lot of data. People first started talking about petabytes (that’s 10^15 bytes). Then we had new people coming in and trying to out-do that by talking about exabytes (10^18 bytes). I think at this point the biggest numbers I’ve seen refer to Big Data in terms of zettabytes (10^21 bytes). Not sure about yottabytes yet, but I imagine that someone somewhere has done a calculation involving yottabytes.
I think that this is the wrong way to look at Big Data; data alone is not the point. Let’s consider four items of data:
- 2.1% of males living in Maryland have the first name Fred
- Fred has an iPhone
- 82% of Netflix subscribers who have an iPhone have downloaded the Netflix app
- The penetration of Netflix in Maryland to Comcast subscribers is 21%
As it stands we can’t obviously conclude anything useful from these four data points. But what say we add a fifth data point:
- Fred is a Netflix subscriber
As soon as we know this then we can say that there is a 82% chance that Fred has downloaded the Netflix app. Although the amount of data has only increased slightly the value of the data has increased from zero to having some value.
The point about this is that big data is not about data per se but about relationships between different pieces of data or, in other words, patterns in the data which convey useful insights and ultimately allow us to make improvements – which is the source of economic growth.
Second Point: Big Data is ultimately a numeric description of everything
It is useful to think about big data is a way of defining the world we live in. If this worldly definition was sufficiently comprehensive and detailed then, in principle, the answer to any question and the solution to any problem would be locked away somewhere in the data. So how might we describe the world in terms of data?
We could imagine that there are two distinct classes of data:
- Data about things: This class of data would relate to specific people, objects or online services.
- Data about connections between things: For example, data that relates to the how a given individual uses their music streaming service.
If we firstly look at the number of ‘things’ that can be as described with data then we might imagine three categories of ‘thing’ as shown below:
This suggests that by 2018 there will be a total of about 10 billion 'things'.
But we cannot fully describe the world in data by restricting ourselves to the 10 billion 'things' that are shown in the above diagram. Facebook realised early on that the commercial value of an engaged community of 1 billion users would depend heavily on knowledge of the relationships between users - rather than simply data about users.
And so a ‘complete’ definition of the world in data would have to include relationships between the three types of ‘thing’ shown in the above diagram. If we add this second category of data then we’d get something like this:
Using this approach, each ‘thing’ and thing-to-thing connection could be described by a unique data file. The amount of data needed to properly describe ‘things’ and ‘connections’ is highly variable. For example it is easy to describe the manufacturer of a user’s smartphone in a few bytes but describing that same user’s search history using a given search engine would require a more complex data file of maybe 10MB.
Hence, we are using the term ‘data segment’ for these unique data files. The idea is that these data segments are logically distinct, rather than being distinct in terms of file size, data format etc.
If we add up all of the individual ‘data segments’ then the total comes to 18 trillion.
To be clear, here are 6 examples of these 18 trillion data segments:
- The version of software on a given smartphone
- The blood glucose level of a given person on a particular day
- The number of people who viewed Breaking Bad on Netflix in the last week
- The number of apps that users of a given service have installed on their devices today
- The rate at which API requests for a given parameter are coming from a partner
- The age distribution of registered users of an online banking service
Remember - there are 18 trillion items of this list.
So how much value might be locked up in all this data? Here are a few things that might be deductible from analysis of this data:
- 72% of single males living in apartments in city locations do not know what they are going to eat for dinner at 4pm.
- 1,560 baseball fans are planning to drive their automobiles to a game next Sunday and will all converge at a given road junction between 9am and 1pm - where repair work is scheduled to take place.
- During the delivery of a presidential address, real-time feedback based on analysis of the biometrics on an opt-in sample of voters makes the President aware of a growing state of anxiety and anger in 76% of swing voters who have not yet decided how to vote.
- An usually high defect level being produced by an injection moulding machine in a factory is because the person who installed the machine chose a location with a specific set of vibration characteristics. This problem has been found in 62 other instances worldwide across a range of industries.
- 32% of those who purchased a particular brand of tinned soup from 16 Sainsbury’s stores in the South West last week have developed an unusually high level of potassium in their blood.
The immediate reaction of most people who read the above list is to start thinking in terms of problems, solutions and opportunities. But these thoughts would never have occurred without being able to identify patterns in the data.
The essential point is that all of this data is theoretically available NOW. The problem is not whether the data exists (it clearly does, or it could) – the point is getting access to the data and then having the ability to identify value-added patterns like these.
The ability to identify patterns like the above would have the potential to release value into the economy: knowledge about some of these points, and others besides might result in the formation of new companies or even whole industries.
We could even view the above list as not simply ‘patterns’ but as ‘market opportunities’.
Third Point: Big Data will eventually unlock unlimited market opportunities
It is interesting to ask how many ‘market opportunities’ there might be - if it were truly possible to access and analyse all 18 trillion data segments?
- 1,000 market opportunities?
- 10,000 market opportunities?
- 1 million market opportunities?
- 10 million market opportunities?
- 100 million market opportunities?
Let’s say that we randomly sampled the 18 trillion data segments using a sample size of 10 million data segments per sample (a decent-size market). This would mean that each random sample would include just 0.005% of the available 18 trillion data segments.
The mathematically exact answer is that this would reveal 5.8 x 10^66,895,664 ‘market opportunities’. Here is a table that summarises the results obtained with different sample sizes:
The theoretical maximum number of market opportunities that would be unlocked by full access to Big Data is exceptionally large.
Indeed, the number is so large that it is, for all practical purposes, infinite.
To drive the point home, the whole universe contains 1.7 x 10^80 atoms, which is a number that looks like:
If we were to write down the number that represents the number of possible market opportunities unlocked by the full exploitation of Big Data using the same font size then we would need 67,000 pages of paper - which is 134 reams of paper that would stand 6 m / 20 ft high when printed out.
So, returning to the original question, the number of ‘market opportunities’ on the above list is effectively unlimited.
I’d argue that the possibilities for value creation based on analysis of all this data are effectively infinite.
Fourth point: Economic development comes from exploiting patterns in data
The rate of human development ultimately comes down to the rate at which we can acquire knowledge. Knowing how to fix a problem, how to make something better, how do something faster or produce something at a lower cost, and similar is the fuel that powers the entire economy.
But knowledge comes mainly from recognising useful patterns in data or from combining data in ways that provide valuable insights.
The more data is available and the smarter and more efficiently we can analyse that data then the faster we progress. Hence, mass-scale exposure of Big Data and the availability of low-cost tools that developers can use to analyse that data will show an increasingly strong correlation with the rate economic development in the future.
Punchline: The commercial value of Big Data is effectively unlimited
With an effectively infinite number of patterns, insights, problems, solutions and market opportunities locked away within Big Data, the commercial ‘value at stake’ is effectively unlimited. The rate at which value is released will depend on the volume and richness of data exposed and the availability of low-cost AI tools that will allow the data to be analysed.