Big Data is the buzz word of the moment. So much is said about it, that it seems to have an unlimited scale.
On the last decade we have grown accustomed to doubling computer processing power every two years (according to Moore's law).
When I got my first digital camera (with 2 megapixels), I remember reading that to simulate a 35mm photo film it would take about 25 megapixels (only taking resolution into account). By today's standards it doesn’t seem much, right? At the time it seemed a stretch.
The number of pixels it would take to simulate the field of human view is about 500 megapixels. (7 MP for the center core of vision, plus the eyes moving through the scene and our brain assembling the ‘big picture’). Once again it seems a stretch, but somewhat achievable.
The term Big Data is said to date back to early 1990s, but the hype cycle is at the top these days. In the broad sense Big Data refers to large data sets, hardly processed with common Database applications. Either because (1) the volume of data is several orders of magnitude higher, (2) the content format can be unstructured, (3) the data analysis has to be in near real time rather than batch, (4) new types of data sources: web and social media, biometric sensors, … and (5) the use of commodity hardware farms instead of state of the art hardware.
Business intelligence use the DBMS (Database Management System) at it’s the core, whereas Big Data tend to use Distributed File Systems and Map Reduce. The quantity of information is so high that no single hard-drive has the capacity to store one Big Data data file, so it has to be split amongst several. The Map Reduce programming model allows parallel processing of such a huge data set.
But grid computing is not new. Some must remember back in 1999, the ‘SETI@home’ project. Its purpose was to search for extra-terrestrial intelligence by analyzing radio telescope signals. The data was split and distributed through the internet to all participants (5 Million of them, myself included!), and processed by a screensaver program that was distributed, making use of the otherwise unused processor time.
So far Big Data seems to have an unlimited scale potential. We have to go back in time to 800 A.D in order to begin to understand its real limitations.
It is said that the chess game was invented by an ancient Indian Brahmin mathematician. When the game was presented to his ruler, the ruler was so pleased that allowed the inventor to name a prize. The clever mathematician asked the ruler a ‘simple’ thing: place 1 grain of rice on the first square of the chess board, double that amount on the next square and so on doubling the amount each time, the amount of rice present on the last square would be his prize. The ruler instantly accepted the mathematician terms and asked the treasurer to calculate and give the mathematician his deserved prize. (Spoiler: The mathematician was later beheaded, because the actual amount of rice required, would cover the entire planet under a couple of inches of rice, 2^64 to be accurate).
When I got my first digital camera (with 2 megapixels), I remember reading that to simulate a 35mm photo film it would take about 25 megapixels (only taking resolution into account). By today's standards it doesn’t seem much, right? At the time it seemed a stretch.
The number of pixels it would take to simulate the field of human view is about 500 megapixels. (7 MP for the center core of vision, plus the eyes moving through the scene and our brain assembling the ‘big picture’). Once again it seems a stretch, but somewhat achievable.
The term Big Data is said to date back to early 1990s, but the hype cycle is at the top these days. In the broad sense Big Data refers to large data sets, hardly processed with common Database applications. Either because (1) the volume of data is several orders of magnitude higher, (2) the content format can be unstructured, (3) the data analysis has to be in near real time rather than batch, (4) new types of data sources: web and social media, biometric sensors, … and (5) the use of commodity hardware farms instead of state of the art hardware.
Business intelligence use the DBMS (Database Management System) at it’s the core, whereas Big Data tend to use Distributed File Systems and Map Reduce. The quantity of information is so high that no single hard-drive has the capacity to store one Big Data data file, so it has to be split amongst several. The Map Reduce programming model allows parallel processing of such a huge data set.
But grid computing is not new. Some must remember back in 1999, the ‘SETI@home’ project. Its purpose was to search for extra-terrestrial intelligence by analyzing radio telescope signals. The data was split and distributed through the internet to all participants (5 Million of them, myself included!), and processed by a screensaver program that was distributed, making use of the otherwise unused processor time.
So far Big Data seems to have an unlimited scale potential. We have to go back in time to 800 A.D in order to begin to understand its real limitations.
It is said that the chess game was invented by an ancient Indian Brahmin mathematician. When the game was presented to his ruler, the ruler was so pleased that allowed the inventor to name a prize. The clever mathematician asked the ruler a ‘simple’ thing: place 1 grain of rice on the first square of the chess board, double that amount on the next square and so on doubling the amount each time, the amount of rice present on the last square would be his prize. The ruler instantly accepted the mathematician terms and asked the treasurer to calculate and give the mathematician his deserved prize. (Spoiler: The mathematician was later beheaded, because the actual amount of rice required, would cover the entire planet under a couple of inches of rice, 2^64 to be accurate).
The exponential problem illustrated by the story above, is present in many areas related to data manipulation even now.
When I played my first chess computer game back on the 80s, everyone anticipated that the computer soon would be unbeatable. What would eventually happen on 1997 with Deep Blue winning the chess master Garry Kasparov, however Deep Blue was not unbeatable (it won 3½–2½ out of 6 games).
The main reason for that is the amount of information required to anticipate every possible outcome of a chess game. Can Big Data tackle that amount?
The amount of data required for storing all endgame permutations with 6 or less pieces spans up to 1Tb of uncompressed data. When the number of pieces increases, the amount of data required also increases exponentially.
There are 10^40 possible chess board configurations and 10^120 different move sequences (or games). Calculated by Claude Shannon on 1950.
Imagine what would take to store 10^40 board positions (that is 1 followed by 40 zeros!) and 10^120 different move sequences.
Some hints:
Even if each board position was 1 byte and each move sequence was 1 byte, we wouldn’t have enough atoms to store the information!
The point is, Big Data is without a doubt a huge achievement of the computer age. The Hadoop Ecosystem has countless applications, unimaginable 10 years ago. However some areas present a challenge even bigger than Big Data.
When I played my first chess computer game back on the 80s, everyone anticipated that the computer soon would be unbeatable. What would eventually happen on 1997 with Deep Blue winning the chess master Garry Kasparov, however Deep Blue was not unbeatable (it won 3½–2½ out of 6 games).
The main reason for that is the amount of information required to anticipate every possible outcome of a chess game. Can Big Data tackle that amount?
The amount of data required for storing all endgame permutations with 6 or less pieces spans up to 1Tb of uncompressed data. When the number of pieces increases, the amount of data required also increases exponentially.
There are 10^40 possible chess board configurations and 10^120 different move sequences (or games). Calculated by Claude Shannon on 1950.
Imagine what would take to store 10^40 board positions (that is 1 followed by 40 zeros!) and 10^120 different move sequences.
Some hints:
- 1Terabyte is 10^12 bytes
- 1.33*10^50 is the calculated number of atoms on planet Earth
Even if each board position was 1 byte and each move sequence was 1 byte, we wouldn’t have enough atoms to store the information!
The point is, Big Data is without a doubt a huge achievement of the computer age. The Hadoop Ecosystem has countless applications, unimaginable 10 years ago. However some areas present a challenge even bigger than Big Data.