In 2015 The Economist wrote an article about big data . Now, if any magazine should be numerate, I would think it would be The Economist (a magazine I subscribe to and am normally a great admirer of). However, in this case, it seems they failed to do the math and were taken in by the big data hype machine.
Let me explain why.
The article says that Walmart has collected 2.5 petabytes of data on customer transactions and does 1 million transactions an hour.
It goes on to say that all of this data is stored in a massive data storage that is around 2.5 petabytes in size.
A petabyte is 1,000 terabytes or 1 million gigabytes.
Now, a million transactions an hour sounds like a lot. The article is fawningly awestruck by the sheer size of the data that is produced and stored.
But what is that data? Does the number of transactions really add up to that much data?
There are 8,760 hours in a year. At 1 million transactions an hour, that is 8.76 billion transactions a year.
So, let’s go with 10 billion (even though that means a million transactions per hour for 24 hours a day, 365 days of the year, and then a major round up).
Now, what kind of data is in a transaction?
In the following, I’m going to assign a number of bytes for certain kinds of information.
For example, standard storage for a date/time takes 8 bytes. Numbers like prices can be stored in 4 bytes. A credit card number is usually 16 digits but some can be up to 19. So, we will go with 19 bytes for a credit card.
In a transaction, someone goes into a Walmart store and buys some stuff. Let’s assume that the average number of items is five, which is high in my opinion.
In a transaction, there is header information that applies to the whole transaction and then there are items (we are assuming five).
Sometimes, this is a cash sale and not a lot is known about the customer. However, let us assume we have a date/time (8 bytes), a credit card (19 bytes), an affinity card (19 bytes), and the store ID — say 8 bytes.
Information on the particular till taking the transaction, say 8 bytes. That is 62 bytes of header. Each item has a product code (12 bytes), a price (4 bytes), and a quantity (4 bytes).
Which give us 20 bytes per item, which is 100 bytes for five items. That means that the transaction is around 162 bytes.
We may have missed some things here: there could be more than one product code on an item (external and internal) or there could be some discount codes for special offers.
So to be safe, let’s blow up the size of the transaction. Instead of 162 bytes, we will say 1,000 bytes, which is ridiculously large for a transaction and should more than cover any extra information that we haven’t counted.
At our greatly inflated 10 billion transactions a year and our gross overestimation of 1,000 bytes per transaction, we have 10,000 billion bytes per year, which is 10 terabytes.
Let’s assume they are keeping 10 years of data. That would be 100 terabytes of data.
According to the article, Walmart was keeping 2.5 petabytes, which is 2,500 terabytes. That is 25 times more data than they could possibly collect over 10 years, even allowing for ridiculously large estimates of data sizes.
What’s going on?
So what is all that data? If that number is at all real, then the 2.5 petabytes would be mostly overhead.
This would mean that all this data is about 96 percent overhead and 4 percent real data. In actuality, the number is probably close to 99 percent overhead and 1 percent real data.
If you do the math on almost all big data numbers, you will find this giant discrepancy between the actual amount of data and what is being stored.
A lot of this is because the data is being stored with relational databases, which are incredibly inefficient when it comes to data storage.
The article says that Walmart gathers 2.5 petabytes an hour. But at 1 million transactions per hour, it would take 250 years to come close to getting 2.5 petabytes.
So, getting that much in an hour is, well, unbelievable.
Consider that the entire YouTube upload is about 1 petabyte a day. This article is saying that Walmart adds about 60 times as much data a day as the entire YouTube network.
Does anyone edit this stuff? Does anyone ever do the math?
There is, unfortunately, a deep level of innumeracy in our technical press, in corporate management, and in the investment community. So, numbers like this just go unchecked.
The next time you read about big data or see a start up trying to ride the big data wave, think about whether that is really big data, big overhead, or just bad math.