It's All About the Data

I was recently asked to speak at the 2012 Wolfram Data Summit in Washington DC.  This conference is billed as “the foremost annual gathering of data-oriented entrepreneurs, policy makers, and scientists in virtually every field of human endeavor.”

 

The first topic I discussed in my talk was the prolific growth of data and the strain it is placing on IT storage administrators. Taking an excerpt from my book Evolution of the Storage Brain, I pointed out that the online storage capacity in a typical data center grew 35% per year over the 30-year period of 1981-2010. I postulated that by the end of the next 30 years (2011-2040), using this historical growth rate, the average data center would be managing 1 Exabyte, or 1 million Terabytes, of online storage. I also mentioned that since I wrote the book in 2010, storage capacities have not been growing at 35%, but at 50% per year, a number that translated to 18 Exabytes of storage capacity in the average data center by 2040! We are dealing with some very large numbers indeed!

 

The second part of my talk focused on what is happening as a result of all this data being collected and stored. Believe it or not, storage vendors haven’t spent a great deal of time thinking about data over the past 3 decades. We’ve thought a lot about how to store, manage, and protect ever-increasing amounts of data, but the data itself? That’s the domain of applications, software developers, and the recipients of the data, not the storage vendors.

 

As I spoke to the audience at the Wolfram Data Summit, I told them that this is about to change.  If a storage vendor is not thinking about the actual data being stored, and doing things to facilitate the quick conversion of raw data into actionable information, they will soon be left behind. We have become a data-driven society, and access to data has become our oxygen. Unfortunately, collecting a mountain of data does not provide any value at all. Value is only extracted as data is harvested – therefore knowing how to harvest the data will become paramount.

 

As we collect monumental amounts of data, how do we place a value on this data? How can we decide which data is valuable today, which data will be valuable next week, and which data will be valuable sometime in the future (or not valuable at all)? Knowing the answer to these questions can help determine where to invest, or not invest, based on the value being extracted from the data. To my knowledge, no software company has ever attempted to develop tools for determining the relative value of data, let alone a storage vendor. NetApp, however, is providing thought leadership in this area, in conjunction with research being conducted at the University of California – San Diego (UCSD).

 

As the primary sponsor of the “Enterprise Data Growth Index” project at UCSD, NetApp is working with researchers to understand the elements that determine the usefulness of data.  The first result of this research is a data taxonomy; shown below. In this taxonomy, data exists in one of three states – creation, consumption, and persistence. The amount of time spent in each state, and the logical path traveled from state-to-state can provide valuable clues about how the data is being used, and the value this data is bringing to the organization. Eventually, tools could be developed that map data flow from point to point, and provide information to pinpoint the data that is making the greatest impact.

 

I’ve provided some additional links below for anyone that is interested in this topic, along with a link to my blog where I’ll be going into much more detail on this and other forward-looking data topics. The NetApp/UCSD research is still in its infancy, but demonstrates the realization that changes are occurring in the data storage industry. In this blogger’s opinion, storage vendors that place emphasis on data harvesting tools are the ones that will persist and thrive as we travel towards the year 2040.

 

UCSD Data Taxonomy

UCSD Research site

http://clds.ucsd.edu/

 

Wolfram Data Summit 2012

http://www.wolframdatasummit.org/2012/

 

NetApp Agile Data Infrastructure

http://www.netapp.com/us/technology/agile-data-infrastructure.html?REF_SOURCE=bnrbonelvhp

 

Larry Freeman’s Blog: “Ask Dr Dedupe”

https://communities.netapp.com/community/netapp-blogs/drdedupe