The Duality of Storage

By Peter Corbett, Vice President & Chief Architect of NetApp


I’ve spent most of my career working on storage, file system and database technology of one form or another.  This has given me some insight that others might find valuable.  I have a good vantage point on storage technology, and how it is evolving to meet the changing needs of its consumers, many of which are large organizations and enterprises running or using a variety of applications.  In any industry, it is important to understand trends, and the impacts of those trends, and this is no less true in the storage industry than in any other technology area.   I’d like to share some of my thoughts in a series of blog postings.


It is an interesting time in the storage industry.  There are several trends at work that individually would each have a large impact on how storage is designed and deployed, but that combined are leading to even larger shifts.  In this first posting, I will discuss the fundamental nature of storage systems, and how the underlying requirements of storage are in many ways unique compared to other technology areas.

Storage has some unique properties, which make it different from computing and networking.  While some stored data is ephemeral, and can thus be created and destroyed quickly without much need to consider it outside of the application context in which it is needed, most data has tangible and potential value long after it is first created.  Much data can sit idly for long periods of time, even decades, without ever being accessed by an end user or application, yet it cannot be deleted.  Data is difficult to move, it must be catalogued, managed, owned and kept secure.


The title of this posting is the Duality of Storage, and that is what I want to talk about now.  There are two timescales that are important with stored data.  The first is how quickly a user might need to access any byte of it.  The second is over what timescale a user will need to keep it stored.  These timescales differ by many orders of magnitude with respect to each other, but also each of these timescales itself has many orders of magnitude of dynamic range for different applications, and depending on the age of the data being accessed.  The first timescale is the operational requirement for use of the data.  We can think of this as the primary performance dimension.  It includes read and write latency and throughput or bandwidth.  All else being equal, faster is better.  But all else is not equal.  Faster is more expensive.  Faster requires different technology.  Faster requires locality to reduce latency.  Slower allows both the time to use cheaper technologies and also increase distance.  Distance allows consolidation.  Consolidation allows the eventual crushing of data to its most efficient state of density and highest degree of sharing.


There are other factors at work here as well.  The durability of data – the degree of certainty that any particular byte of data can eventually be retrieved – also affects the cost of storing it.


Thus we find a duality in storage systems.  They function as an extension to the networking and compute infrastructure to store and retrieve data quickly (low latency) and at a high rate (high bandwidth). Just how quickly and at how high a rate determines much about the design and ultimately the cost of the storage system. But most of the data that is collected or generated by applications will have a much longer life that its brief bursts of operational glory.  It will be stored perhaps for years or decades, and it may be replicated, retrieved, scanned, moved, cataloged, migrated, and analyzed many times during its lifetime. This long time-scale dimension of data retention is a very interesting aspect of storage systems.  Much of the value of a storage system is in the handling of point-in-time replicas of data: backups, snapshots or versions.  The gap between the length of time old data must be stored and the speed with which it must be made available on demand create a set of constraints on the operation of the storage system.  The optimization of the system to store retained data as efficiently as possible while keeping it acceptably accessible at all times is one of the great challenges of storage system design.  Put simply time (in the form of increased acceptable latency) buys efficiency.  On the short timescales of operational latency, whether average access latency, maximum latency, first access latency, write latency, or post-write de-stage time, each additional time increment, whether microseconds, milliseconds, seconds, or hours buys additional opportunity for efficiency and opportunities to reduce the costs of storing the data.  Each increment of storage lifetime, whether seconds, hours, months or decades motivates pushing to lower cost, more efficiently utilized, storage.  There is a similar dynamic to the operational latency dimension in compute and networking, but the second dimension is unique to storage. 


The application of solid state flash storage technology into large scale storage systems has created a greatly accelerated intensification of the exposure and exploitation of this duality.  Flash provides a fundamentally different technology point in the multidimensional latency/capacity/bandwidth/cost/durability space, and thus motivates a rethinking of how storage systems are designed to leverage flash efficiently for different workloads.  At the same time, hard disk drive technology is continuing to evolve, and it is changing the way systems are designed to handle the long tails of retention of different point in time images of data efficiently. Flash fundamentally changes the opportunities to achieve high performance for the primary operations tier. Ever more capacious HDDs change the dynamics of efficiently and durably storing large numbers of versions and snapshots of large data sets for long periods of time.


Optimizing both the operations layer, which will mostly be contained in solid state storage, and the retention layer, which will mostly be contained in low $/GB HDDs, is the task we undertake as builders of large scale storage systems.  The available hardware technologies let us, actually compel us, to optimize in both spaces at once.