Why Are High-Capacity SSDs Such a Big Deal?

By Dr. Mark Bregman, SVP & Chief Technology Officer, NetApp


15.36 TB SDD (Image via Samsung Newsroom)

Sometimes it’s tough to discern the value of new technologies when they first enter the market. But thankfully with new high-capacity solid-state drives (SSDs), it’s pretty clear they can deliver immediate customer value. That makes it even more confusing that some vendors within the storage industry are expressing alarm and uncertainty about the new 15TB SSDs.


The case for SSDs has been compelling from the start, even for some of the more expensive ones. This is particularly true for space constrained data centers, mixed SAN and NAS environments, and applications that require both low latency and high capacity. As SSDs of all sizes have continued to become more affordable, more and more types of customers can consider them. And many of these customers are managing workloads that are ideally suited to the performance profile of high-capacity SSDs, such as business-critical applications where they want to avoid latency spikes that are visible at the app layer.


At NetApp, when we saw the coming of large-capacity SSDs, we made the move to support them in our ONTAP operating system. We were the first to ship storage systems with this capability, but I hasten to add that we didn’t build in support for high-capacity SSDs simply to be the first to do so. We did it because we have customers who need it now, and many more who will need and want it very soon.


So how do you measure the impact of high-capacity drives to your business? Important factors include space efficiency, performance gains and cost of ownership. In all categories, the case for these SSDs keeps getting better. But to be thorough, let’s dig into each one.


From a space efficiency standpoint, you can’t beat the new high-capacity all-flash arrays, which give you up to 321.3TB of raw storage in a single 2U form factor. That means a single 2U system using 15.3TB drives can provide more than 1PB (1,000TB) of effective capacity. To achieve the same with even the highest density SFF hard disk drives would require 52U of rack space (a standard rack is 42U) and 18 times as much power. Along with the space reduction, the savings in power and cooling are huge. In data centers where every square foot or square meter counts, the gains here are considerable. So NetApp all-flash arrays that use high-capacity SSDs can now address customer use cases where such significant relief has historically been impractical.


When they first appeared, all-flash arrays were used to replace hard disk systems for workloads requiring high performance. These hard drive systems delivered performance density of about 100 to 200 IOPs per terabyte, whereas NetApp all-flash arrays with high-capacity SSDs can deliver 8x to 16x that number1,650 IOPs per terabyte. Now, with significant drops in the cost per gigabyte of SSDs and further improvements in deduplication, compression and compaction technologies, our all-flash arrays can replace traditional disk arrays even where performance isn’t the driving factor. So, customers can get the performance and efficiency of all-flash arrays along with the cost of ownership advantages. Of course, for applications that require the highest performance customers can still choose smaller SSDs.


Earlier I alluded to some storage vendors expressing alarm and uncertainty about high-capacity SSDs. This might have to do with theirs not being on the market yet. Not a biggie: they’ll catch up. The fact remains, however, that when they do, it will be with an expensive silo product that doesn’t play well across your infrastructure.


Meanwhile, you can experience the advantages of high-capacity SSDs in your environment now with NetApp. For more information on how easy it is to get NetApp all-flash arrays for your business, including free trials, check out our Flash 3-4-5 promotion today.


Thanks for the post. It raises some interesting questions I'm very curious about.


The biggest concern I see is the ongoing decision to continue to address SSD technology as a type of disk drive. Disks, by there very invention, involve heads, moving parts, cylinders and sectors. Why do we continue to address a memory device with such arcane protocols? Its similar to using random access disk drives to mimic tape drive behaviors (VTL)  or having A.G. Bell send morse code over his first voice lines. Fortunately, the popularity of VTL technology is fading as I suspect the arcane addressing of Flash devices will eventually follow suit. I'm not aware that morse code was ever sent over voicelines, but one never knows.


Where this becomes really serious is the whole concept of RAID protection for SSDs. A lot of error protection is already built in to them and tneir reliability continues to climb. We generally don't mirror the system's DRAM; SSDs are pretty much DRAM that still exists after a power event. OK, so we have to protect against the entire device failing; 15TB of data could be a lot of lost data needing recovery. But we've been protecting large datastores for a long time using techniques more cost-effective than RAID. So why don't storage providers follow suit?


So one potential risk of large SSDs is the high cost required to protect them following traditional techniques. 321TB in a 2U shelf becomes REALLY expensive if only half of it is usable. Where is NetApp headed to address this situation and make it more cost effective for customers to invest in these large SSDs?


Bruce Clarke

NetApp Alum

Bruce, thanks for your comments and questions and it’s good to hear from you. To the points that are answerable, here you go…
We understand the point about addressing SSDs as disk drives and we envision a sensible move to NVMe. There are still a lot of advantages to using a legacy standards-based blocks interface for SSDs: as we see it today, designing a flaky infrastructure built on proprietary hardware is like reinventing an “even rounder” wheel. We choose to enjoy the benefits today while preparing for tomorrow.
RAID is a form of erasure coding and we could debate the efficiencies of one versus the other. We’re assuming RAID-DP and protecting against any two SSDs failing: use of large RAID groups helps us keep the parity overhead reasonable. It’s important to note that the performance of distributed data stores in degraded mode can be horrible. For a shared-nothing architecture, it’s likely that the optimal solution is two-level protection with RAID inside the node to protect against device failures and EC outside for node-level protection.
It’s certainly true that 50% overhead for mirroring would be expensive. But in the case of NetApp high-capacity all-flash arrays, we start with 367TB raw and that gets us 321TB of useable data drive capacity: our RAID overhead for a single shelf is 12.5%: that is, two parity drives and one spare for 24 drives; a second RAID group would carry lower overhead (about 8.3%) because the spare is global.
In the immediate term, it’s important to keep in mind: 1) because we use analytics to detect problems early and copy out the drive contents before an entire drive fails, we can avoid a RAID rebuild in a majority of cases; and 2) observed failure rates for NetApp SSDs are much lower than HDDs.
In the longer term for large SSDs, we have proposed elastic capacity as a standard: that is, a fail-in-place/repair-in-place technique to deal with individual NAND die failures within an SSD. This approach repairs just the missing pieces of data, thereby minimizing disruption (in the form of performance impact) as well as data loss hazard. As you rightly describe, a 32TB SSD would have 512 individual NAND dies, any one of which could cause the whole SSD to fail, thereby forcing reconstruction of 32TB of data.