How many LUNs do I need?

steiner · ‎2022-09-22

How many LUNs do I need?

(notes on Consistency Groups in ONTAP)

This post is part 2 of 2. In the prior post, I explained consistency groups and how they exist in ONTAP in multiple forms. I’ll now explain the connection between consistency groups and performance, then show you some basic performance envelopes of individual LUNs and volumes.

Those two things may not seem connected at first, but they are. It all comes down to LUNs.

LUNs are accessed using the SCSI protocol, which has been around for over 40 years and is showing its age. The tech industry has worked miracles improving LUN technology over the years, but there are a lot of limits related to host OSs, drivers, HBAs, storage systems, and drives that limit the performance of a single LUN.

The end result is this – sometimes you need more than one LUN to host your dataset in order to get optimum performance. If you want to take advantage of the advanced features of a modern storage array, you’ll need to manage those multiple LUNs together, as a unit. You’ll need a consistency group.

My previous post explained how ONTAP delivers consistency group management. This post explains how you figure out just how many LUNs you might need in that group, and how to ensure you have the simplest, most easily managed configuration.

Note: There are a lot of performance numbers shown below. They do NOT represent maximum controller performance. I did not do any tuning at all beyond implementing basic best practices. I created some basic storage configurations, configured a database, and ran some tests to illustrate my point. That's it.

ONTAP < 9.9.1

First, let’s go back a number of years to the ONTAP versions prior to the 9.9.1 release. There were some performance boundaries and requirements that contributed to the need for multiple LUNs in order to get optimum performance.

Test Procedure

To compare performance with differing LUN/volume layouts, I wrote a script that built one Oracle database each, using the following configurations:

1 LUN in a single volume
2 LUNs in a single volume
4 LUNs in a single volume
8 LUNs in a single volume
16 LUNs in a single volume

16 LUNs in a single volume
16 LUNs across 2 volumes
16 LUNs across 4 volumes
16 LUNs across 8 volumes
16 LUNs across 16 volumes

Warning: I normally ban anyone from using the term “IOPS” in my presence without providing a definition, because “IOPS” has a lot of different meanings. What’s the block size? Sequential or random ratio? Read/write mix? Measured from where? What’s my latency cutoff? All that matters.

In the graphs below, IOPS refers to random reads, using 8K blocks, as measured from the Oracle database. Most tests used 100% reads.

I used SLOB2 for driving the workload. The results shown below are not the theoretical storage maximums, they're the result of a complicated test using an actual Oracle database where a lot of IO has interdependencies on other IO. If you used a synthetic tool like fio, you’d see higher IOPS.

The question was - “How many LUNs do I need?” These tests used *one* volume. Multiple LUNs, but one volume. Let’s say you have a database. How many LUNs did you need in that LVM or Oracle ASM volume group to support your workload? What is the expected performance? Here’s the answer to that question when using a single AFF8080 controller prior to ONTAP 9.9.1.

There are three important takeaways from that test:

A single LUN hit the wall at about 35K IOPS.
A single volume hit the wall at about 115K IOPS
The sweet spot for LUN count in a single volume was about 8, but there was some benefit going all the way to 16.

To rephrase that:

If you had a single workload that didn’t need more than 35K IOPS, just drop it on a single LUN.
If you had a single workload that didn’t need more than 115K IOPS, just drop it on a single volume, but distribute it across 8 LUNs.
If you had more than 115K IOPS, you would have needed more than one volume.

That’s all <9.9.1 performance data, so let see what improved in 9.9.1 and how it erased a lot of those prior limitations and vastly simplified consistency group architectures.

ONTAP >= 9.9.1

Threading is important for modern storage arrays, because they are primarily used to support multiple workloads. On occasion, we’ll see a single database hungry enough to consume the full performance capabilities of an A900 system, but usually we see dozens of databases hosted per storage system.

We have to strike a balance between providing good performance to individual workloads while also supporting lots of independent workloads in a predictable, manageable way. Without naming names, there are some competitors out there whose products offer impressive performance with a single workload but suffer badly from poor and unpredictable performance with multiple workloads. One database starts stepping on another, different LUNs outcompete others for attention from the storage OS, and things get bad. One of the ways storage systems manage multiple workloads is through threading, where work is divided into queues that can be processed in parallel.

ONTAP 9.9.1 included many improvements to internal threading. In prior versions, SAN IO was essentially being serviced in per-volume queues. Normally, this was not a problem. Controllers would be handling multiple workloads running with a lot of parallelism, all the queues stayed busy all the time, and it was easy for customers to reach platform maximum performance.

Most of my work is in the database space, and we’d often have the One Big Huge Giant Database challenge. I’ve architected systems where a single database ate the maximum capabilities of 12, yes twelve controllers. If you only have one workload, it can be difficult trying to create a configuration that ensures all those threads are busy and processing IO all the time. You had to be careful to avoid having one REALLY busy queue, while others would be idle. The result is leaving potential performance on the table, and you would not get maximum controller performance.

Those concerns are 99% gone as of 9.9.1. There are still threaded operations, of course, but overall, the queues that led to those performance concerns don’t exist anymore. ONTAP services SAN IO more like a general pool of FC operations, spread across all the CPUs all the time.

To illustrate, let’s start with the same set of tests I showed for <9.9.1, with a single volume and varying numbers of LUN in the diskgroup:

I see four important takeaways here:

A single LUN yields about 4X more IOPS than before.
A single LUN not only delivers 4X more IOPS, but the latency is also about 40% lower.
A single volume (with 8 LUNs) yields about 2X more IOPS
A single volume (with 8 LUNs) delivers 2X more IOPS and with 40% lower latency.

That might seem simple, but there are a lot of implications to those four points. Here are some of the things you need to understand.

ONTAP is only part of the picture

The graph above does show that two LUNs are faster than one LUN, but it doesn’t say why. It’s not really ONTAP that is the limiting factor, it’s the SCSI protocol itself. Even if ONTAP was infinitely fast with FC LUNs that delivered 0µs of latency, it can’t service IO it hasn’t received.

You also have to think about the host-side limits. Hosts also have queues, like per-LUN queues, and per-path queues, and HBA queues. You still need some parallelism up at the host level to get maximum performance.

In the tests above, you can see incremental improvements in performance as we bring more LUNs into play. I’m sure some of the benefits are a result of ONTAP parallelizing work better, but that’s only a small part of it. Most of the benefits flow from having more LUNs driven by the OS itself.

The reason I wanted to explain this is because we have a lot of support cases about performance that aren’t exactly complaints, but are instead more like “Why isn’t my database faster than it is?” There’s always a bottleneck somewhere. If there wasn’t, all storage operations would complete in 0 microseconds, database queries would complete in 0 milliseconds, and servers would boot in 0 seconds.

We often discover that whatever the performance bottleneck might be, it ain’t ONTAP. The performance counters show the controller is nowhere near any limits, and in many cases, ONTAP is outright bored. The limit is usually up at the host. In my experience, the #1 cause of SAN performance complaints is an insufficient number of LUNs at the host OS layer. We therefore advise the customer to add more LUNs, so they can increase the parallelism though the host storage stack.

Yes, LUNS simply got faster

A lot of customers had single-LUN workloads that suddenly became a lot faster, because they updated to 9.9.1 or higher. Maybe it was a boot LUN that got faster and now patching is peppier. Maybe there was an application on a single LUN that included an embedded database, and now that application is suddenly a lot more responsive.

A volume of LUNs got faster too

Previously, I maxed out SAN IOPS in a single volume at about 110K IOPS. The limit roughly doubled to 240K IOPS in 9.9.1. That’s a big increase. IO-intensive workloads that previously required multiple volumes can be consolidated to a single volume. That means simpler management. You can create a single snapshot, clone a single volume, set a single QoS policy, or configure a single SnapMirror replication relationship.

Even if you don’t need the extra IOPS, you still get better performance

The latency dropped, too. Even a smaller database that only required 25K IOPS and was happily running on a single volume prior to 9.9.1 should see noticeably improved performance, because the response times of those individual 25K IOPS got better. Application response times get better, queries complete faster, and end users get happier.

How Many Volumes Do I Need?

I’d like to start by saying there is no best practice suggesting the use of one LUN per volume. I don’t know for sure where this idea originated, but I think it came from a very old performance benchmark whitepaper that included a 1:1 LUN:Volume ratio.

As mentioned above, it used to be important to distribute a workload across volumes in some cases, but it mostly only applied to single-workload configuration. If we were setting up a 10-node Oracle RAC cluster, and we wanted to push performance to the limit, and we wanted to get every possible IOP with the lowest possible latency, then we’d need perhaps 16 volumes per controller. There were often only a small number of LUNs on the system as a whole, so we may have used a 1:1 LUN:Volume ratio.

We didn’t HAVE to do that, and it’s in no way a best practice. We often just wanted to squeeze out a few extra percentage points of performance.

Also, don’t forget that there’s no value in unneeded performance. Configure what you need. If you only need 80K IOPS, do yourself a favor and configure a 2-LUN or perhaps 4-LUN diskgroup. It’s not hard to create more LUNs if you need them, but why do that? Why create unnecessary storage objects to manage? Why clutter up the output of commands like “lun show” with extra items that aren’t providing value? I often use the post office as an analogy – a 200MPH vehicle is faster than a 100MPH vehicle, but the neighborhood postal carrier won’t get any benefit from that extra performance.

If you have an unusual management need where one-LUN-per-volume makes more sense, that’s fine, but you have more things to manage, too. Look at the big picture and decide what’s best for you.

Want proof that multiple volumes don’t help? Check this out.

It’s the same line! In this example, I created a 16-LUN volume group and compared performance between configurations where those 16 LUNs were in a single volume, 2 volumes, 4, 8, and 16. There’s literally no difference, nor should there be. As mentioned above, ONTAP SAN processing as of 9.9.1 does not care if the underlying LUNs were located in different volumes. The FC IO was processed as a common pool of FC IO operations.

Things get a little different when you introduce writes, because there still is some queuing behavior related to writes that may be important to you.

Write IO processing

If you have heavy write IO, you might want more than one volume. The graphs below illustrate the basic concepts, but these are synthetic tests. In the real world, especially with databases, you get different patterns of IO interdependencies.

For example, picture a banking database used to support online banking activity by customers. That will be mostly concurrent activity where a little extra latency doesn’t matter. If you need to withdraw money at the ATM, would you care if it took 2.001 seconds rather than the usual 2 seconds?

In contrast, if you have a banking database used for end-of-day processing, you have dependencies. Read #352 might only occur after read #351 has completed. A small increase in latency can have a ripple effect on the overall workload.

The graphs below show what happens when one IO depends on a prior IO and latency increases. It’s also a borderline worst-case scenario.

First, let’s look at a repeat of my first 9.9.1 test, but this time I’m doing 70% reads and 30% writes. What happens?

The maximum measured IOPS dropped. Why? The reason is that writes are more expensive to complete than reads for a storage array. Obviously, platform maximums will be reduced as write IO becomes a larger and larger percentage of the IO, but this is just one volume. I’m nowhere near controller maximums. Performance remains awesome. I’m at about 150us latency for most of the curve, and even at 100K IOPS, I’m only at 300us of latency. That’s great, but it is slower than the 100% read IOPS test.

What you’re seeing is the result of read IOPS getting held back by the write IOPS. There were more IOPS available to my database from this volume, but they weren’t consumed, because my database was waiting on write IO to complete. The result is that the total IOPS dropped quite a bit.

Multi-Volume write IOPS

Here’s what happens when I spread these LUNs across two volumes.

Looks weird, doesn’t it? Why would 2 volumes be 2X as fast as a single volume, and why would 2, 4, 8, and 16 volumes perform about the same?

The reason is that ONTAP is establishing queues for writes. If I want to maximize write IOPS, I’m going to need more queues, which will require more volumes. The exact behavior can change between configurations and platforms, so there’s no true best practice here. I’m just calling out the potential need to spread your database across more than one volume.

Key takeaways:

If I have 16 LUNs, there is literally no benefit to splitting them amongst multiple volumes with a 100% read workload. Look at that earlier graph. The datasets all graphed as a single line.
Two volumes with a 70% read workload showed a big improvement going from 1 volume to 2, but then nothing further. That’s because, in my configuration, there are two queues for write processing within ONTAP. Two volumes are no different than 3 or 4 or 5 in terms of keeping those queues busy.

I also want to repeat – the graphs are the worst-case scenario. A real database workload shouldn’t be affected nearly as much, because reads and writes should be largely decoupled from one another. In my test, there are about two reads for each write with limited parallelization, and those reads do not happen until the write completes. That does happen with real-world database workloads, but very rarely. For the most part, real database read operations do not have to wait for writes to complete.

Summary

To recap:

If you’re using ONTAP <9.9.1 with FC SAN, upgrade. We’ve observed LUNs deliver 4X more IOPS at 40% lower latency.

Once you get to ONTAP 9.9.1 (or higher):

A single LUN is good for around 100K IOPS on higher-end controllers. That’s not an ONTAP limit, it’s an “all things considered” limit that is a result of ONTAP limits, host limits, network limits, typical IO sizes, etc. I’ve seen much, much better results in certain configurations, especially ESX. I’m only suggesting 100K as a rule-of-thumb.
For a single workload, a 4-LUN volume group on a single volume can hit 200K with no real tuning effort. More LUNs in that volume are desirable in some cases (especially with AIX due to its known host FC behavior), but it’s probably not worth the effort for typical SAN workloads.
If you know you’ve got a very, very write-heavy workload, you might want to split your workload into two volumes. If you’re that concerned about IOPS, you probably did that anyway, simply because you probably chose to distribute your LUNs across controllers. That’s a common practice – split each workload evenly across all controllers to achieve maximum performance, as well as guaranteed even loading across the entire cluster.

Lastly, don’t lose perspective.

It’s nice to have an AFF system with huge IOPS capabilities for the sake of consolidating lots of workloads, but I find admins obsess too much about individual workloads and targeting hypothetical performance levels that offer no real benefits.

I look at a lot of performance stats, and virtually every application and database workload I see plainly shows no storage performance bottleneck whatsoever. The performance limits are almost universally the SQL code, the application code, available raw network bandwidth, or Oracle RAC cluster contention. Storage is usually less than 5% of the problem. The spinning-disk days of spending your way out of performance problems are over.

Storage performance sizing should be about determining actual performance requirements, and then architecting the simplest, most manageable solution possible. The SAN improvements introduced in ONTAP 9.9.1 noticeably improve manageability as well as performance.