We are going to be implementing a couple V3240 clusters using storage from HP XP24000. The storage actually will be coming from a Thin Provisioning Pool made up of HDS AMS 2500s (external storage). I've got the NetApp best practices guide etc, but wondering if anyone has a similar setup and has any gotchas or suggestions to offer. Also, is there any kind of impact (performance or otherwise) with this "double" virtualiziation...first virtualizing on the physical disk and then WAFL on top of that. BTW, we'll only be allocating CIFS/NFS from the V3240s.
We have implemented something similar using the XP12000 with a V3140 head unit. Overall the performance is very acceptable and if you are purchasing the unit add the technology bolt on to the XP at the fraction of the cost then you will be very happy.
We had a couple issues when setting up the volumes mapped onto physical spindles on the XP and found that the 24+4 config worked best from the XP side. When presenting this to the NetApp, don’t use LUSEs and try to break your disks into a number devisable by 8. For us the calculation was 32x120ish GB for every 28+4 spindle count. Presenting larger LUNs to the NetApp was disastrous when following the best practice guide.
We are currently trying to solve a high latency issue generated by doing LUN backups using NDMP which is the only other problem we have and am about to post for advice.
I agree with Walther. I have an AMS2500 set up with two pools of ~22TB each. We have carved 11x1.9TB LUNs from each of these pools and presented them to our N6060 (IBM rebranded NetApp) gateway running in HA mode. One of the pools on the AMS is for controller 0 on the N6060 and the other for controller 1. We get terrible write latency whenever there is a spike in write requests. We worked with HDS and narrowed the problem to the LUN queues on the AMS being filled during these spikes. I think that by using the larger LUNs we have limited the amount of write cache allocated on the AMS. From my experience this has caused more problems than would the WAFL overhead incurred by using more LUNs. Thoughts?
There are some issues with AMS and disk queues that are not related to the number or size of LUNs.
DataONTAP assigns each array LUN the lesser of 32 queues or (# LUNs on port/256). On a busy system with a lot of writes, we are probably going to overrun the available queues on the array, resulting in high latencies as we wait for the array to process request we've assumed it ought have done already.
The solution for this is to change the behavior of the NetApp initiators by tweeking this option: disk.target_port.cmd_queue_dept
From the command line: >options disk.target_port.cmd_queue_depth [#]
The default value is 256. For AMS, a value of 32 (or maybe less) will tend to resolve the queuing issues.
With respect to setting up two pools for a single HA pair, you are essentially cutting your performance in half. One larger pool for an HA pair would be recommended. I'm working on a Technical Report to be released later this year that will cover Best Practices for pools and other advanced array features.
Re queue depth: HDS recommended that a queue depth of "8" per LUN is the sweet spot on the AMS2500. I have 4 ports on the AMS that are zoned to 4 ports on the N6060 using 4 "1 to 1" zones. On the AMS the LUNs are assigned to a single host group which is assigned to those 4 ports. The host group consists of the 4 initiator ports on the N6060. Given the HA nature of the N6060, only two paths to an array LUN are active at once. There are 13x 1.9TB array LUNs presented to each N6060 controller (26 total). All of the LUNs are in the one host group on the AMS. Other than the performance lost by splitting the disks in the AMS into two pools, does the config seem sound to you? How would you calculate a setting for disk.target_port.cmd_queue_depth? I had adjusted the value from 256 to 48, but for the life of me, I cannot recall how I arrived at that number.
There is no steadfast rule, unfortunately. In most cases, the default value works fine. And in the cases where it has been an issue (heavily loaded AMS arrays), we've seen a return to acceptable performance achieved with values ranging from 8 to 64. I tend to suggest starting at 32, and reducing it by 8 until the disk queuing issues resolve.
WRT your array LUN configuration, as long as those 26 LUNs are fully provisioned within the pool, then you're following the Best Practice. If using traditional RAID groups, then as long as no two array LUNs in a given RAID belong to the same V-Series aggregate (to avoid disk contention), you're fine.
Note also that you'll want as many of those LUNs as possible in the same aggregate. DataONTAP writes to an aggregate at a time, so the more disk IOPs available to a set of write operations, the better performance will be. Constrained, of course, by each LUN being written to not sharing disks with another LUN also being written to at that same moment.
I went ahead and reduced the disk.target_port.cmd_queue_depth from 48 to 32 to see if it helps any toward alleviating our write latency spikes. But I'm curious if you have any more information re that setting. I was reading your statement "DataONTAP assigns each array LUN the lesser of 32 queues or (# LUNs on port/256)" and I'm having trouble understanding. Maybe it would help if you could tell me what the risk is, or what I'm trading off if the disk.target_port.cmd_queue_depth were to be set too low.
The setting controls how many commands we will send the array at once. If we are getting queue full delays, then reducing it will set DataONTAP's expectations of array performance closer to reality.
You can either increase latency (with it set too high), or limit max IOPs (if it is set too low). Because setting it too low could reduce the max IOPs, we recommend starting at a bigger # and working your way down in smaller increments to find the right mix. In your case, by lowering it by 8 if 32 doesn't alleviate the problem. We want to avoid asking the array to do less than it can.
What I meant by "DataONTAP assigns each array LUN the lesser of 32 queues or (# LUNs on port/256)" was this:
1. If fewer than 8 LUNs, then each LUN gets at most 32 disk queues.
2. If 8 or more LUNs visible, each will get a max of (256 / #). For example, if you had 32 LUNs active on that port, each would get 8 queues (256/32=8)
Thanks for taking the time to explain. I don't remember that information contained in the vseries/gateway implementation guide or maybe I missed it, or maybe I just forgot.... At least now I better understand what I'm tweaking with that setting.
Yes, I would also agree with Walther! I also have a IBM rebranded NetApp gateway, N6040 in HA configuration and DS4300 Array at the backend.
I followed IBM's best practices to create not more than (6+1) Raid 5 Luns of size about 700 GB each on DS4300 to present to N6040 and built several aggregates using 2 array Luns each. A lot of space is wasted but I did not lose a lot of performance.
I have also setup a few V-Series with different storage below, and one thing not explained in any of the papers released by NetApp is the fact that you have to be very carefully not to do the following:
1. Create a large RAID group on the backend (say 10+1 RAID5)
2. Devide it into 2TB LUNs (say 5 x 2TB)
3. Present it to the NetApp, and put them into the same aggregate, which is a RAID0
This will give you very poor performance. It can be solved by making 5 aggregates or smaller RAID groups at the backend... I have always used smaller RAID groups at the backend, but it wastes a lot of space... with todays FC disks at 600GB you end up using only 5 disks in a RAID5...
So a feature request to NetApp would be to drop the 2TB LUN limit alltogether... 🙂
We do not generally support the use of Thin Pools on the USP-V. When using Hitachi Dynamic Pools, we require the pool be fully provisioned. That is, if you have provided 100TB of LUNs to the V-Series, you should have 100TB of physical space (at least) in the pool. ONTAP will then handle thin provisioning to any hosts. As the USP-V lacks the means to ensure an administrator does not, even accidentally, over-provision the pool, we are not comfortable supporting HDP on the USP-V and its varients. Currently, the use of Dynamic Pools is limited to the AMS and VSP as noted on the V-Series Support Matrix.
With regards to the performance impact, it is difficult answer. It will probably add a bit less than 1ms in response times for operations that have to go back to disk (fabric latencies), though some benefit is also seen from the additional cache on the USP-V. It really depends on the workload. We haven't done much performance work with arrays behind a USP-V, so HP or Hitachi may be better positioned to discuss the performance implications of external vs. internal disks.
Of course, not every operation is going to go back to disk. Since the cache on the V-Series is Dedupe-aware, more blocks may be served directly from the V-Series. As always, YMMV.