Re: Improving Disk Latency Issues and Overall Performance for Microsoft SQL

msandona · ‎2011-06-02

Hello:

We have physical MS SQL servers (SQL Server 2008) that connect to LUNs over fibre channel. These LUNs sit in volumes (1 LUN per volume) that are on 600 GB, 15000 RPM SAS disks. The aggregate that the volumes are on is comprised of 36 disks. as well as the SQL VOLs/LUNs, we have 20 other volumes, mostly NFS VMware data stores on this aggregate.

We have 512 MB PAM II cards in our FAS 3170s. We are on version 7.3.5.1 of ONTAP.

The SQL LUNs have been having some disk latency issues, averaging in the 10-15 ms range but peaking in to 25+ ms. For the most part the performance summary for these volumes/LUNs show relatively low IOPS, averaging around 30-40 or so per volume/LUN. Server based performance monitoring also reported disk latency as a culprit the overall poor performance.

Based on this structure, is there any way to improve the SQL performance?

So far i've considered:

1. running reallocate - these volumes are not deduped and the only snapshots are the snapmirror snap. running 'reallocate measure" on one of the primary culprits generated this output: Allocation check on '/vol/FC_CCAMARPTDB_M' is 5, hotspot 20 (threshold 4), consider running reallocate.

2. moving underperforming databases to smaller aggregates and isolating one VOL/LUN per aggregate.

Any of these ideas make sense? Something else?

The application team is looking for a quick fix to the performance issues so obviously the simpler, the better. But I'm open to anything.

Thank you .

Matt

aborzenkov · ‎2011-06-02

What about total load on aggregate? Number of total IOPS, how busy disks are, etc? Aggregate is shared resource, so may be other systems keep it busy? In this case playing with flexshare could offer some improvements.

shaunjurr · ‎2011-06-02

Hi,

You definitely want to regularly run reallocate on your luns. If you have split log and database luns, then basically, you really only will need to reallocate the database luns. The even log writing and rotation causes little fragmentation. This will clear up some of the response time problems. Be advised, using incorrect options will cause larger snapshot deltas.

You will also want to look at the flexshare (a.k.a. priority command) scheduling subsystem. Here you will be able to more finely tune which volumes get higher I/O priority and which get to have priority in your PAM modules.

If you get both of these right, your problems should basically go away quickly.

I normally split log and database luns (using a volume mountpoint structure on the windows server) between the controllers so that I get full use of both controllers for each database, but that will be more of a task to achieve than the suggestions I made above.

Good luck.

msandona · ‎2011-06-03

aborzenkov and shaunjurr:

thanks for the reply.

Regarding reallocate - as i understand it, reallocate should not be run on deduplicated volumes - no problem as dedupe is not enabled on these db volumes. Also, there is concern about having snapshots on the volume as it can affect performance. Is this a concern only on first run or always?

Right now, these volumes have only their snapmirror snap. We use a server based product to do our SQL backups but are considering switching to SnapManager for SQL. Assuming we schedule weekly reallocates on these volumes, what happens when we start using snapmanager?

When reallocate is running, is there an observable performance degradation?

I appreciate your assistance.

Matt

shaunjurr · ‎2011-06-03

Hi,

The reallocate vs. dedupe (sis) debate certainly is an on-going one, but I think there are a bit too many that are looking for black and white answers when there are probably a good deal of "gray" cases. For the moment, this isn't really an issue in your setup anyway.

Again, you can avoid larger snapshot deltas by using the "-p" switch or by running reallocate a bit more often, or both. Depending on the churn in your databases, you might even want to run things every couple of days. You'll unfortunately have to check the results in the messages logs manually because NetApp hasn't yet delivered a method to track the need/results here.

Performance hits with reallocation isn't really an issue, but you probably want to stagger your runs. 'reallocate' is a low priority background process, so it shouldn't affect things too much (nothing is perfect, but it works pretty well). You can, again, fix things here to your advantage with the 'priority' command. I find using priority really evens out I/O performance on the whole. It seems like most operators, however, haven't used it very much, even if it also allows a better use of your PAM cards.

Use of SnapManager is really a tangential issue. The performance should be basically the same as using normal snapshots. Again, with 'priority' you can increase the priority of snapshot operations on volumes where performance is important. Performance degradation in combination with using snapshots on NetApp is negligible, unlike EMC and HDS. I think I've seen benchmarks where the hit was 2-3%, iirc.

Basically, what I wrote before should work well for you. I have used this with a good deal of success on a number of systems.

There's not a lot to fear here. Get your db luns reallocated and take a look at some TR's on FlexShare and PAM usage and setup 'priority'.

🙂

kkaushal2 · ‎2011-11-07

What Netapp are you using? Is SQL accessing the luns over preffered (primary) path?

lmunro_hug · ‎2011-11-07

Matt,

How often are you running snapmirror on the volumes that have the affected LUNs? Our links between sites are quite fast and we have a few snapmirrors going at the same time. We have noticed a perfomance hit on aggregates when a snapmirror is occuring. This is mainly on aggregates that are closer to their IOP limits. Look at your latencies for multiple volumes on a line graph and see if there is a pattern eg: increase in latency during snapmirror update on the hour etc. You can also compare these with aggregate ops and disk busy times when compared over the same period.

Also check if your snapshot deletes are occurring during this time aswell.

Have you worked out your aggregate max OPs attainable for the number of disks/raigroups you have and compared that to current OPs on the aggregate?

Luke

shaunjurr · ‎2011-11-08

I'm guessing that since this thread has been dead since June, that he either figured it out or is currently working somewhere else, hehe... Probably nothing to see here...