Data Protection
Data Protection
I have a lun which holds the SQL data file and it is showing increased lun latency.
The data is from this month's monthly report for storage, shows LUN latency increasing for the database data file LUNs. Past experience has shown, users become affected at about the 11 ms mark and action should be taken to avoid this scenario. So a NetApp support case was opened to try and address the performance issue.
How do I show if this is a caching or fragmentation issue? I have been looking through the prefstat reports and verything looks healthy apart from the latency. Not sure what "cp_dirty_allocation_blks" is but it is 1000+
Read Write Read Write Average Queue Lun
Ops Ops kB kB Latency Length
0 0 32 0 7.54 0.07 /vol/sqf02/diskf.lun
0 0 37 0 7.00 1.00 /vol/sqf02/diske.lun
Read Write Read Write Average Queue Lun
Ops Ops kB kB Latency Length
1 0 80 0 11.31 0.08 /vol/sqf02/diskf.lun
2 0 123 0 10.96 0.08 /vol/sqf02/diske.lun
CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk FCP iSCSI FCP kB/s iSCSI kB/s
in out read write read write age hit time ty util in out in out
20% 0 1541 0 2140 376 5881 12354 7140 0 0 19 94% 16% F 28% 595 2 9228 5290 1 0
26% 0 1162 0 2172 498 3993 14646 20264 0 0 18 95% 39% 3f 31% 1006 2 18393 8112 1 0
lun:sqf02/diske.lun-XXXXZZZZZ:display_name:/vol/sqf02/diske.lun
lun:sqf02/diske.lun-XXXXZZZZZ:read_ops:0/s
lun:sqf02/diske.lun-XXXXZZZZZ:write_ops:2/s
lun:sqf02/diske.lun-XXXXZZZZZ:other_ops:0/s
lun:sqf02/diske.lun-XXXXZZZZZ:read_data:36798b/s
lun:sqf02/diske.lun-XXXXZZZZZ:write_data:19872b/s
lun:sqf02/diske.lun-XXXXZZZZZ:queue_full:0/s
lun:sqf02/diske.lun-XXXXZZZZZ:avg_latency:22.17ms <-------------------- Why?
lun:sqf02/diske.lun-XXXXZZZZZ:total_ops:3/s
lun:sqf02/diske.lun-XXXXZZZZZ:scsi_partner_ops:0/s
lun:sqf02/diske.lun-XXXXZZZZZ:scsi_partner_data:0b/s
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.0:98%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.1:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.2:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.3:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.4:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.5:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.6:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.7:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.0:86%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.1:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.2:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.3:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.4:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.5:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.6:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.7:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_partial_blocks:1%
lun:sqf02/diske.lun-XXXXZZZZZ:write_partial_blocks:13%
Thanks ifyou know the answer
Bren
Solved! See The Solution
Hi,
Many thanks for posting the results. I reckon many folks will learn a nice lesson based on your experience (that includes some NetApp chaps who, hmm, tend to forget that fragmentation may be an issue )
Re snapshots growing - if they didn't balloon straight after the reallocate run, you are completely safe in my opinion.
Regards,
Radek
Sorry better image of graph...
Hi Bren,
Not saying this is definitely the case of fragmentation, but can you check the LUN in question against this issue first?
reallocate measure [-l logfile] [-t threshold] [-i inter_val] [-o] pathname | /vol/volname
Start a measure-only reallocation on the LUN, large file or volume.
A measure-only reallocation job is similar to a normal reallocation job except that only the check phase is performed. This allows the optimization of the LUN, large file or volume to be tracked over time, or measured ad-hoc.
Regards,
Radek
What sort of load does it put on the filer, can I run it during the day? I have a snapmirror of this volume. Can I run it on the remote site and still get a valid result?
Thanks
Bren
Yeah, as per this thread (which you're already familiar with) reallocation is rather poorly documented:
http://communities.netapp.com/message/20969#20969
My (informed) guesses:
- reallocate should be run on the original LUN, not its mirror, as logical to physical layout may be different at the destination, hence different results are likely in my opinion
- arguably there is some additional load on the filer (actual reads are undertaken), so running reallocate outside of peak ours seems to be a reasonable approach
Regards,
Radek
If 1 is good and 5 is very bad:
"Allocation check on '/vol/test/diske.lun' is 5, hotspot 0 (threshold
4), consider running reallocate."
I think that is a fair clue as at to what is wrong....
Thanks
Bren
The actual scale goes up to 10:
http://now.netapp.com/NOW/knowledge/docs/ontap/rel732_vs/html/ontap/cmdref/man1/na_reallocate.1.htm
The threshold when a LUN, file or volume is considered unoptimized enough that a reallocation should be performed is given as a number from 3 (moderately optimized) to 10 (very unoptimized). [...].The default threshold is 4.
Having said that, I've heard stories from people getting fairly low numbers during measurement, yet when they actually run reallocation, their performance vastly improved
Regards,
Radek
Hi Bren,
Did you by any chance manage to verify that fragmentation was an issue, indeed?
Did you do actual reallocation & did it reduce latency?
Kind regards,
Radek
Still working on the issue with TSE. Have not found the root problem yet. Will post back with solution once it has been discovered.
Bren
TSE have said to run reallocate against the luns. Have to wait until the 7th April due to change control to get the results. Will post back if it is a success.
Thanks all for help
Bren
Hi Bren,
Many thanks for posting the update - fingers crossed for the positive outcome on/after the 7th of April!
Regards,
Radek
Early results are in. Reallocate the lun in the volume does reduce LUN latency for SQL server my 20% in my system! It is still to early to know that the process was a success but early results do look very good.
A total of 7 luns where reallocated on a single 56x 300Gb 15k disk aggregate on a FAS3070
E: 134Gb took 26 min
F: 135Gb took 13 min
G: 100Gb took 26 min
H: 98Gb took 15 min
J: 185Gb took 12 min
K: 100Gb took 37 min - TL
i: 395Gb took 7 min - TL
Did not notice any issues with extra IO or CPU load on the filer during the work. Still waiting to find out how big the snapshots will be.
Hope this post helps you plan your own reallocation work.
Bren
Hi,
Many thanks for posting the results. I reckon many folks will learn a nice lesson based on your experience (that includes some NetApp chaps who, hmm, tend to forget that fragmentation may be an issue )
Re snapshots growing - if they didn't balloon straight after the reallocate run, you are completely safe in my opinion.
Regards,
Radek
Care to make a wager? Next snap will run at 7 pm tonight. I know what the average size is and think about 5% bigger than normal as my guess.
Bren
After 24 hours the lun latency is still average 20% faster and the snapshot size difference was negligible . So FREE upgrade.
Recommend you try it. We are going to wait for a month to confirm results and then look into trying on other SQL servers.
Bren
Hi Bren,
To make the story complete - did you run reallocate with -p option?
I reckon that was the case because your snapshots didn't grow, but just double-checking...
Regards,
Radek
I used this command
reallocate start -f -p /vol/fasqf02/diskf.lun
Did both database and TL luns. Will look into setting up a scheduled
job to run the reallocate task automatically to stop the performance
tailing off with time.
Bren
Fantastic info....I've been using reallocate recently with a customer recently where a lot of Storage vMotion + dedup got the volumes extremely non-optimized. This is a pretty "mushy" topic right now so real-world examples are much appreciated.
Andrew, we are under the same condition due to migration using vMotion + Dedupe...I wonder how it is resolved.
We need to undo sis and run reallocate in order to reduce fragmentation...We have around 12 volumes hosting 500+ VM OS volumes that are deduped heavily, we will run out of space in aggregate if we undo a-sis:( (thin on thin might work)
Reallocate measure on those volumes gives the value as 4 and suggests for a reallocate
Added to it, we got some misaligned VMs(10% with moderate workload) as part of expediated migration from different storage vendor...It just killing 6080 even when the throughput/IOPs is pretty less...in process of fixing those.
For what it's worth, you don't have to undo dedup to run reallocate. If you just run a regular reallocate, you might see larger snapshots and less SIS space savings (although it's not too bad in my experience so far). If you use the "-p" switch on the reallocate start command, that should help lower the larger snapshots/lost SIS savings issue but with some performance overhead (I'm currently trying to understand exactly how much performance impact).
An optimization level of 4 isn't horrible to be honest....I'd thought the scale only went to 10 until I saw a customer volume that came back as 14 (was seeing very high latencies on the volume when the rest of the system was fine).
Output from reallocate measure:
Tue Apr 20 01:52:14 EDT [filer: wafl.reallocate.check.highAdvise:info]: Allocation check on '/vol/vm_prdc6t2os_n2_vol' is 3, hotspot 29 (threshold 4), consider running reallocate.
Tue Apr 20 01:54:43 EDT [filer: wafl.reallocate.check.highAdvise:info]: Allocation check on '/vol/vm_uatc7t3os_n1_vol' is 4, hotspot 24 (threshold 4), consider running reallocate.
Tue Apr 20 01:58:31 EDT [filer: wafl.reallocate.check.highAdvise:info]: Allocation check on '/vol/vm_prdc6t1os_n2_vol' is 4, hotspot 22 (threshold 4), consider running reallocate.
What is the hotspot meant in the output ?
I thought reallocate may not move/spread the deduplicated blocks in a effective manner without undo sis...seems like you are right, the problem is if the snapshot grows beyond limits we will be in a mess(aggregate is 71@300 running at 90%)...I wish we can try this on SnapMirror location but that will not help anyways due to different structure all together.