Data Backup and Recovery

lun latency increasing

BrendonHiggins
23,618 Views

I have a lun which holds the SQL data file and it is showing increased lun latency.

The data is from this month's monthly report for storage, shows LUN latency increasing for the database data file LUNs.  Past experience has shown, users become affected at about the 11 ms mark and action should be taken to avoid this scenario.  So a NetApp support case was opened to try and address the performance issue.

How do I show if this is a caching or fragmentation issue?  I have been looking through the prefstat reports and verything looks healthy apart from the latency.  Not sure what "cp_dirty_allocation_blks" is but it is 1000+

Read Write   Read  Write Average   Queue  Lun
  Ops   Ops     kB     kB Latency  Length
    0     0     32      0    7.54    0.07 /vol/sqf02/diskf.lun
    0     0     37      0    7.00    1.00 /vol/sqf02/diske.lun


Read Write   Read  Write Average   Queue  Lun
  Ops   Ops     kB     kB Latency  Length
    1     0     80      0   11.31    0.08 /vol/sqf02/diskf.lun
    2     0    123      0   10.96    0.08 /vol/sqf02/diske.lun

CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
20%     0  1541     0    2140   376  5881  12354   7140     0     0    19   94%  16%  F   28%    595     2  9228  5290     1     0
26%     0  1162     0    2172   498  3993  14646  20264     0     0    18   95%  39%  3f  31%   1006     2 18393  8112     1     0

lun:sqf02/diske.lun-XXXXZZZZZ:display_name:/vol/sqf02/diske.lun
lun:sqf02/diske.lun-XXXXZZZZZ:read_ops:0/s
lun:sqf02/diske.lun-XXXXZZZZZ:write_ops:2/s
lun:sqf02/diske.lun-XXXXZZZZZ:other_ops:0/s
lun:sqf02/diske.lun-XXXXZZZZZ:read_data:36798b/s
lun:sqf02/diske.lun-XXXXZZZZZ:write_data:19872b/s
lun:sqf02/diske.lun-XXXXZZZZZ:queue_full:0/s
lun:sqf02/diske.lun-XXXXZZZZZ:avg_latency:22.17ms   <-------------------- Why?
lun:sqf02/diske.lun-XXXXZZZZZ:total_ops:3/s
lun:sqf02/diske.lun-XXXXZZZZZ:scsi_partner_ops:0/s
lun:sqf02/diske.lun-XXXXZZZZZ:scsi_partner_data&colon;0b/s
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.0:98%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.1:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.2:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.3:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.4:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.5:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.6:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.7:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.0:86%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.1:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.2:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.3:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.4:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.5:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.6:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.7:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_partial_blocks:1%
lun:sqf02/diske.lun-XXXXZZZZZ:write_partial_blocks:13%

Thanks ifyou know the answer

Bren

1 ACCEPTED SOLUTION

radek_kubka
19,779 Views

Hi,

Many thanks for posting the results. I reckon many folks will learn a nice lesson based on your experience (that includes some NetApp chaps who, hmm, tend to forget that fragmentation may be an issue )

Re snapshots growing - if they didn't balloon straight after the reallocate run, you are completely safe in my opinion.

Regards,

Radek

View solution in original post

21 REPLIES 21

BrendonHiggins
23,159 Views

Sorry better image of graph...

radek_kubka
23,160 Views

Hi Bren,

Not saying this is definitely the case of fragmentation, but can you check the LUN in question against this issue first?

reallocate measure [-l logfile] [-t threshold] [-i inter_val] [-o] pathname | /vol/volname
Start a measure-only reallocation on the LUN, large file or volume.
A measure-only reallocation job is similar to a normal reallocation job except that only the check phase is performed. This allows the optimization of the LUN, large file or volume to be tracked over time, or measured ad-hoc.

Regards,

Radek

BrendonHiggins
23,160 Views

What sort of load does it put on the filer, can I run it during the day? I have a snapmirror of this volume. Can I run it on the remote site and still get a valid result?

Thanks

Bren

radek_kubka
23,160 Views

Yeah, as per this thread (which you're already familiar with) reallocation is rather poorly documented:

http://communities.netapp.com/message/20969#20969

My (informed) guesses:

- reallocate should be run on the original LUN, not its mirror, as logical to physical layout may be different at the destination, hence different results are likely in my opinion

- arguably there is some additional load on the filer (actual reads are undertaken), so running reallocate outside of peak ours seems to be a reasonable approach

Regards,
Radek

BrendonHiggins
23,160 Views

If 1 is good and 5 is very bad:

"Allocation check on '/vol/test/diske.lun' is 5, hotspot 0 (threshold

4), consider running reallocate."

I think that is a fair clue as at to what is wrong....

Thanks

Bren

radek_kubka
22,474 Views

The actual scale goes up to 10:

http://now.netapp.com/NOW/knowledge/docs/ontap/rel732_vs/html/ontap/cmdref/man1/na_reallocate.1.htm

The threshold when a LUN, file or volume is considered unoptimized enough that a reallocation should be performed is given as a number from 3 (moderately optimized) to 10 (very unoptimized). [...].The default threshold is 4.

Having said that, I've heard stories from people getting fairly low numbers during measurement, yet when they actually run reallocation, their performance vastly improved

Regards,
Radek

radek_kubka
23,160 Views

Hi Bren,

Did you by any chance manage to verify that fragmentation was an issue, indeed?

Did you do actual reallocation & did it reduce latency?

Kind regards,

Radek

BrendonHiggins
23,160 Views

Still working on the issue with TSE. Have not found the root problem yet. Will post back with solution once it has been discovered.

Bren

BrendonHiggins
23,160 Views

TSE have said to run reallocate against the luns.  Have to wait until the 7th April due to change control to get the results.  Will post back if it is a success.

Thanks all for help

Bren

radek_kubka
19,337 Views

Hi Bren,

Many thanks for posting the update - fingers crossed for the positive outcome on/after the 7th of April!

Regards,
Radek

BrendonHiggins
19,337 Views

Early results are in.  Reallocate the lun in the volume does reduce LUN latency for SQL server my 20% in my system!  It is still to early to know that the process was a success but early results do look very good.

A total of 7 luns where reallocated on a single 56x 300Gb 15k disk aggregate on a FAS3070

E:  134Gb took 26 min

F:  135Gb took 13 min

G:  100Gb took 26 min

H:  98Gb took 15 min

J:  185Gb took 12 min

K:  100Gb took 37 min - TL

i:  395Gb took 7 min - TL

Did not notice any issues with extra IO or CPU load on the filer during the work.  Still waiting to find out how big the snapshots will be.

Hope this post helps you plan your own reallocation work.

Bren

radek_kubka
19,780 Views

Hi,

Many thanks for posting the results. I reckon many folks will learn a nice lesson based on your experience (that includes some NetApp chaps who, hmm, tend to forget that fragmentation may be an issue )

Re snapshots growing - if they didn't balloon straight after the reallocate run, you are completely safe in my opinion.

Regards,

Radek

BrendonHiggins
19,338 Views

Care to make a wager? Next snap will run at 7 pm tonight. I know what the average size is and think about 5% bigger than normal as my guess.

Bren

BrendonHiggins
19,338 Views

After 24 hours the lun latency is still average 20% faster and the snapshot size difference was negligible  .  So FREE upgrade. 

Recommend you try it.  We are going to wait for a month to confirm results and then look into trying on other SQL servers.

Bren

radek_kubka
19,338 Views

Hi Bren,

To make the story complete - did you run reallocate with -p option?

I reckon that was the case because your snapshots didn't grow, but just double-checking...

Regards,
Radek

BrendonHiggins
17,212 Views

I used this command

reallocate start -f -p /vol/fasqf02/diskf.lun

Did both database and TL luns. Will look into setting up a scheduled

job to run the reallocate task automatically to stop the performance

tailing off with time.

Bren

amiller_1
17,212 Views

Fantastic info....I've been using reallocate recently with a customer recently where a lot of Storage vMotion + dedup got the volumes extremely non-optimized. This is a pretty "mushy" topic right now so real-world examples are much appreciated.

anantha_dommeti
17,212 Views

Andrew, we are under the same condition due to migration using vMotion + Dedupe...I wonder how it is resolved.

We need to undo sis and run reallocate in order to reduce fragmentation...We have around 12 volumes hosting 500+ VM OS volumes that are deduped heavily, we will run out of space in aggregate if we undo a-sis:( (thin on thin might work)

Reallocate measure on those volumes gives the value as 4 and suggests for a reallocate

Added to it, we got some misaligned VMs(10% with moderate workload) as part of expediated migration from different storage vendor...It just killing 6080 even when the throughput/IOPs is pretty less...in process of fixing those.

amiller_1
14,722 Views

For what it's worth, you don't have to undo dedup to run reallocate. If you just run a regular reallocate, you might see larger snapshots and less SIS space savings (although it's not too bad in my experience so far). If you use the "-p" switch on the reallocate start command, that should help lower the larger snapshots/lost SIS savings issue but with some performance overhead (I'm currently trying to understand exactly how much performance impact).

An optimization level of 4 isn't horrible to be honest....I'd thought the scale only went to 10 until I saw a customer volume that came back as 14 (was seeing very high latencies on the volume when the rest of the system was fine).

anantha_dommeti
14,725 Views

Output from reallocate measure:

Tue Apr 20 01:52:14 EDT [filer: wafl.reallocate.check.highAdvise:info]: Allocation check on '/vol/vm_prdc6t2os_n2_vol' is 3, hotspot 29 (threshold 4), consider running reallocate.
Tue Apr 20 01:54:43 EDT [filer: wafl.reallocate.check.highAdvise:info]: Allocation check on '/vol/vm_uatc7t3os_n1_vol' is 4, hotspot 24 (threshold 4), consider running reallocate.
Tue Apr 20 01:58:31 EDT [filer: wafl.reallocate.check.highAdvise:info]: Allocation check on '/vol/vm_prdc6t1os_n2_vol' is 4, hotspot 22 (threshold 4), consider running reallocate.

What is the hotspot meant in the output ?

I thought reallocate may not move/spread the deduplicated blocks in a effective manner without undo sis...seems like you are right, the problem is if the snapshot grows beyond limits we will be in a mess(aggregate is 71@300 running at 90%)...I wish we can try this on SnapMirror location but that will not help anyways due to different structure all together.

Public