Data Backup and Recovery Discussions

Highlighted

lun latency increasing

I have a lun which holds the SQL data file and it is showing increased lun latency.

The data is from this month's monthly report for storage, shows LUN latency increasing for the database data file LUNs.  Past experience has shown, users become affected at about the 11 ms mark and action should be taken to avoid this scenario.  So a NetApp support case was opened to try and address the performance issue.

How do I show if this is a caching or fragmentation issue?  I have been looking through the prefstat reports and verything looks healthy apart from the latency.  Not sure what "cp_dirty_allocation_blks" is but it is 1000+

Read Write   Read  Write Average   Queue  Lun
  Ops   Ops     kB     kB Latency  Length
    0     0     32      0    7.54    0.07 /vol/sqf02/diskf.lun
    0     0     37      0    7.00    1.00 /vol/sqf02/diske.lun


Read Write   Read  Write Average   Queue  Lun
  Ops   Ops     kB     kB Latency  Length
    1     0     80      0   11.31    0.08 /vol/sqf02/diskf.lun
    2     0    123      0   10.96    0.08 /vol/sqf02/diske.lun

CPU   NFS  CIFS  HTTP   Total    Net kB/s   Disk kB/s     Tape kB/s Cache Cache  CP   CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
                                  in   out   read  write  read write   age   hit time  ty util                 in   out    in   out
20%     0  1541     0    2140   376  5881  12354   7140     0     0    19   94%  16%  F   28%    595     2  9228  5290     1     0
26%     0  1162     0    2172   498  3993  14646  20264     0     0    18   95%  39%  3f  31%   1006     2 18393  8112     1     0

lun:sqf02/diske.lun-XXXXZZZZZ:display_name:/vol/sqf02/diske.lun
lun:sqf02/diske.lun-XXXXZZZZZ:read_ops:0/s
lun:sqf02/diske.lun-XXXXZZZZZ:write_ops:2/s
lun:sqf02/diske.lun-XXXXZZZZZ:other_ops:0/s
lun:sqf02/diske.lun-XXXXZZZZZ:read_data:36798b/s
lun:sqf02/diske.lun-XXXXZZZZZ:write_data:19872b/s
lun:sqf02/diske.lun-XXXXZZZZZ:queue_full:0/s
lun:sqf02/diske.lun-XXXXZZZZZ:avg_latency:22.17ms   <-------------------- Why?
lun:sqf02/diske.lun-XXXXZZZZZ:total_ops:3/s
lun:sqf02/diske.lun-XXXXZZZZZ:scsi_partner_ops:0/s
lun:sqf02/diske.lun-XXXXZZZZZ:scsi_partner_data&colon;0b/s
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.0:98%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.1:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.2:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.3:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.4:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.5:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.6:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_align_histo.7:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.0:86%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.1:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.2:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.3:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.4:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.5:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.6:0%
lun:sqf02/diske.lun-XXXXZZZZZ:write_align_histo.7:0%
lun:sqf02/diske.lun-XXXXZZZZZ:read_partial_blocks:1%
lun:sqf02/diske.lun-XXXXZZZZZ:write_partial_blocks:13%

Thanks ifyou know the answer

Bren

21 REPLIES 21
Highlighted

Re: lun latency increasing

Sorry better image of graph...

Highlighted

Re: lun latency increasing

Hi Bren,

Not saying this is definitely the case of fragmentation, but can you check the LUN in question against this issue first?

reallocate measure [-l logfile] [-t threshold] [-i inter_val] [-o] pathname | /vol/volname
Start a measure-only reallocation on the LUN, large file or volume.
A measure-only reallocation job is similar to a normal reallocation job except that only the check phase is performed. This allows the optimization of the LUN, large file or volume to be tracked over time, or measured ad-hoc.

Regards,

Radek

Highlighted

Re: lun latency increasing

What sort of load does it put on the filer, can I run it during the day? I have a snapmirror of this volume. Can I run it on the remote site and still get a valid result?

Thanks

Bren

Highlighted

Re: lun latency increasing

Yeah, as per this thread (which you're already familiar with) reallocation is rather poorly documented:

http://communities.netapp.com/message/20969#20969

My (informed) guesses:

- reallocate should be run on the original LUN, not its mirror, as logical to physical layout may be different at the destination, hence different results are likely in my opinion

- arguably there is some additional load on the filer (actual reads are undertaken), so running reallocate outside of peak ours seems to be a reasonable approach

Regards,
Radek

Highlighted

Re: lun latency increasing

If 1 is good and 5 is very bad:

"Allocation check on '/vol/test/diske.lun' is 5, hotspot 0 (threshold

4), consider running reallocate."

I think that is a fair clue as at to what is wrong....

Thanks

Bren

Highlighted

Re: lun latency increasing

The actual scale goes up to 10:

http://now.netapp.com/NOW/knowledge/docs/ontap/rel732_vs/html/ontap/cmdref/man1/na_reallocate.1.htm

The threshold when a LUN, file or volume is considered unoptimized enough that a reallocation should be performed is given as a number from 3 (moderately optimized) to 10 (very unoptimized). [...].The default threshold is 4.

Having said that, I've heard stories from people getting fairly low numbers during measurement, yet when they actually run reallocation, their performance vastly improved

Regards,
Radek

Highlighted

Re: lun latency increasing

Hi Bren,

Did you by any chance manage to verify that fragmentation was an issue, indeed?

Did you do actual reallocation & did it reduce latency?

Kind regards,

Radek

Highlighted

Re: lun latency increasing

Still working on the issue with TSE. Have not found the root problem yet. Will post back with solution once it has been discovered.

Bren

Highlighted

Re: lun latency increasing

TSE have said to run reallocate against the luns.  Have to wait until the 7th April due to change control to get the results.  Will post back if it is a success.

Thanks all for help

Bren

Highlighted

Re: lun latency increasing

Hi Bren,

Many thanks for posting the update - fingers crossed for the positive outcome on/after the 7th of April!

Regards,
Radek

Highlighted

Re: lun latency increasing

Early results are in.  Reallocate the lun in the volume does reduce LUN latency for SQL server my 20% in my system!  It is still to early to know that the process was a success but early results do look very good.

A total of 7 luns where reallocated on a single 56x 300Gb 15k disk aggregate on a FAS3070

E:  134Gb took 26 min

F:  135Gb took 13 min

G:  100Gb took 26 min

H:  98Gb took 15 min

J:  185Gb took 12 min

K:  100Gb took 37 min - TL

i:  395Gb took 7 min - TL

Did not notice any issues with extra IO or CPU load on the filer during the work.  Still waiting to find out how big the snapshots will be.

Hope this post helps you plan your own reallocation work.

Bren

Highlighted

Re: lun latency increasing

Hi,

Many thanks for posting the results. I reckon many folks will learn a nice lesson based on your experience (that includes some NetApp chaps who, hmm, tend to forget that fragmentation may be an issue )

Re snapshots growing - if they didn't balloon straight after the reallocate run, you are completely safe in my opinion.

Regards,

Radek

View solution in original post

Highlighted

Re: lun latency increasing

Care to make a wager? Next snap will run at 7 pm tonight. I know what the average size is and think about 5% bigger than normal as my guess.

Bren

Highlighted

Re: lun latency increasing

After 24 hours the lun latency is still average 20% faster and the snapshot size difference was negligible  .  So FREE upgrade. 

Recommend you try it.  We are going to wait for a month to confirm results and then look into trying on other SQL servers.

Bren

Highlighted

Re: lun latency increasing

Hi Bren,

To make the story complete - did you run reallocate with -p option?

I reckon that was the case because your snapshots didn't grow, but just double-checking...

Regards,
Radek

Check out the KB!
Knowledge Base
All Community Forums