Data Backup and Recovery

Filerview Extremely slow on a FAS6030

rluthersutt
2,246 Views

I have a FAS6030 cluster running OnTap 7.2.6.1P3D8 that is exhibiting extremely poor response when working in FilerView.   Browsing to the filerview page and logging in is very quick but once you try to view a list of volumes or try to add a snapmirror relationship or perform similar tasks it hangs.  I often have to select an option and leave the browser window for 5-10 min or more before it finally shows up and I can work.  We have also received periodic alerts in DFM indicating the filer has stopped communicating via SNMP.  Some commands in the CLI are also slow including "snap list" and running a df command on an aggregate although not nearly as bad as filerview.  Only one of the head units in the cluster is experiencing this extreme slowness even though both show similar statistics when looking at both sysstat and statit commands.

CPU performance is not horrible as it ranges from between 22% to 65% when running the sysstat -x 1 command.  In some cases it even appears as though the filer not experiencing the slowness issues is under the higher load.  The disk utilization on the filer doesn't seem to be a large problem either.  There are no disks on the system with utilization higher than 40-45%.  We are using Flexshare on both heads.

If anyone has any ideas as to what could be causing these performance issues I would greatly appreciate it.  Also, while I realize upgrading to OnTap 7.3.x or higher would probably help but that just isn't an option at the moment.  We have a number of SQL servers running older versions of Snapdrive that we have had problems upgrading that have put a hold on the OnTap upgrades.

Thanks!

Bob

1 REPLY 1

steve_simmons
2,246 Views

We're seeing the same problems, tho usually thru the CLI. Here are the results of doing some timing tests a few minutes ago. The general test process was done on a UNIX host with ssh-enabled access to a filer, using a command:

$ time ssh <filename> snap list <volname>

twice. The format of times reported were as stock with the UNIX time command. The overall results were consistently inconsistent; sometimes both were slow, sometimes both fast, sometimes one but not the other. A respresentative sample of result pairs and some analysis are below.

Version: NetApp Release 7.3.1.1: Mon Apr 20 22:58:46 PDT 2009

We ran this on all 15 volumes of a pair of servers. There seems to be  no correspondance with volume size, number of snapshots, etc. Delays  occurred running the commands on both src_filer and mirror_filer. There  were no cases where the second query was significantly longer than the  first.

Five of the volumes had been queried a few  minute before with a 'snap list volname.' Four of those five were fast  on both queries, one showed a delay on the first query (20%). Of the ten that were not queried a few minutes  before, five were slow on first query (50%). Mind you, there aren't enough of these tests to be statisitically significant. But they're a pretty solid lead, IMHO.

Details on a few query pairs follow.

Both queries fast (ie, what we'd normally expect):

==========

2010/12/22 16:03:11: Doing vol_B
Volume vol_B
working...

  %/used       %/total  date          name
----------  ----------  ------------  --------
  1% ( 1%)    1% ( 1%)  Jul 14 11:01  hourly.0      
  1% ( 0%)    1% ( 0%)  Jul 14 00:00  nightly.0     
  1% ( 0%)    1% ( 0%)  Jul 13 23:01  hourly.1      

real    0m0.366s
user    0m0.020s
sys     0m0.000s
Volume vol_B
working...

  %/used       %/total  date          name
----------  ----------  ------------  --------
  1% ( 1%)    1% ( 1%)  Jul 14 11:01  hourly.0      
  1% ( 0%)    1% ( 0%)  Jul 14 00:00  nightly.0     
  1% ( 0%)    1% ( 0%)  Jul 13 23:01  hourly.1      

real    0m0.371s
user    0m0.020s
sys     0m0.000s

==========

First query slow, second fast:

==========

2010/12/22 16:00:55: Doing vol_A

Volume vol_A

working......

  %/used       %/total  date          name

----------  ----------  ------------  --------

  0% ( 0%)    0% ( 0%)  Dec 22 20:08  mirrorfiler(0101184681)_vol_A.12461 (snapmirror)

  0% ( 0%)    0% ( 0%)  Dec 22 11:00  hourly.0      

  1% ( 1%)    1% ( 1%)  Dec 22 00:00  nightly.0     

  2% ( 0%)    1% ( 0%)  Dec 21 23:01  hourly.1      

real    2m15.385s

user    0m0.010s

sys     0m0.000s

Volume vol_A

working...

  %/used       %/total  date          name

----------  ----------  ------------  --------

  0% ( 0%)    0% ( 0%)  Dec 22 20:08  mirrorfiler(0101184681)_vol_A.12461 (snapmirror)

  0% ( 0%)    0% ( 0%)  Dec 22 11:00  hourly.0      

  1% ( 1%)    1% ( 1%)  Dec 22 00:00  nightly.0     

  2% ( 0%)    1% ( 0%)  Dec 21 23:01  hourly.1      

real    0m0.366s

user    0m0.010s

sys     0m0.010s

==========

First query slow, second slower than expected.

==========

2010/12/22 16:03:30: Doing vol_G

Volume vol_G

working.......................................................................................

  %/used       %/total  date          name

----------  ----------  ------------  --------

  0% ( 0%)    0% ( 0%)  Dec 22 20:39  src_filer(0101184645)_vol_G.11455

  0% ( 0%)    0% ( 0%)  Dec 22 19:39  src_filer(0101184645)_vol_G.11454

real    1m40.033s

user    0m0.000s

sys     0m0.000s

Volume vol_G

working.......................................................................................

  %/used       %/total  date          name

----------  ----------  ------------  --------

  0% ( 0%)    0% ( 0%)  Dec 22 20:39  src_filer(0101184645)_vol_G.11455

  0% ( 0%)    0% ( 0%)  Dec 22 19:39  src_filer(0101184645)_vol_G.11454

real    0m7.699s

user    0m0.010s

sys     0m0.010s

==========

We ran this on all 15 volumes of a pair of servers. There seems to be no correspondance with volume size, number of snapshots, etc. Delays occurred running the commands on both src_filer and mirror_filer. There were no cases where the second query was significantly longer than the first.

Five of the volumes involved had been queried a few minute before with a 'snap list volname.' Four of those five were fast on both queries, one showed a delay on the first query:

==========

2010/12/22 16:05:17: Doing vol_H
Volume vol_H
working....

  %/used       %/total  date          name
----------  ----------  ------------  --------
  0% ( 0%)    0% ( 0%)  Dec 22 11:00  hourly.0      
  1% ( 1%)    1% ( 0%)  Dec 22 00:01  nightly.0     
  1% ( 0%)    1% ( 0%)  Dec 21 23:01  hourly.1      

real    0m28.653s
user    0m0.020s
sys     0m0.000s
Volume vol_H
working...

  %/used       %/total  date          name
----------  ----------  ------------  --------
  0% ( 0%)    0% ( 0%)  Dec 22 11:00  hourly.0      
  1% ( 1%)    1% ( 0%)  Dec 22 00:01  nightly.0     
  1% ( 0%)    1% ( 0%)  Dec 21 23:01  hourly.1      

real    0m0.385s
user    0m0.020s
sys     0m0.000s

==========

Public