ontap freezes on aggr snapshot

gdefevere · ‎2011-09-30

Config is a metrocluster with 2xFAS3240 with ONTAP 8.0.1P2. 5 shelves with 15k 600GB FC. We alse have PAM II cards of 512GB.

We saw latency spikes every hour in de OM - PA protocol latency graph. When looking deeper we discoverd that this is because of the aggr snapshot taken, needed for syncmirror. Going a little bit further with the analyse, when I did a sysstat during the time frame of the aggr snapshot, we discoverd a freeze of the filer. We had back to back cp's during 27 seconds and therefore had serious latency (see sysstat output bellow). Connected systems are AIX over FC, Windows over iSCSI and Vmware over NFS, all off course felt interruption during this period ! Our aggr is only used for 40% and we have 68 diskspindles (x2 for reading since syncmirror).

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s

in out read write read write age hit time ty util in out in out

79% 2747 0 0 7520 81088 63658 92875 157089 0 0 0s 82% 100% :s 55% 0 3409 1364 71319 56361 5723 17213

81% 1747 1 0 5562 77486 24025 102560 231960 0 0 3s 96% 87% Fn 71% 0 3026 788 59242 48075 13202 7886

64% 1902 0 0 5191 61920 21241 83031 320343 0 0 0s 91% 100% :s 97% 5 2946 338 38916 46337 707 3949

79% 1578 0 0 5834 69657 21601 84776 149099 0 0 0s 87% 100% Zn 42% 0 3739 517 90449 50907 6633 8767

66% 1461 0 0 3162 35360 12700 64848 414208 0 0 0s 93% 100% :s 100% 0 1412 289 60144 23585 5301 1653

75% 2380 0 0 8100 65834 40643 46856 85196 0 0 5s 83% 100% #s 39% 0 4557 1163 132917 20372 5120 17028

72% 1914 0 0 5606 38792 15885 73208 314572 0 0 6s 90% 100% bn 82% 0 2922 770 36707 15983 12661 6417

62% 2745 1 0 6868 40748 15584 97240 297140 0 0 0s 72% 100% :s 89% 3 3598 521 41025 65620 2123 6265

70% 3395 0 0 9257 144651 40463 65710 54210 0 0 0s 90% 100% #s 27% 0 4718 1144 69526 39316 9722 17465

56% 301 1 0 3399 6935 10582 29620 0 0 0 9s 96% 100% #s 12% 0 2375 722 1374 31335 2110 4818

22% 136 0 0 2831 3222 9392 27468 48 0 0 0s 65% 100% #s 14% 0 1696 999 479 25160 360 4260

18% 268 0 0 2431 4937 5354 8964 16 0 0 11s 70% 100% #s 18% 0 1518 645 749 15322 168 2517

14% 41 0 0 1778 1826 4429 9856 0 0 0 11s 69% 100% #s 11% 3 1256 478 108 14064 172 1996

4% 19 0 0 342 1266 920 14884 48 0 0 11s 79% 100% #s 7% 0 243 80 141 11329 74 442

9% 0 0 0 461 1807 845 9272 0 0 0 11s 77% 100% #s 6% 0 416 45 88 8416 559 169

4% 0 0 0 427 405 180 7380 0 0 0 11s 74% 100% #s 10% 0 414 13 55 7257 16 45

13% 0 0 0 715 563 806 8448 64 0 0 11s 62% 100% #s 6% 192 465 58 81 4914 0 391

10% 0 0 0 976 232 2484 11188 4248 0 0 16s 64% 100% #s 25% 0 716 260 29 7557 9 1202

11% 0 0 0 616 323 433 16928 0 0 0 16s 73% 100% #s 18% 0 588 28 84 11682 0 119

3% 0 0 0 197 228 79 11876 48 0 0 16s 82% 100% #s 5% 0 190 7 13 10235 0 33

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s

in out read write read write age hit time ty util in out in out

15% 0 0 0 267 406 1589 11752 0 0 0 16s 63% 100% #s 7% 0 191 76 5 8356 0 762

5% 0 0 0 327 315 467 16392 16 0 0 16s 83% 100% #s 8% 0 272 55 7 12914 0 221

17% 0 1 0 464 207 1106 16076 48 0 0 2s 55% 100% #s 8% 8 332 123 12 47344 1 524

7% 0 1 0 276 109 477 15084 0 0 0 5s 79% 100% #s 6% 0 271 4 117 10467 0 20

9% 0 0 0 240 855 225 21600 0 0 0 23s 92% 100% #s 13% 0 215 25 5 18424 4 98

5% 0 0 0 340 3218 122 16912 48 0 0 23s 85% 100% #s 8% 0 331 9 8 14441 6 41

6% 0 0 0 374 197 603 9556 16 0 0 23s 77% 100% #s 9% 0 304 70 7 7012 8 289

1% 0 0 0 35 98 50 752 0 0 0 23s 42% 100% #s 4% 3 27 5 5 45 0 20

5% 0 0 0 64 472 313 1368 48 0 0 23s 6% 100% #s 5% 0 28 36 5 16 0 147

7% 0 0 0 87 331 4244 4016 0 0 0 10s 29% 100% #s 5% 0 13 74 0 11 90 1757

5% 0 0 0 62 92 325 1140 0 0 0 10s 56% 100% #s 7% 0 24 38 82 44 1 152

2% 0 0 0 65 77 225 936 64 0 0 10s 65% 100% #s 17% 0 39 26 66 11 0 106

5% 0 0 0 32 330 52 1244 0 0 0 10s 58% 100% #s 4% 3 24 5 1 33 1 20

51% 5290 0 0 8008 74293 15583 44886 115624 0 0 8s 87% 99% bn 48% 0 2382 336 37159 2438 10377 2666

99% 3771 0 0 9380 135469 24379 63451 148182 0 0 34s 57% 100% :n 60% 0 4822 787 92249 12047 15235 5355

54% 960 0 0 3444 10253 5329 59309 368883 0 0 0s 91% 100% #n 100% 0 2182 302 4309 16145 1369 1988

40% 177 0 0 3686 8822 8721 28908 101816 0 0 0s 78% 100% #s 35% 0 3314 195 1985 25871 2104 3343

22% 102 0 0 3369 5089 303 5968 4336 0 0 0s 65% 100% #s 25% 3 3253 11 537 22568 35 45

25% 120 1 0 3107 5090 4514 6404 0 0 0 3s 63% 100% #s 8% 0 2803 183 75 19562 67 1630

75% 2742 0 0 6758 62862 17007 51733 178109 0 0 3s 84% 100% bn 49% 0 3311 705 39570 7460 9402 6532

CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s

in out read write read write age hit time ty util in out in out

85% 4901 1 0 11245 37277 21186 63653 333495 0 0 3s 91% 100% :s 100% 0 5565 778 50615 17568 10370 8959

76% 6309 0 0 14614 71993 41035 52008 130273 0 0 4s 81% 100% #s 68% 0 5497 2808 61456 25820 28863 16140

42% 5828 0 0 9955 10710 32847 22072 112872 0 0 5s 72% 100% #s 34% 3 2371 1753 2634 15646 2591 14199

72% 3646 0 0 7777 29749 23536 46044 163310 0 0 7s 87% 100% bn 53% 0 2754 1377 41209 8213 11391 10523

76% 4767 0 0 9253 43227 12916 63163 381066 0 0 0s 93% 100% :n 100% 4 3359 1123 47357 12763 9338 4260

85% 5233 0 0 15317 100412 38845 82104 115912 0 0 0s 78% 100% #s 58% 0 6384 3700 73959 56073 41421 15987

44% 5608 0 0 9926 11256 39865 38521 100195 0 0 9s 70% 100% #f 35% 0 2182 2136 2290 18793 2855 17516

85% 2906 0 0 7642 36207 17897 85460 299099 0 0 7s 90% 100% bn 79% 3 3483 1250 52082 36332 12693 7376

I don't think this is normal !? What could be the reason ?

ian_iball · ‎2012-01-18

Hi;

Did you ever get a response or answer for this...?

We have noticed the exact behaviour, but on a NON Metrocluster setup so no syncmirror between the nodes. We have 2 x FAS3270's and just recently have been seeing freezes at certain points in the day.. We also tracked this down to an aggr snapshot.

In our situation, the default snap sched for the aggrs were left so we were seeing freezes at 9AM, 2PM , 7PM and midnight.

It would be interesting to see if anyone else is seeing similar freezes..

gdefevere · ‎2012-01-18

Hello,

We are hit by a bug (bug ID 488825). It is scheduled to be fixed somewhere in april 2012 (8.0.4Px). We can’t understand that it takes so long to get a fix, but what can we do. According to NetApp, only a few customers are hit by this bug, so it gets low priority. In the meantime we are struggling to keep the system going on. Offloading volumes to other systems, take only snapshots where absolutely necessary and don’t delete snapshots if possible. Every minor change has an influence on the system. In the meantime we have hit a couple of other bugs. We upgraded to new hardware FAS3240 and use ontap 8.0.2P4 (latest version) for the 64bit aggr. Ontap 8 is full of bugs. I have only one advice: If not it’s not necessary for your config leave it at the moment. Never had any issue with our FAS3040 and 7.x versions.

If you don’t have a metrocluster, you can disable aggr snapshot (this is what ontap 8 docs are saying). It has no use to take them, if not in metrocluster config.

Kind Regards,

Geert

Van: ian.iball

Verzonden: woensdag 18 januari 2012 16:13

Aan: Geert Defevere

Onderwerp: - Re: ontap freezes on aggr snapshot

NetApp Online Community <https://communities.netapp.com/index.jspa>

<http://media.netapp.com/images/divider-600x3.jpg>

Re: ontap freezes on aggr snapshot

created by ian.iball <https://communities.netapp.com/people/ian.iball> in Data ONTAP - View the full discussion <https://communities.netapp.com/message/71986#71986> <http://media.netapp.com/images/divider-600x3.jpg>

Hi;

Did you ever get a response or answer for this...?

We have noticed the exact behaviour, but on a NON Metrocluster setup so no syncmirror between the nodes. We have 2 x FAS3270's and just recently have been seeing freezes at certain points in the day.. We also tracked this down to an aggr snapshot.

In our situation, the default snap sched for the aggrs were left so we were seeing freezes at 9AM, 2PM , 7PM and midnight.

It would be interesting to see if anyone else is seeing similar freezes..

Reply to this message by replying to this email -or- go to the message on NetApp Community <https://communities.netapp.com/message/71986#71986>

Start a new discussion in Data ONTAP by email <mailto:discussions-community-products_and_solutions-data_ontap@netappcommunity.hosted.jivesoftware.com> or at NetApp Community <https://communities.netapp.com/choose-container.jspa?contentType=1&containerType=14&container=2877>

CHARLESNG2 · ‎2013-04-01

We had this same issue. Turns out it was a bug in OnTap 8.1.2. We updated to OnTap 8.1.2 P3 and it resolved our issue. (Non Metrocluster, FAS3240)

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=393877&app=portal