ONTAP Discussions
ONTAP Discussions
Brief Overview
Netapp 8.1.4 P9
2 x FAS6020 in a flexpod setup (Configured in 2013)
Aggr 1 Fsata0 with Flash Pool (24 volumes) (124tb used) (nfs datastores + vfilercifs )
Aggr Sas0 (48tb used)
Aggr Sata (50tb)
Aggr sata (45tb)
Netapp connected to Nexus 5K via 10Gig (flexpod setup)
Summary of Problems : Bad performance read /write | Snap mirror taking ages to complete | throughput is very low
We have been experincing bad read/write latency since the summer and we upgrade to 8.1.4 P9 in september which made the problem go away for four weeks. Types of problems Users can't write/read small documents such a 1mb word document or it takes upto 1min to open them. The problem seems to be our netapp02 controller which has high ping rates from controller 1 to controller 2 but when you ping anything else on the network you get sub 1ms from controller 2.
Snap mirror lag times are horrdenous please see screenshot.
Netapp can't find anything wrong with perfstats, we are thinking its a bug or our config is incorrect somewhere.
Please could you guys help me investigate
sysstat -m ANY AVG CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 100% 59% 58% 61% 58% 59% 59% 61% 59% 59% 100% 53% 52% 54% 52% 51% 52% 55% 53% 52% 99% 50% 49% 51% 49% 49% 50% 53% 50% 50% 100% 57% 56% 58% 56% 56% 57% 59% 57% 57% 99% 51% 50% 51% 50% 50% 51% 53% 51% 51% 99% 48% 47% 49% 47% 47% 48% 51% 48% 48% 99% 51% 49% 52% 49% 49% 51% 54% 51% 50% 99% 50% 48% 51% 48% 49% 50% 52% 49% 49% 99% 50% 49% 52% 49% 48% 50% 52% 50% 49% 99% 48% 47% 50% 47% 48% 48% 51% 48% 49% 99% 52% 52% 53% 52% 52% 52% 54% 52% 52% 99% 51% 49% 52% 50% 50% 51% 54% 51% 51% 100% 53% 52% 54% 51% 52% 52% 56% 52% 53% 100% 53% 52% 55% 52% 53% 53% 56% 54% 53% 99% 49% 48% 51% 48% 49% 49% 52% 49% 49%
Hi,
I would say that aggregate with SATA disks + flash pool is slowing down your system bit as it's mostly utilized based on provided statit output.
Flash pool is caching reads and random overwrites(operations smaller then 16KB) - make sense to use it for cifs shares with many small files but not for datastores.
Usually datastores needs quick read response as OS of VMs is laying on your volumes. For that purpose Flash Cache is better solution.(ideally with dedup enabled)
Flash Cache is PCI based (much faster then accessing SSD on disk layer as flash pool does) .
Also in case of Flash pool , all hot blocks(the most accessed blocks) needs to be written to the SSDs during consistency point. It means you need to wait for writing it to the disk to use advantage of cache.
I would recommend (if you have flash cache installed) to create new datastore on aggregate with SAS disks and migrate some VMs with heavy load there.
Another thing what I would say is not best practice is that you are mixing disk types within one controller. If you have SATA disks and SAS disks on one controller , you are slowing down consistency point as you need to still wait once SATA disks will complete their writes.
That's just my opinion 🙂
we have a 512gb flash cache card.
I understand your view on same disk types per controller but this would impact on n+1 redundancy netapp ? (that's what we were sold)
Do you have compression or deduplication jobs which are running at the same time as your snapmirrors? How full are your aggregates? Have you done a reallocate measure to check for noncontiguous free space?
Hi @asulliva
We need to find out why its taking so long to do a snap mirror ? (WAN is not the problem )
Do you have compression or deduplication jobs which are running at the same time as your SnapMirrors?
We have snap mirrors running continuously and de-dup runs concurrently to these if it runs at all.
Path State Status Progress /vol/vol_188_data188_T3_01 Enabled Idle Idle for 93:49:11 /vol/vol_dart_data Disabled Idle Idle for 2950:24:43 /vol/vol_188_documentumsql_backup Disabled Idle Idle for 3102:04:50 /vol/vol_188_188cvma Disabled Idle Idle for 2943:03:06 /vol/vol_188direct_pool_1_j Disabled Idle Idle for 2950:23:22 /vol/vol_188direct_n Disabled Idle Idle for 2950:22:11 /vol/vol_188direct_pool2_d Disabled Idle Idle for 2949:41:44 /vol/vol_188direct_pool3_e Disabled Idle Idle for 2950:03:09 /vol/vol_188direct_q Disabled Idle Idle for 2950:21:51 /vol/vol_188direct_r Disabled Idle Idle for 2950:16:06 /vol/vol_188icedblive_s Disabled Idle Idle for 2942:49:07 /vol/vol_188icedblive_r Disabled Idle Idle for 2949:30:52 /vol/vol_188_dss_clust Enabled Idle Idle for 66:35:05 /vol/vol_188_dss_rdm_map Enabled Idle Idle for 43:32:23 /vol/vol_vfiler_medical_records_images0 Disabled Idle Idle for 2933:17:27 /vol/vol_vfiler_188doccache_cache Disabled Idle Idle for 2934:03:20 /vol/vol_188_dss_clust_file188 Enabled Idle Idle for 14:53:40 /vol/vol_188_vfilercifs_departments_01 Enabled Idle Idle for 120:30:15 /vol/vol_188_vfilercifs_applications_01 Enabled Idle Idle for 20:47:53 /vol/vol_vfiler_records_images1 Disabled Idle Idle for 2944:24:43 /vol/vol_vfiler_records_images2 Disabled Idle Idle for 2944:24:43 /vol/vol_188_vfilercifs_backups_01 Disabled Idle Idle for 3120:24:42 /vol/vol_188_vfilercifs_users_01 Enabled Idle Idle for 20:42:47 /vol/vol_188_data188_T2_01 Enabled Idle Idle for 12:23:52 /vol/vol_vfiler_images0_test Disabled Idle Idle for 2944:24:43 /vol/vol_vfiler_images1_train Disabled Idle Idle for 2944:23:40 /vol/vol_vfiler_medical_records_dart_images2_live Disabled Idle Idle for 2944:23:43 /vol/vol_188_data188_T2_04 Disabled Idle Idle for 3076:43:44 /vol/vol_vfiler_retinal_image188 Disabled Idle Idle for 2943:28:41 /vol/vol_188_data188_T4_01 Enabled Idle Idle for 165:00:53 /vol/vol_188_data188_T4_03 Enabled Idle Idle for 148:12:06 /vol/vol_188_vfilercifs_archive_01 Disabled Idle Idle for 2941:37:40 /vol/vol_188_vfilercifs_backups_02 Disabled Idle Idle for 3123:58:09 /vol/vol_188_data188_T4_06 Disabled Idle Idle for 3076:43:54
How full are your aggregates?
NetApp consultant from neos healthcheck said we go upto 95% utilisation on a large aggregate we are currently at 90% for fstat0
Have you done a reallocate measure to check for noncontiguous free space?
No we havn't done this
netapp02> sysstat -x 2 CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 93% 7703 7416 0 15438 24008 114883 156912 66706 0 0 1 95% 69% : 32% 3 316 0 155 2978 0 0 94% 5332 6664 0 12511 24394 113986 191314 70969 0 0 1 96% 21% Hn 45% 0 515 0 663 4527 0 0 93% 6986 8064 0 15282 27105 101081 151629 119054 0 0 0s 94% 100% :f 32% 0 232 0 120 2354 0 0 91% 6623 7459 0 14320 24715 133290 197996 116298 0 0 0s 92% 100% :f 31% 4 234 0 187 2642 0 0 95% 6043 6910 0 13296 33680 123154 181144 9527 0 0 36s 95% 29% Hn 36% 0 321 0 212 2306 0 0 94% 4993 8246 0 13591 27255 66921 144321 145748 0 0 0s 96% 100% :f 38% 2 350 0 187 3107 0 0 91% 5346 10869 0 17902 25553 71091 155638 155790 0 0 0s 95% 100% :v 35% 0 1687 0 14114 1678 0 0 92% 4914 8204 0 13328 42350 78161 184114 119610 0 0 1 96% 46% Hs 44% 0 210 0 200 1706 0 0 84% 5295 7032 0 12760 26394 73475 142920 132128 0 0 1 95% 100% :f 33% 4 407 0 292 3103 0 0 88% 6525 8547 0 15443 38743 90686 115278 31376 0 0 1 96% 35% : 43% 130 241 0 411 2083 0 0 92% 6419 8500 0 15200 47308 118509 193627 80080 0 0 1 95% 18% Hn 37% 2 279 0 598 2113 0 0 92% 5533 8235 0 13942 20524 86481 170934 168575 0 0 1 93% 100% :f 41% 0 174 0 133 1710 0 0 92% 8000 6383 0 14973 48441 133375 165988 70304 0 0 47s 95% 66% : 32% 0 590 0 632 4876 0 0 97% 7145 6787 0 14285 30185 101420 175611 80692 0 0 1 97% 26% Hs 42% 2 351 0 301 2997 0 0 91% 6967 6911 0 14224 46559 96112 152779 122749 0 0 0s 94% 100% :f 39% 0 346 0 645 2594 0 0 88% 7634 7319 0 15310 28505 126086 160210 101060 0 0 1 95% 100% :f 37% 2 355 0 198 3279 0 0 97% 7885 6401 0 14636 42104 211693 299165 134110 0 0 0s 95% 58% Hs 50% 3 347 0 1081 2406 0 0 98% 7764 9916 0 18045 100886 88082 220507 176332 0 0 48s 93% 100% :f 57% 1 364 0 31110 32819 0 0 99% 8120 7809 0 16458 58469 115699 300433 213909 0 0 1 96% 99% Zs 63% 5 524 0 42626 45206 0 0
Any advice guys ?
Do you think we need to stagger our snap mirror schdule because all volumes start snap mirror operations every 15minutes ?
Thanks
Umar
There's too many variables to really narrow it down. The snippet of Graphana output shows some pretty high disk utiliziation, so I'd start with that. Try disabling some tasks...is it ok to disable both SnapMirror and dedupe for a time and see whether performance returns to an acceptable level for the clients? If not, try disabling one or the other and seeing how it affects performance.
Reducing the frequency of SnapMirror jobs could help, so could alternating dedupe and SnapMirror so that they aren't both running at the same time. You said WAN isn't an issue, but you're averaging over 37MB of data coming into the system every second (net KB/s in)...if you're replicating all of that data then you need a WAN pipe which can support at least that much bandwidth (> 300mb/s). Check the SnapMirror transfer sizes to help determine how much bandwidth each volume needs and divide that how much throughput is available to determine windows.
Do a reallocate measure on the volumes to determine if reallocation would help. The chains in your statit are ok, not great, but not terrible either...might be worth doing a reallocate measure on the aggregates as well. Be aware that reallocate will consume some IO, so it could impact latency if it's already bad.
If you haven't opened a support case, I would do so. Reach out to your account team to have them escalate if needed.
Andrew