ONTAP Hardware
ONTAP Hardware
I'm currently deeling with a customer who has a FAS2220 HA with ONTAP 7-Mode 8.2.2. Unfortunately, it was installed incorrectly (he contacted my company to try to fix it): all data is stored in the root aggregate, needs to upgrade ONTAP, shares and exports need to be redone, snapshots (inefective because of being in aggr0) disabled, vifs need to be redone. The list goes on with all the problems. They are in the process of purchasing a new unit, so we're waiting to do most of the work until it arrives, then they going to move this unit to DR. Until then, he's worried about the delays in reads/writes the system is having now.
Cabling looks ok. It's run out to 2 Cisco 2960s for redundancy. Only issue with the switches is that they are showing certain ports flapping (our network guy has the information on that).
The ifgrps seem to really need some work (same settup for both controllers): vif1 - e0a, e0c. vif2 - e0b, e0d. e0M has been disabled (even though it's cabled). And vif1 and vif2 then seem to teamed up as svif. As in the two vifs are bonded as a seperate vif. This svif is what's doing all the sending/receiving/management. I have no idea how someone did this, but let me know if you've seem it before.
So the customer is wanting me to solve the latency issue. Is it worth it to recreate all the ifgrps in hopes that solves the problem? He can't afford to have unscheduled downtime (it's a 24-7 company). Should we just wait another couple weeks until his new equipment is in and we can just reset everything?
Looking for any advice/guidance you have. Thanks.
Solved! See The Solution
Controller 2 only has 3 data drives. I am not sure what kind of workload you are tyring to run, but that is not very many. There are not enough IOPS to support the workload.
You have 2 spares on controller 2, so you could give up one of those spares to the aggregate and get a few more IOPS (this would require restriping your volumes). With systems this small, I usually do an "active/passive" configuration to provide a larger single pool of disks.
So instead of splitting the disks evenly, I would do the following:
Controller 1 (RAIDDP):
parity
dparity
data
data
data
data
data
data
spare
Controller 2 (RAID4):
parity
data
spare
Controller 1 gets all of the workload, and controller 2 is "passive" and will take over in case controller 1 fails.
A couple of things..
It is very common to NOT have a dedicated root aggregate on smaller systems. A dedicated root aggregate is not a requirement in 7-mode. Not having a dedicated root aggregate does not limit functionality in any way.
The multi-tier VIF/IFGRP that you see is also very common in smaller environments that do not have stacked switches.
For example:
Ports e0a and e0c are part of an LACP bond with both connections going to the same switch (vif1)
Ports e0b and e0d are part of an LACP bond with both connections going to the other switch (vif2)
vif1 and vif2 are then placed in to a active/passive (single mode) LIF called "svif".
So at any given time traffic is on once switch or the other.
Without stackable switches, this is the only way to provide link aggregation AND switch redundancy. The flapping could be the result of a misconfiguration on the switch or the NetApp. We would need to see the /etc/rc files from the controllers to determine exactly what is going on.
In regards to the performance..
What does 'sysstat -x 1' show? Are CPU or Disk Util % high? I would consider anything above 75% to be cause for concern or at least a place to start looking.
If the network is flapping, I suspect this also may have something to do with the performance issues.
Thanks for the help. I'll post both controller's sysstat and /etc/rc. Controller Netapp1 doesn't seem to be having any problems, but disk utilization on Netapp2 is hitting 100%.
Controller 1 sysstat
NetApp1> sysstat -x 1 CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 1% 0 0 0 0 5 1 0 0 0 0 47 100% 0% - 0% 0 0 0 0 0 0 0 1% 0 11 0 11 6 21 52 24 0 0 47 95% 0% - 3% 0 0 0 0 0 0 0 2% 0 239 0 244 73 358 204 0 0 0 47 92% 0% T0 3% 5 0 0 0 0 0 0 1% 0 48 0 48 16 72 796 1412 0 0 47 99% 19% : 15% 0 0 0 0 0 0 0 1% 0 10 0 10 5 2 28 24 0 0 47 100% 0% - 9% 0 0 0 0 0 0 0 1% 0 5 0 5 5 2 20 0 0 0 47 100% 0% - 27% 0 0 0 0 0 0 0 1% 0 15 0 15 5 3 0 0 0 0 47 - 0% - 0% 0 0 0 0 0 0 0 1% 0 55 0 59 15 9 8 24 0 0 47 99% 0% - 3% 4 0 0 0 0 0 0 1% 0 20 0 20 6 3 16 8 0 0 47 100% 0% - 2% 0 0 0 0 0 0 0 1% 0 16 0 16 7 3 8 0 0 0 47 100% 0% - 4% 0 0 0 0 0 0 0 1% 0 0 0 0 4 0 8 24 0 0 47 100% 0% - 3% 0 0 0 0 0 0 0 1% 0 58 0 58 16 9 16 0 0 0 47 100% 0% - 17% 0 0 0 0 0 0 0 1% 0 2 0 17 4 1 0 0 0 0 47 100% 0% - 0% 15 0 0 0 0 0 0 1% 0 0 0 0 7 0 780 1420 0 0 47 100% 18% T 20% 0 0 0 0 0 0 0 1% 0 37 0 37 11 6 28 0 0 0 47 99% 0% - 9% 0 0 0 0 0 0 0 1% 0 6 0 6 3 1 0 0 0 0 47 100% 0% - 0% 0 0 0 0 0 0 0 1% 0 4 0 4 7 1 8 24 0 0 47 100% 0% - 3% 0 0 0 0 0 0 0 2% 0 0 0 5 8 1 16 0 0 0 47 100% 0% - 25% 5 0 0 0 0 0 0 1% 0 0 0 0 5 0 0 8 0 0 47 100% 0% - 1% 0 0 0 0 0 0 0 CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 1% 0 0 0 0 4 0 16 24 0 0 47 88% 0% - 3% 0 0 0 0 0 0 0 1% 0 0 0 0 4 1 16 0 0 0 47 100% 0% - 4% 0 0 0 0 0 0 0
Controller 2 sysstat
NetApp2> sysstat -x 1 CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 14% 0 214 0 214 14156 485 1912 105296 0 0 3s 93% 100% :f 100% 0 0 0 0 0 0 0 26% 0 1169 0 1172 77341 1980 4668 24440 0 0 3s 73% 39% Hf 68% 3 0 0 0 0 0 0 11% 0 179 0 183 11923 763 3248 86436 0 0 3s 93% 100% :f 100% 4 0 0 0 0 0 0 17% 0 990 0 990 64805 1409 3924 7792 0 0 3s 53% 29% : 74% 0 0 0 0 0 0 0 17% 0 382 0 382 24533 1431 4000 107284 0 0 3s 79% 92% Hf 95% 0 0 0 0 0 0 0 25% 0 1172 0 1172 76275 2525 5720 12864 0 0 3s 82% 44% Hn 74% 0 0 0 0 0 0 0 14% 0 164 0 164 10882 536 2572 104812 0 0 3s 94% 100% :f 100% 0 0 0 0 0 0 0 12% 0 614 0 618 40395 1350 2420 11180 0 0 3s 77% 40% : 63% 4 0 0 0 0 0 0 22% 0 662 0 662 43514 1198 3456 35056 0 0 3s 90% 32% Hf 50% 0 0 0 0 0 0 0 7% 0 33 0 33 2191 51 1404 82424 0 0 3s 98% 93% : 89% 0 0 0 0 0 0 0 1% 0 0 0 0 6 2 0 0 0 0 3s 100% 0% - 0% 0 0 0 0 0 0 0 1% 0 0 0 0 4 0 0 0 0 0 3s 100% 0% - 0% 0 0 0 0 0 0 0 1% 0 0 0 4 3 30 16 32 0 0 3s 98% 0% - 5% 4 0 0 0 0 0 0 5% 0 252 0 252 16464 817 928 0 0 0 3s 71% 0% - 16% 0 0 0 0 0 0 0 26% 0 1088 0 1088 71821 1937 5300 42096 0 0 3s 78% 42% Hf 70% 0 0 0 0 0 0 0 8% 0 263 0 263 17460 1119 2236 69128 0 0 3s 51% 100% :f 99% 0 0 0 0 0 0 0 25% 0 1049 0 1049 68698 2569 4556 18928 0 0 9s 76% 35% Hn 70% 0 0 0 0 0 0 0 13% 0 147 0 152 9746 1132 2436 91668 0 0 9s 94% 100% :f 100% 5 0 0 0 0 0 0 14% 0 784 0 784 52038 1880 3056 10964 0 0 9s 64% 36% : 73% 0 0 0 0 0 0 0 CPU NFS CIFS HTTP Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP Disk OTHER FCP iSCSI FCP kB/s iSCSI kB/s in out read write read write age hit time ty util in out in out 22% 0 497 0 497 32938 1624 4052 74340 0 0 9s 95% 79% Hf 96% 0 0 0 0 0 0 0 10% 0 595 0 595 28364 1648 2788 42392 0 0 9s 59% 71% : 100% 0 0 0 0 0 0 0 17% 0 900 0 900 50004 2389 3956 31589 0 0 9s 77% 26% Hn 84% 0 0 0 0 0 0 0
Controller 1 /etc/rc
NetApp1> rdfile /etc/rc #Auto-generated by setup Tue Jan 28 09:31:14 EST 2014 hostname RJ-NetApp1 ifgrp create multi vif1 -b ip e0a e0c ifgrp create multi vif2 -b ip e0b e0d ifgrp create single svif vif1 vif2 ifconfig svif `hostname`-svif mediatype auto partner svif mtusize 1500 route add default 192.168.1.1 1 routed on options dns.domainname ronjon.corp options dns.enable on options nis.enable off savecore
Controller 2 /etc/rc
NetApp2> rdfile /etc/rc #Auto-generated by setup Tue Jan 28 09:33:21 EST 2014 hostname RJ-NetApp2 ifgrp create multi vif1 -b ip e0a e0c ifgrp create multi vif2 -b ip e0b e0d ifgrp create single svif vif1 vif2 ifconfig svif `hostname`-svif mediatype auto partner svif mtusize 1500 route add default 192.168.1.1 1 routed on options dns.domainname ronjon.corp options dns.enable on options nis.enable off savecore
Disk utilization at 100% is the issue.
aggr status -r (from both controllers) will show us how the raid groups are configured. From there we can see why you are having disk util issues.
aggr status
Controller 1
NetApp1> aggr status -r Aggregate aggr0 (online, raid4) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- parity 0a.00.2 0a 0 2 SA:A 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.4 0a 0 4 SA:A 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.6 0a 0 6 SA:A 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.8 0a 0 8 SA:A 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.10 0a 0 10 SA:A 0 BSAS 7200 2538546/5198943744 2543634/5209362816 Pool1 spare disks (empty) Pool0 spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block checksum spare 0a.00.0 0a 0 0 SA:A 0 BSAS 7200 2538546/5198943744 2543634/5209362816 Partner disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- partner 0b.00.11 0b 0 11 SA:B 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.5 0b 0 5 SA:B 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.3 0b 0 3 SA:B 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.9 0b 0 9 SA:B 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.7 0b 0 7 SA:B 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.1 0b 0 1 SA:B 0 BSAS 7200 0/0 2543634/5209362816
Controller 2
NetApp2> aggr status -r Aggregate aggr0 (online, raid4) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- parity 0a.00.3 0a 0 3 SA:B 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.5 0a 0 5 SA:B 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.9 0a 0 9 SA:B 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.7 0a 0 7 SA:B 0 BSAS 7200 2538546/5198943744 2543634/5209362816 Pool1 spare disks (empty) Pool0 spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block checksum spare 0a.00.1 0a 0 1 SA:B 0 BSAS 7200 2538546/5198943744 2543634/5209362816 spare 0a.00.11 0a 0 11 SA:B 0 BSAS 7200 2538546/5198943744 2543634/5209362816 Partner disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- partner 0b.00.10 0b 0 10 SA:A 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.2 0b 0 2 SA:A 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.6 0b 0 6 SA:A 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.4 0b 0 4 SA:A 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.8 0b 0 8 SA:A 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.0 0b 0 0 SA:A 0 BSAS 7200 0/0 2543634/5209362816
Controller 2 only has 3 data drives. I am not sure what kind of workload you are tyring to run, but that is not very many. There are not enough IOPS to support the workload.
You have 2 spares on controller 2, so you could give up one of those spares to the aggregate and get a few more IOPS (this would require restriping your volumes). With systems this small, I usually do an "active/passive" configuration to provide a larger single pool of disks.
So instead of splitting the disks evenly, I would do the following:
Controller 1 (RAIDDP):
parity
dparity
data
data
data
data
data
data
spare
Controller 2 (RAID4):
parity
data
spare
Controller 1 gets all of the workload, and controller 2 is "passive" and will take over in case controller 1 fails.
The /etc/rc files look good. I have implemented this same configuration several times.
You just have to make sure the switch is configured for static etherchannel (NetApp's "multi'), and not dynamic etherchannel (NetApp's "lacp").
Makes sense. Thanks a lot for the help.
Here's the aggr status:
Controller 1
NetApp1> aggr status -r Aggregate aggr0 (online, raid4) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- parity 0a.00.2 0a 0 2 SA:A 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.4 0a 0 4 SA:A 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.6 0a 0 6 SA:A 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.8 0a 0 8 SA:A 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.10 0a 0 10 SA:A 0 BSAS 7200 2538546/5198943744 2543634/5209362816 Pool1 spare disks (empty) Pool0 spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block checksum spare 0a.00.0 0a 0 0 SA:A 0 BSAS 7200 2538546/5198943744 2543634/5209362816 Partner disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- partner 0b.00.11 0b 0 11 SA:B 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.5 0b 0 5 SA:B 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.3 0b 0 3 SA:B 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.9 0b 0 9 SA:B 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.7 0b 0 7 SA:B 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.1 0b 0 1 SA:B 0 BSAS 7200 0/0 2543634/5209362816
Controller 2
NetApp2> aggr status -r Aggregate aggr0 (online, raid4) (block checksums) Plex /aggr0/plex0 (online, normal, active) RAID group /aggr0/plex0/rg0 (normal, block checksums) RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- parity 0a.00.3 0a 0 3 SA:B 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.5 0a 0 5 SA:B 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.9 0a 0 9 SA:B 0 BSAS 7200 2538546/5198943744 2543634/5209362816 data 0a.00.7 0a 0 7 SA:B 0 BSAS 7200 2538546/5198943744 2543634/5209362816 Pool1 spare disks (empty) Pool0 spare disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- Spare disks for block checksum spare 0a.00.1 0a 0 1 SA:B 0 BSAS 7200 2538546/5198943744 2543634/5209362816 spare 0a.00.11 0a 0 11 SA:B 0 BSAS 7200 2538546/5198943744 2543634/5209362816 Partner disks RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks) --------- ------ ------------- ---- ---- ---- ----- -------------- -------------- partner 0b.00.10 0b 0 10 SA:A 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.2 0b 0 2 SA:A 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.6 0b 0 6 SA:A 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.4 0b 0 4 SA:A 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.8 0b 0 8 SA:A 0 BSAS 7200 0/0 2543634/5209362816 partner 0b.00.0 0b 0 0 SA:A 0 BSAS 7200 0/0 2543634/5209362816