2013-11-13 06:37 PM
Hello my fellow Minions!
We have filers that are presenting NFS volumes to vmware esx servers using a VIF. The VIF was misconfigured when installed as single _mode rather than lacp. Is there a way to convert from single_mode to multi_node + lacp without downtime to the vmware, esxi hosts or the vms they present?
NetApp Release 8.0.5 7-Mode
slot 0: Dual 10G Ethernet Controller T320E-SFP/KR
c0a MAC Address: (auto-unknown-enabling)
c0b MAC Address: (auto-10g_kr-fd-up)
slot 0: Internal BGE 10/100 Ethernet Controller
e0M MAC Address: (auto-100tx-fd-up)
e0P MAC Address: (auto-100tx-fd-up)
slot 0: Dual 10/100/1000 Ethernet Controller G20
e0a MAC Address: (auto-1000t-fd-up)
e0b MAC Address: (auto-1000t-fd-up)
slot 0: SAS Host Adapter 0a
132 Disks: 114796.7GB
6 shelves with IOM3
slot 0: SAS Host Adapter 0b
132 Disks: 114796.7GB
6 shelves with IOM3
slot 0: Fibre Channel Target Host Adapter 0c
slot 0: Fibre Channel Target Host Adapter 0d
slot 0: Intel ICH USB EHCI Adapter u0a (0xdff01400)
boot0 Micron Technology Real SSD eUSB 2GB, rev 2.00/11.00, addr 2 1936MB 512B/sect (4BF0022700025202)
slot 1: Quad 10/100/1000 Ethernet Controller G20
e1a MAC Address: (auto-1000t-fd-up)
e1b MAC Address: (auto-1000t-fd-up)
e1c MAC Address: (auto-1000t-fd-up)
e1d MAC Address: (auto-1000t-fd-up)
slot 2: Dual 10G Ethernet Controller T320E-SFP+
e2a MAC Address: (auto-unknown-down)
e2b MAC Address: (auto-unknown-down)
vif1_nfs: 1 link, transmit 'none', Ifgrp Type 'single_mode' fail 'default'
Ifgrp Status Up Addr_set
e1d: state up, since 15Oct2013 12:49:30 (29+06:11:49)
e1b: state down, since 15Oct2013 12:50:33 (29+06:10:46)
e1c: state down, since 15Oct2013 12:49:35 (29+06:11:44)
e1a: state down, since 15Oct2013 12:49:30 (29+06:11:49)
2013-11-14 01:59 AM
It was asked so many times ... You need destroy VIF and create new. This can be done non- disruptively using takeover/giveback. Takeover filer, change /etc/rc, giveback.
2013-11-14 07:13 AM
That was our first thought as well, we could do a cf takeover and after the reboot and before the giveback, delete the existing VIF, create a new one making it lacp as shown below in the /etc/rc file, maybe reboot the filer one more time to make sure the changes submitted to the /etc/rc file were all happy, happy, happy and then move on to the next filer.
However, the Netapp PSE we had heeling us with the issue said that since we had 100+ VM guest connected, he would not take any chances and shut down all the VMs, rebuild the ifgrp vid and then test connectivity. We also called NetApp support asking the same question and they said the same thing. The latest netapp support ticket I asked about the same issue yielded the response "That would require some downtime on the interfaces. Depending on your NFS timeout settings, it might be possible to make the changes quick enough that they wont notice the outage. Does this answer your question?" To me, this seems to suggest that they seem to think we can make the changes while all the VM guest are up and only looking at the time outs via NFS and not including any issues with the VM guest that may arise from the VM guest loosing connectivity to their NFS storage presenting their vmdk files. I would think some form of havoc would in sue from following that approach.
So basically, the responses we are receiving from support, onsite netapp PSE and our storage vendor/reseller are split about 50/50 with doing it with all VMs offline or doing the takeover-giveback method. I don't know if we would like those odds and wonder if you have tried the cf takeover, make the changes and giveback this in the past. To our team here, that seems like it would work, but we have limited experience with the netapp.
NTAP-TST-01> rdfile /etc/rc
#created by Scottie 11/05/2013
ifgrp create lacp vif1_nfs -b ip e0c e0d
ifconfig e0a 172.22.108.150 netmask 255.255.255.0
ifconfig vif1_nfs 172.22.108.151 netmask 255.255.255.0
ifconfig vif1_nfs up
2013-11-14 10:06 PM
Yes, I did it when moving connection from 1GbE to 10GbE. This was configuration with several Oracle databases on NFS.
In this case it is up to you. I always say that if downtime is possible, take safe route and request downtime. You know, s*** happens...