Subscribe
Accepted Solution

NetApp Ontap 9.2 Upgrade – review your network first

 

 

This is simply to ask you to review your configuration before upgrading to Ontap 9.2, still upgrade, just check first.

 

I discovered an issue just over a month ago, after an upgrade to Ontap 9.2.  It has now been confirmed by NetApp, but its not a bug, just a change in network stack that changes the way it routes particular protocols.  

 

Had the change to the Ontap OS been known, I would have made infrastructure changes before the upgrade, rather than having to raise support tickets and now have to make changes to to restore resilience to our infrastructure.

 

I will state that I have been very impressed with the performance of the new all flash NetApps and with the exception of one major bug, the systems have been bullet proof in general operation on our environment for the last year.

 

NetApp have removed a feature called “Fastpath” from Ontap 9.2, this feature stores the interface of each incoming network packet and ensures it goes back out the same interface.  This was originally implemented for performance as it saved the time of checking the routing table. This feature has enabled servers to send storage traffic to any NetApp virtual interface and teh packet would egress from the same interface irrespective of the network infrastructure.

 

During the upgrade to 9.2 we had several outages and lost monitoring from On Command Unified Manager, we restored monitoring by moving OCUM to another subnet but we had to live with some loss of NFS resilience until we had confirmation of the cause..

 

Though each virtual interface still has a profile (data, management, intercluster, etc.) for incoming traffic, the response packet can go out through any interface within the same SVM and now uses the routing table.

 

The loss of monitoring from OCUM was caused by the https responses from polling via the cluster management interface, leaving the NetApp via an intercluster interface, which happened to be on the OCUM server’s subnet.

 

NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.

 

A new feature of Ontap 9.2, a tcpdump like command, allowed viewing traffic in wireshark and confirmation of the asymmetric routing from the netapp side.

 

We are now in the process of moving intercluster and admin interfaces to their own subnets.  Hopefully upon completion 9.3 will be GA and we can gain some more space back.

Re: NetApp Ontap 9.2 Upgrade – review your network first

Thanks for the feedback - "network tcpdump" is a much more friendly interface to "pktt", which has been around for many years. We have a KB article about it here - https://kb.netapp.com/app/answers/answer_view/a_id/1029833/

 

Regarding fastpath's removal, it is one of those "features" which blocked some other developments, and through our ActiveIQ telemetry, we were able to see very limited adoption of it. It reminds me of "ip proxy-arp" on some network devices, which enabled systems without correct subnet masks to still work. 

Re: NetApp Ontap 9.2 Upgrade – review your network first

Hi Alex,

 

I totally understand the why of FASTPATH removal, optimising code paths and improving the customer experience.  The question is on the how.  

 

Google FASTPATH ,search NetApp support and community sites, check the release notes, best practice on Active IQ, upgrade procedure download from NetApp support site.  Its not mentioned anywhere!

 

As you have said, you knew exactly who was using it, and it's on by default.  A warning would have been nice.  Instead I had a support ticket open for over a month for failures during upgrade, where I clearly and in detail, describe the problem, eventually  before I finally received a confession from NetApp support.

 

So I published this so customers would be aware and could review there environment, maybe disable FASTPATH before upgrade and see what, if anything fails.  Then if required they can make changes to their environment so they dont suddenly get failures during an upgrade.

Re: NetApp Ontap 9.2 Upgrade – review your network first

[ Edited ]

Is this impacting if "ipspace" in place?

Re: NetApp Ontap 9.2 Upgrade – review your network first

Hi Kanna,

 

You could potentially isolate Cluster interconnects in their own IP Space but the Heads and each SVM only have 1 routing table so I dont know how effectively that would be.  I would discuss with your network team first and treat the routing and IP stuff as if it was any other network device.

 

 

Regards,

 

Mark

Re: NetApp Ontap 9.2 Upgrade – review your network first

Deligatedgeek,

 

I absolutely got bit by this same bug as well when we upgraded to 9.2 the last week of December. Wish I had seen this post before then. I agree that the big issue is there was no documentation anywhere about this change, and we actually experienced an outage of some content as a result. It took a T2 NetApp tech, a T2 VMware tech, our virt team and myself about 5 hours to figure out the cause. I don't understand why NetApp wouldn't describe this change in Upgrade Advisor or the Release Notes to give people a heads up.

Re: NetApp Ontap 9.2 Upgrade – review your network first

[ Edited ]

Please read NetApp KB article 1072895: Network traffic not sent or sent out of an unexpected interface after upgrade to 9.2 or later.

(https://kb.netapp.com/app/answers/answer_view/a_id/1072895/)

 

Workaround/fix:

Connect directly to the LIF associated with the port where the data is egressing
- This may require creating a new LIF depending on protocols supported by the existing LIF
Create a more specific route directly back to the client
- This will cause traffic to egress out the interface where the route points

 

By the way: I think NetApp should at least refer to this KB article or place a warning in the Upgrade Advisor report when upgrading towards ONTAP 9.2 as this can cause(s) serious disruptions for production environments. It should be easy to detect from the autosupport data which the system reports back to NetApp.


Re: NetApp Ontap 9.2 Upgrade – review your network first

I spoke to NetApp and they told me their data showed this feature was no longer in use.  So maybe they missed all the systems that still are.

 

The knowledge base article you reference should have been in the release notes, and it was publishjed last month, so thats 4 months after we upgrade, which means at least 5 months after 9.2 went GA.

Re: NetApp Ontap 9.2 Upgrade – review your network first

We are running 9.1p6 and planning on upgrading to 9.3.

I read KB, and understand the change about ip.fastpath. However, I am not clear on if our environment would be exposed to the change or not.

 

1.  ip.fastpath enable currently is set to on and with deprecated. Doe that mean ip.fastpath is not in use? If not, then we don't need to worry about the change, because we are not using fastpath anyway. Correct?

 

2. If we do need to worry about it. What changes should I make before the upgrade?

Suppose there are three LIF's. admin_lif, data_lif1, and data_lif2, and also there are three different gateways for each LIF.

The user is coming from data_lif1, then what rule should SVM follow to egress the traffic? should I make sure the GW for data_lif2 will have routes to reach out the user who is coming from data_lif1?

 

Can somebody please shed the light on the issue for me?

 

 

Re: NetApp Ontap 9.2 Upgrade – review your network first

Hi Netappmagic,

 

1.  ip.fastpath enable currently is set to on and with deprecated. Doe that mean ip.fastpath is not in use? If not, then we don't need to worry about the change, because we are not using fastpath anyway. Correct?

 

if its on then it can be used.

 

2. If we do need to worry about it. What changes should I make before the upgrade?

Suppose there are three LIF's. admin_lif, data_lif1, and data_lif2, and also there are three different gateways for each LIF.

The user is coming from data_lif1, then what rule should SVM follow to egress the traffic? should I make sure the GW for data_lif2 will have routes to reach out the user who is coming from data_lif1?

 

This missing detail because my first question is why do you have a different gateway for each? are there 3 different subnets? Could you provide more information ip/netmask and routing table

 

You could build a dev netapp on vmware that reflects your environment and turn ip.fastpath off.  Or during a maintence period, turn ip.fastpath off.

 

Do you have multiple interfaces on the same subnet? if so then this should affect you.

 

Attach a diagram of your infra. or photo of hand scribled diagram.  any more information would be helpful

 

Do you currently have multiple interfaces on the same subnet?

Do you have