2017-12-12 09:14 AM
This is simply to ask you to review your configuration before upgrading to Ontap 9.2, still upgrade, just check first.
I discovered an issue just over a month ago, after an upgrade to Ontap 9.2. It has now been confirmed by NetApp, but its not a bug, just a change in network stack that changes the way it routes particular protocols.
Had the change to the Ontap OS been known, I would have made infrastructure changes before the upgrade, rather than having to raise support tickets and now have to make changes to to restore resilience to our infrastructure.
I will state that I have been very impressed with the performance of the new all flash NetApps and with the exception of one major bug, the systems have been bullet proof in general operation on our environment for the last year.
NetApp have removed a feature called “Fastpath” from Ontap 9.2, this feature stores the interface of each incoming network packet and ensures it goes back out the same interface. This was originally implemented for performance as it saved the time of checking the routing table. This feature has enabled servers to send storage traffic to any NetApp virtual interface and teh packet would egress from the same interface irrespective of the network infrastructure.
During the upgrade to 9.2 we had several outages and lost monitoring from On Command Unified Manager, we restored monitoring by moving OCUM to another subnet but we had to live with some loss of NFS resilience until we had confirmation of the cause..
Though each virtual interface still has a profile (data, management, intercluster, etc.) for incoming traffic, the response packet can go out through any interface within the same SVM and now uses the routing table.
The loss of monitoring from OCUM was caused by the https responses from polling via the cluster management interface, leaving the NetApp via an intercluster interface, which happened to be on the OCUM server’s subnet.
NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.
A new feature of Ontap 9.2, a tcpdump like command, allowed viewing traffic in wireshark and confirmation of the asymmetric routing from the netapp side.
We are now in the process of moving intercluster and admin interfaces to their own subnets. Hopefully upon completion 9.3 will be GA and we can gain some more space back.
Solved! SEE THE SOLUTION
2017-12-12 06:39 PM
Thanks for the feedback - "network tcpdump" is a much more friendly interface to "pktt", which has been around for many years. We have a KB article about it here - https://kb.netapp.com/app/answers/answer_view/a_id/1029833/
Regarding fastpath's removal, it is one of those "features" which blocked some other developments, and through our ActiveIQ telemetry, we were able to see very limited adoption of it. It reminds me of "ip proxy-arp" on some network devices, which enabled systems without correct subnet masks to still work.
2017-12-13 02:51 AM
I totally understand the why of FASTPATH removal, optimising code paths and improving the customer experience. The question is on the how.
Google FASTPATH ,search NetApp support and community sites, check the release notes, best practice on Active IQ, upgrade procedure download from NetApp support site. Its not mentioned anywhere!
As you have said, you knew exactly who was using it, and it's on by default. A warning would have been nice. Instead I had a support ticket open for over a month for failures during upgrade, where I clearly and in detail, describe the problem, eventually before I finally received a confession from NetApp support.
So I published this so customers would be aware and could review there environment, maybe disable FASTPATH before upgrade and see what, if anything fails. Then if required they can make changes to their environment so they dont suddenly get failures during an upgrade.
2018-02-05 02:30 AM
You could potentially isolate Cluster interconnects in their own IP Space but the Heads and each SVM only have 1 routing table so I dont know how effectively that would be. I would discuss with your network team first and treat the routing and IP stuff as if it was any other network device.
2018-02-05 07:01 AM
I absolutely got bit by this same bug as well when we upgraded to 9.2 the last week of December. Wish I had seen this post before then. I agree that the big issue is there was no documentation anywhere about this change, and we actually experienced an outage of some content as a result. It took a T2 NetApp tech, a T2 VMware tech, our virt team and myself about 5 hours to figure out the cause. I don't understand why NetApp wouldn't describe this change in Upgrade Advisor or the Release Notes to give people a heads up.
4 weeks ago - last edited 4 weeks ago
Please read NetApp KB article 1072895: Network traffic not sent or sent out of an unexpected interface after upgrade to 9.2 or later.
By the way: I think NetApp should at least refer to this KB article or place a warning in the Upgrade Advisor report when upgrading towards ONTAP 9.2 as this
can cause(s) serious disruptions for production environments. It should be easy to detect from the autosupport data which the system reports back to NetApp.
4 weeks ago
I spoke to NetApp and they told me their data showed this feature was no longer in use. So maybe they missed all the systems that still are.
The knowledge base article you reference should have been in the release notes, and it was publishjed last month, so thats 4 months after we upgrade, which means at least 5 months after 9.2 went GA.