2017-12-12 09:14 AM
This is simply to ask you to review your configuration before upgrading to Ontap 9.2, still upgrade, just check first.
I discovered an issue just over a month ago, after an upgrade to Ontap 9.2. It has now been confirmed by NetApp, but its not a bug, just a change in network stack that changes the way it routes particular protocols.
Had the change to the Ontap OS been known, I would have made infrastructure changes before the upgrade, rather than having to raise support tickets and now have to make changes to to restore resilience to our infrastructure.
I will state that I have been very impressed with the performance of the new all flash NetApps and with the exception of one major bug, the systems have been bullet proof in general operation on our environment for the last year.
NetApp have removed a feature called “Fastpath” from Ontap 9.2, this feature stores the interface of each incoming network packet and ensures it goes back out the same interface. This was originally implemented for performance as it saved the time of checking the routing table. This feature has enabled servers to send storage traffic to any NetApp virtual interface and teh packet would egress from the same interface irrespective of the network infrastructure.
During the upgrade to 9.2 we had several outages and lost monitoring from On Command Unified Manager, we restored monitoring by moving OCUM to another subnet but we had to live with some loss of NFS resilience until we had confirmation of the cause..
Though each virtual interface still has a profile (data, management, intercluster, etc.) for incoming traffic, the response packet can go out through any interface within the same SVM and now uses the routing table.
The loss of monitoring from OCUM was caused by the https responses from polling via the cluster management interface, leaving the NetApp via an intercluster interface, which happened to be on the OCUM server’s subnet.
NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.
A new feature of Ontap 9.2, a tcpdump like command, allowed viewing traffic in wireshark and confirmation of the asymmetric routing from the netapp side.
We are now in the process of moving intercluster and admin interfaces to their own subnets. Hopefully upon completion 9.3 will be GA and we can gain some more space back.
2017-12-12 06:39 PM
Thanks for the feedback - "network tcpdump" is a much more friendly interface to "pktt", which has been around for many years. We have a KB article about it here - https://kb.netapp.com/app/answers/answer_view/a_id/1029833/
Regarding fastpath's removal, it is one of those "features" which blocked some other developments, and through our ActiveIQ telemetry, we were able to see very limited adoption of it. It reminds me of "ip proxy-arp" on some network devices, which enabled systems without correct subnet masks to still work.
2017-12-13 02:51 AM
I totally understand the why of FASTPATH removal, optimising code paths and improving the customer experience. The question is on the how.
Google FASTPATH ,search NetApp support and community sites, check the release notes, best practice on Active IQ, upgrade procedure download from NetApp support site. Its not mentioned anywhere!
As you have said, you knew exactly who was using it, and it's on by default. A warning would have been nice. Instead I had a support ticket open for over a month for failures during upgrade, where I clearly and in detail, describe the problem, eventually before I finally received a confession from NetApp support.
So I published this so customers would be aware and could review there environment, maybe disable FASTPATH before upgrade and see what, if anything fails. Then if required they can make changes to their environment so they dont suddenly get failures during an upgrade.