ONTAP Discussions

NetApp Ontap 9.2 Upgrade – review your network first

Deligatedgeek
20,982 Views

 

 

This is simply to ask you to review your configuration before upgrading to Ontap 9.2, still upgrade, just check first.

 

I discovered an issue just over a month ago, after an upgrade to Ontap 9.2.  It has now been confirmed by NetApp, but its not a bug, just a change in network stack that changes the way it routes particular protocols.  

 

Had the change to the Ontap OS been known, I would have made infrastructure changes before the upgrade, rather than having to raise support tickets and now have to make changes to to restore resilience to our infrastructure.

 

I will state that I have been very impressed with the performance of the new all flash NetApps and with the exception of one major bug, the systems have been bullet proof in general operation on our environment for the last year.

 

NetApp have removed a feature called “Fastpath” from Ontap 9.2, this feature stores the interface of each incoming network packet and ensures it goes back out the same interface.  This was originally implemented for performance as it saved the time of checking the routing table. This feature has enabled servers to send storage traffic to any NetApp virtual interface and teh packet would egress from the same interface irrespective of the network infrastructure.

 

During the upgrade to 9.2 we had several outages and lost monitoring from On Command Unified Manager, we restored monitoring by moving OCUM to another subnet but we had to live with some loss of NFS resilience until we had confirmation of the cause..

 

Though each virtual interface still has a profile (data, management, intercluster, etc.) for incoming traffic, the response packet can go out through any interface within the same SVM and now uses the routing table.

 

The loss of monitoring from OCUM was caused by the https responses from polling via the cluster management interface, leaving the NetApp via an intercluster interface, which happened to be on the OCUM server’s subnet.

 

NFS issues were caused by using a NAS interface on the same node as the SVM admin interface, once I realised we moved all servers NFS to the node without the admin interface.

 

A new feature of Ontap 9.2, a tcpdump like command, allowed viewing traffic in wireshark and confirmation of the asymmetric routing from the netapp side.

 

We are now in the process of moving intercluster and admin interfaces to their own subnets.  Hopefully upon completion 9.3 will be GA and we can gain some more space back.

1 ACCEPTED SOLUTION

Deligatedgeek
20,690 Views

Hi Alex,

 

I totally understand the why of FASTPATH removal, optimising code paths and improving the customer experience.  The question is on the how.  

 

Google FASTPATH ,search NetApp support and community sites, check the release notes, best practice on Active IQ, upgrade procedure download from NetApp support site.  Its not mentioned anywhere!

 

As you have said, you knew exactly who was using it, and it's on by default.  A warning would have been nice.  Instead I had a support ticket open for over a month for failures during upgrade, where I clearly and in detail, describe the problem, eventually  before I finally received a confession from NetApp support.

 

So I published this so customers would be aware and could review there environment, maybe disable FASTPATH before upgrade and see what, if anything fails.  Then if required they can make changes to their environment so they dont suddenly get failures during an upgrade.

View solution in original post

18 REPLIES 18

AlexDawson
20,522 Views

Thanks for the feedback - "network tcpdump" is a much more friendly interface to "pktt", which has been around for many years. We have a KB article about it here - https://kb.netapp.com/app/answers/answer_view/a_id/1029833/

 

Regarding fastpath's removal, it is one of those "features" which blocked some other developments, and through our ActiveIQ telemetry, we were able to see very limited adoption of it. It reminds me of "ip proxy-arp" on some network devices, which enabled systems without correct subnet masks to still work. 

Deligatedgeek
20,691 Views

Hi Alex,

 

I totally understand the why of FASTPATH removal, optimising code paths and improving the customer experience.  The question is on the how.  

 

Google FASTPATH ,search NetApp support and community sites, check the release notes, best practice on Active IQ, upgrade procedure download from NetApp support site.  Its not mentioned anywhere!

 

As you have said, you knew exactly who was using it, and it's on by default.  A warning would have been nice.  Instead I had a support ticket open for over a month for failures during upgrade, where I clearly and in detail, describe the problem, eventually  before I finally received a confession from NetApp support.

 

So I published this so customers would be aware and could review there environment, maybe disable FASTPATH before upgrade and see what, if anything fails.  Then if required they can make changes to their environment so they dont suddenly get failures during an upgrade.

Kannan_DWA
20,054 Views

Is this impacting if "ipspace" in place?

Deligatedgeek
19,987 Views

Hi Kanna,

 

You could potentially isolate Cluster interconnects in their own IP Space but the Heads and each SVM only have 1 routing table so I dont know how effectively that would be.  I would discuss with your network team first and treat the routing and IP stuff as if it was any other network device.

 

 

Regards,

 

Mark

TMADOCTHOMAS
19,964 Views

Deligatedgeek,

 

I absolutely got bit by this same bug as well when we upgraded to 9.2 the last week of December. Wish I had seen this post before then. I agree that the big issue is there was no documentation anywhere about this change, and we actually experienced an outage of some content as a result. It took a T2 NetApp tech, a T2 VMware tech, our virt team and myself about 5 hours to figure out the cause. I don't understand why NetApp wouldn't describe this change in Upgrade Advisor or the Release Notes to give people a heads up.

bvanderkolk
19,600 Views

Please read NetApp KB article 1072895: Network traffic not sent or sent out of an unexpected interface after upgrade to 9.2 or later.

(https://kb.netapp.com/app/answers/answer_view/a_id/1072895/)

 

Workaround/fix:

Connect directly to the LIF associated with the port where the data is egressing
- This may require creating a new LIF depending on protocols supported by the existing LIF
Create a more specific route directly back to the client
- This will cause traffic to egress out the interface where the route points

 

By the way: I think NetApp should at least refer to this KB article or place a warning in the Upgrade Advisor report when upgrading towards ONTAP 9.2 as this can cause(s) serious disruptions for production environments. It should be easy to detect from the autosupport data which the system reports back to NetApp.


Deligatedgeek
19,591 Views

I spoke to NetApp and they told me their data showed this feature was no longer in use.  So maybe they missed all the systems that still are.

 

The knowledge base article you reference should have been in the release notes, and it was publishjed last month, so thats 4 months after we upgrade, which means at least 5 months after 9.2 went GA.

netappmagic
18,832 Views

We are running 9.1p6 and planning on upgrading to 9.3.

I read KB, and understand the change about ip.fastpath. However, I am not clear on if our environment would be exposed to the change or not.

 

1.  ip.fastpath enable currently is set to on and with deprecated. Doe that mean ip.fastpath is not in use? If not, then we don't need to worry about the change, because we are not using fastpath anyway. Correct?

 

2. If we do need to worry about it. What changes should I make before the upgrade?

Suppose there are three LIF's. admin_lif, data_lif1, and data_lif2, and also there are three different gateways for each LIF.

The user is coming from data_lif1, then what rule should SVM follow to egress the traffic? should I make sure the GW for data_lif2 will have routes to reach out the user who is coming from data_lif1?

 

Can somebody please shed the light on the issue for me?

 

 

Deligatedgeek
18,820 Views

Hi Netappmagic,

 

1.  ip.fastpath enable currently is set to on and with deprecated. Doe that mean ip.fastpath is not in use? If not, then we don't need to worry about the change, because we are not using fastpath anyway. Correct?

 

if its on then it can be used.

 

2. If we do need to worry about it. What changes should I make before the upgrade?

Suppose there are three LIF's. admin_lif, data_lif1, and data_lif2, and also there are three different gateways for each LIF.

The user is coming from data_lif1, then what rule should SVM follow to egress the traffic? should I make sure the GW for data_lif2 will have routes to reach out the user who is coming from data_lif1?

 

This missing detail because my first question is why do you have a different gateway for each? are there 3 different subnets? Could you provide more information ip/netmask and routing table

 

You could build a dev netapp on vmware that reflects your environment and turn ip.fastpath off.  Or during a maintence period, turn ip.fastpath off.

 

Do you have multiple interfaces on the same subnet? if so then this should affect you.

 

Attach a diagram of your infra. or photo of hand scribled diagram.  any more information would be helpful

 

Do you currently have multiple interfaces on the same subnet?

Do you have

netappmagic
16,422 Views

Thanks a lot for your prompt messages!

 

 

> net int show -vserver vserver10
  (network interface show)
            Logical    Status     Network            Current       Current Is
Vserver     Interface  Admin/Oper Address/Mask       Node          Port    Home
----------- ---------- ---------- ------------------ ------------- ------- ----
vserver10
            vserver10-01 up/up    10.192.19.101/24    node-01        a0a-180 true
            vserver10-02 up/up    10.192.19.102/24    node-02        a0a-180 true
            vserver10-admin up/up 10.192.108.12/24    node-01        e0i     true
            vserver10-c01 up/up   10.192.100.31/24    node-01        a0a-379 true
            vserver10-c02 up/up   10.192.100.32/24    node-02        a0a-379 true

 

 

We have three subnets to me, 10.192.19, 10.192.108, and 10.192.100, and three GW's for each one. Please let me know if information here is enough.

 

 

One addtional question is not clear to me. Based on my understanding, ip.fastpath.enable is enabled/disabled on nodes level. So, if I disable it, should it effect all SVM's on the node?

bvanderkolk
16,399 Views

 

Make sure that your SMB/NFS clients (preferrably) accessing the NetApp filer via the LIF in their local subnet. 

 

For example you have the LIF in SubnetA with 10.192.19.X/24 and a LIF in SubnetB with 10.192.100.X/24.

ClientA is in SubnetA and tries to access the LIF in SubnetB. The (return) traffic from the netapp filers will exit through the LIF in SubnetA and file access will break.. i ran into this behavior.

 

 

For clients accessing the NetApp filer in a subnet where there is no 'local' LIF, make sure that you configure the default gateway for that subnet and everything should work fine.

 

About the ip.fastpath setting per SVM... i'm not 100 percent sure but i think the ip fastpath is a global setting so not configurable per SVM. Disable or enable will affect your entire filer.

netappmagic
16,381 Views

I am still not sure what if there are anything we would have to do on the NetApp cluster side before the upgrade...

Deligatedgeek
16,346 Views

You have 2 testing options.

 

  1. Using your change process that you would use for the upgrade, tell everyone that there will be an outage.  During this period disable ip fastpath and test connectivity from user IP ranges, application IP ranges, select some servers and check logs.  Generally I've found that OCUM lost connectivity with NAS and NFS broke if the lif was in the same subnet as the admin lif.
  2. Create a test environment on VMWare.  We have replicated all vfilers with non default configurations on our dev and always test with application teams before upgrade.

Your  SVM networking looks fine, whats the default gateway for the SVM and are all clients within the same subnet as their lifs?

bvanderkolk
16,343 Views

To prepare for the upgrade... just do the pre-check/steps that you would normally do to prepare for the upgrade.

Use Upgrade Advisor, make configuration (metadata) and data backups of your filers, now which client versions and protocols (SMB/NFS/iSCSI etc.) your clients are using and to which LIFs they are connecting. AFAIK isn't not mandatory to disable fastpath before the upgrade.

Deligatedgeek
16,340 Views

Hi Bvanderkolk,

 

I would normally agree with you and always get upgrade instructions direct from Active IQ, do all the prechecks and yet we have had issues on 3 sequential upgrades, 2 majors that required several webex sessions with support.

 

We know there is potential for stuff to break in this upgrade so why would you not test first?

 

Disable fastpath during a safe time and test, if stuff breaks or loses contact (etc), then re-enable and fix infrastructure before the failure that could have been caused by actually doing the upgrade with an unknown result.

 

Do you work in an enterprise environment? directly with Users and live servers?

 

Regards,

 

Mark

bvanderkolk
13,272 Views

Hi

 

I agree that testing is always a must and a preferred way of doing things, but sometimes it's not possible to take down your infrastructure outside a maintenance window. You could start your maitenance window with disabling fastpath and fixing any issues that may occur. After that you can decide to continue with the ONTAP upgrade or rollback any changes you've made.

Deligatedgeek
13,270 Views

I have stated that testing fastpath should be done under the same procedure you use for the upgrade.

 

Assuming even monthly maintence window, if its waiting this long, it can wait for testing.

 

Regards,

 

Mark

andrew_braker
12,856 Views

First off, a lot of great information in this thread. Thanks everyone, it's helping me plan for a 9.1 to 9.3 upgrade 🙂

 

A note though, NetApp must have updated their Upgrade Advisor, because I see they have this at the start:

 

Upgrade Plan Addendum
Please review the Upgrade Plan Addendum for additional checks, cautions, frequently asked questions, and errata. Incorporate
the addendum information into your upgrade preparation and execution process as appropriate. This is an all-inclusive addendum
covering all releases and is not context sensitive.

 

Link to Addendum: https://mysupport.netapp.com/ecm/ecm_get_file/ECMP12370318 (See step 3 about this situation).

 

Also, the Release Notes for 9.3 (at least) say Fast Path is removed.

 

 

Public