Re: A PCI error triggered from a memory error on the DRAM component of Converged Network Adapter ?

ASH2017 · ‎2017-09-21

Hi NetApp,

We received advisory email from NetApp to upgrade the ONTAP version before this issue impacts our customers. We are told to upgrade asap.

Looking at the BUG and KB, it appears there is a NMI PCI errors on the CNA [UTA2] card due to non-correctable ECC erros resulting reboot, basically the Node will be failed-over to prevent loss of data and to maintain data integrity and will be failed back.

KB: https://kb.netapp.com/support/s/article/ka61A000000041fQAA/PCI-error-triggered-from-memory-error-condition-when-CNA-port-is-used-in-ethernet-mode-in-F...

BUGID: https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1026931

For a dedicated NetApp clusters in a small environment, this is not an issue, but for managed services company with more than 50 clusters its easier said then done. We need to first make sure everything connected to NetApp is compatible using Matrix site and only then proceed towards upgrading DR first and PROD next. With so many Clusters it may well take some.

Concern: My concern is about 'insufficient explanation' around this BUG in the KB or BUG itself?

CNA [UTA2] - Can be used in two personality mode:

1. FC only

2. CNA (FCoE) - Protocols allowed : FC, ISCSI, CIFS & NFS

CNA [UTA2] - Provides - hardware offload support for iSCSI and FCoE , and I believe for CIFS/NFS there is no offloading stuff, DATA is just passed on like any other Ethernet NIC.

My question is:

1. Does this BUG effect customers using CNA personality mode for - CIFS/NFS only ? and if yes how does it impacts ?

2. Looking at the advisory it appears the solution is to upgrade the ONTAP, which means there is nothing wrong withe the Hardware or Firmware of the Device CNA ? ONTAP will probably do some early detection and reset the non-correctable ECC errors before it panics ?

3. Workaround says - I must say very confusing to read - It reads- Change any un-used CNA mode to FC mode ? What do you mean by that - If the Ports are CNA mode and offline, they will still be impacted. How about the Ports that are in CNA mode at the moment and serving data to customers. I thought workaround is always for the current situation and not for something that is un-used.

Those are the 3 key questions for now. But, I would really appreciate if you could also let us know - Any particular logs in the NetApp logs directory that might spit up some errors which would indicate that we are closing in on the BUG mentioned?

We have a large NetApp Customer base, so would really appreciate if someone from NetApp could help us answer this queries ?

Many thanks,

-Ashwin

ASH2017 · ‎2017-09-21

Hi Folks,

I noticed the KB mentioned in the post earlier was removed briefly and re-instated once again. Not sure what exactly is updated.

Usually for Correctable ECC Errors, there is a threshold depending upon the platform. Once its over the threshold the module needs replacement. For Un-correctable memory errors its often module replacement straight away, which is exactly what these two BUG IDs suggesting.

As per this Bug ID: QLogic 8324 Fiber channel Host Adapter on FAS80x0

https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=910422

Solution: Failover-giveback and replace Hardware QLogic 8324

As per this Bug ID: PCI NMI error triggers from the QLogic FC/10GbE CNA on FAS80x0 systems

https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=917715

Solution: Failover-giveback and replace Hardware moetherboard

However, according to this BUG ID, solution is - ONTAP upgrade.
https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1026931

How is uncorrectable ECC memory error resolved with ONTAP code upgrade ? Does that mean there is nothing wrong with the Physical Hardware/Firmware ? How is code upgrade resolving this hard error condition ?

Thanks,

-Ashwin

bobshouseofcards · ‎2017-09-21

Hi Ashwin -

I'm not NetApp and don't claim to represent them in any manner. I have had to deal with multiple QLogic CNA adapter issues over a few years under Clustered Data ONTAP, so I thought I'd share some experiences around your questions.

First - since you have such a large installation, I advise you to leverage your NetApp support lines (Service Account Manager, direct NetApp Support, account System Engineer, etc.) to get more details. The community site can certainly get you some response and insights, but it isn't a primary support mechanism for this type of question.

Some of the QLogic CNA card issues have required updated firmware, which is delivered through Data ONTAP updates (not separately). This one it isn't clear if there is a card firmware component involved or just an ONTAP code change to mitigate the issue. I suspect both card firmware and ONTAP code have both been updated.

With these types of issues, you tend to get them or you don't. But if you aren't having the issue, you are not necessarily immune from it. It's very much like the two types of motorcycle riders - there are those that have spilled a bike and those that have yet to do so. It's just a matter of time. If you are not having issues with your QLogic CNA cards, count yourself lucky, but address the ONTAP release as soon as you can. If you have had issues, then expect more and address the ONTAP release faster.

My experience has been that when any QLogic CNA issue happens, it just does - no warning or impending doom messages. Of course, even if you got them, the usual answer is to restart the node to reset the card, so such warnings aren't really helpful.

And yes, in my opinion, for all ports if it is currently in CNA mode and you don't need it to be for iSCSI, CIFS, NFS, or FCoE - change the port mode. It's the CNA personality that messes up in this case. Doesn't matter what protocol runs over the CNA mode - it's the mode that counts.

Hope this helps you.

Bob Greenwald

Senior System Engineer | cStor

NCIE SAN ONTAP, Data Protection

ASH2017 · ‎2017-09-22

Hi Bob,

Thanks for replying to this thread, and sharing your experience. I really appreciate it.

Yes, we have already engaged NetApp here, they have indeed forwarded this alert to us. I was hoping for more detailed information regarding this issue.

Most non-correctable ECC erros leads to Hardware replacement, and that's a standard practice.

It would be nice to have more detailed information on what exactly is fixed in the ONTAP upgrade wrt QLOGIC adapter non-correctable ECC erros.

Also, looks like there is no workaround here for the customers currently serving data in 'CNA Personality mode', apart from Node failover & giveback. The other workaround is to switch the personality to FC mdoe, but that dosent serve CIFS/NFS protocols.

Of course, we are aleady planing for ONTAP upgrade but then its going to take some time considering not Just upgrade to ONTAP but also other infrastructure that interacts with NetApp - OCUM, CommVault, ESXi, WFA , vCenter & HBAs on the Host.

Thanks,

-Ashwin

BRIAN_BERKLEY · ‎2017-09-29

We're getting this error on our AFF-A300 pair in our cluster. I upgraded to 9.1P7 on 11SEPT17. It has core dumped and failed over 4 times in the last 18 hours. Waiting on support to get back to me.

The other half of our cluster is a FAS-2650 HA pair, so I'm a bit concerned about this not being fixed in 9.1P7, and spreading. All of the adapters show as being in CNA mode.

ASH2017 · ‎2017-09-30

Thanks for sharing this update. Very shocking to hear. As per this bugID:1026931, All these affected FILERS - configured in CNA mode on FAS80x0, FAS8200, FAS2650, FAS2620, AFF80x0, AFF A300, or AFF A200 storage systems, have been fixed sicne ONTAP 8.3.2P12.

We are currently under process of upgrading to 8.3.2P12 [minor upgrade] and looking at your update, it appears this is still not fixed. Is it somehting wrong at Hardware/Firmware of the CNA or 'certain driver stack' that is causing this?

I had raised a ticket with NetApp, but the response was as expected, a reply purely based on BUGID & KB which is already published on the support site.

Please do share with us any update you get from NetApp.

Thanks,
-Ashwin