AFF

A700s QSFP+ connection via EX4550-32F switch doesn't work

CloudShareian
5,183 Views

We are using a EX4550-32F x 2 switches stack with a QSFP+ Juniper modules in PIC 1 on both switches for connecting an A700s Netapp to the Data network. For the sake of the question I will address only one of the connections we currently have to the switches. 

We are using the following setup as advised by Netapp: 

CloudShareian_0-1721551373705.png

Node B is currently connected with only e0e to the 2nd EX4550-32F QSFP+ module, with a QSFP+-40G-CU3M cable (recognized by the Juniper switch) but shown on the Netapp as "CISCO-JPC", and configured on et-1/1/0 with access mode. A LIF is configured with an IP on e0e directly. 

The current status is that network port show command shows healthy status on e0e for Node B, right duplex, mtu etc.

Same is on the et-1/1/0 interface on Juniper.

But if I run the ifstat e0e command on Node B I see CRC errors accumulating on e0e (I don't see any errors on et-1/1/0 interface on the Juniper).

CloudShareian_1-1721552347723.png

My question is - what can be the reason for the CRC errors (that are probably the reason for the fact we don't have any connectivity on this LIF that uses this port) and what additional things should we check?
Our vendor states that all the hdw is compatible. 

Thank you.

1 ACCEPTED SOLUTION

TMACMD
5,103 Views

Yeah, my guy says this is on the switch end. I use those all the time on NetApp too Cisco switches without incident. To other switch manufacturers, not so much. Even Arista, it can be hit or miss depending on switch model and os revision. Using optics levels the paying field. Use vendor supported optics at either end and use a neutral fibre to connect them. 

View solution in original post

20 REPLIES 20

TMACMD
5,140 Views

Please show the rest of the ifstat output!

 

I’ve never had any good luck using twinax between any Netapp and a juniper switch. Pretty much always used optical transceivers on both ends. 

you could try:

 turn off auto negotiation on both ends

 forcing flow control to none on both ends

 forcing speed to 40000 on both ends

 forcing duplex to full on both ends

 

 then physically remove the cable, zero the stats (ifstat -z -a) and plug the cable back in

 

 if it doesn’t work, get APPROVED/SUPPORTED optics for both ends. 

twinax is great but doesn’t always work. It does require some bit of supportability in the code on both ends. 

I’ve seen customers get special twinax cables with each end programmed differently (like Cisco at one end and Intel at the other). 

CloudShareian
5,111 Views

Thank for your for you comment! 🙏
We actually already ordered optical transceivers and cables to try that from our vendor because my theory was exactly that the cables we using are no good.

We will try your suggestion as soon as those arrive this week and hopefully it will solve our issue.

CloudShareian
5,110 Views

Full 'ifstat e0e' for the port in question:

CloudShareian_1-1721564646252.png

 

 

TMACMD
5,106 Views

It’s not the the cables are no good it’s that one or both ends (Netapp and/or switch) don’t have sufficient programming to support that cable. 

just curios, are you able to send a picture of the cable ends (with markings)? If not, no worries. The optics are the best bet in this case anyway

CloudShareian
5,105 Views

Here are the cables used (both ends are the same):

CloudShareian_1-1721567395649.png

andris
4,956 Views

The A700s 40GbE onboard ports are known to work well with Cisco 40GbE Twinax, generally. It is most likely an issue on the switch side. You can consider using the switch vendor's supported CR4 cables or as others have suggested... go with optical (SR4).

TMACMD
5,104 Views

Yeah, my guy says this is on the switch end. I use those all the time on NetApp too Cisco switches without incident. To other switch manufacturers, not so much. Even Arista, it can be hit or miss depending on switch model and os revision. Using optics levels the paying field. Use vendor supported optics at either end and use a neutral fibre to connect them. 

CloudShareian
5,100 Views

Thank you for your help!
I will post our results when we have them.

CloudShareian
4,756 Views
We replaced the QSFP+-40G-CU3M (37-1317-03) twinax cables with: 
EX4550-EM-2QSFP (711-039086) QSFP+-40G-eSR4 (740-045627) (Juniper side)
QSFP, OPT, TRANSCEIVER, 40GbE, SR (332-00389) (NetApp side)
CABLE,OPT,OM4,MPO/MPO,TYPE B,2M (112-00326) (MPO cable)
On all 4 ports on both A700s Nodes: e0e, e0j / e0e, e0j
All ports show link and all ports recognize 40GbE connection. 
We managed to configure an IP on 2 of the ports and got a stable ping.
On all 4 ports e0e, e0j / e0e, e0j on both nodes we see constantly generated CRC errors
I think that we may need to use another type of transceivers and cable again despite the fact it looks like the things we use are compatible.
CloudShareian_0-1722271196898.jpeg

 

TMACMD
4,733 Views

A few things to try. Are they currently connected as individual ports or a port-channel?

 

Are the CRCs on the NetApp side (system node run -node * ifstat e0e. then repeat with e0j) or on the switch side or both? Are you seeing just CRCs or anything populated in the Xon/Xoff/Pause fields? (if those are populated -> somewhere you have flo-control enabled...turn it off!)

 

First, pull the cables an wipe the ends with a clean/lint-free cloth. I know some places take the caps of the ends and they get dirty. Not saying that is the case, but clean the ends. Try hitting all the optics with a puff of air. NOT COMPRESSED AIR. just those little "puffers".

 

You can clean ONTAP stats by rebooting (clears all stats) or: system node run -node * -ifstat -z -a (zero all interfaces, at least for ifstat)

 

Try setting BOTH sides to: 40G, full duplex, no-flow control

Some switches do not auto-negotiate nicely. Even Cisco, occaionally I need to force speed/duplex

 

CloudShareian
4,671 Views
  1. e0e and e0j of node-A are configured as port channel, e0e and e0j on node-B are configured as individuals and those are the ports I have a steady ping on.
  2. the CRC errors are on all the ports, there are no errors in xon/xoff/pause fields, we see the CRC errors only on RECEIVE side of the Netapp the switch show no errors not in the port statistics and not in the logs
    CloudShareian_0-1722326865662.png
  3. We will defiantly try to clean them, hopefully today or tomorrow, thank you for the tip
  4. I did zeroed all the interfaces every time we tried reconnecting different cables etc and the errors coming back instantly 
  5. I will try to set everything to static and see if it changes anything, thank you

CloudShareian
4,625 Views

Strangely, I'm getting this when trying to set the speed of the port to 40GbE:

CloudShareian_0-1722334639538.png

How is that possible? What am I doing wrong?

 

TMACMD
4,620 Views

Just curious,

 Based on that output it looks like the ports are currently in breakout mode!

 

 What’s the output of:

 net port show -node * -port e0*

 

TMACMD
4,620 Views

And by the way, if it is in breakout mode , which I believe it is, there’s a good chance the twinax will work after the port is converted back to 40g

CloudShareian
4,617 Views

We tried the twinax cables after the conversion of the ports to 40GbE and saw CRC errors + no ping

CloudShareian
4,617 Views

Here is the output, all ports are 40GbE

CloudShareian_0-1722339350895.png

I did ran the nicadmin convert command on both e0a and e0f nics (which change the mode for the whole pare of the nics e0a + e0e and e0f + e0j) to 40GbE when we go the device. Running was successful 
Followed similar procedure: 
https://kb.netapp.com/on-prem/ontap/OHW/OHW-KBs/Convert_Ethernet_port_between_40GbE_and_4x10GbE_personality

TMACMD
4,610 Views

Ok. Try it in a two step process. 
First, just run the port modify command but only set the auto negotiation-admin to false. After that hopefully works then try setting the rest. 

I’ve never seen not being able to set the port speed. 

TMACMD
4,610 Views

This looks like it might be your issue. Just found it

 

https://kb.netapp.com/on-prem/ontap/Ontap_OS/OS-KBs/CRC_Errors_on_XL710_40GbE_ports_after_ONTAP_reboot

 

there are a few suggestions there. 
basically, try direct optics on the switch side and try to change the debounce timer in the switch

andris
4,750 Views

X65402 (332-00389) is also a 40GBASE-eSR4 QSFP transceiver and is supported on AFF A700s onboard ports.

Maybe it's time to start suspecting the EX4550-EM-2QSFP module?

CloudShareian
4,671 Views

Thank you, we will try to replace the modules and you are correct about the X65402. The funny thing that some vendors, as FINISAR we use are marked as QSFP and SR not QSSFP+ and SR4. It's just an observation and probably a big, red herring  

Public