Network and Storage Protocols

<SOLVED> netif.linkErrors: Excessive link errors on network interface

aknight
2,753 Views

I was seeing this spewed in the

event log show

output on the CLI of our 9.5P5 cluster today and started investigating. I went searching for information and found one old topic that didn't have any concise replies so figured I'd share here.

 

Errors looked like this:

9/26/2019 09:32:21  <node-name>        ERROR         netif.linkErrors: Excessive link errors on network interface e0b. Might indicate a bad cable, switch port, or NIC, or that a cable connector is not fully inserted in a socket. On a 10/100 port, might indicate a duplex mismatch.
9/26/2019 08:18:36  <node-name>        ERROR         netif.linkErrors: Excessive link errors on network interface e0b. Might indicate a bad cable, switch port, or NIC, or that a cable connector is not fully inserted in a socket. On a 10/100 port, might indicate a duplex mismatch.

I checked error counters on the cluster switch and didn't see anything there.

 

Then went to the node itself and started diving into that interface:

nacl01::> node run -node nacl01-h7 ifstat e0b

-- interface  e0b  (96 days, 20 hours, 17 minutes, 6 seconds) --

RECEIVE
 Total frames:      135g | Frames/second:   16234  | Total bytes:       743t
 Bytes/second:    88830k | Total errors:      520k | Errors/minute:       4 
...

...ouch. 520k errors since the counters were last reset. No wonder the event log was complaining.

 

I have had issues with SFP+ optics partially failing in the past so that was my first place to go. Tested replacing the optic on the filer itself, reset the counter (ifstat -z e0b), waited.......errors came back but at a slower rate.

 

Then went to the cluster switch and replaced the optic, reset counters again, waited.....VOILA! No more errors.

 

From most of my troubleshooting in the past, I'd say if your cables are well run and not stepped on or anything crazy, these types of network errors are about 20x as likely to be optics as they are cables.

 

Anyway, just wanted to share an updated version of this error and my experience.

0 REPLIES 0
Public