Receive side flow control? nfsd.tcp.close.idle.notify:warning

spmkayser · ‎2010-07-29

Greetings,

we are facing a satured 1GBit/s link on a FAS system, but what exactly is the following flow control related log entry trying to tell me?

[nfsd.tcp.close.idle.notify:warning]: Shutting down idle connection to client (x.x.x.x) where receive side flow control has been enabled. There are 67032 bytes in the receive buffer. This socket is being closed from the deferred queue.

Does this point to Ethernet flow control kicking in? Or TCP flow control? Who triggered what? And what's happening to the 67302 remaining bytes in receive the buffer?

Sebastian

marcconeley · ‎2010-09-09

Hi Sebastian,

Did you ever find a solution to this problem??

I am recieving similar errors on my filer every 10 seconds, and I am also suffering NFS disconnects during the evenings (all my ESX hosts are being disconnected!).

In my case, the IP 192.168.102.6 is the IP of one of my host ESX servers - and the warning message continues even if I take this ESX host offline!

Thu Sep 9 14:53:07 CEST [Filer: nfsd.tcp.close.idle.notify:warning]: Shutting down idle connection to client (192.168.102.6) where receive side flow control has been enabled. There are 0 bytes in the receive buffer.

I am not sure if the warning is related to the disconnects, but it seems likely.

Any help would be much appreciated.

//Marc

spmkayser · ‎2010-09-09

Marc,

no, we didn't investigate the error message any furher. Ideally, one would open a support case and have a NetApp engineer explain what exactly the warning is about. What we did is to monitor the aggregated network links between the filer and the switch (with Cacti, can be done with other tools). We found the NFS-traffic to be unequally balanced (IP-based link selection policy) with one link being satured. Thus, we chose different VMkernel IPs to distribute the traffic across the links. Now the IP-traffic is balanced, there's plenty of headroom for further throughput and there are no more timeouts, disconnects, or hickups.

If you open a NetApp support case regarding the warning messages, it would be great if you could share possible findings.

Sebastian

braidenjudd · ‎2010-12-02

Having the exact same problem, what version of Ontap are you using? Did it start happening after an upgrade?

marcconeley · ‎2010-12-03

We are now using 7.3.4P2. I think before we were using 7.3.2P2.

I can't now remember if it was directly related to an OnTap upgrade - we only spotted it because we had high CPU & latency on the filer and were suffering disconnects from our NFS LUN (our ESX servers were dropping connections to the filer).

We spent many hours trying to fix this and opened a case with our support vendor; we then tried a number of different things:

- we enabled "Flow control recieve on" on our Cisco switch interfaces to the filer.

- we edited the RC files on the filer to specify "FLOW CONTROL ON" on the interfaces (even though this always displayed as ON when looking ifstat -a).

- we restarted the controllers.

I have no idea which of these helped matters, but after making these changes and restarting the contoller, the message cleared. Making these changes without restarting didn't help; it was only after that we saw an improvement.

I also have no idea what caused this issue as we have 2 filers, both with dual controllers, all configured the same way on the switch and filer side - but only 1 controller developed this error.

After we cleared the error message we then started a project to reduce the load on this controller (offloading database and Exchange verifications, moving high I/O volumes to the 2nd controller etc). Since then we haven't seen this error again nor experienced any disconnects on our ESX hosts.

I hope this information helps.

Marc

abuchmann · ‎2011-04-26

hi marc

we experience the same problem and i would like to try what you have configured.

but one question:

you wrote "flow control on", does this mean "ifconfig e0x flowcontrol full"?

kind regards,

adrian

marcconeley · ‎2011-04-26

Hi Adrian,

Yes. But I believe we had some problems when trying to set this value through the BMC console or Filerview - therefore we included this in the RC files and restarted (by performing a takeover/giveback) the controllers.

But you're right:

ifconfig e0a flowcontrol full

Actually, I am currently dealing with exactly the same error message again! And once again its occurred after having suffering major performance problems - in this latest occurance our Exchange verify pushed us above our IOPs limit it appears, the filer CPU and latency times then hit the roof, all our VM hosts started complaining about disk write times being too high or crashing - and this "recieve side flow control" error message has returned.

The IP address in my case its whining about is the IP assigned to the VMKernel on one of my ESX hosts. Just like last time, turning off this ESX host or assigning it a new IP address doesn't fix the problem. I am sure the problem will remain until I perform a restart of my filer controller 😕

I think that most people who have seen this error have maxed-out their filer somehow (IOPs or saturated Gb links), and then get stuck with this filer error. I guess it's a kind of "flag" on the filer that gets raised and doesn't clear itself. It's obviously not relying on the actual IP address still being alive or active.

Can I ask you which IP address the warning is referring to in your case? Is it the VMKernel port on one of your ESX hosts?

Regards,

Marc

abuchmann · ‎2011-04-27

Hi Marc

No, we're testing a configuration based on a Solaris 10 Veritas Cluster and Oracle 11gR2 with Direct NFS mounts and we weren't able to take backups using SnapManager for Oracle. So yesterday, I moved the affected volumes to another, less productive filer (same flowcontrol settings) to check if the problem also occurs...and it did. After that, I applied the settings you posted.

With "flowcontrol receive on" on the catalyst and "ifconfig e0b flowcontrol full" configured on the filer we were able to get rid of the "nfsd.tcp.close.idle.notify:warning". The backup jobs using SMO are now also working like they should.

I don't think that this error has something to do with performance problems in our environment, because both filers (FAS6280 & FAS3160) were not at all maxed out at the time it occured.

But is it normal that flowcontol has such a huge impact?

kind regards,

Adrian

netappgomez · ‎2011-05-03

Hello,

I agree with Marc on this. We recorded the exact same error during very high IOPS. We are smoking our spindles! It continues to report the error below. It only reports it from one host on the NFS kernel port we have for that network.

"Tue May 3 25:75:78 NAN [BLAHBLAH: nfsd.tcp.close.idle.notify:warning]: Shutting down idle connection to client (x.x.x.x) where receive side flow control has been enabled. There are 0 bytes in the receive buffer."

Network settings full control

We upgraded from 7.2 family to 7.3.5PX about 45 days ago and our filers didn't report the error until we performed some high IOPS tasks yesterday.

Netapp's possible solution is to failover.

https://kb.netapp.com/support/index?page=content&id=2013194

I won't be a able to failover for a while.

marcconeley · ‎2011-05-05

Great! Thanks for that link.

And I can confirm that today I performed a takeover/giveback and it cleared the error!

Although the article says it's safe to ignore this error unless you recieve many of them, last night we had more high I/O and then we started getting the same complaints about a second IP address (another one of our ESX hosts). Ignoring the error wasn't an option as all our VMs started complaining about time-outs on their virtual disks - even though by this time the high I/O period was finished and everything had returned to normal.

Windows servers log errors like this one:

Event Type: Error
Event Source: symmpi
The device, \Device\Scsi\symmpi1, did not respond within the timeout period.

Personally I am not convinced by this "non disruptive" theory. I would recommend that you monitor your servers closely whilst this error is existing on your filer and schedule a takeover/giveback ASAP.

Thanks for the info.

Marc