"Too many open files in system" error, but can't figure out why

mikenowlin · ‎2015-10-31

About a week ago, an FAS2040 (8.0.2P5 7-Mode) stopped accepting HTTPS and SSH connections. SSH immediately disconnects, HTTPS returns "Error 503 - not enough server resources". I was finally able to get in through the serial console and started seeing messages like:

Cannot open the /etc/registry.0 file. Too many open files in system

Unable to update '/etc/registry.local' (err=Too many open files in system)

The storage side (mostly iSCSI with some CIFS) appears to be working normally so far, but this has me very nevous.

Running "maxfiles" from the CLI shows that none of the in-use file counts are anywhere close to the maximum. None of them are over 1% of the max. I've run a couple of other status commands with no indication of anything unusual. Some of the volumes are close to being full, but that's normal.

Any pointers on what to try? The filer has been running for several years without any problems, and nothing has really changed on it recently. The last reboot was about 14 months ago (physical move), but it ran with a similar load for almost three years straight before that with no problems.

Thanks, Mike

bobshouseofcards · ‎2015-10-31

Hi Mike -

Good news/Bad news type of answer here. It isn't so much that the clients have a ton of open files in volumes - rather it's that internal Data OnTap applications do. Consider the text from "bug" 746936 :

Description

 Each Data ONTAP application has a limit of 2,048 opened files. When an 
 application hits this limit, it will no longer be able to open new files, which
 can prevent it from writing to disk or communicating on the network.
  
 This can often happen because too many external applications are connecting to 
 a Data ONTAP application.

Workaround

 If you use scripts in your environment, you can reduce the load on opened files
 by ensuring that the scripts reuse the same network connections, files, or 
 both.

The "bug" isn't something that will get fixed in that it just describes what happens when an environment tries to exceed the capability of the NetApp system, in this case driving too much load to one of its applications. The "applications" in question are internal Data OnTap processes, like the SSH daemon and/or the web server service. Remember that under the covers DoT is a BSD OS. So each individual process has a per process file and memory limit just like a process in any regular BSD installation would have. Somewhere one of the processes has run up against the limit and is messing up your system. So that's the good part - there is an explanation for the underlying cause.

Now - the hard part - finding out what is triggering this condition. Obviously a cluster failover/giveback would clear the situation for a while as it would force all processes to stop and restart. That's the quick "fix". But of course the situation is likely to come back. As indicated in the "bug" I'd start first with the environment. Is there some process or monitoring tool which is regularly opening connections to your nodes? Are they releasing those connections properly or simply opening new ones over time and holding onto the old ones? I've certainly run into that type of situation before where a monitoring tool kept opening SSH sessions to run command lines but didn't ever formally release the old ones due to an implementation error.

Failing that, at the diagnostic privilege level you can get close to command line level access to the underlying BSD - as I recall you can run a "ps" command as well as kill individual processes if needed. Unfortunately I've been away from 7-mode for quite a while and I don't trust that I have current enough info to guide you - hopefully someone else might be able to. But like the cluster failover routine, killing (and hopefully restarting) an underlying process is only a temporary fix. You'll need to determine why that specific process is getting hit up to open files.

Hope this helps you.

Bob Greenwald

Lead Storage Engineer

Huron Legal | Huron Consulting Group

NCIE - SAN Clustered, Data Protection