During periods of heavy activity, we are seeing LogFull CPs intermixed with HighWaterOps CPs at the rate of 1 or 2 every 100 secs. A coworker says the LogFull CPs will cause the filer to return an IO Error and drop writes. I claim it will merely slow the filer down (perhaps a lot) since the write will just block (perhaps for a second or two, even.)
All our activity is NFS3 over TCP and all mounts are hard, so all writes should hang unless the app interrupts the IO. (Will the filer interrupt the IO? I am a NetApp newbie; I expect a hard write to hang forever unless the app or operator want to interrupt it.)
How can LogFull CPs be eliminated? (buy more hardware, my guess.)
Is there a document somewhere that explains what exactly is happening when a LogFull CP happens?
I've attached a pretty Cacti graph that shows the CP rates at a time of heavy activity. I think the rates are "per second" so "100m" (100 msec) means "every 10 sec", etc.
- your filer receives more data than it can physically write to disks
So the first step is to analyze disk write performance. In the former case it may be possible to increase it without adding more hardware (e.g. fragmentation may impact disk performance by inducing more disk IO than is necessary). In the latter I am afraid throwing in more hardware and redistributing data between old and new is the only option.
Log Full is when half NVRAM filled before the 10 second timer... a busy system but not back to back CPs where both logs fill before the other can flush. This isn't a horrible thing with log full since it is able to flush before the other log fills. If you get back-to-back CPs then you would be concerned.
Thanks for the replies. To narrow the question down a little: With a hard mounted file system, wouldn't I expect IOs to just hang (hard mounted filesystem) rather than return an IO error if the incoming writes exceed the performance of the filer, at least for the duration of consecutive back to back CPs, is this correct?
There is also the question of whether an app might be setting a timer that tines out IOs on its own.
And we can see when back to back CPs happen with SNMP. A better metric than Log Full CP's.