2015-02-17 07:07 AM
2015-02-20 02:14 PM
Is this on a CDOT version of OnTap?
We've been experiencing the same issue for about six months - as soon as we took delivery of and started using our new NetApp in October of last year - about every two to four weeks we start to see multiple messages like this in the error logs:
wafl.memory.statusLowMemory: WAFL is running low on memory, with 377MB
This is a pair of 3220s running 8.2.1P3. When this issue occurs, latency is extremely high and most of our dependent systems failed. We have had to watch for this message specifically and when it occurs, manually fail the controller over, reboot it, and giveback the storage. We missed a cue this last weekend and the controller just completely dropped the storage out from under the vCenter and the entire server farm failed. Production hospital systems - failed.
We have a new case open with NetApp now with almost zero result at this point, and tomorrow it will have been a week.
I would just say you need to watch your controller that this message is showing up on very, very carefully to avoid the situation we found ourselves in here.
2015-02-22 08:58 AM - edited 2015-02-22 11:09 AM
I have the same version you have but I have a V3250 pair of Netapps. Do you have a flashcache card and if so how big is it? We had the issue you describe this weekend and I had to do a takeover/giveback of our active cluster node.
I'm starting to think that the issue is with the flashcache cards having to take large I/O loads. We just moved our DW environment 2 weeks ago and everyday, we are doing about 30,000 IOPS on the flashcache cards alone and I'm guessing that it might be affecting CP processing or something.
2015-02-23 01:23 PM
We have 512 GB Flash Cache cards. I was so tempted to "disable" the one on the controller that is having a problem but I need to research this more before I take any action.
NetApp is indicating they believe it is a software bug as well but cannot tell me specifically what the bug is or how to reconfigure to avoid it.
In the meantime I sit and watch for wafl low memory errors so I can trigger a failover/giveback.
No word from my NetApp support folks today.
2015-02-23 01:35 PM
Netapp support is suggesting I run a diagnostic software during the peak time. I'm planning to run it tonight and again saturday night. The tool is called perstatgui. I also set an alert to email us if it sees any entry from wafl.memory.status and wafl.memory.status.LowMemory using event route and event add-destination command. I will update on my issue when I get more information.
2015-02-27 01:36 PM - edited 2015-02-27 01:47 PM
Ok in our case what I found is that we have a POC virtualized commvault system that came online in the last 3 weeks. The POC system was running on volumes that were being deduplicated. We don't know if it kept generating too much data for the flashcache card to handle or if it threw bad data at the flashcache card. Flashcache caches metadata and random reads. Deduplication uses the metadata information so it's going to use the flashcache religiously. So I was guessing that since we have multiple volumes that were logically virtualized by commvault as one volume, that there were alot of random reads and in combination to those volumes being set to dedup, it must have done something negative to the flashcache card.
We did notice that it was only 1 of the nodes that was being affected. So I was able to rule out the new datawarehouse since it was operating on the 2nd node.
So maybe some of my questions might help you out in finding out what system is causing your issue:
- Are you running everything off of 1 node or 2 nodes?
- Does the wafl error state that it's comming from one node or both nodes?
- Is your backup system on separate disk or inside the Netapp SAN?
- Are you deduplicating every volume? Including SQL data volumes?
- Do you have all your volumes running the deduplication schedule using the default sun-sat@0 schedule?
The things I did to bring down performance usage as much were the following:
- Made sure not to deduplicate SQL databases/logs (not worth the processing compared to the low space savings)
- Moved all non-production dedulicated volumes to run on a different time on the weekend with a qos-policy of background
- changed some of the depuplicated volumes to automatic or disabled deduplication. I became picky about what should be deduplicated.
- moved the POC commvualt test to a separate san so it doesn't touch the flashcache
A drastic thing you can do if you need to buy time is to disable deduplication for now across all the volumes while you track what system could be generating all the data. Hope this helps.