ONTAP Hardware

WAFL is running low on memory, likely due to Volume Migration (URGENT)

NetApp_Journeyman
9,736 Views

Hi

 

We're in the process of migrating volumes to a different aggregate. Things were going smoothly once I started relying un-throttling but I was hit with the following error message yesterday:

 

wafl.memory.statusLowMemory: WAFL is running low on memory, with 8XXMB remaining.

 

I did some preliminary checking and I assume this is due to the volume migration being a heavy load on the PAM cards. Additionally the cluster home LIFs were not at home. 

 

How serious of an issue is this and could it potentially lead to devastating consequences? Please advise.

1 ACCEPTED SOLUTION

loganb
9,484 Views

at a minimum you should be on 9.3p7 to avoid this bug . It affects various processes.

 

https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1118890

 

In some cases, entire user space memory pages become corrupted and cause memory failures.

View solution in original post

15 REPLIES 15

Drew_C
9,733 Views

If this is truly urgent, please open a case with our support team.

Community Manager \\ NetApp

NetApp_Journeyman
9,725 Views

I fully intend to

 

However if anyone knows the cause I would appreciate an answer nonetheless

SpindleNinja
9,711 Views

I second what Drew said. 

 

 

Vol moves shouldn't affect the system like that,  they are background processes.     This could be a bug though.  

 

TMACMD
9,655 Views

How many volumes are moving without throttling?

You can only do a a few a time unless you want to run out of memory.

 

Try doing fewer moves. Wait until some finish. Do more.

NetApp_Journeyman
9,646 Views

I'd usually move 4 volumes without throttling during weekdays and a bit more (8-10) on the weekend.

 

I've been doing this for about 2 months now, so i'm wondering why this wasn't an issue before. Will going easier on the system resolve the issue?

paul_stejskal
9,637 Views

Are you on 9.6 not most recent patch release (like GA, P1, P2)? We've seen some similar issues. If you haven't opened a case please do as a P1.

NetApp_Journeyman
9,632 Views

We're on 9.3P3  currently

 

We were in the middle of upgrading our systems before the pandemic hit. We plan on upgrading to 9.5P10 when we get our chance

 

And I've created a case with Netapp, yes

paul_stejskal
9,630 Views

Very good. Case number? Feel free to PM me if you want.

NetApp_Journeyman
9,628 Views

No problem, the case number is #2008299487

 

 

NetApp_Journeyman
9,583 Views

I'd like to ask a question; what happens if I run out of WAFL memory?

 

Would it cause the move jobs to fail or would the system directly be affected? And is it a known bug for ONTAP 9.3P3? I can't seem to find it in the documentation anywhere

paul_stejskal
9,581 Views

Yes. https://mysupport.netapp.com/NOW/cgi-bin/relcmp.on?notfirst=Go%21&rels=9.3P3%2C9.3P17&what=fix ctrl+f for memory for examples.

 

To confirm, Support's Core Analysis team would need a core file panic. If you run out of memory the system will panic.

 

It might be helpful to get some performance data if possible. 9.3 may have some of the counters so a perf archive would be good to review during the time of the issue.

NetApp_Journeyman
9,488 Views

Two more error messages have propped up over the weekend as I migrated more volumes.

 

The first was wafl.readdir.expired and the 2nd was wafl.cp.toolong. The first error message is a bug, according to Netapp. Corrective action seems to come down to restructuring the directory, which simply isn't feasible. I assume upgrading will resolve this issue

 

The 2nd error message is directly related to my migration attempts as well, and there doesn't appear to be anything I can do to address that one. wafl seems to be a consistent culprit, it seems.  

 

I still have a case open with Netapp, but i'm hoping that anyone here can shed light on the issue

loganb
9,485 Views

at a minimum you should be on 9.3p7 to avoid this bug . It affects various processes.

 

https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1118890

 

In some cases, entire user space memory pages become corrupted and cause memory failures.

NetApp_Journeyman
9,481 Views

Yikes, that sounds terrifying

 

Thanks for alerting me; unfortunately, updating won't be feasible due to the pandemic. How risky is it for me to remain in 9.3P3?

paul_stejskal
9,478 Views

The readdir expired can still be experienced on newer versions. 9.5 has some readdir optimizations.

Public