ONTAP Hardware

WAFL is running low on memory, likely due to Volume Migration (URGENT)

NetApp_Journeyman
9,742 Views

Hi

 

We're in the process of migrating volumes to a different aggregate. Things were going smoothly once I started relying un-throttling but I was hit with the following error message yesterday:

 

wafl.memory.statusLowMemory: WAFL is running low on memory, with 8XXMB remaining.

 

I did some preliminary checking and I assume this is due to the volume migration being a heavy load on the PAM cards. Additionally the cluster home LIFs were not at home. 

 

How serious of an issue is this and could it potentially lead to devastating consequences? Please advise.

1 ACCEPTED SOLUTION

loganb
9,490 Views

at a minimum you should be on 9.3p7 to avoid this bug . It affects various processes.

 

https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1118890

 

In some cases, entire user space memory pages become corrupted and cause memory failures.

View solution in original post

15 REPLIES 15

Drew_C
9,739 Views

If this is truly urgent, please open a case with our support team.

Community Manager \\ NetApp

NetApp_Journeyman
9,731 Views

I fully intend to

 

However if anyone knows the cause I would appreciate an answer nonetheless

SpindleNinja
9,717 Views

I second what Drew said. 

 

 

Vol moves shouldn't affect the system like that,  they are background processes.     This could be a bug though.  

 

TMACMD
9,661 Views

How many volumes are moving without throttling?

You can only do a a few a time unless you want to run out of memory.

 

Try doing fewer moves. Wait until some finish. Do more.

NetApp_Journeyman
9,652 Views

I'd usually move 4 volumes without throttling during weekdays and a bit more (8-10) on the weekend.

 

I've been doing this for about 2 months now, so i'm wondering why this wasn't an issue before. Will going easier on the system resolve the issue?

paul_stejskal
9,643 Views

Are you on 9.6 not most recent patch release (like GA, P1, P2)? We've seen some similar issues. If you haven't opened a case please do as a P1.

NetApp_Journeyman
9,638 Views

We're on 9.3P3  currently

 

We were in the middle of upgrading our systems before the pandemic hit. We plan on upgrading to 9.5P10 when we get our chance

 

And I've created a case with Netapp, yes

paul_stejskal
9,636 Views

Very good. Case number? Feel free to PM me if you want.

NetApp_Journeyman
9,634 Views

No problem, the case number is #2008299487

 

 

NetApp_Journeyman
9,589 Views

I'd like to ask a question; what happens if I run out of WAFL memory?

 

Would it cause the move jobs to fail or would the system directly be affected? And is it a known bug for ONTAP 9.3P3? I can't seem to find it in the documentation anywhere

paul_stejskal
9,587 Views

Yes. https://mysupport.netapp.com/NOW/cgi-bin/relcmp.on?notfirst=Go%21&rels=9.3P3%2C9.3P17&what=fix ctrl+f for memory for examples.

 

To confirm, Support's Core Analysis team would need a core file panic. If you run out of memory the system will panic.

 

It might be helpful to get some performance data if possible. 9.3 may have some of the counters so a perf archive would be good to review during the time of the issue.

NetApp_Journeyman
9,494 Views

Two more error messages have propped up over the weekend as I migrated more volumes.

 

The first was wafl.readdir.expired and the 2nd was wafl.cp.toolong. The first error message is a bug, according to Netapp. Corrective action seems to come down to restructuring the directory, which simply isn't feasible. I assume upgrading will resolve this issue

 

The 2nd error message is directly related to my migration attempts as well, and there doesn't appear to be anything I can do to address that one. wafl seems to be a consistent culprit, it seems.  

 

I still have a case open with Netapp, but i'm hoping that anyone here can shed light on the issue

loganb
9,491 Views

at a minimum you should be on 9.3p7 to avoid this bug . It affects various processes.

 

https://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=1118890

 

In some cases, entire user space memory pages become corrupted and cause memory failures.

NetApp_Journeyman
9,487 Views

Yikes, that sounds terrifying

 

Thanks for alerting me; unfortunately, updating won't be feasible due to the pandemic. How risky is it for me to remain in 9.3P3?

paul_stejskal
9,484 Views

The readdir expired can still be experienced on newer versions. 9.5 has some readdir optimizations.

Public