Data Ontap 8.1 upgrade - RLW_Upgrading process and other issues

davidrnexon · ‎2012-06-24

Hi,

We recently upgraded ontap from version 8.02 to 8.1. We read through the release notes, upgrade advisor and the upgrade notes, proceeded with the upgrade which was quite smooth BUT..

No where in the release notes or 8.1 documentation does it mention that in the background (after the upgrade) there is a background process that runs that can potentially dramatically degrade performance of your filer. If anyone from Netapp reads this, can you please ask to add this caveat into the release notes and upgrade advisor.

Right after upgrading there is a background process that begins which is entitled rlw_upgrading. RLW is short for Raid Protection Against Lost Writes. It is new functionality that is added into Data Ontap 8.1.

to see this process you need to be in priv set diag and then aggr status <aggr_name> -v

The issue is, while this process is running, and your dedupe jobs kick in, the CPU will sky rocket to 99% and filer latency goes through the roof. The only way to run the filer sufficiently is to either disable all dedupe, or turn all dedupe schedules to manual.

The problem is, this background process has been running for the last 3 weeks on one filer, and the last 2 weeks on another filer.

I have a case open with Netapp at the moment, but was wondering if anyone else has experience with this, or any recommendations/commands as for us to see how long this process has left to complete because no one seems to know much about this process or function ?

Becasuse for the last 2-3 weeks we have not been able to run any deduplication without severly impacting filer latency.

scottgelb · ‎2012-06-24

From the support site I found this forum link. Can you view it? It is a post talking about the rlw process isn't in the release notes but the user describes a process still running after a long time and waiting for a support answer. Lost write protect was an option before 8.1 and I don't know what this new protection enhances but am interested to hear back what support comes back with. Did you get a response?

https://forums.netapp.com/thread/30084

davidrnexon · ‎2012-06-24

Hey Scott, thanks for the reply, yeah I could open the forums link and it's exactly the same problem. If I type in the cli vol scrub status -v I can see the following:

vol scrub: status of /aggr3/plex0/rg0 :
Current scrub is 2% complete (suspended).
First full scrub not yet completed.

vol scrub: status of /aggr0/plex0/rg0 :
Scrub is not active.
Last full scrub completed: Sun Jun 17 06:36:30 EST 2012

vol scrub: status of /aggr1/plex0/rg0 :
Current scrub is 25% complete (suspended).
Last full scrub completed: Sun Jun 10 05:15:30 EST 2012

vol scrub: status of /aggr1/plex0/rg1 :
Current scrub is 26% complete (suspended).
Last full scrub completed: Sun May 20 06:32:16 EST 2012

vol scrub: status of /aggr2/plex0/rg0 :
Current scrub is 21% complete (suspended).
Last full scrub completed: Sun May 6 02:22:28 EST 2012

The only aggregate that no longer has the rlw_upgrading status is aggr0. As you can see from the status above is also is not in the middle of any scrub. Could this be the reason why the other aggregates are reporting back rlw_upgrading, because it has not completed a full scrub since the upgrade to data ontap 8.1 ?

davidrnexon · ‎2012-06-24

I'm going to test this tonight on one 2040 filer that we have. I'm going to resume to aggr scrub and see what the status reports back once it's finished. It's at 94% now.

For anyone interested the cli to resume the scrub on an aggregate is aggr scrub resume <aggr_name>, followed by aggr scrub status -v to check the status. Fingers crossed this is it. Will write back once its finished the scrub.

davidrnexon · ‎2012-06-24

After resuming the scrub last night on the aggregate, it is now complete and the aggregate does not show rlw_upgrading. It is now shoing rlw_on, which means the upgrade has completed. I will try on one more aggregate on a different filer just to make sure this is the resolution. I'll post back my results.

aborzenkov · ‎2012-06-24

Could someone explain how lost write protection works or point to any materials?

davidrnexon · ‎2012-06-24

This is what I got back from Netapp

"What it does is add additional data into the parity
block that is stored when a write is completed to protect against uncommon disk
malfunctions where the system is not able to detect that data wasn’t actually
written to disk"

But you're right, I can't find any information on rlw in the 8.1 documentation. Luckily Scott pointed out the previous forum post above.

scottgelb · ‎2012-06-24

We had lost write protection before but haven't seen this new version or enhancement of it documented yet. Hopefully someone posts more info on it.

mgiard214 · ‎2012-06-29

We have the EXACT same problem. We upgraded from 7.3.6 to 8.1P1. We upgraded our SATA aggregate to 64 bit. When dedup started, CPU hit 100%. We have set dedup to manual and the RLW_upgrading process has been running now for 8 days. I ran the aggr scrub status -v and the output shows 2% complete (suspended) and 3% complete (suspended) for the two RAID groups in the aggregate. When dedup was running, I ran statit -b, waited 60 seconds, and ran statit -e and the xfers column shows the IOPS on the SATA disks at 115, when they are rated for about 40 IOPS. When I turned off dedup, it went down to 1. I've been waiting for the RLW process to finish so i can run the command when just dedup is running to see if there is an issue with dedup on 8.1P1 (it ran fine with 7.3.6). To follow up on the theory about the aggr scrub, I opened a case with Netapp and speicifically asked if the aggr scrub process had anything to do with the RLW_upgrading process and they said

"I just wanted to follow-up on our conversation. The percentages listed in the aggr status –v have no relation to the rlw_upgrading process."

So at this point, I am just waiting for the RLW process to complete and hope you get more information from Netapp and from your tests.

davidrnexon · ‎2012-06-29

Hi, actually the scrub does have a relation with RLW.

I wouldn't wait for the scrub to finish on it own, it will probably take months, because it only runs in the background when the filer has low utilization.

With our one, I found a low utilization period and resumed the scrub manually "aggr scrub resume <aggr name>". Once the scrub completed, the RLW_Upgrading changed to RLW_ON

With all these bugs, i'm wondering if ontap 8.1 was released a bit too early ?

mgiard214 · ‎2012-06-29

I forgot to add that upgrading to 8.1P1 caused a filer shutdown because of a bug that has no fix. http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=611922

scottgelb · ‎2012-06-29

The silent corruption of any Flexvol isnt encouraging. Then how to determine it occurred after a core dump....

davidrnexon · ‎2012-06-29

If you have one aggregate with 32-bit and another with 64.bit, and your root vol sits on the 32-bit aggr, an option is to use snapmirror with 8.1 to snapmirror the root to the 64-bit aggr, set the new root volume as root, and either reboot or failover and failback the filer to put it in place.

scottgelb · ‎2012-06-29

Gets around the aggr upgrade and the quick outage or ndu seems worth it. Maybe the best way to convert flexvols by mirroring to 64 until the Burt on aggr convert has a fix...but I wonder if there are 32 to 64 mirror conversion Burts that are similar.

aborzenkov · ‎2012-06-29

32 to 64 does initiate conversion after snapmirror break. Of course, it is "32 bit volume on 64 bit aggregate" so conditions may not be exactly the same ...

scottgelb · ‎2012-06-29

True. Not for this Burt but hopefully not related yet. The corruption Burts really scare everyone. Me too.

Sent from my iPhone 4S

dimitrik · ‎2013-04-10

- bug 611922 has been fixed since 8.1P3 per the link you sent...

rpaivacesce · ‎2012-06-29

Apparently this is a Bug, that is fixed on 8.1.1RC1.

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=590193

mgiard214 · ‎2012-06-29

David, Great job in testing the RLW_upgrading issue by manually running the aggregate scrub. You were 100% correct. I'm going to follow the same manual scrub procedure as you and then reenable my dedup once complete and see what happens. I heard back from my Netapp case owner as follows:

You can trigger a manual scrub by running the following command:

aggr scrub start aggrname

…but I wouldn’t recommend you do this during production hours as this has the potential to impact the performance of your system .

The scrub % is sort of related but it’s a different process that needs to be marked as complete before the rlw moves to the _on state

davidrnexon · ‎2012-06-29

When you type in aggr scrub status <aggr name> -v, if you see the aggr or raid groups as (suspended) and it's already done a few percent, then use the command aggr scrub resume <aggr name> so it resumes from the point it started, it should take less time.

I think if you type in aggr scrub start aggrname, it will start the scrub process from the beginning again. Which is not a bad thing, but depending on the size of your aggregates and the utilization of the filer, could take quite a few hours. Good thing is, if you are seeing some impact from the command you can pause it and resume it later.

mgiard214 · ‎2012-07-02

good advice, I'll do that.