2012-06-24 03:06 AM
We recently upgraded ontap from version 8.02 to 8.1. We read through the release notes, upgrade advisor and the upgrade notes, proceeded with the upgrade which was quite smooth BUT..
No where in the release notes or 8.1 documentation does it mention that in the background (after the upgrade) there is a background process that runs that can potentially dramatically degrade performance of your filer. If anyone from Netapp reads this, can you please ask to add this caveat into the release notes and upgrade advisor.
Right after upgrading there is a background process that begins which is entitled rlw_upgrading. RLW is short for Raid Protection Against Lost Writes. It is new functionality that is added into Data Ontap 8.1.
to see this process you need to be in priv set diag and then aggr status <aggr_name> -v
The issue is, while this process is running, and your dedupe jobs kick in, the CPU will sky rocket to 99% and filer latency goes through the roof. The only way to run the filer sufficiently is to either disable all dedupe, or turn all dedupe schedules to manual.
The problem is, this background process has been running for the last 3 weeks on one filer, and the last 2 weeks on another filer.
I have a case open with Netapp at the moment, but was wondering if anyone else has experience with this, or any recommendations/commands as for us to see how long this process has left to complete because no one seems to know much about this process or function ?
Becasuse for the last 2-3 weeks we have not been able to run any deduplication without severly impacting filer latency.
2012-06-24 04:33 AM
From the support site I found this forum link. Can you view it? It is a post talking about the rlw process isn't in the release notes but the user describes a process still running after a long time and waiting for a support answer. Lost write protect was an option before 8.1 and I don't know what this new protection enhances but am interested to hear back what support comes back with. Did you get a response?
2012-06-24 04:52 AM
Hey Scott, thanks for the reply, yeah I could open the forums link and it's exactly the same problem. If I type in the cli vol scrub status -v I can see the following:
vol scrub: status of /aggr3/plex0/rg0 :
Current scrub is 2% complete (suspended).
First full scrub not yet completed.
vol scrub: status of /aggr0/plex0/rg0 :
Scrub is not active.
Last full scrub completed: Sun Jun 17 06:36:30 EST 2012
vol scrub: status of /aggr1/plex0/rg0 :
Current scrub is 25% complete (suspended).
Last full scrub completed: Sun Jun 10 05:15:30 EST 2012
vol scrub: status of /aggr1/plex0/rg1 :
Current scrub is 26% complete (suspended).
Last full scrub completed: Sun May 20 06:32:16 EST 2012
vol scrub: status of /aggr2/plex0/rg0 :
Current scrub is 21% complete (suspended).
Last full scrub completed: Sun May 6 02:22:28 EST 2012
The only aggregate that no longer has the rlw_upgrading status is aggr0. As you can see from the status above is also is not in the middle of any scrub. Could this be the reason why the other aggregates are reporting back rlw_upgrading, because it has not completed a full scrub since the upgrade to data ontap 8.1 ?
2012-06-24 05:17 AM
I'm going to test this tonight on one 2040 filer that we have. I'm going to resume to aggr scrub and see what the status reports back once it's finished. It's at 94% now.
For anyone interested the cli to resume the scrub on an aggregate is aggr scrub resume <aggr_name>, followed by aggr scrub status -v to check the status. Fingers crossed this is it. Will write back once its finished the scrub.
2012-06-24 03:37 PM
After resuming the scrub last night on the aggregate, it is now complete and the aggregate does not show rlw_upgrading. It is now shoing rlw_on, which means the upgrade has completed. I will try on one more aggregate on a different filer just to make sure this is the resolution. I'll post back my results.
2012-06-24 03:38 PM
This is what I got back from Netapp
"What it does is add additional data into the parity
block that is stored when a write is completed to protect against uncommon disk
malfunctions where the system is not able to detect that data wasn’t actually
written to disk"
But you're right, I can't find any information on rlw in the 8.1 documentation. Luckily Scott pointed out the previous forum post above.
2012-06-24 03:47 PM
We had lost write protection before but haven't seen this new version or enhancement of it documented yet. Hopefully someone posts more info on it.
2012-06-29 06:58 AM
We have the EXACT same problem. We upgraded from 7.3.6 to 8.1P1. We upgraded our SATA aggregate to 64 bit. When dedup started, CPU hit 100%. We have set dedup to manual and the RLW_upgrading process has been running now for 8 days. I ran the aggr scrub status -v and the output shows 2% complete (suspended) and 3% complete (suspended) for the two RAID groups in the aggregate. When dedup was running, I ran statit -b, waited 60 seconds, and ran statit -e and the xfers column shows the IOPS on the SATA disks at 115, when they are rated for about 40 IOPS. When I turned off dedup, it went down to 1. I've been waiting for the RLW process to finish so i can run the command when just dedup is running to see if there is an issue with dedup on 8.1P1 (it ran fine with 7.3.6). To follow up on the theory about the aggr scrub, I opened a case with Netapp and speicifically asked if the aggr scrub process had anything to do with the RLW_upgrading process and they said
"I just wanted to follow-up on our conversation. The percentages listed in the aggr status –v have no relation to the rlw_upgrading process."
So at this point, I am just waiting for the RLW process to complete and hope you get more information from Netapp and from your tests.
2012-06-29 07:04 AM
I forgot to add that upgrading to 8.1P1 caused a filer shutdown because of a bug that has no fix. http://support.netapp.com/NOW/cgi-bin/bol?Type=Det