Re: Data Ontap 8.1 upgrade - RLW_Upgrading process in the background BEWARE!! - Page 3

davidrnexon · ‎2012-06-24

Hi,

We recently upgraded ontap from version 8.02 to 8.1. We read through the release notes, upgrade advisor and the upgrade notes, proceeded with the upgrade which was quite smooth BUT..

No where in the release notes or 8.1 documentation does it mention that in the background (after the upgrade) there is a background process that runs that can potentially dramatically degrade performance of your filer. If anyone from Netapp reads this, can you please ask to add this caveat into the release notes and upgrade advisor.

Right after upgrading there is a background process that begins which is entitled rlw_upgrading. RLW is short for Raid Protection Against Lost Writes. It is new functionality that is added into Data Ontap 8.1.

to see this process you need to be in priv set diag and then aggr status <aggr_name> -v

The issue is, while this process is running, and your dedupe jobs kick in, the CPU will sky rocket to 99% and filer latency goes through the roof. The only way to run the filer sufficiently is to either disable all dedupe, or turn all dedupe schedules to manual.

The problem is, this background process has been running for the last 3 weeks on one filer, and the last 2 weeks on another filer.

I have a case open with Netapp at the moment, but was wondering if anyone else has experience with this, or any recommendations/commands as for us to see how long this process has left to complete because no one seems to know much about this process or function ?

Becasuse for the last 2-3 weeks we have not been able to run any deduplication without severly impacting filer latency.

davidrnexon · ‎2012-08-22

Hi Craig thanks for your update. Have you applied this work around, did it solve the issue ?

We are running 8.1.1RC1 and in the burt it says the issue has been resolved in 8.1.1RC1 and 8.1.1GA

craigbeckman · ‎2012-08-22

I am implimenting it later today.

Found this thread where others have successfully made the change:

https://communities.netapp.com/message/81479

craigbeckman · ‎2012-08-27

I have applied the fix for BURT: 526941 (ACP module bug) and IOPS to Aggr0 have dropped significantly.

CPU is still very high at this stage.

craigbeckman · ‎2012-08-29

My CPU issues have now also been identified as Bug 590193: WAFL background filesystem scanner may cause high CPU usage (deswizzling)

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=590193&app=portal

The fix is in DOT 8.1.1.

davidrnexon · ‎2012-08-30

Hi Craig, thanks for posting the update. This is definately a long process to resolve this issue, we are currently stepping through each vm and making sure they are aligned, shortening our raid group sizes by creating new aggregates, migrating vm's and lun's, destorying existing aggr's and re-adding disks to the correctly size raid group aggr.

We are currently down to 2 aggregates, 1 aggregate fully optimized, correct raid group size and every vm aligned correctly, the other aggr not the optimal raid group size and a few vm's still need to be aligned.

This has taken about 3 weeks to get to this level, and we still experience high cpu. We are getting quite good throughput and lower latencies now, but cpu still jumps high, especially at night when all the backups kick in. We are still not running any dedupe jobs as well.

When are you scheduled to upgrade to 8.1.1GA ? I'm eager to know if the GA will fix your issue.

radek_kubka · ‎2012-08-30

we are currently [...] shortening our raid group sizes

Hmm, it's a long thread / story with many interesting twists already.

Are you saying too big RAID groups had anything to do with the problems? Or rather the fact that RAID groups within the same aggregate had different sizes?

Regards,

Radek

davidrnexon · ‎2012-08-30

Yeah it's definately a massive thread, but very valuable information from everyone. There are optimal raid group sizes depending on your system and the disks you run in them, I have them for 3240 and 2040 if you need them, runs through SAS and SATA disks. If the RAID group size is too big, the system has to work harder to read through all the disks. For example we had raid group size 24, we now have raid group size 16 (actually at this moment we have an aggregate with 2 x 16 disk raid groups, once I migrate the last machines off a previous aggregate we'll end up with 3 x 16 disk Raid Group). We then run a reallocate job on each volume to spread the data across all RG's

I don't think Netapp recommend having an aggregate with different raid group sizes, even though it can be done, it's not optimal.

radek_kubka · ‎2012-08-30

RG sizes should be the same (or similar) within the same aggregate - but it isn't very well documented (e.g. nothing in the TR-3437 mentioned above in this thread).

With regards to what this size actually should be, the story is far less straightforward to me. The Storage Subsystem Technical FAQ (available via Filed Portal) gives some guidelines, but they are mostly driven by the formula "how to get to max aggregate size whilst having RGs of the same / similar size".

There isn't much proof (if any) that RG size 16 is any better than RG size 20.

Hence, as I wrote few posts earlier in this thread, the RG size story looks like a dark art to me

davidrnexon · ‎2012-08-30

Depends what you want your storage for, to maximize storage space you would go with a large raid group size but will definately suffer performance, if you are after performance you would go with a smaller raid group size with the idea to create multiple raid groups for performance, though this comes at a price due to double parityin regards to usable space. Raid Group size of 16 for example would make the system work less hard due to the fact that it only needs to pass through 16 disks, as opposed to a raid group size of 24. We use raid group size 16 now because we run a 3240 with 600GB SAS, we have seen improvement in this area dropping for RG 24, with latency and throughput.

aborzenkov · ‎2012-08-30

large raid group size [...] will definately suffer performance

Why? Could you explain in more details the reasons?

it only needs to pass through 16 disks, as opposed to a raid group size of 24

Could you explain what do you mean under "needs to pass through 16 disks"? Under which conditions does it happen? What operation triggers "pass through 16 disks"?

davidrnexon · ‎2012-08-30

because with RAID-DP data is written in stripes. The stripes are across raid groups. To write a stripe across a 24 disk RG will take more time than to write the stripe across a 16 disk raid group

radek_kubka · ‎2012-08-30

Sorry, but it doesn't stack up.

Firstly, partial stripes can be written (e.g. upon a timer). Secondly, you write more data with longer stripes (and to more spindles).

thomas_glodde · ‎2012-11-07

its not only about writing bigger stripes, its about read ahead for parity calculation. if you have an incomplete stripe on a highly utilized aggregate you might need to read a few blocks from the disks to complete the stripe and calculate parity.

statit Disk Statistics
cpreads/write << 1 is good

statit Raid Statistics
Partial stripes / full stripes << 1 is good

davidrnexon · ‎2012-08-30

actually I just checked all the releases where this bug is fixed, and it also include Ontap 8.1.1 RC1 which is what we are running. I hope the upgrade fixes your issue though.

Do you know if you have optimal raid group sizes and all your disks/vm's aligned correctly ?

mdvillanueva · ‎2012-09-27

Hi all:

Hopefully I am not too late to add on this. We recently upgraded to 8.1.1 also and encountered the rlw_upgrading issue. Our filer has two aggregate. One of the aggregate already completed the first scrub, but the other one (the one that contains vol0) is not yet done. I have not seen rlw=on on the aggregate that has completed the first scrub when I run aggr status -v.

8.1.1 looks full of issues.

davidrnexon · ‎2012-09-27

Hi, you can force the scrub to continue in your off hours until a full scrub has completed. aggr scrub start /vol/<volume_name>

Also how is your aggr's set up, i.e. What is your current system, what drives do you have in what aggregate, and currently how many raid groups to you have per aggregate. What does your storage contain, virtual machines, iscsi/fcp luns ? cifs shares, etc ?

Lastly have you sent any perfstats over to Netapp for analysis ?

mdvillanueva · ‎2012-09-28

Hi David,

Yes, I scheduled it to run nightly as opposed to the default weekly schedule.

We have a FAS3240 with DS4243 shelves. We have one aggregate of SAS disk with two RAID groups and one SATA aggregate with three raid groups.

We have nfs for vms, cifs, and iscsi luns. I did sent a perfstat to when I started noticing the cpu spike right after the upgrade to 8.1.1 and that is how they discovered about the rlw upgrade.

Here is the current status of scrubbing.

filer22*> aggr scrub status -v

aggr scrub: status of /aggr0_sas_15k/plex0/rg0 :

Current scrub is 58% complete (suspended).

First full scrub not yet completed.

aggr scrub: status of /aggr0_sas_15k/plex0/rg1 :

Current scrub is 54% complete (suspended).

First full scrub not yet completed.

aggr scrub: status of /aggr1_sata/plex0/rg0 :

Current scrub is 62% complete (suspended).

Last full scrub completed: Sat Sep 22 14:48:02 EDT 2012

aggr scrub: status of /aggr1_sata/plex0/rg1 :

Scrub is not active.

Last full scrub completed: Thu Sep 27 03:32:50 EDT 2012

aggr scrub: status of /aggr1_sata/plex0/rg2 :

Current scrub is 46% complete (suspended).

Last full scrub completed: Wed Sep 26 04:20:47 EDT 2012

aborzenkov · ‎2012-09-27

What your issue actually is? Rlw_upgrading status is NOT an issue by itself.

mdvillanueva · ‎2012-09-28

Hi Aborzenkov,

the problem is, I am still trying to figure out whats going on. I was told that after the scrubbing of the aggregate that I will see something like “rlw=on” when I run aggr status –v. I don’t see that on the aggregate that already completed the first scrub.

We have more CPU spike now than before the upgrade. OnCommand System Manager 2.0 became flaky since the upgrade. I get more nbt logs suppressions also.

aborzenkov · ‎2012-09-28

Well, if you put more load on your system by running aggregate scrubbing more often, you should expect some impact …

mdvillanueva · ‎2012-09-28

I only run it during off hours. In spite of that I still get CPU spike during work hours that we don’t get before… I was told also that there are changes to the WAFL that is causing the performance changes but it should gradually back to normal.

Data Ontap 8.1 upgrade - RLW_Upgrading process and other issues

Introducing GenAI Search on NSS