VMware Solutions Discussions

Data Ontap 8.1 upgrade - RLW_Upgrading process and other issues

davidrnexon
39,240 Views

Hi,

We recently upgraded ontap from version 8.02 to 8.1. We read through the release notes, upgrade advisor and the upgrade notes, proceeded with the upgrade which was quite smooth BUT..

No where in the release notes or 8.1 documentation does it mention that in the background (after the upgrade) there is a background process that runs that can potentially dramatically degrade performance of your filer. If anyone from Netapp reads this, can you please ask to add this caveat into the release notes and upgrade advisor.

Right after upgrading there is a background process that begins which is entitled rlw_upgrading. RLW is short for Raid Protection Against Lost Writes. It is new functionality that is added into Data Ontap 8.1.

to see this process you need to be in priv set diag and then aggr status <aggr_name> -v

The issue is, while this process is running, and your dedupe jobs kick in, the CPU will sky rocket to 99% and filer latency goes through the roof. The only way to run the filer sufficiently is to either disable all dedupe, or turn all dedupe schedules to manual.

The problem is, this background process has been running for the last 3 weeks on one filer, and the last 2 weeks on another filer.

I have a case open with Netapp at the moment, but was wondering if anyone else has experience with this, or any recommendations/commands as for us to see how long this process has left to complete because no one seems to know much about this process or function ?

Becasuse for the last 2-3 weeks we have not been able to run any deduplication without severly impacting filer latency.

107 REPLIES 107

BRAVEBELL
6,080 Views

sorry. now i m working netapp storage. but i don't know about how to install. please send netapp simulator

BOUCHERPHILIPPE
6,038 Views

Hi all

I ve open previously this thread  on rlw_upgrading  https://forums.netapp.com/thread/30084

We have experiment problem with rlw_upgrading  after upgrade to 8.1   .

We have configure scrub for running all time .After one week , all aggregat are status rlw_on ( aggr status -v with mode diag )

But we have always the same problem with high CPU and command vol that take very long time to execute ( more 10 minutes for a vol offline ! )

Netapp find that our problem match with bug 518288 and is correct in 8.1P3

 

Yesterday we have  upgrade to 8.1P3 .Command vol is always very long time to execute .

We wait one week for determinate if probleme with CPU have been resolved .

Regards

mdvillanueva
6,008 Views

Hi Boucher,

Thanks for the tip. I did not know that I have to run in diag mode to see the rlw_on. Now I see it on the aggregate that completed scrubbing.

May I know why you guys choose not to upgrade to 8.1.1?

BOUCHERPHILIPPE
6,008 Views

Hi

8.1.1 have some new features that can be interact with our configuration .

So we prefer upgrade to 8.1P3  that correct some bugs only .

But if our problem we not solved , we plan to upgrade to 8.1.1 ( or 8.1.x )  .

Regards

michailidisa
6,310 Views

Hello to all i have upgraded a 3140 in one of the biggest telco companies and the systems suffers serial performance issues... i mean iops about 30-40 with 8.1.1... all these after upgrading to 8.1 and opening a case to netapp who proposed me to upgrade to 8.1.1 for several performance fixes... i am thinking very very seriously on downgrading to 8.0.3 which customer had before with no issues... are all these "bugs" because of the raid write lost and the disk scrubbing?

could it be that fas 3140 has only 8gb of mem and are not enough to support this operation?

RENEARROW
6,269 Views

i think i got same problem on ontap 8.1.1

after ontap upgrade, wafl scan processes makes the filer to use 100% disk and CPU for several days.  

after 3-5 days, the filer stop responding (high read latency) and NFS/ISCSI is disconnected.

i have seen this 2 times now, on ontap 8.1 and 8.1.1 and on diferent systems FAS3210 and FAS3140.

dburkland
6,310 Views

Just out of curiosity did you update to 8.1.1GA or 8.1.1P1?

michailidisa
6,310 Views

8.1.1 GA

davidrnexon
6,269 Views


Have you logged a fault with Netapp to grab some perfstats etc (first thing they will ask you, are all your vm's, disks aligned) ? If so let us know what they say. Are you using your SAN for virtual machines or dedicated luns, cifs shares etc ?

How many aggregates do you have and how many raid groups per aggregate ?

mikeymac1
6,408 Views

Wonderful thread!   I'm actually running 8.1 with no problems on my DR NetApp system.    It's lightly used, though.    I was just investigating upgrading the DR system and my production systems (which are still at 7.3.2)  to 8.1.1.     Thankfully, I found this thread.    I'll hold off until this issue is resolved.   

aborzenkov
6,257 Views

Well … this thread lists half a dozen issues at least and none of them is related to RLW upgrade. So which issue do you have in mind specifically? ☺

michailidisa
6,258 Views

Dear all

after many many emails with netapp and escalations etc... the conclusion is

1. misaligned luns in combination with deduplication is a killer for a netapp environment.

2. check the disk utilization on the aggregates (priv set advanced , statit -b after 10mins statit -e)

3. if you have too much disk utilization try to share the workload

4. never have over 85% space utilization on the aggrs

5. sata disks with sas disks in the same controller can cause many performance issues.

6. turn the aggr no snap option to ON

7. disable deduplication for a period of time and chech again the disk utilization

normally in a environment which is installed with the best practices of netapp with 8.1.1 there will be no problem.

in my case we ordered flash cache modules in order to speed up the read cause we have to copy all the data to another storage build the fas3140 from scratch and then bring the data back.

davidrnexon
6,258 Views

I'm kinda of leaning towards the conclusion that any ontap version before 8.1 handled a "non-optimized" san much better than 8.1 does. 8.1 really lets you know if you have some issues, i'm still not completely ruling out that there could be some bugs in 8.1.x

In regards to your list above, item 5, i've heard with 8.1.1 that this doesn't doesn't make any difference. However we still isolate SAS and SATA to different controllers. It's also better to have one large aggregate with multiple raid groups, than many smaller aggregates with only 1 raid group due to the controller being able to read from multiple raid groups at the same time. Remember that when you add new disks you need to run a reallocate -f -p /vol/volname which will spread out the data across the new disks. This is in priv set advanced mode.

michailidisa
6,258 Views

yeah the reallocation is a must!!!!

8.1.1 just brought in the light all the misconfiguration of the specific 3140... i believe that with the right config + PAM the system will fly...!!!

radek_kubka
6,258 Views

I'm not liking this - I will explain why.

All these recommendations make some sense and are fairly well known (albeit some not always feasible, like not mixing SATA & SAS on the same controller).

But it all sounds like hiding behind 'best practices', rather than solving the actual problem.

It is a simple as this:

A "not ideally" configured filer runs on ONTAP 8.0.2 (or 7.3.x) and the performance is acceptable. After an upgrade to 8.1.x performance goes through the floor.

Surely there must be more to it, rather than tweaking well known parameters, like aggregate utilisation, etc.

Regards,
Radek

davidrnexon
6,258 Views

Hi Radek, I totally agree with you, besides some of the responses from users on this thread where Netapp did identify one or two issues that were being hit and their recommendation to upgrade to 8.1.1GA, we don't have much more than that from them besides following best practices.

ERICBLECKE
5,993 Views

Okay, I have a couple of questions (upgraded to 8.1.1 - filer 2 panicked and we had to do a wafl iron - but that seems fine now)  There was a BURT that was referenced on the root cause, but it had only been seen like 4 or 5 times before.

I added 2 sata disk shelves and grew my aggregate by 20 disks (on both filers) - Performance took a major hit - found this thread and set the scrubs to run for 7 hours each night (last night being the first) - thanks for this thread - at least users aren't complaining

     1.  How long do the initial scrubs take to complete?  Do I leave the scheduled scrubs running during the week forever, or is this a monthly/yearly maintenance type action?  I see that some scrubs have been done in the past, but it doesn't follow a set pattern that I can see...

          a.  Currently sitting at ~ 20% complete on each of the raid groups - in theory, it will be this weekend before the scrubs complete?

     2.  How long does a volume reallocate take?  This is my next step especially since I added so many disks - unless I can do them at the same time as a scrub?

     3.  Why does it take so long to get a reallocate measure back?  Is there a way I can run that command across the system and get a summary?  Seems to take about an hour to get a result from each volume - multiply that by 20 volumes, it could take a while just to find out which volumes NEED to be reallocated

Thanks for any feedback I appreciate it!

davidrnexon
6,257 Views

you're right, i've added to the thread title "and other issues" just for you

mdvillanueva
6,257 Views

Yes, I am almost regretting upgrading to 8.1.1 now. the rlw upgrading is done but now everytime a deduplication runs, the CPU spike up and the latency goes up for almost all volumes. Our Exchange environment is driving me nuts because it is so sensitive to sudden changes in latency.

davidrnexon
6,257 Views

Hi midvillanueva, we were in the exact same position. We had to turn all dedupes to manual (i.e. not run them at all), until we got all our alignment, aggregate/raid group layout corrected, etc We now only run dedupe on about 3 volumes. Any time dedupe kicks in we see a CPU spike, but latency does not seem to be that bad.

michailidisa
5,992 Views

is it the read IOPS to an aligned lun so much better than a missaligned lun?

i can see 15-20 iops on a misaligned lun and about 70-80 to an aligned lun. even if i do lun create -s 50g -t vmware and map it to a windows host i can see in the lun show -v that the lun is aligned and in the lun alignment show i see that in blocks 0-7 is all the read and write but i know that the lun is not aligned... i am afraid that even if i correct all the errors in my env i will still have just acceptable performance. Downgrading is an option but when i noticed that i have to revert all the sis metadata files back to 8.0.3 downgrading just disappeared from my head...!!! has anyone tried to adjust his env with the best practices of netapp and tell as if the performance boosted as netapp insists?

thanks!

Public