VMware Solutions Discussions

Data Ontap 8.1 upgrade - RLW_Upgrading process and other issues

davidrnexon
39,134 Views

Hi,

We recently upgraded ontap from version 8.02 to 8.1. We read through the release notes, upgrade advisor and the upgrade notes, proceeded with the upgrade which was quite smooth BUT..

No where in the release notes or 8.1 documentation does it mention that in the background (after the upgrade) there is a background process that runs that can potentially dramatically degrade performance of your filer. If anyone from Netapp reads this, can you please ask to add this caveat into the release notes and upgrade advisor.

Right after upgrading there is a background process that begins which is entitled rlw_upgrading. RLW is short for Raid Protection Against Lost Writes. It is new functionality that is added into Data Ontap 8.1.

to see this process you need to be in priv set diag and then aggr status <aggr_name> -v

The issue is, while this process is running, and your dedupe jobs kick in, the CPU will sky rocket to 99% and filer latency goes through the roof. The only way to run the filer sufficiently is to either disable all dedupe, or turn all dedupe schedules to manual.

The problem is, this background process has been running for the last 3 weeks on one filer, and the last 2 weeks on another filer.

I have a case open with Netapp at the moment, but was wondering if anyone else has experience with this, or any recommendations/commands as for us to see how long this process has left to complete because no one seems to know much about this process or function ?

Becasuse for the last 2-3 weeks we have not been able to run any deduplication without severly impacting filer latency.

107 REPLIES 107

davidrnexon
6,776 Views

I wouldn't worry about downgrading it's too late now I think, unless you want to involve Netapp. Especially if your RLW_Upgrading process is complete. What do you mean by you know that the lun is not aligned even though you created it with the -t vmware option ?

michailidisa
6,776 Views

normally if you want to map a lun to a windows 2008 server u do

lun create -s 20g -t windows_2008 , you map the lun to the host and then from windows you do diskpart to align the lun with netapp to a 64kb block size.

if you do lun create -s 20g -t linux and map it to a windows server will the lun be aligned?

craigbeckman
6,776 Views

Would have to agree with David....I have never known anyone who has ever downgraded the firmware on an array. It would have to be totally unusable before you would consider such a risky procedure.

michailidisa
6,783 Views

also downgrading from 8.1.1 to 8.0.3 for example its not only risky its also  a loooooooooooooong procedure cause you have to revert all the metadata from 8.1.1 to 8.0.3

vladimirzhigulin
6,775 Views

What are those notorious "best practices" you are talking about? Just wondering here .. I have a couple of spare FAS3270 which I plan to re-install with 8.1.1GA, reading this thread doesn't make me feeling confident as I'm more than happy with 8.0.2P4 7-Mode at the moment.

Regards,

Vladimir

IRVING_POPOVETSKY
6,924 Views

So we can put this thread to rest:

1.  For those who have upgraded from 8.0.x directly to 8.1.1:  How was this issue resolved for you?   Was the root cause Bug IDs the same as 8.1.0 or different?

2. Has anyone compared the post-upgrade performance of 8.1.2RC to 8.1.1?   8.1.2RC2 has a long list of bug fixes over 8.1.1 that may be related to this issue. 

Please post your results.

aborzenkov
6,924 Views

I again obliged to ask - which issue? This thread names half a dozen issues at least. And again - RLW_Upgrading status/process is not an issue.

davidrnexon
6,924 Views

Hi Irving, I haven't really seen any posts on here where the user upgraded to a specific version and their problems went away.

I read through the long list of bug fixes for 8.1.2RC2 and it's a little concerning for us guys that are on earlier releases, all the resolved issues for system panics.

I'd also like to hear from anyone that experienced system problems after upgrading to 8.1.x but have now upgraded to 8.1.2RC2 ?

dburkland
6,931 Views

I just upgraded a pair of 3240s from 8.0.x to 8.1.1P1 and as a precaution I am currently performing the following steps:

a. Disable dedupe schedules

b. Perform manual scrub of each aggregate *Still waiting on some raid groups to complete*

c. Re-enable Dedupe schedules

Most of the scrubs completed quickly however one aggregate seems to be taking its time (it has been a few days and its only at 33%). After the ONTAP upgrade I now wish I would have let the dedupe schedules run to see if resource utilization would still increase due to the RLW_Upgrading & Dedupe processes running at the same time.

Dan

michailidisa
6,931 Views

Hello all

after 1 month of continuous testing with the 8.1.1 with. 2 systems one 2240 and one 3140 i have to say the following...

2240 with 24 sas disks 8.1.1 connected to 1 server running esxi 5 and 12 vms upon 2 running exchange servers 1 oracle and many more

3140 with 3 ds14mk2 selves with sata disks and 1 4243 with sas disks running esxi 4 with many vms upon also running exchange 2010 dags and more.

2240 without flash cache performs pretty good with its 6gb of mem per controller althought that the 3140 with 4gb mem per controller does not. i have an officially answer from netapp that with the 8.1.1 there are only 1.5gb of mem left to the 3140 to run the data ontap which is not enough. when  deduplication was enabled i saw disk util about 94% and cpu hitting 91% when i disabled them the system's behavior was better.

when we installed flash cache on the 3140 which runs in 128gb not 256gb cause the system will panic the performance increased a lot... my conclusiuon with the 8.1.1 is that the problem appears to old systems like 3140 3210 etc. i have 12 2240 systems installed with 8.1.1 with heavy env upon them and nobody complains to me about performance...!!!

i also tried to install the 2240 without best practices (misaligned luns 2 or more aggrs etc) i noticed that i had about 240 IOPS on the exchange servers. i wiped out the system and install everything from the beginning configuting everything with the best practices of netapp (i even installed the OS on the VMS with the winpe cd in order to create a 64kb partition) i did everything according to what netapp reccomends reallocated all the data and tested the env and i had about 360 IOPS... misaligned luns and many aggr could be a serious performance issue.

so i think that there is nothing with the RLW procedure running... and its all about memory and correct configuration... that is of course my opinion

lafoucrier
6,931 Views

Hello michailidisa,

we experiment same behaviour with 8.1.1 on FAS3240 and must prepare to migrate our fas3140 HA into this version by the way.

We also use dedup and i've a question about it :

CPU and disk pic activity you observed with dedup, was it because dedup was running at this time or just because dedup was simply "enable" (not running) on volumes?

When you disabled dedup : would you mean that you simply turn it off (disable) on volumes or you've done a complet revert (priv set diag; sis undo) of dedup on those volumes?

thanks in advance,

best regards,

Yannick

michailidisa
6,931 Views

Hello Yannick

i noticed the peak during the dedup running (customer had deduplication running everyday) so i just disabled the dedup sis off (i didn't sis undo).

the target was just to disable the deduplication on the misaligned luns which caused the peaks on the disks and cpu.

after 3 weeks already dedup is still not enabled cause i dont want to have the same impact even if flash cache is installed on the systems.

just for the case

i am running benchmarks in many customers  (i/o meter)

so... i can upload in some time the results from customers running 8.1.1 and 8.0.3

lafoucrier
6,931 Views

Thanks for your feed back and research!

I also notice that Flash Cache on FAS3140 help to increase performance in your context.

Maybe a solution in mine too, my FAS3140HA is a QSM dest and a VSM source, so FlashCache could store metadata into a low latency cache to help in relieve memory usage => that's just an hypothese, not planned for me for the moment.

Also in your test plan, did you try to disable RLW on filers to see the performance impact of this new feature?

aborzenkov
6,957 Views

Also in your test plan, did you try to disable RLW on filers to see the performance impact of this new feature?

I do not think you can (at least, using documented interfaces). RLW is integral part of how RAID works in these versions.

michailidisa
6,957 Views

beware that flash cache in 3140 can only run in 128GB and its supported only in 8.1.1

i haven't tried to disable RLW cause i wanted to see the impact on the filers with this feature running. (i dont know if its possible to disable rlw)

as i can see in your post you have a 3240 also... and you are facing problem withn 8.1.1

i have one also here in Greece installed withn about 100 sata disks no flash cache very well configured (no misalignments , many disks in 1 aggr etc)

i saw about 700 IOPS to an exchange 2010 which is pretty good i guess...

lafoucrier
6,957 Views

seems that you can disable RLW : https://kb.netapp.com/support/index?page=content&id=3013583

but could be a problem if you upgrade later in 8.2 (can't re-enable it for the moment).

I've a FAS3240 (in 8.1P3 sorry i was thinking that we already were in 8.1.1...) and we are in the mood to : 1. disable dedup entierly (disk usage impact) or 2. reschedule dedup jobs to empty hours (less disk usage impact).

But in // i'm searching all the other way (RLW is maybe one) that could lead to extra ressource consumption observed since 8.1 and could explain why Netapp tells you that FAS3140 is too short in memory for DOT 8.1*.

michailidisa
6,957 Views

8.0.3 if i am not mistaken is 160+ mbs and 8.1.1 is about 205mbs

so there are 40mbs of programming and stuff more in these versions which probably use features or more memory to operate...

when the ONTAP boots there is a point which sais "there are 1.5gb free mem for data ontap"

i dont know if the rlw process is already started in that point to "eat" mem from the 4gb of each controller.

if the RLW is the only thing added in 8.1.x then for sure is something which is mem-eater

and i am based on this cause our 2240 with 6gb per controller has no problem operating with 8.1.1

lafoucrier
7,229 Views

thanks for the informations, thats a lot more effectively.

If somebody have the possibility to disable RLW in is context to see the behavior (perf impact) with and without it could be interesting.

best regards,

aborzenkov
7,229 Views

Thank you for reference. Good article indeed.

lafoucrier
7,229 Views

seems that Netapp explain that RLW as no effect on perf here :

https://forums.netapp.com/thread/30084?start=15&tstart=0

but i'm trying to have more informations on RLW and exact enhancement it offers over existing write lost protection mechanisms already in DOT + WAFL (bloc checksum, physical + logical ID, Scrubs, RAID-DP....).

dburkland
7,229 Views

Thanks for the link @lafoucrier, I have bookmarked the KB article listed there.

Dan

Public