Re: Data Ontap 8.1 upgrade - RLW_Upgrading process in the background BEWARE!! - Page 2

davidrnexon · ‎2012-06-24

Hi,

We recently upgraded ontap from version 8.02 to 8.1. We read through the release notes, upgrade advisor and the upgrade notes, proceeded with the upgrade which was quite smooth BUT..

No where in the release notes or 8.1 documentation does it mention that in the background (after the upgrade) there is a background process that runs that can potentially dramatically degrade performance of your filer. If anyone from Netapp reads this, can you please ask to add this caveat into the release notes and upgrade advisor.

Right after upgrading there is a background process that begins which is entitled rlw_upgrading. RLW is short for Raid Protection Against Lost Writes. It is new functionality that is added into Data Ontap 8.1.

to see this process you need to be in priv set diag and then aggr status <aggr_name> -v

The issue is, while this process is running, and your dedupe jobs kick in, the CPU will sky rocket to 99% and filer latency goes through the roof. The only way to run the filer sufficiently is to either disable all dedupe, or turn all dedupe schedules to manual.

The problem is, this background process has been running for the last 3 weeks on one filer, and the last 2 weeks on another filer.

I have a case open with Netapp at the moment, but was wondering if anyone else has experience with this, or any recommendations/commands as for us to see how long this process has left to complete because no one seems to know much about this process or function ?

Becasuse for the last 2-3 weeks we have not been able to run any deduplication without severly impacting filer latency.

davidrnexon · ‎2012-07-09

Just thought i'd update everyone on this issue. We have been working with Netapp for almost one month now and it's with the highest level engineers. They have basically told us that 8.1GA code has issues that could seriously degrade system performance to where it's almost unusable. There is no workaround for Ontap 8.1 however the fix is to upgrade to 8.1.1RC1. There is a public BURT that has been release which you can read here: http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=510586

We are just waiting on some more test results and we will be upgrading to 8.1.1RC1 soon. I will write back to let everyone know if the performance returns to normal.

mgiard214 · ‎2012-07-13

David,

I just wanted to let you know about our case update. Our aggr scrub completed, the RLW upgrade completed, i manually retarted dedup and, unfortunately, still have the same problem of 100% CPU and 95% disk utilization. Once dedup is stopped, everything is normal. I reengaged the Netapp escalation engineers, submitted all the perfstat and stait information and they told me the same thing you were told, that the performance characteristics we are seeing on our FAS3240 is very similar to issues others have had that should be addressed in 8.1.1RC1. They also indicated that 8.1.1GA should be coming out within the next couple of weeks or so with additional fixes. My BURT of the controller shutting down is not addressed in the 8.1.1 releases. I hope to hear back from you that your upgrade to 8.1.1RC1 fixed your issues.

davidrnexon · ‎2012-07-13

I'm supprised Data Ontap 8.1GA hasn't been removed from the downloads section with all the problems it's causing everyone.

We are upgrading this Sunday night, so I will definately let you know

dburkland · ‎2012-07-18

How did the upgrade go?

I'm glad my organization has held off on 8.1, hopefully 8.1.1 goes GA soon though!

craigbeckman · ‎2012-07-26

Does anyone reading this have any idea when 8.1.1 goes GA?

mgiard214 · ‎2012-07-27

I was told last week by Netapp escalation support that 8.1.1GA was scheduled to be released by the end of July. However, the tech could not guarantee they would meet that deadline.

mgiard214 · ‎2012-08-16

8.1.1 is available today on the NOW site for download!!

http://support.netapp.com/NOW/download/swchangelog.shtml

davidrnexon · ‎2012-08-16

Awesome, thanks for the update. Can anyone with this issue that plans to install 8.1.1GA let us know if it resolves any of the issues ?

davidrnexon · ‎2012-07-18

A quick update, we have updated to ontap 8.1.1RC and still face the same issues with performance problems. We have "on loan" some 256GB flash cache cards that we put into the system last night, it's still a bit early to tell but so far today everything seems to be running ok, we haven't tried to run a dedupe yet. Once we do i'll update the post here.

craigbeckman · ‎2012-07-19

Not happy to hear that you have the same issues on 8.1.1RC1 as we were considering upgrading to that release as recommended by Netapp support.

We upgraded from 8.0.2p6 to 8.1 a few weeks ago and have had very high CPU and CIFS latency issues even though we haven't enabled dedupe or compression as yet.

I still have a ticket open with Netapp support and am trying to get some answers as to what the next steps are.

davidrnexon · ‎2012-07-24

Another update, we have now seen the system running with the 256GB flash cache modules for a few days now. The system seems to be responding much faster, though we have been advised that the modules will need to be increased to 512GB. We will also be adding another shelf of 24 x 600GB SAS. Setting a raid group size of 16 (which is recommended for a 3240 with 600GB SAS), creating a new aggregate, migrating data over to this aggregate.

Netapp also identified a server that was absolutely hammering the system, we have removed this system to a single Physical Server running on it's own local storage. This has also improved performance.

We will then be able to detroy existing aggregates and add these existing 600GB disks into the new aggregate. This will take some time but i'll post updates along the way. If anyone has questions don't hesitate to ask.

GTNDADABO · ‎2012-07-25

davidrnexon wrote:

Another update, we have now seen the system running with the 256GB flash cache modules for a few days now. The system seems to be responding much faster, though we have been advised that the modules will need to be increased to 512GB. We will also be adding another shelf of 24 x 600GB SAS. Setting a raid group size of 16 (which is recommended for a 3240 with 600GB SAS), creating a new aggregate, migrating data over to this aggregate.

Netapp also identified a server that was absolutely hammering the system, we have removed this system to a single Physical Server running on it's own local storage. This has also improved performance.

We will then be able to detroy existing aggregates and add these existing 600GB disks into the new aggregate. This will take some time but i'll post updates along the way. If anyone has questions don't hesitate to ask.

Warning on aggregate size: you want to keep your raid groups as uniform as possible. The new raid group size recommendations are 12-20, whichever gives you the most uniform raid group size. An aggregate is only as strong as its smallest raid group.

Also, if you add to an aggregate remember to reallocate any volumes in it.

davidrnexon · ‎2012-07-25

Thanks for the reply, according to Netapp there are specific RAID group sizes you should use depending on the disks that you have in the system. See the image below

Another note on adding disks to an aggregate. We have been told to either add the full raid group size or at least half when adding disks. For example, for a Raid Group of 16, it is recommeneded to either add 16 disks or 8 disks at a time, not adding 1 or 2 disks at a time. Also remember you must keep 2 spare disks per disk type.

radek_kubka · ‎2012-07-26

RG size is a bit of dark art (or personal preference). The table you are referring to suggests the optimal RG size for maximising an aggregate. If you are not going to grow your aggregate that much, you may be better off with a different value.

GTNDADABO · ‎2012-07-26

Radek Kubka wrote:

RG size is a bit of dark art (or personal preference). The table you are referring to suggests the optimal RG size for maximising an aggregate. If you are not going to grow your aggregate that much, you may be better off with a different value.

I've found tr-3437 particularly useful for practicing this art. I certainly recommend tailoring RG size to drive type/size.

craigbeckman · ‎2012-07-30

8.1P2 has been released a few days ago and it looks like all my current issues have been addressed!

http://support.netapp.com/NOW/download/software/ontap/8.1P2/

davidrnexon · ‎2012-07-30

Hi Craig, do you plan on upgrading to this release soon ? I would be interested if it resolved the issues ? Also what version of Ontap are you running now ?

craigbeckman · ‎2012-07-30

Hi David, we have 6210's running ONTAP 8.1 (we recently upgraded from 8.0.2p6).

While our CPU and protocol latency increased with 8.1, we have not experienced significant performance impact and will wait for 8.1.1GA.

I am however going to increase the scrub schedule to run from 11pm each night until about 5am to complete the initial scrub (and apparently the RLW_upgrading process):

options raid.scrub.schedule 360m@mon@23,360m@tue@23,360m@wed@23,360m@thu@23,360m@fri@23,360m@sat@23,360m@sun@23

Will let you know how this goes.

Craig

davidrnexon · ‎2012-07-30

Sounds good, with our situation completing scrubs on the aggreage and clearing the RLW_Upgrading process didn't make any difference to the performance. Upgrading also to 8.1.1RC1 did not make any difference. Only after installing the PAM cards we saw some difference. But with the same load, same utilization going from 8.02 to 8.1 to 8.1.1, only saw problems in 8.1 and above.

craigbeckman · ‎2012-08-16

DOT 8.1.1 is GA according to the support site.

The RLW-updating is almost complete since I increased the scrubs.

RLW is on for most of my aggregates with only a couple to complete.

craigbeckman · ‎2012-08-22

RLW is now set to on for all my aggregates but it has not made any difference to high CPU.

In saying that, Netapp support have now identified the problem.

See details in BURT: 526941 (http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=526941&app=portal).

This ACP module bug causes excessive checksum verification processes resulting in high CPU and high IO load on vol0.

The recommended workaround is to:

Remove checksum firmware files from "/etc/acpp_fw":

* ACP-IOM3.xxxx.AFW.FVF

* ACP-IOM6.xxxx.AFW.FVF

where; "xxxx" is the firmware file version.

After, removing the FVF files, stop and start ACP.

Data Ontap 8.1 upgrade - RLW_Upgrading process and other issues