Solved: Re: Duration of "container block reclamation" ?

uptimenow · ‎2011-08-30

Hi,

How long can a container block reclamation run on a volume after the deletion of a snapshot ? I would assume the process/thread would terminate in a reasonable amount of time, but I occasionally come across situations where container block reclamations are taking over 80 hours to finish. Would that still be considered something normal, or an indication that something is wrong with the system's performance ?

Thanks in advance,

Filip

uptimenow · ‎2012-06-07

I'll anwser my own question with some pieces of information I found.

There's a couple of parameters that affect the performance of the container block reclamation scans.

First of all, there is the "wafl scan speed". Obviously changing the wafl scan speed (by default set to 0 which means it's auto-tuning itself) will have an effect on all WAFL scans.

For the container block reclamation scans, there is:

setflag wafl_blk_reclaim_secs_max xxx

setflag wafl_blk_reclaim_secs_min yyy

where the default values for FAS32x0 ONTAP 7.3/8.0 systems are xxx=3600 and yyy=300. I don't know if they are different on other architectures and/or ONTAP versions. To the best of my understanding, they determine how long a scan should take.

Note that there are a couple of other flags (see "priv set diag ; printflag").

We found out about this information the hard way: one one of our customer's systems, a third-line VSM destination, the CPU was pegged at 100% in the Kahuna domain, caused by these container block reclamation scans. That was with only about 20-40 VSM relations (10-200 GB volume sizes). However, then looking at the "wafl scan status" output, we saw that the number of blocks to scan was quite high, eg for a normal volume:

Volume xxxxx:

Scan id Type of scan progress

1186 container block reclamation block 7655 of 8032 (fbn 5519)

However, on that third-line system, we had over 180.000 blocks per volume.

We found out why: those volumes had been provisioned by protection manager, and PM had created volumes with a size equal to the containing aggregate size (20 TB) and a guarantee set to none.

As a result, we had serious performance issues because of the container block reclamation scans.

(solution: recreate the volumes as non-thin provisioned volumes and (re)import the relations in PM)

View solution in original post

uptimenow · ‎2012-06-07

I'll anwser my own question with some pieces of information I found.

There's a couple of parameters that affect the performance of the container block reclamation scans.

First of all, there is the "wafl scan speed". Obviously changing the wafl scan speed (by default set to 0 which means it's auto-tuning itself) will have an effect on all WAFL scans.

For the container block reclamation scans, there is:

setflag wafl_blk_reclaim_secs_max xxx

setflag wafl_blk_reclaim_secs_min yyy

where the default values for FAS32x0 ONTAP 7.3/8.0 systems are xxx=3600 and yyy=300. I don't know if they are different on other architectures and/or ONTAP versions. To the best of my understanding, they determine how long a scan should take.

Note that there are a couple of other flags (see "priv set diag ; printflag").

We found out about this information the hard way: one one of our customer's systems, a third-line VSM destination, the CPU was pegged at 100% in the Kahuna domain, caused by these container block reclamation scans. That was with only about 20-40 VSM relations (10-200 GB volume sizes). However, then looking at the "wafl scan status" output, we saw that the number of blocks to scan was quite high, eg for a normal volume:

Volume xxxxx:

Scan id Type of scan progress

1186 container block reclamation block 7655 of 8032 (fbn 5519)

However, on that third-line system, we had over 180.000 blocks per volume.

We found out why: those volumes had been provisioned by protection manager, and PM had created volumes with a size equal to the containing aggregate size (20 TB) and a guarantee set to none.

As a result, we had serious performance issues because of the container block reclamation scans.

(solution: recreate the volumes as non-thin provisioned volumes and (re)import the relations in PM)

aborzenkov · ‎2012-06-07

So - did you adjust any of these to fix your performance issues? If yes, how exactly (increased, decreased)?

uptimenow · ‎2012-06-19

By increasing the values we gave the system more time to perform the container block reclamation scans. We incleased them to (min) 3600 and (max) 14400 based on whan a local NetApp PSE found in some case notes. From our understanding this means the WAFL scans now get between 1 hour and 4 hours to finish. Before the interval was 5 mins to 1 hour.

Changing this had an immediate and quite dramatic impact on the Kahuna CPU load: it dropped immediately.

However, we considered this to be a workaround, because many of the destination volumes on this system had a VSM replication schedule of once per hour. In the past, we had already experimented by lowering the "wafl scan speed" (default should be set to 0, which means "auto-tunable", for normal system load the system should tune to 1300, we lowered it to 300, this resulted in a Kahuna CPU drop, but also in the system not being able to finish the wafl scans in time, this lead which means the system could not take any more new snapshots on volumes until the previous block reclamation was finished etc.).

Ideally, you do not want to change any of the wafl tuning flags. I consider them to be aids to help you find the root cause.

We ended up recreating the volumes as non-thin provisioned volumes. We recreated the VSM relations manually and then imported them into PM. The wafl scan speed and setflags were set back to the default values.

The performance improvement (Kahuna CPU drop) was also very, very noticable. The more volumes were "corrected", the lower Kahuna CPU utilization got.

Hope this helps.

mahala · ‎2012-06-14

I am really interested to know that did you apply the solution you mentioned (solution: recreate the volumes as non-thin provisioned volumes and (re)import the relations in PM) ?

And if yes then what was impact on performance or mainly Kahuna Domain utilization ?

uptimenow · ‎2012-06-19

Hi, see my other reply in this thread.

It's easy to measure the Kahuna CPU utilization drop by changing the wafl setflags: it's immediate (give it 30 seconds maximum to settle) and very noticable (ie. from 100% to 5%).

By recreating the volumes and re-establishing the Volume SnapMirrors one by one, obviously the result is less noticeable: With about 1/4th of the volumes done, we could see the response time of the filer improving. Now, with all volumes done, the filer is acting very normal, ie. after the hourly VSM updates, it's still busy for about a minute or five, and then it's pretty much idle until the next scheduled update.

We preferred this solution over changing some hidden flags.

SAROJSAHUMSC · ‎2012-06-13

Thanks for the block what you have mentioned.

Hello dear friends can you give step by step how to configure the cluster (active/active) and how to test it if one controller got unplanned shutdown.

Thanks and Regards

Saroj