Disk copy temporarily suspended and will resume automatically...

Beardmann · ‎2020-05-27

I'm in the process of emptying a disk shelf on an AFF8080 in order to move to to a newer A700 system...

The AFF8080 is a two node system with disk partitioning 3.8T SSD disks... pretty standard Root-Data-Data partitioning.

One RG on each node sharing three DS224-12 shelfs...

We have emptied one of the two aggregates and we are now in the process of copying around the partitions in order to empty one of the three shelfs...

it all worked fine for two of the three RAID groups, but the last RG seems to stall on us...

We basically run a command like:

disk partition replace -action start -partition 4.1.10.P2 -replacement 1.10.4.P2

And the copy starts which we can see with the "storage aggregate show-status -aggregate DATA02"

And it does indeed show us:

shared 4.1.2 0 SSD - 1.74TB 3.49TB (replacing, copy in progress)
shared 4.0.0 0 SSD - 1.74TB 3.49TB (copy 0% completed)

So far so good...

But... it never get last the 0%... in fact in the event log we can see the following:

event log show

5/27/2020 17:22:31 NETAPP01-02 NOTICE raid.rg.diskcopy.aborted: /DATA02/plex0/rg2: disk copy from 0d.01.2P2 to 4a.00.0P2 aborted at disk block 5248 after 53:38.94. Reason: Disk copy temporarily suspended and will resume automatically..

And we have a lot of these notes and none of them gets bast block 5248... it's been like this for an hour now... (53 mins.)

There is a bit of load on the aggregate...

NETAPP01::*> statistics aggregate show

NETAPP01 : 5/27/2020 17:24:15

                          *Total Read Write      Read     Write Latency
Aggregate            Node    Ops  Ops   Ops     (Bps)     (Bps)    (us)
--------- --------------- ------ ---- ----- --------- --------- -------
   DATA02     NETAPP01-02  23307 9001  8351 190295040 178327552     284

NETAPP01::*> node run -node NETAPP01-02 -command sysstat -u 1
 CPU   Total     Net   kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP           CP_Ty            Disk
       ops/s      in    out    read  write    read  write    age    hit  time  [T--H--F--N--B--O--#--:]  util
 79%   16839     186    397   81384  41384       0      0     9s    99%    0%   0--0--0--0--0--0--0--0     3%
 77%   16378     237    475   92520  40840       0      0     9s    98%    0%   0--0--0--0--0--0--0--0     3%
 76%   17495     593    813   80588  39768       0      0     9s    99%    0%   0--0--0--0--0--0--0--0     3%
 79%   16614     383   1152   94324  43460       0      0     9s    98%   11%   0--0--0--0--0--0--0--0     3%

As you can see there is quite some CPU load on the system... but that's all because of this copy... even though it does not seem to do anything...

In this sysstat check I have stopped the disk replace...

NETAPP01::*> node run -node NETAPP01-02 -command sysstat -u 1
 CPU   Total     Net   kB/s    Disk   kB/s    Tape   kB/s  Cache  Cache    CP           CP_Ty            Disk
       ops/s      in    out    read  write    read  write    age    hit  time  [T--H--F--N--B--O--#--:]  util
 47%   18766      99     44  185051  54054       0      0    44s    95%  100%   0--0--0--0--0--0--0--1     4%
 30%   17002      41   4040  155132 114824       0      0    44s    99%  100%   0--0--0--0--0--0--0--1     2%
 38%   17862       8     11  135448  69704       0      0    24s    99%  100%   0--0--0--0--0--0--0--1     1%
 32%   14811      11     17  130132  61452       0      0    28s    98%  100%   0--0--0--0--0--0--0--1     1%

I have managed to replace 15 out of 24... but he last 9 just won't start and hangs as described above...

The raid.resync.perf_impact is set to medium. and I'm not too keen on raising it to high...

There does not seem to be any other errors on the system...

I'm just trying the community before opening a case, maybe someone have the golden key to this? 😉

/Heino

maffo · ‎2020-05-28

I believe this might be due to the fact that you are attempting to copy just a partition, can you try copying the whole disk instead?
::> disk partition replace -action start -partition 4.1.10 -replacement 1.10.4

Beardmann · ‎2020-05-28

I'm sorry maffo, but your suggested command doesn't make sense...

There are to different commands... "disk partition replace" and "disk replace"...

The disk partition replace command needs a partition input... 1.0.0.P1...

And using the "disk replace" would replace the whole disk with all three partitions which we do not want...

I'm currently waiting for Fujitsu/NetApp support to wake up 😉

/Heino

maffo · ‎2020-05-28

Apologies, I did copy and paste but indeed I meant to suggest "disk replace" to replace the whole disk.

maffo · ‎2020-05-28

If you want to PM me the serial number of the AFF8080s I can try having a look