ONTAP Discussions
ONTAP Discussions
I'm in the process of emptying a disk shelf on an AFF8080 in order to move to to a newer A700 system...
The AFF8080 is a two node system with disk partitioning 3.8T SSD disks... pretty standard Root-Data-Data partitioning.
One RG on each node sharing three DS224-12 shelfs...
We have emptied one of the two aggregates and we are now in the process of copying around the partitions in order to empty one of the three shelfs...
it all worked fine for two of the three RAID groups, but the last RG seems to stall on us...
We basically run a command like:
disk partition replace -action start -partition 4.1.10.P2 -replacement 1.10.4.P2
And the copy starts which we can see with the "storage aggregate show-status -aggregate DATA02"
And it does indeed show us:
shared 4.1.2 0 SSD - 1.74TB 3.49TB (replacing, copy in progress)
shared 4.0.0 0 SSD - 1.74TB 3.49TB (copy 0% completed)
So far so good...
But... it never get last the 0%... in fact in the event log we can see the following:
event log show
5/27/2020 17:22:31 NETAPP01-02 NOTICE raid.rg.diskcopy.aborted: /DATA02/plex0/rg2: disk copy from 0d.01.2P2 to 4a.00.0P2 aborted at disk block 5248 after 53:38.94. Reason: Disk copy temporarily suspended and will resume automatically..
And we have a lot of these notes and none of them gets bast block 5248... it's been like this for an hour now... (53 mins.)
There is a bit of load on the aggregate...
NETAPP01::*> statistics aggregate show
NETAPP01 : 5/27/2020 17:24:15
*Total Read Write Read Write Latency
Aggregate Node Ops Ops Ops (Bps) (Bps) (us)
--------- --------------- ------ ---- ----- --------- --------- -------
DATA02 NETAPP01-02 23307 9001 8351 190295040 178327552 284
NETAPP01::*> node run -node NETAPP01-02 -command sysstat -u 1
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP_Ty Disk
ops/s in out read write read write age hit time [T--H--F--N--B--O--#--:] util
79% 16839 186 397 81384 41384 0 0 9s 99% 0% 0--0--0--0--0--0--0--0 3%
77% 16378 237 475 92520 40840 0 0 9s 98% 0% 0--0--0--0--0--0--0--0 3%
76% 17495 593 813 80588 39768 0 0 9s 99% 0% 0--0--0--0--0--0--0--0 3%
79% 16614 383 1152 94324 43460 0 0 9s 98% 11% 0--0--0--0--0--0--0--0 3%
As you can see there is quite some CPU load on the system... but that's all because of this copy... even though it does not seem to do anything...
In this sysstat check I have stopped the disk replace...
NETAPP01::*> node run -node NETAPP01-02 -command sysstat -u 1
CPU Total Net kB/s Disk kB/s Tape kB/s Cache Cache CP CP_Ty Disk
ops/s in out read write read write age hit time [T--H--F--N--B--O--#--:] util
47% 18766 99 44 185051 54054 0 0 44s 95% 100% 0--0--0--0--0--0--0--1 4%
30% 17002 41 4040 155132 114824 0 0 44s 99% 100% 0--0--0--0--0--0--0--1 2%
38% 17862 8 11 135448 69704 0 0 24s 99% 100% 0--0--0--0--0--0--0--1 1%
32% 14811 11 17 130132 61452 0 0 28s 98% 100% 0--0--0--0--0--0--0--1 1%
I have managed to replace 15 out of 24... but he last 9 just won't start and hangs as described above...
The raid.resync.perf_impact is set to medium. and I'm not too keen on raising it to high...
There does not seem to be any other errors on the system...
I'm just trying the community before opening a case, maybe someone have the golden key to this? 😉
/Heino
I believe this might be due to the fact that you are attempting to copy just a partition, can you try copying the whole disk instead?
::> disk partition replace -action start -partition 4.1.10 -replacement 1.10.4
I'm sorry maffo, but your suggested command doesn't make sense...
There are to different commands... "disk partition replace" and "disk replace"...
The disk partition replace command needs a partition input... 1.0.0.P1...
And using the "disk replace" would replace the whole disk with all three partitions which we do not want...
I'm currently waiting for Fujitsu/NetApp support to wake up 😉
/Heino
Apologies, I did copy and paste but indeed I meant to suggest "disk replace" to replace the whole disk.
If you want to PM me the serial number of the AFF8080s I can try having a look