Global Status: NonCritical; There are not enough spare disks. /aggr0/plex0/rg0: Please add spare dis

Dnyaneshwar · ‎2019-10-04

Hi Team,

We are getting continious alert on Netapp 7 mode array:(Active/Active Controller Error ev1netapp12/FAS6290/)

Global Status: NonCritical; There are not enough spare disks. /aggr0/plex0/rg0: Please add spare disks to any pool supporting block checksums with minimum size: Used 560000/1146880000 (MB/blks)

ev1netapp12> aggr status -s

Pool1 spare disks (empty)

Pool0 spare disks

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
Spare disks for block checksum
spare 12d.00.21 12d 0 21 SA:B 0 SSD N/A 190532/390209536 190782/390721968
spare 12d.00.23 12d 0 23 SA:B 0 SSD N/A 190532/390209536 190782/390721968
spare 1c.00.20 1c 0 20 SA:A 0 SSD N/A 190532/390209536 190782/390721968
spare 1c.00.22 1c 0 22 SA:A 0 SSD N/A 190532/390209536 190782/390721968

ev1netapp12> aggr status -f

Broken disks (empty)

All disk are in assighed state. 4-5 days back sysconfig -a O/P is saying 2 disk in failed state.

Slot 1:

15.8 : NETAPP X412_S15K7560A15 NA08 560.0GB 520B/sect (Failed)

Slot 11;

15.8 : NETAPP X412_S15K7560A15 NA08 560.0GB 520B/sect (Failed)

Yesterday sysconfig -a O/P is not showing any failed disks.

Today 1 disk have been replaced for slot 11 but not for Slot 1 yet and still alert is receiving above alert.(There are not enough spare disks). Seek your assiatnce.

Regards,

Dnyaneshwar

Ontapforrum · ‎2019-10-04

Hi,

Are you sure you have matching 'spare-disks' for the complaining RAID-GROUP.

ev1netapp12> aggr status -s is only reporting 190GB SSDs, probably there are no spare disks for the disk-type in ' /aggr0/plex0/rg0:'

To get an clear idea, send this output:

filer>sysconfig -r

Dnyaneshwar · ‎2019-10-04

Hi Team,

Please find the attched output for sysconfig -r for both nodes ev1netapp12 and ev1netapp11.

Note: Please note that alerts(Not enough spare disk) are receiving only for ev1netapp12

Ontapforrum · ‎2019-10-04

Thanks for the 'sysconfig -r ' output. I will review and come back to you shortly.

Ontapforrum · ‎2019-10-04

Hi,

Cause: There are no spares for the RAID-GROUP on FILER :ev1netapp12 [It's your root aggregate]

Filer: ev1netapp12

Aggregate aggr0 (online, raid_dp) (block checksums)
Plex /aggr0/plex0 (online, normal, active)
RAID group /aggr0/plex0/rg0 (normal, block checksums)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 1a.13.0 1a 13 0 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096
parity 1a.12.14 1a 12 14 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096
data 1a.10.0 1a 10 0 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096

Pool0 spare disks [None of these are applicable]

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
Spare disks for block checksum
spare 12d.00.21 12d 0 21 SA:B 0 SSD N/A 190532/390209536 190782/390721968
spare 12d.00.23 12d 0 23 SA:B 0 SSD N/A 190532/390209536 190782/390721968
spare 1c.00.20 1c 0 20 SA:A 0 SSD N/A 190532/390209536 190782/390721968
spare 1c.00.22 1c 0 22 SA:A 0 SSD N/A 190532/390209536 190782/390721968

However, the other Node: ev1netapp11
Pool0 spare disks [There are matching spares, hence it is not complaining]

Aggregate aggr0 (online, raid_dp) (block checksums)
Plex /aggr0/plex0 (online, normal, active)
RAID group /aggr0/plex0/rg0 (normal, block checksums)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 11d.34.0 11d 34 0 SA:B 0 SAS 15000 560000/1146880000 560879/1148681096
parity 1d.32.0 1d 32 0 SA:A 0 SAS 15000 560000/1146880000 560208/1147307688
data 1d.31.0 1d 31 0 SA:A 0 SAS 15000 560000/1146880000 560208/1147307688

Pool0 spare disks [There are matching spares, hence it is not complaining]

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
Spare disks for block checksum
spare 1d.32.14 1d 32 14 SA:A 0 SAS 15000 560000/1146880000 560208/1147307688 (not zeroed)
spare 1d.31.22 1d 31 22 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096 (not zeroed)
spare 12a.20.21 12a 20 21 SA:B 0 SSD N/A 190532/390209536 190782/390721968
spare 12a.20.23 12a 20 23 SA:B 0 SSD N/A 190532/390209536 190782/390721968
spare 1b.20.20 1b 20 20 SA:A 0 SSD N/A 190532/390209536 190782/390721968
spare 1b.20.22 1b 20 22 SA:A 0 SSD N/A 190532/390209536 190782/390721968

Suggestion:
You have two spare disks attached to Node: ev1netapp11
spare 1d.32.14 1d 32 14 SA:A 0 SAS 15000 560000/1146880000 560208/1147307688 (not zeroed)
spare 1d.31.22 1d 31 22 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096 (not zeroed)

If you wish to get rid of that error, then you could assign one of the two diks above to Node:ev1netapp12

You can use following command to unown the disk for example:

Before that make sure auto-assign is off:

fieler>options disk.auto_assign

If its on, se it to 'off', then follow the commands

ev1netapp11> disk assign 1d.32.14 -s unowned -f

Then on,

ev1netapp12>disk assign 1d.32.14

Once this appears under spare for ev1netapp12>, it should stop complaning.

Dnyaneshwar · ‎2019-10-04

Hi Team,

I am executing below commnads and getting below message.

ev1netapp11> options disk.auto_assign off
You are changing option disk.auto_assign, which applies to
both members of the HA configuration in takeover mode.
This value must be the same on both HA members to ensure correct
takeover and giveback operation.
Fri Oct 4 08:33:46 PDT [ev1netapp11:reg.options.cf.change:warning]: Option disk.auto_assign changed on one cluster node.
ev1netapp11> options disk.auto_assign
disk.auto_assign off (value might be overwritten in takeover)

Could you please execute the detailed plan as i am new to this system.

Thanks in advance.

Regards,

Dnyaneshwar

Ontapforrum · ‎2019-10-04

That's absolutely normal messages. B'cos its an HA Pair, hence all the settings should be same on both Nodes. Nothing to worry.

Once the disk assign task is completed, you can set the option back to 'ON'.

Dnyaneshwar · ‎2019-10-04

You have mentiioed disk assign on should complete before disk assign.

Faced below issue.

ev1netapp12> disk assign 1d.32.14
Fri Oct 4 08:49:23 PDT [ev1netapp12:diskown.changingOwner:info]: changing ownership for disk 11d.32.14 (S/N 6SL5V9KL0000N329345F) from unowned (ID 4294967295) to ev1netapp12 (ID 1874380489)
ev1netapp12> Fri Oct 4 08:49:40 PDT [ev1netapp12:raid.rg.recons.missing:notice]: RAID group /hybrid_aggr_db2/plex0/rg5 is missing 1 disk(s).
Fri Oct 4 08:49:40 PDT [ev1netapp12:raid.rg.recons.info:notice]: Spare disk 11d.32.14 will be used to reconstruct one missing disk in RAID group /hybrid_aggr_db2/plex0/rg5.
Fri Oct 4 08:49:40 PDT [ev1netapp12:raid.rg.recons.start:notice]: /hybrid_aggr_db2/plex0/rg5: starting reconstruction, using disk 11d.32.14
Fri Oct 4 08:49:40 PDT [ev1netapp12:raid.rg.spares.low:warning]: /hybrid_aggr_db2/plex0/rg0
Fri Oct 4 08:49:40 PDT [ev1netapp12:raid.rg.spares.low:warning]: /hybrid_aggr_db2/plex0/rg1
Fri Oct 4 08:49:40 PDT [ev1netapp12:raid.rg.spares.low:warning]: /hybrid_aggr_db2/plex0/rg2
Fri Oct 4 08:49:40 PDT [ev1netapp12:raid.rg.spares.low:warning]: /hybrid_aggr_db2/plex0/rg3
Fri Oct 4 08:49:40 PDT [ev1netapp12:raid.rg.spares.low:warning]: /hybrid_aggr_db2/plex0/rg4
Fri Oct 4 08:49:40 PDT [ev1netapp12:raid.rg.spares.low:warning]: /hybrid_aggr_db2/plex0/rg6
Fri Oct 4 08:49:40 PDT [ev1netapp12:raid.rg.spares.low:warning]: /hybrid_aggr_db2/plex0/rg7
Fri Oct 4 08:49:40 PDT [ev1netapp12:raid.rg.spares.low:warning]: /aggr0/plex0/rg0
Fri Oct 4 08:49:40 PDT [ev1netapp12:callhome.spares.low:error]: Call home for SPARES_LOW

Dnyaneshwar · ‎2019-10-04

Hi Team,

Please reply ASAP.

I can see spare disk from ev1netapp11 is removed but not assigned to ev1netgapp12 as a spare.

Current status:

ev1netapp11> aggr status -s

Pool1 spare disks (empty)

Pool0 spare disks

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
Spare disks for block checksum
spare 1d.31.22 1d 31 22 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096 (not zeroed)
spare 12a.20.21 12a 20 21 SA:B 0 SSD N/A 190532/390209536 190782/390721968
spare 12a.20.23 12a 20 23 SA:B 0 SSD N/A 190532/390209536 190782/390721968
spare 1b.20.20 1b 20 20 SA:A 0 SSD N/A 190532/390209536 190782/390721968
spare 1b.20.22 1b 20 22 SA:A 0 SSD N/A 190532/390209536 190782/390721968
ev1netapp11> Fri Oct 4 09:00:53 PDT [ev1netapp11:wafl.scan.ownblocks.done:info]: Completed block ownership calculation on aggregate aggr0. The scanner took 186 ms.

ev1netapp12> aggr status -s

Pool1 spare disks (empty)

Pool0 spare disks

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
Spare disks for block checksum
spare 12d.00.21 12d 0 21 SA:B 0 SSD N/A 190532/390209536 190782/390721968
spare 12d.00.23 12d 0 23 SA:B 0 SSD N/A 190532/390209536 190782/390721968
spare 1c.00.20 1c 0 20 SA:A 0 SSD N/A 190532/390209536 190782/390721968
spare 1c.00.22 1c 0 22 SA:A 0 SSD N/A 190532/390209536 190782/390721968

Please suggest what should we do.

Ontapforrum · ‎2019-10-04

If you are new to netapp. I suggest please call support, they will walk you with steps.

Its important to understand how spare works.

If the spare is picked for recon that means there is a failed disk , a matching spare will be picked automatically. Once the failed disk is replaced, it will appear as spare , unless something happens.

I am out now, will see your post later.

Dnyaneshwar · ‎2019-10-04

Hi Sir,

Thanks for the reply.

Reconstruction is going on other disks.

RAID group /hybrid_aggr_db2/plex0/rg5 (reconstruction 27% completed, block checksums)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
--------- ------ ------------- ---- ---- ---- ----- -------------- --------------
dparity 1a.11.16 1a 11 16 SA:A 0 SAS 15000 560000/1146880000 560208/1147307688
parity 1a.13.14 1a 13 14 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096
data 11a.15.0 11a 15 0 SA:B 0 SAS 15000 560000/1146880000 560879/1148681096
data 1a.10.15 1a 10 15 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096
data 11a.13.17 11a 13 17 SA:B 0 SAS 15000 560000/1146880000 560879/1148681096
data 11a.16.14 11a 16 14 SA:B 0 SAS 15000 560000/1146880000 560879/1148681096
data 11a.14.15 11a 14 15 SA:B 0 SAS 15000 560000/1146880000 560879/1148681096
data 1a.11.17 1a 11 17 SA:A 0 SAS 15000 560000/1146880000 560208/1147307688
data 11a.14.11 11a 14 11 SA:B 0 SAS 15000 560000/1146880000 560208/1147307688
data 11a.14.23 11a 14 23 SA:B 0 SAS 15000 560000/1146880000 560879/1148681096
data 1a.10.16 1a 10 16 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096
data 1a.12.16 1a 12 16 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096
data 11a.16.15 11a 16 15 SA:B 0 SAS 15000 560000/1146880000 560879/1148681096
data 11a.14.16 11a 14 16 SA:B 0 SAS 15000 560000/1146880000 560879/1148681096
data 1a.11.18 1a 11 18 SA:A 0 SAS 15000 560000/1146880000 560208/1147307688
data 1a.13.16 1a 13 16 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096
data 11a.15.16 11a 15 16 SA:B 0 SAS 15000 560000/1146880000 560208/1147307688
data 1a.10.17 1a 10 17 SA:A 0 SAS 15000 560000/1146880000 560879/1148681096
data 11d.32.14 11d 32 14 SA:B 0 SAS 15000 560000/1146880000 560208/1147307688 (reconstruction 27% completed)
data 11a.16.16 11a 16 16 SA:B 0 SAS 15000 560000/1146880000 560879/1148681096

Sir,

We hae tried to locate failed disks but aggr status -f is showing empty O/P. There may be some missing/bad/failed disks but need to identify.

We have already ordered 2 new disks with (SAS 15000 560000/1146880000 560208/1147307688) configuration and we have that fresh disks handy.

Please let us know how to procced.

Regards,

Dnyaneshwar

Ontapforrum · ‎2019-10-04

Thanks for the update. I can see that disk was picked for recon, so that's ok and normal.

Have you got 2 spare disk handy ? But you are unable to identify the failed one ?

Plz send:

Aggr status -r

Vol status-f

Disk show -v

Can you ask someone to physically inspect the shelf and look for any failed disk physically?

Dnyaneshwar · ‎2019-10-04

Hi Sir,

Please find the atthced output. I have already informed HW admin to check the disk physically in DC but i will again cross check with him.

Thanks and Regards,

Dnyaneshwar

Dnyaneshwar · ‎2019-10-04

HW admin checked and dint found any amber light on any of the disk.

How can we add a newly arrived disk to Pool which should act as a spare and that alert(Not enough spare disk) should not come?

Thanks and Regards,

Dnyaneshwar

Ontapforrum · ‎2019-10-04

Hi,

If there are no fault and all the disks are :

1) SOLID GREEN or BLINKING GREEN that means = If the HW engineer has confirmed this state that means all the disks in all the shelfs are fine.

2) Now : You can only insert the spare disks ,provided the BAY is empty , if all the disks are in there and all are green, then you cannot insert those disks. There are no spares left, everything has been consumed.

ALso give this output from both Nodes.

filer>fcstat device_map

filer>storage show hub

Dnyaneshwar · ‎2019-10-07

Hi Sir,

Apologies for dealy in reply.

They are saying all FGreen lights are there but we can't rely on them because these guys are only for HW replacement.

HW team also told taht there ino empty bay. So where should we place these new drives?

ev1netapp12> fcstat device_map
Cannot complete operation on channel 0a; link is DOWN
Cannot complete operation on channel 0b; link is DOWN
Cannot complete operation on channel 0c; link is DOWN
Cannot complete operation on channel 0d; link is DOWN

ev1netapp12> storage show hub
No hubs found.
ev1netapp12>

DFM magnages these HA nodes.

Please find the dfm O/P for these 2 nodes. ev1netapp12 status is error

[root@dfm ~]# dfm report storage-systems
Warning: Use of this command for listing and viewing reports has been
deprecated by 'dfm report list' and 'dfm report view' commands respectively.

Object ID Type Status Storage System Model Serial Number System ID
--------- ------------------------ ------ --------------------------------- ------- ------------- ----------
251215 Active/Active Controller Normal ev1netapp11.ev1.yellowpages.com FAS6290 XXXXXXXXXXXX YYYYYYYYYY
251214 Active/Active Controller Error ev1netapp12.ev1.yellowpages.com FAS6290 WWWWWWWWWWWW ZZZZZZZZZZ

Dnyaneshwar · ‎2019-10-07

Hi Sir,

Could you please update on this issue?

Thanks in advance.

Regards,

Dnyaneshwar

Ontapforrum · ‎2019-10-07

Hi,

Please give us this ouput from both nodes:
filer> environment status
filer> options raid.min_spare_count

Please note : If all the disks are in the BAY, and there are no failed disk [vol status -f], that means there is no spares dedicated to Node (For the disk-type), it only indicates poor planing when the system was setup.

If that's the case, only two options:
1) Add another shelf, and if you cannot then;

2) Set the options raid.min_spare_count to '0' [zero] - It will stop complaining about the warning message : monitor.globalStatus.nonCritical: There are not enough spare disks, but not ideal and not recommended. It means until the failed disk is replaced, your raid-group will be in degraded state (Praying no other disk fails).

It's your call.

Dnyaneshwar · ‎2019-10-10

Hi Sir,

2 days back one disk filure happned for ev1netapp11 and HW team replaced the new disk and same warning messages(Not enough spare disk) are cominng for ev1netapp as well.

Currently "Not enough spare disks" warning messgaes are coming for both nodes ev1netapp11 and ev1netapp12

Pl find the atthced output for environment status.

Regards,

Dnyaneshwar

Ontapforrum · ‎2019-10-10

Hi,

I checked both outputs. There is nothing wrong with your shelfs, everything working fine. All the shelfs are fully-loaded, hence there is no scope for you to populate it.

Comments: There are no spare in your Filers, and as long as this option is set to '1', it will continue to report warning.

filer> options raid.min_spare_count 1

Options:

1) Either set the min spare to '0'

2) Decomisison any aggregate to get some spares.

3) Add additional shelf

You have to take this decision.