Legacy Product Discussions

shelf status critical condition

drp_sar_sys
46,657 Views

Hi everyone,

Am working for France Telecom and I've been administrating about 6 netapp filers for the last 8 months. Recently the /etc/messages popped up some pesky alarms of this type:

[windata8602:monitor.shelf.fault:CRITICAL]: Fault reported on disk storage shelf attached to channel 0b. Please check fans, power, and temperature.

When I'm grep-ing for "status" after the "environment status shelf" command I'm getting this for some shelves:

        Shelf status: information condition
        Shelf status: critical condition

However, the output of  "environment status chassis all" seems OK:

Temperature ok

PSU 1 ok

PSU 2 ok

Voltage ok

SYS FAN ok

NVRAM6-temperature-3 ok

NVRAM6-battery-3 ok.

Am I fretting for nothing?


Thanks a lot

Dan

7 REPLIES 7

danielpr
46,657 Views

Hi Dan,

You can check the Global status via FilerView . If the status is still in Critical status then you check this kb

https://now.netapp.com/Knowledgebase/solutionarea.asp?id=kb23174

Thanks

Daniel

matthewt
46,657 Views

The command "environment status chassis" only looks at the storage appliance itself (its onboard power supplies, and fans, etc).

It looks like you have 2 shelves that have some type of issue.  A lot of times these errors will cause a case to be generated, but the standard response for those cases is to email the contact and ask them to inspect the shelf or module and make sure the condition is real.  If the customer does not see the email, or respond, the case will automatically close. A customer power condition can cause these conditions, so its not apporpriate to always ship a part, until power fluctuations, or maintenance are ruled out.

You need to interrogate all the lines of the "environment status shelf" for those shelves on loop 0b to see where the issue really is.  That output is roughly 30 lines long for each shelf. There are several classes of errors that can produce that output.  It is important to address issues if you see them.

1.)You may not have redundant power to the shelf (and not having both PSU's running can have implications for how the disks respond in high-load situations)

2.)You may not be cooling effectively.

3.) firmware won't download to a shelf with an error condition.

Here is an example, on which I have highlighted the fields you should watch for: 

Channel: 1d
        Shelf: 3
        SES device path: local access: 1d.49
        Module type: ESH4; monitoring is active
        Shelf status: normal condition
        SES Configuration, via loop id 49 in shelf 3:
         logical identifier=0x50050cc0021082c9
         vendor identification=XYRATEX
         product identification=DS14-Mk4-FC
         product revision level=1313
        Vendor-specific information:
         Product Serial Number: OPS6773421082C9
         Optional Settings: 0x00
        Status reads attempted: 222163; failed: 0
        Control writes attempted: 7415; failed: 0
        Shelf bays with disk devices installed:
          13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
          with error: none
        Power Supply installed element list: 1, 2; with error: none
        Power Supply information by element:
          [1] Serial number: PMC643620239587
              Type: <N/A>
              Firmware version: <N/A>
          [2] Serial number: PMC643620239618
              Type: <N/A>
              Firmware version: <N/A>
        Power control element status: ok
        Cooling Unit installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none

        Shelf temperatures by element:
          [1] 26 C (78 F) (ambient)  Normal temperature range
          [2] 31 C (87 F)  Normal temperature range
          [3] 34 C (93 F)  Normal temperature range

        Temperature thresholds by element:
          [1] High critical: 50 C (122 F); high warning 40 C (104 F)
              Low critical:  0C (32 F); low warning 10 C (50 F)
          [2] High critical: 63 C (145 F); high warning 53 C (127 F)
              Low critical:  0C (32 F); low warning 10 C (50 F)
          [3] High critical: 63 C (145 F); high warning 53 C (127 F)
              Low critical:  0C (32 F); low warning 10 C (50 F)
        ES Electronics installed element list: 1, 2; with error: none
        ES Electronics reporting element: 1

        ES Electronics information by element:
          [1] Serial number: IMS69813312CAFF
              CPLD version: <N/A>
          [2] Serial number: IMS698133130DC8
              CPLD version: <N/A>
        Embedded Switching Hub installed element list: 1, 2; with error: none

        Shelf mapping (shelf-assigned addresses) for channel 1d:
          Shelf 1:  29  28  27  26  25  24  23  22  21  20  19  18  17  16
          Shelf 2:  45  44  43  42  41  40  39  38  37  36  35  34  33  32
          Shelf 3:  61  60  59  58  57  56  55  54  53  52  51  50  49  48

netappnasadmin
46,657 Views

I agree with mathew three conditions stated:

1.)You may not have redundant power to the shelf (and not having both PSU's running can have implications for how the disks respond in high-load situations)

2.)You may not be cooling effectively.

3.) firmware won't download to a shelf with an error condition.

becasue of the above three resons you get error messages regaridng the shelf drp..

Thanks

adaikkap
46,657 Views

Is your filers running ONTAP 7.3.2 ?

Regards

adai

bjornkoopmans
46,657 Views

I had the exact same issue. It was easily resolved by upgrading ONTAP to the latest 7.3.2. Apparently this is a known bug.

adaikkap
46,657 Views

Pls refer the bugs online for fixed version of Data ONTAP and the workaround for the same.

http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=383376

Regards

adai

drp_sar_sys
46,657 Views

Hi guys and thanks a lot for your quick answers. My appologies for responding so late, but I've been on vacation...

To respond to some of you:

One filer that reports the aforementioned errors has this vesion of ONTAP

NetApp Release 7.0.5: Wed Aug  9 00:27:38 PDT 2006

The other :

NetApp Release 7.0.5: Wed Aug  9 00:27:38 PDT 2006

Despite being THAT old they work flawlessly, but if it's a known bug that generates these messeges than I suppose it won't be a problem to upgrade to the latest stable version. Concerning what Mathew said, I don't see any particular errors when sifting through the output (except for what I've already noted):

environment status | grep -i "status" | grep -v normal
PSU Status ok
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Shelf status: critical condition
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574360; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Shelf status: critical condition
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0
        Status reads attempted: 3574361; failed: 0

environment status | egrep -i "failed|err"
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 41; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 45; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 40; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 84; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 21; failed: 0
        Power Supply installed element list: 1, 2; with error: 2
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 16; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 25; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 20; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 17; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574415; failed: 0
        Control writes attempted: 13; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 12; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 12; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 24; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 67; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 95; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 207; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 18; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 17; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 23; failed: 0
        Power Supply installed element list: 1, 2; with error: 1
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 53; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 183; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 443; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 141; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none
        Status reads attempted: 3574416; failed: 0
        Control writes attempted: 746; failed: 0
        Power Supply installed element list: 1, 2; with error: none
        Cooling Element installed element list: 1, 2; with error: none
        Temperature Sensor installed element list: 1, 2, 3; with error: none
        ES Electronics installed element list: 1, 2; with error: none
        Embedded Switching Hub installed element list: 1, 2; with error: none

Public