Subscribe
Accepted Solution

Performance Diagnosis in Performance Advisor didn't show failed disk

Performed following testing for "Diagnosis" option of the Performance advisor and find out that manually failed disk didn't appear on diagnosis.

Manually failed the disk on the filer to degrade the performance of filer. Started coping data on the volume for which the disk was part. Latency increased considerably. Then performed PA diagnosis on the filer for the time during which disk was failed and data was copied. Diagnosis didn't show failed disk or reconstructing disk??

stx601na08> disk fail 0b.29

*** You are about to prefail the following file system disk, ***

*** which will eventually result in it being failed ***

  Disk /test_aggr1/plex0/rg0/0b.29

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)

      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------

      data      0b.29   0b    1   13  FC:B   -  ATA   7200 423111/866531584  423889/868126304

***

Really prefail disk 0b.29? y

disk fail: The following disk was prefailed: 0b.29

Disk 0b.29 has been prefailed.  Its contents will be copied to a

replacement disk, and the prefailed disk will be failed out.

stx601na08> sysconfig -r

Aggregate test_aggr1 (online, raid4) (block checksums)

  Plex /test_aggr1/plex0 (online, normal, active)

    RAID group /test_aggr1/plex0/rg0 (normal)

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)

      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------

      parity    0a.29   0a    1   13  FC:A   -  ATA   7200 423111/866531584  423889/868126304

      data      0b.29   0b    1   13  FC:B   -  ATA   7200 423111/866531584  423889/868126304 (prefail, copy in progress)

      -> copy   0c.112  0c    7   0   FC:B   -  ATA   7200 423111/866531584  635858/1302238304 (copy 0% completed)

Re: Performance Diagnosis in Performance Advisor didn't show failed disk

Please note that 'Performance Diagnosis' is performed/verified against the last available configuration with the DFM.

The rule 'Disk Reconstruction in Progress' depends/uses the configuration of the disks in the raid group.

The raid configuration is updated when the diskmon runs (every 4 hrs by default). Hence you would not have got the appropriate error message as the diagnosis is not performed with the latest configuration.

It is recommended that DFM server be set as the SNMP trap reciever. Please note that storage system generates 'disk:failed' trap which when DFM server recieves shall invoke the diskmon. diskmon updates the (raid) configuration.

Re: Performance Diagnosis in Performance Advisor didn't show failed disk

The disk monitor running as part of Operations Manager generates the

disk failed event and the disk monitoring interval is 4 hours by default.

Looks like the disk monitor did not run in your case and hence the

Operations Manager did not generate the disk failed event and thats why

PA diagnosis did not show this event.

You can forcefully run the disk monitor (and all other monitors) on your

filer with the following command:

dfm host discover <host-name-or-id>

and then see if the disk failed event gets generated by Operations

Manager and shows up in PA too.

Regards

Harish

Re: Performance Diagnosis in Performance Advisor didn't show failed disk

Hi Muhammad,

I suspect that the way you failed the disk does not trigger any events.

With "disk fail <disk>" you do not fail the disk in a way that causes a reconstruct.

As you can see from the command output, the disk is "pre-failed".

filer> disk fail 0a.18
*** You are about to prefail the following file system disk, ***
*** which will eventually result in it being failed ***
  Disk /aggr0/plex0/rg0/0a.18

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   0a.18   0a    1   2   FC:A   -  FCAL 10000 136000/278528000  137422/281442144
***
Really prefail disk 0a.18?

In this sate ONTAP just tries to copy all readable data from this disk to a spare disk.

Also your output from sysconfig -r shows a copy operation, not a reconstruct.

If you really want to force a reconstruct of a raid group, use the command "disk fail -i <disk>".

This will fail the disk immedeately. No data copy is involved. The disk is completely reconstructed

onto a new spare disk.

Compare the following output to the one above carefully.

filer> disk fail -i 0a.18
*** You are about to fail the following file system disk ***
  Disk /aggr0/plex0/rg0/0a.18

      RAID Disk Device  HA  SHELF BAY CHAN Pool Type  RPM  Used (MB/blks)    Phys (MB/blks)
      --------- ------  ------------- ---- ---- ---- ----- --------------    --------------
      dparity   0a.18   0a    1   2   FC:A   -  FCAL 10000 136000/278528000  137422/281442144
***
Really fail disk 0a.18?

Now PerfAdvisor should show the reconstruct event.

regards, Niels