Solved: Performance Diagnosis in Performance Advisor didn't show failed disk

muhammad_i_pasha · ‎2010-06-07

Performed following testing for "Diagnosis" option of the Performance advisor and find out that manually failed disk didn't appear on diagnosis.

Manually failed the disk on the filer to degrade the performance of filer. Started coping data on the volume for which the disk was part. Latency increased considerably. Then performed PA diagnosis on the filer for the time during which disk was failed and data was copied. Diagnosis didn't show failed disk or reconstructing disk??

stx601na08> disk fail 0b.29

*** You are about to prefail the following file system disk, ***

*** which will eventually result in it being failed ***

Disk /test_aggr1/plex0/rg0/0b.29

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)

--------- ------ ------------- ---- ---- ---- ----- -------------- --------------

data 0b.29 0b 1 13 FC:B - ATA 7200 423111/866531584 423889/868126304

***

Really prefail disk 0b.29? y

disk fail: The following disk was prefailed: 0b.29

Disk 0b.29 has been prefailed. Its contents will be copied to a

replacement disk, and the prefailed disk will be failed out.

stx601na08> sysconfig -r

Aggregate test_aggr1 (online, raid4) (block checksums)

Plex /test_aggr1/plex0 (online, normal, active)

RAID group /test_aggr1/plex0/rg0 (normal)

RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)

--------- ------ ------------- ---- ---- ---- ----- -------------- --------------

parity 0a.29 0a 1 13 FC:A - ATA 7200 423111/866531584 423889/868126304

data 0b.29 0b 1 13 FC:B - ATA 7200 423111/866531584 423889/868126304 (prefail, copy in progress)

-> copy 0c.112 0c 7 0 FC:B - ATA 7200 423111/866531584 635858/1302238304 (copy 0% completed)

niels · ‎2010-06-07

Hi Muhammad,

I suspect that the way you failed the disk does not trigger any events.

With "disk fail <disk>" you do not fail the disk in a way that causes a reconstruct.

As you can see from the command output, the disk is "pre-failed".

filer> disk fail 0a.18
*** You are about to prefail the following file system disk, ***
*** which will eventually result in it being failed ***
Disk /aggr0/plex0/rg0/0a.18

      RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)    Phys (MB/blks)
      --------- ------ ------------- ---- ---- ---- ----- --------------    --------------
      dparity   0a.18   0a    1   2   FC:A   - FCAL 10000 136000/278528000 137422/281442144
***
Really prefail disk 0a.18?

In this sate ONTAP just tries to copy all readable data from this disk to a spare disk.

Also your output from sysconfig -r shows a copy operation, not a reconstruct.

If you really want to force a reconstruct of a raid group, use the command "disk fail -i <disk>".

This will fail the disk immedeately. No data copy is involved. The disk is completely reconstructed

onto a new spare disk.

Compare the following output to the one above carefully.

filer> disk fail -i 0a.18
*** You are about to fail the following file system disk ***
Disk /aggr0/plex0/rg0/0a.18

      RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks)    Phys (MB/blks)
      --------- ------ ------------- ---- ---- ---- ----- --------------    --------------
      dparity   0a.18   0a    1   2   FC:A   - FCAL 10000 136000/278528000 137422/281442144
***
Really fail disk 0a.18?

Now PerfAdvisor should show the reconstruct event.

regards, Niels

View solution in original post

rbalaji · ‎2010-06-07

Please note that 'Performance Diagnosis' is performed/verified against the last available configuration with the DFM.

The rule 'Disk Reconstruction in Progress' depends/uses the configuration of the disks in the raid group.

The raid configuration is updated when the diskmon runs (every 4 hrs by default). Hence you would not have got the appropriate error message as the diagnosis is not performed with the latest configuration.

It is recommended that DFM server be set as the SNMP trap reciever. Please note that storage system generates 'disk:failed' trap which when DFM server recieves shall invoke the diskmon. diskmon updates the (raid) configuration.

harish · ‎2010-06-07

The disk monitor running as part of Operations Manager generates the

disk failed event and the disk monitoring interval is 4 hours by default.

Looks like the disk monitor did not run in your case and hence the

Operations Manager did not generate the disk failed event and thats why

PA diagnosis did not show this event.

You can forcefully run the disk monitor (and all other monitors) on your

filer with the following command:

dfm host discover <host-name-or-id>

and then see if the disk failed event gets generated by Operations

Manager and shows up in PA too.

Regards

Harish

niels · ‎2010-06-07