Solved: Drives going bad

greg_epps · ‎2010-11-21

Greetings,

I've had four drives fail in three days on a fas2020 disk shelf. That seems a bit high by the lowest standards. Does anyone know if I could be doing something to cause this? I've not been unseating and reseating them or anything physical like that, but have been practicing with the system configuration which has involved resetting them to factory status and zeroing them quit a few times, but it seems unlikely that would cause the drives to go bad. Could I be misinterpretting the signs: orange light on the drive and filerview shows the drive as failed? Is it a physical failure, or could there be a way to recover the drive? At this point, it appears to be physical failures and unrecoverable, but if anyone knows otherwise, it would be great to learn what you know.

Thanks.

ekashpureff · ‎2010-11-21

Is this an old Ebay or NetApp graveyard kind of unit that may have been mistreated in the past ?

You could run 'disk shm_stats' or look at the syslog messages from shm to get a clue why the drives are being failed.

You might decide to run 'disk maint' against these drives.

You may also try unfailing them, by going into advanced priv mode and doing a 'disk unfail'.

I hope this response has been helpful to you.

At your service,

Eugene E. Kashpureff
ekashp@kashpureff.org
Fastlane NetApp Instructor and Independent Consultant
http://www.fastlaneus.com/ http://www.linkedin.com/in/eugenekashpureff

(P.S. I appreciate points for helpful or correct answers.)

View solution in original post

ekashpureff · ‎2010-11-21

Is this an old Ebay or NetApp graveyard kind of unit that may have been mistreated in the past ?

You could run 'disk shm_stats' or look at the syslog messages from shm to get a clue why the drives are being failed.

You might decide to run 'disk maint' against these drives.

You may also try unfailing them, by going into advanced priv mode and doing a 'disk unfail'.

I hope this response has been helpful to you.

At your service,

Eugene E. Kashpureff
ekashp@kashpureff.org
Fastlane NetApp Instructor and Independent Consultant
http://www.fastlaneus.com/ http://www.linkedin.com/in/eugenekashpureff

(P.S. I appreciate points for helpful or correct answers.)

greg_epps · ‎2010-11-21

Eugene,

Thanks so much for your help!

The systems are brand new. I'm currently working with 5 but eventually have to get nearly 80 up and running. They're operating in a dusty desert environment, obviously not the best place for any computer system, but we have many other servers operating in the same environment without a high rate of disk failure.

I didn't find any indication yet as to what is causing the 'broken' and 'failed' notifications on the drives, but two of them seem to be back up. They're currently operating in the 0 and 13 slots. The one in slot 0 seemed to recover on its own as its status changed from failed to 'dparity reconstructing' before I really did anything other than halt and boot. I used disk unfail on the other and now it's status is normal as well. This is good news because I was beginning to worry about being able to keep enough spares on hand if drives were to continue failing at such a high rate.

I'm not really sure where to find the syslog messages you mentioned, other than the reports in the filerviews which don't reveal much information.

Thanks again,

Greg

greg_epps · ‎2010-11-22

Please disregard the question about the syslog; I found it. However, the one drive's status has been 'dparity reconstructing' for about 7 hours, even though there's no data on the drives. Is this normal?

Thanks,

Greg