Checksum error, bad data, WAFL inconsistent

KemDatacenter · ‎2013-05-15

Today I've noticed rather worrying messages on one of the filers, saying that there are four bad blocks on one of the volumes, that WAFL is inconsistent and scrub starting. What's interesting, I haven't received any messages from the Unified Manager, neither do I see any errors on volume and aggregate in question.

What I'm concerned about is absence of any messages saying that WAFL has recovered from the parity data. So the question are:

Is the volume now corrupted or not?
Why filer hasn't marked disk drive as failed and hasn't started rebuilding to a spare drive?
What do I need to do to recover from the issue?

filer_name> Thu May 16 08:54:05 EST [filer_name: raid.cksum.wc.blkErr:EMERGENCY]: Checksum error due to wafl context mismatch on volume volume_name, Disk /aggr0/plex0/rg0/1a.71 Shelf 4 Bay 7 [NETAPP X291_S15K7420F15 NA00] S/N [3SK1Z4PQ00009123NQHF], block 31885141, buftree id 0, inode number 101, snapid 106, file block 45778970, level 0: checksum context has buftree id 137615, file block 76494408.

Thu May 16 08:54:05 EST [filer_name: raid.cksum.wc.blkErr:EMERGENCY]: Checksum error due to wafl context mismatch on volume volume_name, Disk /aggr0/plex0/rg0/1a.71 Shelf 4 Bay 7 [NETAPP X291_S15K7420F15 NA00] S/N [3SK1Z4PQ00009123NQHF], block 31885144, buftree id 0, inode number 101, snapid 106, file block 45778973, level 0: checksum context has buftree id 8351367, file block 213319903.

...

Thu May 16 08:54:05 EST [filer_name: raid.data.lw.blkErr:CRITICAL]: Bad data detected on Disk /aggr0/plex0/rg0/1a.71 Shelf 4 Bay 7 [NETAPP X291_S15K7420F15 NA00] S/N [3SK1Z4PQ00009123NQHF], block #31885141

...

Thu May 16 08:54:06 EST [filer_name: raid.multierr.bad.block:CRITICAL]: Marking on 'Disk /aggr0/plex0/rg0/1a.71 Shelf 4 Bay 7 [NETAPP X291_S15K7420F15 NA00] S/N [3SK1Z4PQ00009123NQHF]', block number 31885141, as bad block.

...

Thu May 16 08:54:06 EST [filer_name: wafl.incons.userdata.vol:error]: WAFL inconsistent: volume volume_name has a corrupted user data block. Note: Any new Snapshot copies might contain this inconsistency.

Thu May 16 08:54:06 EST [filer_name: wafl.raid.incons.userdata&colon;error]: WAFL inconsistent: bad user data block 1208910165 (vvbn:76495103 fbn:45778970 level:0) in inode (fileid:101 snapid:106 file_type:15 disk_flags:0x8402) in volume volume_name.

...

Thu May 16 08:54:06 EST [filer_name: coredump.micro.completed:info]: Microcore (/etc/crash/micro-core.151702107.2013-05-15.22_54_06) generation completed

Thu May 16 08:54:15 EST [filer_name: raid.rg.scrub.start:notice]: /aggr0/plex0/rg1: starting scrub

...

Thu May 16 08:54:41 EST [filer_name: asup.smtp.sent:notice]: Cluster Notification mail sent: Cluster Notification from filer_name (WAFL INCONSISTENT) ERROR

Thu May 16 08:54:43 EST [filer_name: asup.smtp.sent.minicore:notice]: Core file 'micro-core.151702107.2013-05-15.22_54_06' sent to NetApp

aborzenkov · ‎2013-05-16

1. It is possible that corruption is confined to some snapshots and active file system is OK. Only support can tell.

2. Error is software one, not hardware. There is no reason to mark disk as bad. It may be caused by hardware - again, support could probably analyze it.

3. The first thing you need is open case.

KemDatacenter · ‎2013-06-02

How can bad block be a software issue? You mean possible Data ONTAP bug?

cliffwilliams44 · ‎2013-07-01

Have you resolved this? We are dealing with this right now.

gavin_meadows · ‎2013-07-01

Contact support as you may need to run WALF_IRON and should only do so after consulting Netapp.

cliffwilliams44 · ‎2013-07-01

We have corrected the problem, then failed the drive and replaced it. All if good now.

Sent from my iPad

Darkstar · ‎2013-07-04

No, this is indeed a hardware problem. It has to do with the drive firmware not reporting media errors in time (note that the drives in this post are firmware NA00, while NA03 is the latest)

See http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=606576 for the bug.

There were some TSBs sent to partners last year that mentioned this problem (I don't remember the number right now)

-Michae

colsen · ‎2015-07-28

This is an old post, but we were just informed of the following BURT which seems to match what you ran into. No idea if you resolved the error and/or the problem went away, but here's the bug:

http://mysupport.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=724468

Cheers,

Chris