2013-07-17 03:30 PM
Hello NetApp Communities!
We have been experiencing spikes in latency (1000ms+) at or around the same time we see the errors in the messages file regarding stripe writes and block reassignment:
Wed Jul 17 16:10:37 CDT [filerA:raid.tetris.media.recommend.reassign.err:i
Wed Jul 17 16:10:37 CDT [filerA:raid.rg.readerr.repair.data:debug]: Fixing bad data on Disk /aggr0/plex0/rg0/4a.02.17 Shelf 2 Bay 17 [NETAPP X412_HVIPC560A15 NA02] S/N [XXXXXX], block #3088720
Is it normal and expected behavior to exprience a noticable spike in latency when the filer recommends reassignments of blocks during stripe writes? We saw the spike in latency across all protocols.
It makes sense to me that there could potentially be some increase in latency, espeically if this is a sign that the disk is failing, however the spike in latency we observed today (1,000ms+) is something we would like to avoid in the future. Normal latency hovers around 1ms. We are running ONTAP 8.1.2P4 7-mode on a V3240.
2013-07-17 04:38 PM
Do you have a spare disk on this controller?
IMHO you have a disk issue on shelf 2 bay 17. You can remove this disk and lets ONTAP allocates a spare disk to rebuild your raid group.
Before doing this, check how is the resource utilization on your controller (processors, disks, etc..) and choose the best time for your business avoiding concurrency between a rebuild and your production data.
All the best,
NetApp - Enjoy it!
2013-07-17 06:03 PM
We do have spares available. We typically let the filer fail the disk on its own rather than doing a proactive replacement. However, we have never seen such a large spike in latency due to a disk showing signs of failure.
In this case, I may pursue a proactive disk replacment as I found a read error on the same disk a few days ago in the messages file. I'll be opening a case shortly so hopefully the NetApp support group agrees that the disk needs to be replaced.