2011-09-15 05:01 AM
I'm fairly new to NetApp and this is my first post of the forums, so I hope that my question has been posted in the correct area and I don't sound too much of a noob .
Anyway, we have recently had a 2 node NetApp FAS2020 SAN installed with its primary use being as a repository for all files in our DMS (Document Management System) however I have had a few issues with file corruptions occurring.
Prior to the SAN being installed, we were using a Marathon EverRun HA solution on Windows Server 2003 with the DMS documents being stored on direct attached storage, this solution worked OK for around 3 years with no issues, until we started to run out of space hence the SAN.
The SAN has been connected to the Marathon environment via iSCSI and a volume has been created of 1.89TB in size, all files have been copied from the DAS to the SAN and all appeared to be working fairly well. However, we ran into an issue where all switches used for the iSCSI rebooted (this was caused by a bug in the firmware installed on the switches) and due to the iSCSI Initiator not reconnecting cleanly until the server was rebooted we encountered a number of corrupt files and were required to run CHKDSK on the volume to resolve the problems. This appeared to complete successfully, and new files have been written successfully.
We're now in the situation a few weeks after this incident where the event log is reporting (Source: Ntfs Event ID: 55) that the disk structure is corrupt and that CHKDSK needs to be run which is slightly concerning!
We have around 946GB of the 1.89TB used with approximately 22 million files stored on the volume.
Does anyone have any suggestions on what the cause of this problem may be?
2011-09-29 10:08 AM
I'm not certain as to the root cause of the file corruptions, but would like to offer a much faster and less disruptive alternative to locating the corrupt files than CHKDSK. One of our customers with a multi-terabyte share had a similar problem and, using our solution, examined the CIFS error messages to find the location of the corrupt files, which they then restored from backup. In this case, the customer examined CIFS metrics, but similar metrics are available for NFS. For IP-based SANs, the ExtraHop system also provides health and performance metrics based on analysis of the iSCSI protocol.
I'd like to invite you to try the same technique using our free http://www.networktimeout.com tool, which demonstrates the capabilities of our ExtraHop system in an offline fashion. You'll need a packet capture with network activity to the SAN in question, preferably 5+ minutes. Upload that to http://www.networktimeout.com. You'll be able to examine error messages per device to find the location of the corrupt files.
You can also check out our blog post on this particular problem at http://www.extrahop.com/post/blog/performance-metrics/cifs-errors/
Hope that helps!
Technical Marketing Manager, ExtraHop Networks