Data Backup and Recovery
Data Backup and Recovery
After we've moved our information stores to Netapp and began using SnapManager for Exchange, some users have begun seeing Outlook timeout errors. The actual snapshot of the 100gig Exchange database doesn't take any time, but the verify takes approximately 20 minutes. The verify maxes out my iSCSI connection on the Exchange server, thus degrading performance for users. Anybody else running into this? Do most people even run verify's? Is it possible to run the verify from a server that is not your Exchange server? Do most people do that?
Philip Arnason
Senior IS Lead
I'll answer my own question. Starting at the bottom of page 202 of the admin guide exists 4 different ways to manage the verify. My favorite is the one that throttles the speed of the verify.
Philip Arnason
Senior IS Lead
We have a FAS3070 with 2Tb of Exchange 2003 'cluster' database in 16 stores. The verify take under 4 hours and it maxs out the filer head for CPU, disk I/Ops and FCP. We use a second server for verifications as the load is to high on the production exchange box. This also allows the verification server to be rebooted when the verification process fails, with no affect on live exchange services. -At least once a month-.
When can only run verification once a day because of the load it places on the storage.
When your verification run is going have a look at your lun stats.
lun stats -i 5
I have found that a lag of +20ms is trouble starting and a queue of +2 for more than 2 or 3 refreshes if a very bad sign. If you have operations manager you can alert on these if you need. I sit next to the exchange team and just check when the phone rings....
I have a tone of stuff on SME and perfstat if you want to know more about it.
Sorry forgot to but on the throttling stuff
Call SnapBackup.exe with the -p 500 switch.
[19:30:01.026] ESEUTIL throttling is enabled.
[19:30:01.026] Pause/throttle: 1000 ms per 500 I/O's.
[19:30:01.026] This is an operation which will only run database verification.
Testing has show the 500 is the best for time taken vs system load in our environment.
Thanks for the additional info. We put the verification job on another server (virtual), and we are getting a little bit better performance, but still unacceptable. We will likely end up throttling the job so the storage doesn't get hit as hard. We're also having a SE come onsite in a couple hours, I'll write another follow up once we have the final results.
Philip Arnason
Senior IS Lead
Verification uses eseutil for Exchange 2003 and chksgfiles for Exchange 2007. In both cases, the database is read in 64K chunks and the checksum of each page is verified. The SME backup log should tell you the rate at which the verification occurs. 70 -90MB is not uncommon.
A 1Ge iSCSI connection is maxed at about 120MB/sec throughput. At 80MB/sec verification, you're not hitting the limitations of the iSCSI connection. More likely it's a disk bottleneck; you're constrained by the number of spindles in the aggregate from which the volume containing the LUN you're using is carved. A modern 15K FC spindle can achieve about 150 random 8K IOPS at a 20ms response time. If I have 13 spindles in a RAID group, and RAID DP RAID group is used in my aggregate, I have 11 data spindles in my aggregate. The aggregate can support about 150*11 = 1650 8K random IOs.
chksgfiles and eseutil both read in 64K chunks. If I have absolutely no fragmentation in my filesystem, then 64K IOs are only slighly slower than 8K IOs. The more fragmentation I have, the worse the sequential read performance gets until the 64K read performance is approximately equal to 8 8K reads. If your CPU is jumping up there and writes/cpreads is less than 1, I'd say your fragmented. The fragmentation happens because of short writes, and builds up over time. You can check your fragmentation level with measure layout. For strategies to reduce fragmentation, see:
http://now.netapp.com/NOW/knowledge/docs/ontap/rel7251/html/ontap/bsag/11perf-f.htm
You'll want to carefully read both options and pick the one that is best for you.
A lot of times the goal isn't to get the verification going as fast as you can. You need to balance verification IO against user activity. You can do this by scheduling the verification in off hours, or by throttling the verification. If you throttle, your verification will take longer. If you go back to the example of 13 drives in the aggregate, and verification proceeds at 1400 64K IOs per second, and I have no fragmentation in my filesystem, then that only leaves 200 or so IOPS of user load before I start breaking 20ms response times. For that 100GB database, the verifiation would complete in under half an hour at that rate. I think you said you were completing in 20 minutes - about 88MB/sec. If this is a 9-5 shop, and I verified the backup at 10 PM, that might not be a problem. If I had users accessing email at all hours, maybe I wouldn't want to go full bore verification but instead throttle 50% and let it take an hour to complete.
John