2009-09-04 07:41 AM
IHAC current using NetApp FC SAN, connect to window 2003, they are running backup to disk on top of the LUN (NTFS format).
After running for a while, the FC LUN performance (write) drop sigificantly (from close to 100MB/s to around 40MB/s). A closer look finds that the LUN is highly fragmented and due to the staging unit design it is always running with 90% file system full.
Our competitor is educating them this behaviour appear at Netapp FC SAN only.
I am urgently looking for any third party documentation that support the theory a highly fragmented NTFS LUN, together with file system close to or above 90% full, will introduce sigificant write performance degradation. Therefore the above issue will appear regardless of what SAN vendor they are using.
Any input is highly appreicated!!!
2009-09-04 01:52 PM
Is the competitor HP by any chance?
I'm not sure if I have any supporting evidence to prove the point, but it is simple storage economics. RAID algorithms always attempt to ground writes together to improve disk layour and read performance. The more free space you have, then better write and read performance is. The less space you have, the smaller RAID stripes can be made in succession and this causes read performance to be affected. This is a global challenge of any RAID system really.
Having said that, the NetApp storage system utilizes free space in the entire aggregate. So although it is best practice to keep the volumes with a fair proportion of free space, so long as your aggregate has adequate space, the NetApp storage should not be the bottleneck or cause for fragmentation.
It is much more likely that Windows is causing this issue. Windows never really deletes data blocks when you remove data, so the LUN is constantly filling to 100% before data is getting purged by Windows. This would cause Windows itself to fragment the data significantly.
If large amount of data is being written and removed from the Windows file system, you can take several steps using NetApp tools and technology to improve this. You could schedule a regular reallocation scan from the NetApp side to greatly improve the performance of this LUN. In ONTAP 7.3 this is completely transparent from any snapshots, so this would definitely be a recommended step. http://now.netapp.com/NOW/knowledge/docs/ontap/rel7311/html/ontap/sysadmin/tuning/task/t_oc_tun_reallocate-creating-scan-schedule.html
If you are using SnapDrive, you could also look at scheduling a space_reclamation more regularly. This allows the blocks that Windows doesn't delete to be freed up properly and allow new writes to be laid out better. If you have a scheduled batch job, you'd want to time the space_reclaimer before this happens.
I'll never tire of hearing "WAFL degrades performance over time", and I've yet to see it actually proved when you follow the best practices. These best practices aren't too tricky to deploy or even restrictive on storage utilisation. Let me know if you need any more information on this, and I'd be interested to know the results.
2009-09-07 01:55 AM
As much as I love NetApp, I hate this some kind of a pact of silence around fragmentation issues in LUN environment.
It is a problem, and yes, it can & should be cured via regular reallocating. IMHO messaging around this is far too weak & many people are getting stung by it due to a lack of a properly advertised, decent & clear information.
Re ballooning snapshots due to reallocate being run - I believe the only way to dodge this is to use physical reallocation:
2009-09-14 08:04 PM
So....I'm a bit more mixed here. I ran a SAN-only FAS3050 for years (4+ to be precise before we added in NFS in addition) and never had performance problems although we ran volumes pretty full.
As a partner engineer now, I like knowing about reallocate as an option (and like to use Performance Advisor to help people understand before just jumping to it) but so far I haven't see many cases where fragmentation for SAN use cases has actually been an issue.
2009-09-17 01:35 AM
From my experience fragmentation manifests itself under specific circumstances, so I agree it may be a non-existing issue for many installations.
What Richard has described though perfectly fits the bill: pure sequential read/write operations tend to perform poorly if a LUN is heavily fragmented. So anything around backup to / from LUN (performed via external host, i.e. not snapshots) is likely to be affected.
One of my customers is doing SQL dumps every day (they do not use SMSQL) & then stream them to tape via NDMP. They run reallocate every Sunday & the performance is always the best on Monday, gradually degrading over the week.
2009-12-24 09:20 AM
Our solution to the SQL rights, not that performance was an issue, was to write the SQL data out to a CIFS share. Vol size set to 700 gigs with 80% reserved. DBA overwrites his files each day and then we can use snapshots if needed. This way, we don't have LUN reserve issues and the databases from multiple SQL servers (our SQL cluster plus a few little SQL instances) can be written to a single spot. We then snapmirror this data to DR.
2010-05-19 11:43 AM
We are seeing the same issue backing up 600GB LUN's on Windows 2003 servers with millions of small files. We researched the issue for months and believe it's related to fragmentation. If we restore the data from tape to a LUN we can back it up in about 8 hours instead of 30. When I run a reallocate measure on the LUN it comes back with a 3 which means it's not fragmented. We also run Windows defrag which tells me the same thing that the LUN is not fragmented. Does running a reallocate with a -p make a difference? Also those who run a reallocate on a weekly basis do you delete all your snapshots before running it? I was told I need to delete the snapshots for the reallocate to run.
2010-05-19 12:35 PM
Millions of small files is never going to perform that fantastic. I reckon you may find that a small level of fragmentation is causing a snow-ball effect on performance. I'd certainly give a reallocate a test and see if it makes a difference, especially if a fresh recover from tape backs up very well subsequently.
I've run through a reallocations a couple of times and not needed to delete the snapshots. I think this recommendation is because the snapshot can grow significantly in size after the reallocate, so it may cause you space issues. I've seen pretty decent results when you run 7.3.2.
All I can recommend is to give it a test and see what happens. You could look at running off a FlexClone and splitting it off to give yourself a volume to play with.
2010-05-21 02:29 AM
reallocate -p option (physical) gives the benefit of snapshots not growing - hence it is fine to run it & leave snapshots as they are.
Re running reallocate on a FlexClone volume: I doubt this provides any level of separation from an original volume, as FlexClone will point to 'live' blocks anyway & they will get reallocated (hmm, if running reallocate against a FlexClone is possible at all...).
Here is a good story from Bren describing the problem, the solution & one, happy customer :