I've a bit of puzzle with a client's storage system... The setup is quite simple: physical windows 2003 servers with SnapDrive atatched LUNs over iSCSI on FAS2020.
Issue: all the LUNs they've set up for their Windows Server platforms suffered a sudden drop of performance. Subsequent analysis of the system we carried out showed only fragementation issues as the probably culprit - up to 12 on one of the LUNs. All else seemed to be working fine, network tested, LUNs properly alligned, all services up & runing.
After several reallocation jobs across all volumes/LUNs, the fragmentation levels are now below 1,9. In some cases, it brought about more-less better performance but on one particular LUN the READ performance is attrociously poor. Most of the time it's under 1-2MB/s, even when there's hardly any work load on the storage side! 😞
The problem is that this LUN is extremely important for them. Moreover, it was the only reason for setting up a storage system - to raise it's data availability. We've opened up a case with NetApp support and sent them perfstat reports they've requested but, by the looks of this, it will require a lot of time to resolve the problem.
Now... Since the LUN in question is only 150 gigabytes in size, our first reflex was to move the data to a local disk on the server untill we figure out what's wrong with iSCSI. But the read performance is so poor it would take us at least 8-10 hours to complete the migration. There are a few snapshots we also tried using for copying but since they're point-in-time images of that same, slow LUN, performance was equally disappointing.
So I've mapped up a new LUN to the same server and tested it's performance. Suprisingly, it was OK. So now I'm thinking - if we could somehow migrate the data to a new LUN, internally, perhaps we can avoid the slow read problem. Any ideas on how to do that? Do you suppose that would help?
Have you tried to map the problem LUN to another server? What type of data is on the LUN?
If you take a snapshot and map it and the problem still exists I would say that the data on the LUN may be the issue here.
Manual copy would be slow but you could run it over night via ndmp copy, rsync, or other copy tool from the local nodes to a freshly mapped lun.
The snapshot route, take a snapshot with SnapDrive, and map to server or other server and see if problem exists.
If the load on the filer isn't huge, you may try cloning the LUN in question & then splitting the clone, so from the read perspective all blocks will be 'new'. And then mount it & see whether anything has improved.
When splitting the cloned LUN, I'd have to provide enough space for it on the same volume? Also, when split is initiaited, I assume NetApp will run an internal copying process which would be limited by the RAID group performance?
When splitting the cloned LUN, I'd have to provide enough space for it on the same volume?
Yes. I've suggested LUN cloning, because you don't need FlexClone license for it. If you have the license though, then volume-level cloning could be even better.
Also, when split is initiaited, I assume NetApp will run an internal copying process which would be limited by the RAID group performance?
Yes. I am not saying it should improve anything when applying standard rules & common sense. Yet from what you have said, this LUN behaves in a very unusual way, hence a hypothesis that there is something seriously wrong with blocks / blocks layout it originally sits on (which may, or may not be true though).
Not sure about the type of data, could be a database/application for GPS tracking. LUN itself is at 83% capacity.
Yes, it would certainly have to be an overnight job.
I'll try mapping a LUN snapshot to different server first. Most other servers experienced the same iSCSI performance drop, so I'm doubtfull, but if I get better performance that should prove that the issue is on the server side and not NetApp, correct?
If you have space on the 2020 you could create a new volume and do an ndmpcopy at the volume level. I have used ndmpcopy for both vmware volumes and sql database volumes, what I like is the fact it copies the tree structure accross.
For example, I moved 2 x 60Gb FC luns last night, from a 3070 to a 6080. One took 16 mins, the other 19 mins accross ethernet, before reconfiging the FC switch with the new settings. It puts the qtree, lun right under the volume you specified in the command.
With the VMware one I actually did this during working hours on a test/dev system and there was no outage. The ndmpcopy command does tie up the filer from the command line and can be slightly offputting when you see very little being displayed on the filer for 20 mins.
I will eventually have to copy the data, yes. The trick is that this 2020 box has only 12 internal SATA drives and it's copying data onto itself, so it might take quite a while to move 140 gigabytes. Ugh.
We'll see after I test what Watan and Radek proposed.
Trick for ndmpcopy:
Don't do it on NetApp's CLI using ndmpcopy command but on Windows command line using remote shell (rsh <nodename> ndmpcopy). That way nothing gets tied up. 😉
I've done as you suggested. I created a snapshot clone of the LUN, another clone from a different LUN and a brand new test LUN.
I've mapped all 3 to different servers and tested read/write performance. Only the initial problem was observed - slow read performance on the original LUN, and it's clone. Everything else was nominal. It seems that network/host issues can be ruled out since only that single LUN is causing problems.
This is very unusual... 😕
P.S. The data on the LUN consists of SQL databases and logs.
It will require application downtime on our client's side, but I guess we'll have to try an overnight clone & split since we're running out of ideas and technical support is of no help. Does splitting a clone mean that the new blocks will have an identical layout as the parent object or not? If not, perhaps there might be an improvement...
If you have isolated the LUN itself as the problem, then it really seems like you have an incorrect LUN layout. There are KB articles on checking alignment yourself based on output from the stats command (or just find the volume stats in the output from a perfstat run). You haven't mentioned much as far as disk (aggr) layout/sizes. Using a little FAS2020 and SATA disks is probably not the ideal basis for running SQL databases, but that might just be the financial realities of your situation.
I guess my suggestion is to make a new LUN and copy your data over to it. All of the cloning operations, as you have discovered, will simpy copy the wrong layout over to a new lun.
If you had a 2020 cluster, I would have suggested even splitting the database
You might be able to look at the 'priority' command as well, if you want to give a higher I/O priority to a select few volumes.