2013-08-26 05:56 AM
So we had an interesting issue over the weekend where we saw nfs write latency spikes up to 32,000,000 microseconds in our environment. We have PA enabled, but i'm not sure what other views/countes I should look at to try to correlate if this was a filer issue or not a filer issue.
Do we need a custom view for this. I think i'm also looking at something like statit for the disks ut% and xfers, to try to see if the disks were slammed. Is there a view for that as well?
Thanks in advance
2013-08-27 07:10 AM
I've seen 2 different types of problems for what you're reporting:
1. DFM itself was having a problem - I noticed that sometimes our perf collection lags behind while the rest of DFM services run fine and this will be apparent if you don't see performance data collected for a period of time right before the huge spike(you'll see a straight line from one minute to another indicating there was no data collection between those minutes). It's almost as if you're seeing the sum of the values from the period that DFM couldn't poll for perf data. I know there's a huge amount of I/O going to/from the LUN that hosts the perf data on our DFM server when this happens.
Based on the number you quoted, I'd suspect that's what happened and not an actual performance problem. Our performance problems rarely, if ever, hits 1,000,000 us and that raises all sorts of complaints. Anytime we get a number like that it's almost always a DFM service or some other DFM problem in general.
2. There was an actual problem - If you had an I/O bully(or shark), then it'd be best to write a script or at least do this over CLI to find the volume that generated the I/O that corresponds to the latency spike. You can do this by hand by looking at every volume's I/O during that time, but it's usually not practical and we do it often enough that writing a script made sense for us. What we did is basically take the output from dfm perf data retrieve -o $object -C $counter -b "start of problem period" -e "end of problem period" -m max and report back the volume with the largest I/O.
so that line above can look like this: dfm perf data retrieve -o filer1 -C volume:throughput -b "2013-08-27 09:00:00" -e "2013-08-27 09:30:00" -m max.
This line will show you lines and lines of volume:throughput data for each volume, and at the end of each volume's throughput data there will be the max value for that period
I think I've seen your posts about powershell(and I use powershell too, especially to work with DFM), so I think you'll find this approach helpful. check out the output from a line like that, and you'll want to use(or at least I ended up using) select-string for the last line for each volume since that's where dfm reports the max value,
We found it beneficial to use variables for those parameters above so that we're not restricted to just running it against a filer or a volume(if you group filers by certain groups in DFM you can run it against that group), and so we can specify the counters we want and so we can find more than maximum. We almost always use max or mean switch and not the others, though.
This approach has been very helpful for us in figuring out who the bully is, and if the filer was CPU or disk bound as a result.
For determining whether it was processor or disk issue, we use processor:processor_busy and disk:disk_busy. If you're running 7.x, then if even one core hits 95%+ you can pretty much assume it was a processor issue, We noticed that the processor utilization became very even across every core once we went to 8.1.x. We skipped 8.0 when going from 7 to 8. When we're having a disk-performance issue we see the disk_busy counter hitting the same 95-99% type of numbers. Typically when there's a disk performance issue it will become apparent in another way, which is that the volumes outside the busy disks' aggr don't report high volume:read_latency numbers. write_latency counter for those volumes will report high numbers once you have a processor or disk performance issue, but if it's a disk issue then the read performance problem is usually limited to just that aggr. If your filer is processor bound then all your latency counters will reprt high numbers.
2013-08-27 07:36 AM
Excellent post... I was digging into this more and found out the following
DFM stopped collecting data on this object from 23:00 until 2:51am the next day. I've dumped out the numbers from the cli from a latency number and noticed that gap. I checked other counters and noticed the same, so me trying to correlate those counters to find root cause will be very difficult. This is very aggravating... I also, just exported numbers from different objects and there is a gap the entire time. I'm wondering if DFM backups are happening during that time which is why no data is getting captured.. I am checking into this now..
So, my thought process from yesterday went out the window as well. I was working on multi-level triggers in PA, so I created a threshold with an emergency alert of 65000 microseconds for 5 minutes and the thought process was to have an event triggered in DFM, and then have the alert kick off a perfstat. I'm obviously stuck on the last part of triggering the perfstat, I'm trying to wrap powershell around that part.. TBD...
I'm very interested to see what you use powershell for DFM, as I use it a lot. I post a lot in the powershell forum for filer based, but I also posted powershell script to do elastic growth rate trending on aggrs from DFM and it got no traction, so I kind of gave up posting, but I have some other cool scripts that I would be willing to discuss with you. You can PM me and we can chat about that.
I haven't had some success dumping out from dfm perf data retrieve and wrapping powershell around it. I'm curious as to how you handled that aspect of it. I don't understand the statement where you said, that Netapp reports the highest last?
Also, with this statement, "We found it beneficial to use variables for those parameters above so that we're not restricted to just running it against a filer or a volume(if you group filers by certain groups in DFM you can run it against that group), and so we can specify the counters we want and so we can find more than maximum. We almost always use max or mean switch and not the others, though."
I'm really interested in hearing more about this.. Are you creating custom objects, and using measure-object etc....
2013-08-27 08:11 PM
If you're backing up the perf data using the archive method, then I believe it does stop perf data collection. I'm not sure if that should come with that 32 second spike, however. Missing perf data is a problem of its own and our perf data was getting so large that it was taking hours for us to back it up too, so we ended up switching to snapshot based backups.
We have a threshold like yours setup, but we pair it with another metric(volume:throughput) to identify either the bully, or the volume(s) most affected by the bully in terms of I/O. Otherwise if every volume is seeing high latencies and you get that many more emails even if some of them aren't generating I/O. There are some cases where the filer is at the point where it can satisfy the bully volume's I/O below the latency threshold and no other volume is moving the amount of data specified in the alarm I mentioned and they can still suffer(like a low I/O & latency sensitive process), so YMMV with that.
I switch back and forth with using ONTAP module for doing filer related tasks vs using "dfm run cmd", especially with DFM monitoring 7-mode filers. WIth the DFM server monitoring Clustered ONTAP filers, more of the work is done with ONTAP module, largely because the version of DFM we have in place for cDOT doesn't/won't work with some of the standard monitoring/conformance stuff we do with 7-modes(dedupe and autosize checks, for example).
If you are looking to do some graphing like dfm perf view command, then you may be running into the issue of powershell handling binary data differently than the regular prompt. I'm not sure if that's what you mean by growth rate trending, but when I was trying to use dfm perf view to generate graphs on space usage and so forth, that's what I had to work around. There's probably a .NET library to handle this, but I instead opted to get the script to write a simple batch file and had it run the batch file in the regular prompt because I knew I could do that quicker. There is no style point in that, however.
I didn't run into any odd problems with putting the dfm perf data output into powershell. I wrote it a while ago so I see there are some things I may do differently today, but it's run-of-the-mill string manipulation stuff and doesn't involve any use of objects or properties as far as the absorbing/manipulating the output. What I'm talking about when I say DFM reporting the max value last, here's what I mean:
2013-08-27 17:01:44 5018.000
2013-08-27 17:02:44 5072.550
2013-08-27 17:03:44 5385.433
2013-08-27 17:04:44 6178.200
Timestamp filer1:/vol0:throughput (max)
2013-08-27 17:04:44 6178.200
You see that DFM picks out the peak value at 17:04 and puts it 2 lines under (max) at the end, which is the mode that was specified. The three things I'm interested are the the timestamp, the value(6178), and the object name, which is vol0. I just manipulate the output so that it doesn't have that ------ line and organize the object name, timestamp, and the value more neatly. I do use new-object cmdlet for that part, and I attach those parameters as noteproperties. The rest is the standard array-of-objects code. Create an array, keep adding a new object to it with those 3 properties, and then after you're done, you can sort the objects by the values in descending order so that you can view the values from the highest value to the lowest. If you have 50 volumes on a filer(in any object), then the script will have 50 lines in the array with the name, timestamp, and the value it got from DFM.
2013-08-28 04:27 AM
What's your metrics for volume throughput.. I would like to pair it together like you are saying, but i'm not exactly quite sure how you set it up. Can you export that threshold template you setup, and send it on over?
Check my post on elastic aggregate growth in the powershell forum, it should have the script attached to what I did to determine growth rate trending on aggrs to see when they fill up.. I will try to dig up link or I can send you the code..
2013-08-28 08:53 PM
Well, I didn't know I could export a template. Good to know. I opened it up and it's got some company-specific stuff in there so I don't think I can attach that, but I can tell you the command to create the threshold I've got:
dfm perf threshold create -o filer1 -e 50ms_15MBs_test -d 300 -C "volume:throughput 15000000 b_per_sec upper,volume:write_latency 50000 microsec upper"
But to be honest I actually use threshold templates and not thresholds, and I'm not sure how to do it from CLI. I tried CLI for the first time tonight and I couldn't figure it out, so here's a screenshot of what I did in PA:
After that's created, just attach Objects(filers) to the template and you should be set. You can also create a template from your existing threshold, and I think that'll also achieve the same. We have 4.0.2 & 5.x in the environment, and that screenshot up there is from 4.0.2. There's a weird bug on it where that 50000 microsec value might get changed into 50 microsec if you try to edit the threshold. I don't mean 50 milliseconds. I mean micro. So you may have to add 3 extra zeroes to get what you want. For DFM gurus, this behavior, on 4.0.2, is reproducible 100% of the time after a threshold is created. Yes, I'm aware that we shouldn't be on 4.0.2 and we will move all of 7-mode monitoring over to 5.2 eventually.
2013-08-29 05:08 AM
So you don't combine write-latency and read_latency with the volume throughput? Now, do you create alarms based on this? I usually create templates as well. Let me play with this for a little..
2013-08-29 01:22 PM
I'm a bit fuzzy on how the templates are coupled with alarms, but I think what happens is that when a threshold in a threshold template is breached by an object(usually a filer) that's attached to the template, it creates an error event, triggering an alarm with Error severity. I could be wrong.
On your first question, no, I do not combine write_la with read, but this is more out of my own experience and not any specific documentation/guidance I received from NetApp, and it is based on the workload type I see here, so this may not work for the type of workload your filers are seeing. But how I came to create a threshold without read_latency(or avg_latency) is that the primary metric that our Oracle DBAs used to determine that they were having a storage performance problem(almost always accurately, too) is by watching their log sync wait numbers, which I understood to be how Oracle measures write performance with redo logs. We could pretty much match up Oracle Enterprise Manager's graph of "log sync wait" with the write_latency graph for that volume in PA.
We also noticed that ESX complaints can be matched up with it. We can match up other database performance problem with it. I'm sure you can do this with avg_latency, too but while looking at these problems we noticed that there's a pretty big gap between read_latency and write_latency, and with avg_latency & write_latency to a lesser extent. Taking the most recent example, the I/O bully from the last alarm I picked out is reporting 250ms write_latency and 33ms read_latency. This volume is doing mostly reads and is moving a lot of data so 33ms latency doesn't get a lot of sympathy from me, but guess what? The redo log vols were also seeing write_latency of around 250ms, and they care more about the 250ms number than the bully. And this level of disparity between the two counters for us is very common. To be fair, avg_latency during that time is around 100ms. I think anyone would agree that 100ms can be classified a problem and I wouldn't disagree either, but I think the choice of write_latency over avg or read_latency has helped us identify problems better(more frequently) historically.
So, for the type of data layout we've got, this level of filtering works for us. Sure, we can put in threshold of 50ms write_latency + 10ms read_latency + 15MBs of throughput and properly identify this particular problem I mentioned and you can see now that you can go crazy picking different counters and tweaking values, but we made a judgment call that we don't need the third condition for a threshold and it's worked ok for us.