The broader your monitoring, the better you are able to correlate issues.
In cases like this, you want to collect:
On the NetApp
- IOps per volume, and per aggregate
- latency per volume and per aggregate
- LUN queueing
- CPU, per CPU, FCP stats, etc.
- physical disk utilization
But you also want to get performance information from your fabric switches; from your ESX hosts (datastore latency from the point of view of the ESX host; total IOPs; IOps per virtual machine.)
Then you'll be able to see if:
- there was a latency issue on the datastore volume
- if so, was it due to IO load on that volume, or other volumes on the aggregate. Depending on the workload it could be the NetApp CPU, but that is unusual - much more likely IO's per spindle.
- if it was that datastore volume, was it due to some specific VM doing an unusual workload causing it?
- or possibly some issue in the fiber channel network
We use LogicMonitor to track all this stuff in one place
- there was a latency issue on the datastore volume on the NetApp