I read the guidlines for storage sizing that NetApp and Liquidware Labs released. How can accurately measure IOPS when sampling every 5 minutes. 5 seconds could be an eternity from an IOPS perspective. 5 minutes, it seems you could miss a lot of activity and demand.
The Liquidware team pointed somethign out to me today;
We measure the average and peak values for machine level IOPS and application level IO Rate within the sampling period. We provide the average and peak values for IOPS + IO Rate for the machine and all applications running on the machine within that 1min period. So effectively, we will capture the peaks values during that period and not at the time of our sampling. Due to this approach we will not miss any peaks that occurred during our sampling interval. Also, the sampling interval can be configured down to 1min as well.
So even though the sampling period is by default 5 min you are in fact getting all of the data from inbetween the samples. Thus you will definitly catch the peaks you were worried about. In particular with VDI you want to watch for times that all the desktops peak. As I said, a single desktop peaking likely will not effect the environment but if there is something (Virus scan, patch update, ect) that causes all the desktops to peak at the same time then this can be a major issue.
I guess I'm not getting something. If you're sampling every 5 minutes where does the 1 minute sample come from and how are you capturing a peak if you're only sampling once every 5 minutes. I don't know how Liquidware monitors so I'm sure I'm missing something. Do you poll at the millisecond level and only write one record to the database every five minutes to keep the database size down or are you polling every 5 minutes and grabbing the value at that point in time. I still contend 5 minutes is long interval, that's only 12 records in an hour, spindles are expensive so under provisioning or over provisioning is a big deal especially for an SMB customer, it can be the difference whether a project gets funded or not. Clearly I'm not a "storage" expert and I'm sure I'm not getting this but I appreciate your input and explanations.
Sorry for the long delay, I decided I better follow up with Liquidware directly to make sure I was not making assumtions. Here is the data I recieved;
We collect Read IOPS, Write IOPS and IO transfer rate (Kb/s) for both applications and machines.
Our CID keys (agents) collect the information locally on the machine, then sends it back to our Hub. Each agent follows 2 schedules:
Call back frequency : How often the agent send the information back to the hub
Sampling interval: How often do we capture the local data (for continuous metrics such as cpu, memory, disk IOPS, etc…)
Sampling interval basically determine how granular the information will be and can be set up as low as 1 minute.
For each sampling interval, we collect the average resource used during that period, NOT the value at the time of the collection.
In other words, we ask the system for the past sample intervals AVG value with a single API call every sample interval.
The system returns the past X min AVG value and not the current realtime value. “X” may be as low as 1 minute.
1 hour call back frequency, 5 minutes sampling interval
For each application, the CID key will report 12 values representing the average metric (read IOPS, Write IOPS…) observed over 5 minute + one peak value + average over the call back (value 1+ value2+…..Value 12 / 12 ) .
The peak is the highest average value among the 12 samples.
This method insures a very accurate average and that any peaks would be collected.
CREEZ111, I would agree if we were sizing for a major server workload. With VDI however the IO spikes of any one desktop are not usually enough to effext the environment as a whole. It is the aggrgated IO load from hundreds or thousands of desktops which is the load we are sizing for.
Thats not to say we don't care about IO spikes. Certainly is all the desktops spike their IO at the same time that is a huge problem. These are usually induced events though (OK everyone download this file on my count ready? 1...2...3..) Things like patch updates, virus definitions ect.
What we really want to capture with the Liquidware tool is the profile of the desktops in general over the course of days or even weeks. That is just too much data if we shorten the sample.