I have quite an interesting situation, we have 2 3170 cluster pair with VDI solution and we are seeing almost every day around 3am its start FCP IOPS going high say from 7000 to 13000 at 4am almost and then by 5am it reduces again.
Our latency for all 24 hrs almost is less than 10ms which is good i believe, but the problem is really when latency peaks at say 3.45am-4.10am it goes crazy with 100ms -150ms , so this spike is only for this 10-15min and then back to normal gradually we are normal to 20's and to 10's by 5am.
when i tried to find an answer for this in web i got this link which is so close to what i see ->
can someone please check on this and get me any reply?
i do have perfstats for filer in case -> 2001524925 (filer -nsaunsw068 and nsaunsw069).
thanks for any help in advance.
Are you experiencing the infamous symptoms of VMDK misalignment? Can you speak to this?
Please pardon my BlackBerry thumbmanship ...
Stetson M. Webster
Professional Services Consultant
NCIE-SAN, NCIE-B&R, SCSN-E, VCP
NetApp Professional Services - East
Learn how: netapp.com/guarantee
Thanks for info, few things to clear.
-Aggr snapshot is no more there, it was disabled alogn with some options for MPIO error threshold notice they told us to clear. We have disabled it and yet it is same thing. The perfstat is of one month ago and after that the big change we did is
-dedupe was made to run on sunday,saturday only, but still it is same character - between 3.50am -4.15am it freaks out on latency with vtic resources out errors (which is saying basically i cant take any more connections or serve?)
BUT believe me guys --just today when i found the latency for FCP ..it was flat.we just want to know if that remains same next week if so i will let you guys know what who did from other srevice providers from vmware to network guys.
i cant believe i did not see latency today night when i started blogging after weeks of issue ..i can promise we did not do any changes on filer!!
i will be out of office for a week i hope all goes well..crossing my fingers.
Any chance this is a boot storm? We have our VDI machines set to throttle down to 10% running after business hours, then turn back on prior to the beginning of the next business day.
Just an other suggestion, probably nothing, but are there a lot of snapshots taken in that period. I think also on the aggregate snapshots.
But, because it are VDI's, the problem is more likely on the client side: some tasks that are running from management tools like SCCM, altiris, Tivoli, ... They like to start some tasks on all the machines on the same moment. The same with antivirus products.
Just wanted to share few info on this..
so, the story so far short..."finally we believe it is caused due to AV mc'fee AV" almost it is nailed to that (no prize for guessing here!!), but there are other things here to note.
what does also happen is due to this load from client and due to combination of cocktail issues of blockalignment and MPIO traffic getting routed on wrong paths, we also found due to wrong paths set the VTIC adapter gets out of resources during this time and latency goes crazy!!
so, things to check
-check your priv set diag;stats show lun , get the histo: 0-7 and find out how bad is alignment.
-check lun stats -o -i 1 /lunpath , get the partner load (use zero counter before if u need)..ha also check QFULLs!! if u have one fix it.
-ask vm guys to get the path set from vmside.
-check statiit for those times...dont be surprised if u see B2B or any big B in cp type (i dont want to genralize this, but a good pointer to check cp's and if not all sysstat -x 1 columns).
as of now there are some good indicators we might get some relief after fixing blockalignment and MPIO path fix!!
moral of story!! VMWARE or any virutal env, make sure you have the basics done right..else its a pain in .....
I thought i will update some useful info on this episode..
- We were finally able to move the AV load and there was a dramatic shift in latency from 100+ms to just 5-10ms..wow that is good.
- The other analysis is after the HUK for ESX was installed, the partner ops was fixed the vtic resources issue was fixed.
- alignment is yet to be fixed..but it is far better now..mostly things getting under control.
moral of the story --- Use HUK and most importantly..test and confirm always all ok before getting into production