Solved: Re: I/O problem FAS8040 Netapp

abdoul001 · ‎2019-03-27

hi,

we are facing one problem since 1 mounth, VM are disconnected form the storage and from VMwarLog we can see this :

2019-03-24T22:38:15.173Z cpu4:33383)ScsiDeviceIO: 2613: Cmd(0x439e4bb84280) 0x8a, CmdSN 0x80000018 from world 36469 to dev "naa.600a098038303841635d4a2d2d575a63" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2019-03-24T22:39:42.333Z cpu10:33383)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x8a (0x43a640b73680, 36213) to dev "naa.600a098038303772735d4a37734f4844" on path "vmhba2:C0:T0:L0" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2019-03-24T22:39:42.333Z cpu10:33383)ScsiDeviceIO: 2613: Cmd(0x43a640b73680) 0x8a, CmdSN 0x800e0038 from world 36213 to dev "naa.600a098038303772735d4a37734f4844" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2019-03-24T22:43:57.540Z cpu14:33383)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x8a (0x43a64b6f6e40, 35971) to dev "naa.600a098038303841635d4a2d2d575a59" on path "vmhba2:C0:T1:L3" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2019-03-24T22:43:57.540Z cpu14:33383)ScsiDeviceIO: 2613: Cmd(0x43a64b6f6e40) 0x8a, CmdSN 0x8000007b from world 35971 to dev "naa.600a098038303841635d4a2d2d575a59" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

we need urgent help please !!

abdoul001 · ‎2019-03-29

we have opened a case with netapp and waiting for the feedback, the perfstat has been sent yesterday

View solution in original post

SpindleNinja · ‎2019-03-27

Are all of the best practice settings applied to vmware?

Any changes to the environment in the last month?

Are all parts (server,storage,switch,hypervisor) all compatible in the IMT?

abdoul001 · ‎2019-03-28

hi,

we did an upgrade .

previously we had 1 stack and after we did an ugrade and add a second stack to the controllers

are you there? we really need help

SpindleNinja · ‎2019-03-28

I would open a P1 support case with netapp if you help ASAP, that will give you the fastest response time possible.

https://www.netapp.com/us/contact-us/support.aspx

As far as the upgrade goes... do you mean you did an ontap upgrade or added some disks?

abdoul001 · ‎2019-03-28

we added 9 new shelfs to the new stack created and after that big IO latency started on all VM and on both stacks

konnerth · ‎2019-03-28

I agree with the other poster. Sounds like more to this than just the virtualization software. I would open a P1 case ASAP. They can do a comprehensive look.

---Karl

SpindleNinja · ‎2019-03-28

There could be a lot going on there honestly. A support case and have a perfstat run on the cluster.

abdoul001 · ‎2019-03-29

this is the response from VMwar:

2019-03-24T23:04:41.380Z cpu22:33383)NMP: nmp_ThrottleLogForDevice:3298: Cmd 0x2a (0x439e4b6ccd00, 36451) to dev "naa.600a098038303841635d4a2d2d575a63" on path "vmhba2:C0:T1:L6" Failed: H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0. Act:EVAL
2019-03-24T23:04:41.380Z cpu22:33383)ScsiDeviceIO: 2613: Cmd(0x439e4b6ccd00) 0x2a, CmdSN 0x8000004b from world 36451 to dev "naa.600a098038303841635d4a2d2d575a63" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2019-03-24T23:05:01.627Z cpu14:33383)ScsiDeviceIO: 2613: Cmd(0x439e4b7a09c0) 0x2a, CmdSN 0x80000073 from world 36451 to dev "naa.600a098038303841635d4a2d2d575a63" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.
2019-03-24T23:16:25.753Z cpu10:33383)ScsiDeviceIO: 2613: Cmd(0x439e40b95680) 0x2a, CmdSN 0x800e0017 from world 35697 to dev "naa.600a098038303772735d4a37734f484c" failed H:0x2 D:0x0 P:0x0 Possible sense data: 0x0 0x0 0x0.

-- We tried to bring the production back up by registeing orpahned VM's on the host .170 and powered on successfully
-- In the mean time the host .171 went into non-responding state which houses the vCenter and which had the VM's also
-- We rebooted the VM and it came back online and once it was online the host .170 went into not-responding state
-- The VM LMS has 3 disk with 2 levels of snapshots

[root@localhost:/vmfs/volumes/583f0917-bc1e434e-c61b-000af74ef094/LMS] ls -ltrh | grep -i delta
-rw------- 1 root root 948.7G Dec 18 16:46 LMS_1-000001-delta.vmdk
-rw------- 1 root root 908.4G Mar 24 16:47 LMS_1-000002-delta.vmdk
-rw------- 1 root root 215.2G Mar 24 16:47 LMS-000001-delta.vmdk

[root@localhost:/vmfs/volumes/583f0917-bc1e434e-c61b-000af74ef094/LMS] ls -ltrh /vmfs/volumes/59a78fc2-1606aed3-675e-000af74ef094/LMS | grep -i delta
-rw------- 1 root root 46.7G Dec 15 23:56 LMS-000002-delta.vmdk
-rw------- 1 root root 46.5G Mar 24 16:47 LMS-000003-delta.vmdk

-- Even if we try to power on it will stuck at 33% and told the customer we will have to wait till it completes or we need to do a consolidation
-- Since the reads and writes are failing on all the datastores told the customer to engage storage Vendor on this to check further
-- Checked the driver and firmware on HBA and they are compatible

-- With storage commands failing with above errors, we see host .170 and .171 is going into not-responding state one after the other and it is difficult for us to troubleshoot until we fix the storage issue.

the second stack has been added 6 month after the first one and the first stack is connected to IBM chassis blade, it was working fine all those month but the problem start when we add the second stack wich is connected to a hp blade ( LUN on the second stack are access by HP blade).

we have a replication site where the configuration is same and there everything is fine

abdoul001 · ‎2019-03-29

we have opened a case with netapp and waiting for the feedback, the perfstat has been sent yesterday

JakeOff · ‎2019-05-17

I also noticed such a problem, I hope it has already been fixed.

)))

abdoul001 · ‎2019-05-19

We fixed our problem