ONTAP Discussions

Flash Accel question

borismekler
7,309 Views

Now that version 1.2 is out with support for vSphere 5.1 and VMotion, I'm preparing to deploy it, but before I do, there's one thing that I haven't been able to find an answer to in documentation - how does it handle cache device failures? That is, if I give it just one SSD (rather than RAID1 or RAID10), and that SSD fails, will I simply get performance degradation, or will my cached VMs (or worse, entire host, including non-cached VMs) crash? Same goes for PCIe devices, which can't be RAIDed in the first place.

1 ACCEPTED SOLUTION

SLASH5K200
7,309 Views

Thats a good question..

I just kicked off a SQLIO on a test VM with flash accel cache enabled and after a few minutes removed the datastore that holds our RDMp's for Flash Accel on the host where this VM lives to simulate the loss of a cache device...

VM stayed alive running the SQLIO test and flashcache start to kick in... VM was fine

Flash Accel hit Ratio %

Flash Cache:

When i look at the Flash Accel homepage it now also says:

To repair this, i migrated the VM to another host that had access to the existing RDMp datastore, disabled the cache for the VM, enabled the cache, and now its working again... no restart required!

If you ask me... thats pretty cool!

hope that helps you

C

View solution in original post

13 REPLIES 13

SLASH5K200
7,219 Views

Howdy,

I cant see the benefit of raid1/10 for a read cache, however if the SSD / PCIe was to die your data will still safe, but now your cache will come from flashcache or failing that its back to spindles.

C

borismekler
7,219 Views

I know the reads will come from spindles, my question is, how gracefully is the failure itself handled, when the entire cache device disappears from host, or starts throwing weird errors? Will the VMs bluescreen? Will the host crash? Will either the VMs or the host require a reboot? Can Flash Accel trigger an automatic VMotion to hosts where the cache is still alive? When I replace the faulty device, will I need a reboot to get the cache back online?

SLASH5K200
7,310 Views

Thats a good question..

I just kicked off a SQLIO on a test VM with flash accel cache enabled and after a few minutes removed the datastore that holds our RDMp's for Flash Accel on the host where this VM lives to simulate the loss of a cache device...

VM stayed alive running the SQLIO test and flashcache start to kick in... VM was fine

Flash Accel hit Ratio %

Flash Cache:

When i look at the Flash Accel homepage it now also says:

To repair this, i migrated the VM to another host that had access to the existing RDMp datastore, disabled the cache for the VM, enabled the cache, and now its working again... no restart required!

If you ask me... thats pretty cool!

hope that helps you

C

borismekler
7,219 Views

Thank you very much, this helps indeed - now I know I can safely deploy single SSDs, or RAID0 if I need more capacity.

liviu_ianasi
7,219 Views

Hi everyone,

I'm also doing tests with FlashAccel and followed your example with the RDMp datastore fail test. When i offline the LUN/datastore my test VM shuts down - vmware HA kicks in and tries to restart the vm on another host.

SLASH5K200
7,219 Views

Interesting...

My environment:

  • Cisco UCS
    • B200M3 Blade, 256G of Ram, LSI 400GB SLC WarpDrive
    • ESXi 5.0 - current patchset
      • Windows 2008R2 - all current patches including aditional patches required for MPIO, snapdrive, etc.

  • NetApp 3240AE
    • Clustered-Ontap 8.1.3
    • FlashCache
    • SAS

I presented the datastore to each ESXi host via iscsi and I did my test against an iSCSI LUN presented within a windows host configured with clustered file services (this was a test HA SQL environment)

I'm not in a position to do any re-testing against the operating system drive hosted on a VMDK - which may be the difference here ?

Cheers,

Chris

Message was edited by: Chris Anders Added LSI Card to B200M3 spec.

borismekler
7,219 Views

Wait one - I was under the impression that current version of Flash Accel doesn't support MSCS. I have a few environments similar to what you tested (Windows Server 2008 R2 on top of vSphere with in-guest iSCSI LUNs used for SQL Server 2008 R2 on MSCS) which could benefit from Flash Accel (the filers are FAS2040/2220/2240, so no option of FlashCache), but when I asked whether or not MSCS is supported in a recent NetApp/LSI webcast about Flash Accel, I was told that it's not supported in 1.2, and may be added in 1.3. Was that incorrect?

SLASH5K200
7,219 Views

Interesting...

so from the flash accel gui i was able to see on both hosts the mapped lun's however only one of the hosts had the luns mounted and was writing to it.

Active Node:

Passive Node:

10G of cache was given to both hosts and migration was enabled, which meant i burnt 20G of cache on both blades. - i had each SQL host on separate blades.

I did some simple testing whereby i ran some IO and watch the cache do its job, i then failed over to the other node, re ran some tests and watch the second cache do its job.

(the screenshots above dont represent that test - just pulled them now and the server has since been restarted)

Cache was cold as i migrated between SQL hosts, but that was to be expected and to be honest i didnt even check if this configuration was supported, i just tested it since 1.2 supported iscsi within the host and to my surprise it did the job!

*shrug*

Im not saying its "supported" but it certainly passed the - wow this is cool... lets try this in UAT!

Cheers,

Chris

liviu_ianasi
7,219 Views

My environment looks like this:

Dell R720

  •      ESXi 5.1 latest patches
    • Fusionio iodrive2 - 750GB - scsi driver for esx 5.1 latest

Netapp 2240AA

  • NFS exports for vmware datastores

Test Machine

  •       Windows 2008 R2 x64 - no MPIO or any special apps installed. Just IOmeter for testing.

Only one ESXi host is involved in the test but is part of a cluster configured with HA and DRS.

To make FlashAccel work i presented a iscsi LUN to the ESXi to store the pRDM file. All other vm disks are on NFS datastore.

All working ok until I offline the lun presented with iscsi. At that moment the ESXi throws an error that it cannot find the raw disk, and shuts down the vm to restart it on another host. No other host has (for the moment) the iscsi datastore so it remains powered off.

I think it's expected behavior from ESX HA to try and restart the VM to another host in the cluster when it looses connection to the LUN but that is not the way I'd wish it should react.

Anyway loosing the iscsi datastore is not a viable scenario as the netapp is AA so no problems here to make iscsi redundant. I will do some more test but this time will actually fail the fusionio card to see the result.

SLASH5K200
6,194 Views

So the main difference I see apart from ESXi version is that your making use of the FusionIO card and im making use of the LSI card which i believe is presented to the ESXi host differently.

Not sure how else you can simulate the card failure without physically pulling it out, but interested to see how you go

Cheers,

Chris

SLASH5K200
6,194 Views

In response to your comment:

"I think it's expected behavior from ESX HA to try and restart the VM to another host in the cluster when it looses connection to the LUN but that is not the way I'd wish it should react."

That seems a tad odd to me, considering i have seen  datastores go missing on me numerous times, especially when demonstrating NFS failover on 7-mode installs to customers and instead of the VM dying, it will pause while trying to resolve the missing datastore.

interesting...

liviu_ianasi
6,194 Views

The vm will pause while esx tries to restore the NFS datastore for a given amount of time - if you use Netapp VSC those setting are done by VSC - NFS timeout - but eventualy it will HA to a different host.

For the Fusionio i've used a dedicated driver so i think i'm ok in that part.

Did you enable cache on the OS disk also or only on the disk presented as iSCSI LUN directly in the VM (windows iscsi software initiator) ?

chittur
6,194 Views

Chris,

Have you deployed Flash Accel in Production environment?

Thanks!

Kumar

Flash Accel TME.

Public