FabricPool Testing and unexpected results

StevePutre · ‎2021-10-05

Our organization is planning to deploy FabricPool, and we are doing some testing prior to rollout. We want to see how it behaves, what caveats there may be, and how it performs.

We have StorageGRID as our S3 target, and OnTAP 9.8 on FAS. We have successfully setup FabricPool and have a set of test volumes. Some have their tiering-policy set to auto and some to snapshot-only. We have confirmed (using 'volume show-footprint') that most of the data on the volumes has been sent to the object store.

One of our tests asks the question, "what happens if communications between OnTAP and SG are lost or interrupted?" To simulate a failure, we have removed the SG HA Group (the Virtual IP) which OnTAP uses to talk to the S3 bucket. As expected, OnTAP throws error events and labels the object-store as 'unavailable'.

Now, my assumption is that if I attempt to access snapshots or files on my tiered volume, I should get a failure or error. But I have not been able to make it fail.

Example- one volume has snapshot-only tiering policy, and it appears from the volume footprint that all the snaps have been sent to the capacity tier. But I can copy data out of a number of snapshots just fine even though the SG is effectively off the network.

Similarly, for the volume with the 'auto' policy, nearly 100% of the data blocks (396Gb of a 400Gb volume) are on the capacity tier, yet I can read at least 20Gb of data from the volume.

Not a major issue, but it is confusing that I can still read data blocks which were allegedly removed from the performance tier.

Any hints on how FabricPool rally works under the covers?

StevePutre · ‎2021-10-06

Update / new information:

To make REALLY sure that the OnTAP cluster could not communicate with the StorageGRID, we also disabled the intercluster LIFs on our test cluster (as stated above, we had already deleted the StorageGRID VIP which was used to communicate).

To run another test, we changed the tiering policy on one of our volumes from 'none' to 'all'.

To our surprise, OnTAP seems to be showing that the data from the volume has been tiered. We watched 'volume show-footprint' go from 0% on the capacity tier to 93%. We have no idea where the data went, or if it was moved.

We can still see the volume and read its files over a CIFS connection.

cruxrealm · ‎2021-10-06

Not sure if you have read this link: Learn about Cloud Tiering (netapp.com) I think you have to consider the cooling periods of tiering policies. As mention already, tiering is a background async process.

StevePutre · ‎2021-10-12

We know the data has been tiered...

When we set the tiering-policy of a volume to 'all', WAFL immediately begins tiering data to the object store. Cooling period does not come into play in this case.

We can see the objects being created in the S3 bucket, and 'volume show-footprint' indicates that over 90% of the blocks have been tiered. We have proven this.

Once that is done, our test disables network communication between WAFL and the object-store. OnTAP alerts us that the object store is unavailable.

After tiering 93% of blocks to the object store and blocking communication, I can still read ALL of the data from the volume. This goes against our assumptions, so we simply want to understand it in more detail.

paul_stejskal · ‎2021-10-12

That is odd. I would want to see it. Do you have a Support case open? I believe you but it should give an error or something at a minimum.

paul_stejskal · ‎2021-10-06

Data is tiered in the background asynchronously. How long ago has the data been moved? CIFS doesn't get disrupted when the datastore is offline.

I'd have to look at it more to see the behavior, as it does seem odd at first glance.