Solved: OnTAP Behaviours Dealing with with FabricPool & StorageGRID and connectivity issues

JFM · ‎2022-05-02

Our customer is currently in the last stages of putting their brand new StorageGRID environment for use with FabricPool (source = A400). Our customer is asking a few very tricky questions for which the SG documentation seems to be lacking. All questions related to the resiliency of the environment (A400 + SG) regarding the side effects of possible network failures between the A400 and the SG.

For read operations, more specifically: cold blocks retrieval from the SG to the performance tier, what is the configured time out value (in ms) before an I/O error is returned to the clients? Is this value exposed somewhere via the command line on OnTap or SG? More details about the read logic between the cold tier and the performance tier in that situation would be very welcomed.
For write operations, we do understand that data migration from the performance tier to the capacity tier (SG) are less impacted by a network link failure (this is an assumption, we could be wrong here as OnTap would probably retry “put/write” operations before giving up the data migration process). What is the actual behaviour in that scenario?
For all types of I/O operations: what happens if the primary load-balancer fails? How long before OnTap retries a write or read operation to the capacity tier with the other load-balancer? What is the logic used to switch to the other load-balancer? Is the load-balancer aware of the current network status, and active topology? Is OnTap informed when the performance tier cannot use one load-balancer over the other? Again, the more details, the best.

As you might guess, rightfully, our customer is wary about using FabricPool with a StorageGRID instance without a good understanding of how OnTap is managing various conditions involving the network and the load-balancers.

Presales SE at ESI Technologies

Mjizzini · ‎2022-05-12

1. When Object Store is unavailable, data tiering from ONTAP performance tier to the capacity tier will be suspended. Data retrieval from capacity tier will not be possible.
Reasons for Object Store to become unavailable:
When there is no response from the capacity tier for thousand consecutive S3 operations.
When there is no response from the capacity tier for two minutes.
When there is continuous request timeouts (10 seconds) to the capacity tier.

Regarding client Retries,

SMB client Retries are client dependent.
NFS client Retry after 5 seconds. Hang until connectivity is reestablished
SAN client The application might need to be restarted so that the read can be retried

2. Unless the is no risk on filling up the Performance tier, Writes will not be effected.

3. Regarding the Load balancing primary and secondary, I don't have exact timeouts. Below some links can help.

StorageGRID load balancer Third-party and global load balancers

Load Balancing NetApp StorageGRID guid

View solution in original post

JFM · ‎2022-05-12

Surprising for me to see a lack of responses to this question. Maybe it was wrongly formulated. I was expecting some input, especially for question #3. I'll be patient on this one. 🙂

Presales SE at ESI Technologies

Mjizzini · ‎2022-05-12

1. When Object Store is unavailable, data tiering from ONTAP performance tier to the capacity tier will be suspended. Data retrieval from capacity tier will not be possible.
Reasons for Object Store to become unavailable:
When there is no response from the capacity tier for thousand consecutive S3 operations.
When there is no response from the capacity tier for two minutes.
When there is continuous request timeouts (10 seconds) to the capacity tier.

Regarding client Retries,

SMB client Retries are client dependent.
NFS client Retry after 5 seconds. Hang until connectivity is reestablished
SAN client The application might need to be restarted so that the read can be retried

2. Unless the is no risk on filling up the Performance tier, Writes will not be effected.

3. Regarding the Load balancing primary and secondary, I don't have exact timeouts. Below some links can help.

StorageGRID load balancer Third-party and global load balancers

Load Balancing NetApp StorageGRID guid

JFM · ‎2022-05-13

Now we're talking! Thanks for the heads up! Food for thought, certainly.

Presales SE at ESI Technologies