ONTAP Hardware

Spectrum Scale File system (GPFS) hang Problem




3 Node RHEL 8.2 Cluster

Spectrum Scale 5.1.1 Replicated File System on 2 NetApp Clusters (FAS8200)


When One of the NetApp SANs goes offline (either from NetApp classic view or removed mapping from SAN switch), the GPFS cluster goes to HANG state and only recovers when the SAN becomes online again.


When the SAN is removed, the multipath shows the disks as faulty whereas in GPFS the disk status remains unchanged and is shown UP. 

Above mentioned issued does not occur when some other vendor SANs are used.



I don't think that's related to ONTAP SAN vs "other SANs" but simply is your MPIO properly configured on the OS level. If it's not, then when failures happen, they won't be detected by GPFS either.




Key MPIO settings are OS-level, some still have to be done in Spectrum Scale (NSD-related configuration steps).


GPFS should use DevMapper to detect that some disks that are down, and use the replicas, unless strict replication is configured but isn't possible.






What you could do is create one test VM, connect it to ONTAP (I assume this is FC or iSCSI?) and create a single node Spectrum Scale cluster in the VM, then first confirm correct behavior (failure, recovery) on the OS level, and then confirm that Spectrum Scale properly uses those multipath devices. If Spectrum Scale isn't correctly handling disk (and complete path disconnection) failures, NSDs are not correctly configured (or DevMapper isn't, but I assume you wouldn't test GPFS if all DevMapper tests didn't pass).




Generally speaking E-Series would be a better choice of storage for Spectrum Scale back-end, but just like ONTAP, MPIO would have to work, and Spectrum Scale be correctly configured, for a similar setup to survive failures of a replica system.