SolidFire and HCI

[SolidFire SF4805] All 10 Drives Failed Simultaneously — Suspected SAS Controller Failure

wv
92 Views

Hi Community,

I'm dealing with a SolidFire SF4805 node issue and would appreciate any advice, especially since this unit is out of warranty.

---

Environment
- Cluster: 4-node SolidFire SF4805, Element OS 11.8.0.23
- Affected Node: SF03 (Node ID: 3)
- Cluster is still operational with remaining 3 nodes

---

Symptoms
- All 10 Samsung SSDs on SF03 show Status = **failed** simultaneously
- Active Drives on SF03 = 0
- Node Status = active (node is online but serving no storage)
- Replication Port = "-" (not participating in cluster replication)

 

Active Alerts:
- `hardwareConfigMismatch` — MPTSAS_BIOS_VERSION = Unknown (expected != Unknown)
- `hardwareConfigMismatch` — MPTSAS_FIRMWARE_VERSION = Unknown (expected != Unknown)
- `irqBalanceFailed` — mpt3sas0-msix0 through msix7 interrupts not found
- `networkConfig` — eth1 and eth3 down
- `notUsingLACPBondMode` — Bond10G not using LACP

 

Hardware Check Output (xCheck):
```
MPTSAS_BIOS_VERSION: Passed=false, actual=Unknown
MPTSAS_FIRMWARE_VERSION: Passed=false, actual=Unknown
(All other components: CPU, RAM, NIC, BIOS, iDRAC → Passed=true)
```

---

My Analysis
All symptoms point to the SAS HBA controller (mpt3sas / LSI) being undetectable by the OS. Since all 10 drives failed at exactly the same time rather than individually, I believe the drives themselves are likely still healthy — the controller is simply not being recognized on the PCIe bus.

Firmware versions are also behind:
- iDRAC: running 2.40.40.40 (current: 2.75.75.75)
- BIOS: running 2.2.5 (current: 2.8.0)

---

Questions
1. Can anyone confirm this is a SAS controller hardware failure rather than a firmware/software issue?
2. What is the exact SAS controller model used in the SF4805? (Trying to source a replacement)
3. Has anyone successfully replaced the SAS controller on an SF-series node and recovered the drives/data?
4. If I reseat or replace the controller and the node rejoins the cluster, will Element OS automatically re-add the drives, or is there a manual process?
5. Any risk of data loss on the drives themselves if the controller is replaced?

---

What I've Tried
- Verified all alerts in Element OS UI (Reporting → Alerts)
- Confirmed Node Details: all 10 drive slots showing failed
- Reviewed hardware check JSON output
- Cross-referenced firmware versions against NetApp docs: https://docs.netapp.com/us-en/element-software/hardware/fw_storage_nodes.html#sf_nodes

---

Any guidance is greatly appreciated. Thanks in advance!

1 REPLY 1

elementx
52 Views

First, I don't know what this problem might be.

I'd boot from Live Linux CD and run diags to see what exactly is broken. You could also use BMC (root/calvin or whatever).

 

The drives is likely still good although by now their data has been rebuilt (assuming there was enough free capacity on the rest of boxes). So the data has already been "lost" (and rebuilt, since RF2 is used) so you don't have to worry about the drives now. If you recover the box and bring it back online, you'll probably need to re-add them as they were likely declared dead after 10-15 mins offline. Once you add them back, SF will rebalance. If you replace the motherboard, then you'd have to join the node to the cluster as a new node and that might be risky (I don't know if there's some BIOS/model/version check in software, so I wouldn't spend money on a new motherboard from the server maker or eBay).

 

You may look up the h/w model in BMC or other boxes that are healthy (using the API* (CLI, PowerShell, etc.) or HCC UI) and try to find the failed part.

 

[*] https://docs.netapp.com/us-en/element-software/api/reference_element_api_gethardwareinfo.html

 

P.S. SolidFire isn't ONTAP hardware, this is not the right topic for this post.

Public