ONTAP Hardware

**[SolidFire SF4805] All 10 Drives Failed Simultaneously — Suspected SAS Controller Failure **

wv
37 Views

**[SolidFire SF4805] All 10 Drives Failed Simultaneously — Suspected SAS Controller Failure **

Hi Community,

I'm dealing with a SolidFire SF4805 node issue and would appreciate any advice, especially since this unit is out of warranty.

---

**Environment**
- Cluster: 4-node SolidFire SF4805, Element OS 11.8.0.23
- Affected Node: SF03 (Node ID: 3)
- Cluster is still operational with remaining 3 nodes

---

**Symptoms**
- All 10 Samsung SSDs on SF03 show Status = **failed** simultaneously
- Active Drives on SF03 = 0
- Node Status = active (node is online but serving no storage)
- Replication Port = "-" (not participating in cluster replication)

**Active Alerts:**
- `hardwareConfigMismatch` — MPTSAS_BIOS_VERSION = Unknown (expected != Unknown)
- `hardwareConfigMismatch` — MPTSAS_FIRMWARE_VERSION = Unknown (expected != Unknown)
- `irqBalanceFailed` — mpt3sas0-msix0 through msix7 interrupts not found
- `networkConfig` — eth1 and eth3 down
- `notUsingLACPBondMode` — Bond10G not using LACP

**Hardware Check Output (xCheck):**
```
MPTSAS_BIOS_VERSION: Passed=false, actual=Unknown
MPTSAS_FIRMWARE_VERSION: Passed=false, actual=Unknown
(All other components: CPU, RAM, NIC, BIOS, iDRAC → Passed=true)
```

---

**My Analysis**
All symptoms point to the **SAS HBA controller (mpt3sas / LSI) being undetectable** by the OS. Since all 10 drives failed at exactly the same time rather than individually, I believe the drives themselves are likely still healthy — the controller is simply not being recognized on the PCIe bus.

Firmware versions are also behind:
- iDRAC: running 2.40.40.40 (current: 2.75.75.75)
- BIOS: running 2.2.5 (current: 2.8.0)

---

**Questions**
1. Can anyone confirm this is a SAS controller hardware failure rather than a firmware/software issue?
2. What is the exact SAS controller model used in the SF4805? (Trying to source a replacement)
3. Has anyone successfully replaced the SAS controller on an SF-series node and recovered the drives/data?
4. If I reseat or replace the controller and the node rejoins the cluster, will Element OS automatically re-add the drives, or is there a manual process?
5. Any risk of data loss on the drives themselves if the controller is replaced?

---

**What I've Tried**
- Verified all alerts in Element OS UI (Reporting → Alerts)
- Confirmed Node Details: all 10 drive slots showing failed
- Reviewed hardware check JSON output
- Cross-referenced firmware versions against NetApp docs: https://docs.netapp.com/us-en/element-software/hardware/fw_storage_nodes.html#sf_nodes

---

Any guidance is greatly appreciated. Thanks in advance!

0 REPLIES 0
Public