Problem Description: cfbi01n2 has taken over cfbi01n1
Summary: Ticket for a failover on an IBM N7900 filer cfbi01n2 has taken over cfbi01n1. Over this period I’ve resolved multiple existing errors on the filers. After fixing three intermittent FC errors, fixing the network issue and then the issue with the AT-FCX modules, cfbi01n1 Panicked with the following error below. I’m no longer seeing any hardware related issues or reasoning the giveback fails other than the below panic message. We also upgraded the software to the latest version 8.2.5.P1.
Old Panic Message:
Tue Sep 25 19:31:04 CDT [cfbi01n1:mgr.stack.string:notice]: Panic string: protection fault on VA 0 code = 0 cs:rip = 0x20:0xffffffff83f74e2e in SK process wafl_hipri on release 8.2.4P4 :: Which has since cleared.
New Panic Error
Tue Oct 9 18:41:30 CDT [cfbi01n1:mgr.partner.stack.saved:notice]: Cluster takeover has saved partPanic string: protection fault on VA 0 code = 0 cs:rip = 0x20:0xffffffff86429c01 in SK process NwkThd_03 on release 8.2.5P1
The following summarizes certain events and related action items.
8/6/18- confirmed we are not seeing any hardware errors cfbi01n2 & n1 after the visual triage. We checked the SFPs, cables associated with 2a & 8b. Requested NAS team perform a cf giveback.
8/15/18– While troubleshooting the 2a/8b FC data ports it was discovered that these filer nodes are each only connected to one FC loop (cfbi01n1 to Loop A, cfbi01n2 to Loop B) for the following FC data interface ports: 2a/8b, 2b/8a, 2c/8c. The other FC data interface ports on these filers are configured for dual FC loop connections: 0a/0d, 0b/0e, 0c/0h. This leads to what is called a "mix-path" configuration (partial single path, partial dual path). This type of configuration is susceptible to single point of failure outages for the data configured for single path only.
8/17/18- To summarize the activity performed on cfbi01n1/cfbi01n2 to bring cfbi01n1 back into the HA config:
1. Recheck/reseat all cabling, SFPs and shelf modules for cfbi01n1 ports 2a/8b
2. Replacement of possible failing module in shelf 1 for FC loop A (the only active/configured loop for cfbi01n1)
3. Test loop A connectivity for cfbi01n1 and note intermittent error with shelf 4 for FC loop A
4. Replacement of module in shelf 4 for FC loop A
5. Boot of cfbi01n1 no longer blocked due to multiple loop A errors for ports 2a/8b and node is in “ready for giveback” state
6. Console logs are still showing intermittent error for 2a.63/8b.63 (shelf 4), but healthy for giveback to join the HA
7. Giveback completed for cfbi01n1, triggering autosupports from both nodes to review
8. Giveback failed within an hour
8/18/18- NAS team received issues reported from clients that NAS storage is disconnected state on 70 ESXi hosts and cfbi01n1-nas10.sldc.sbc.com was not connecting from ESXi and windows Jump server. However, after node taken over back again issue got resolved which we suspect route issue. NAS team resolved internally
9/6/18- We have issues with cfbi01n1 again. We had 80 servers that were not able to access the nas10. We could not ping the gateway. I had T2 do a vif failover & they were then able to ping the gateway, but the filer panicked & did a failover. Core file is currently dumping. are not able to ping the gateway (220.127.116.11) from the NetApp arrays. When we tried to failover to the other interface in the ifgroup (form e5a to e5b) it caused the filer to panic
9/10/18- opened the case with Cisco and identified the module as faulty. It was replaced by Cisco on 9/13/18.
9/25/18- Next window for giveback was approved. We identified faults on 2c/8c prior to the filer failing back over. On cfbi01n1 we noticed 8c as hard down and 2c as hard down on cfbi01n2. Our FE remained onsite as we suspected a fault on the filer. We proceeded to trace the cables from both filer heads and found the shelves in question. We noticed LED lights out on shelf 1, module A/ In port & shelf 6, module A/Out port. After authorization from the NAS team we replaced both I/O Modules. The LED status came back to solid green and the NAS team proceeded to do a cf giveback, failed again within 1 hour
10/9/18- Upgraded to 8.2.5P1. failed within an 1.5 hours