ONTAP Discussions
ONTAP Discussions
Hello All,
I am in some trouble here and would appreciate some help in resolving the issue.
I have 2 Filers: Filer1 and Filer2 as HA pair. Both are FAS3160 model with 8.1.1 7-mode running. When I login to On-Command system manager and go to "HA Configuration" It shows Interconnect status under both filers as Down (red X). Under Active/ Active State, it shows Failover under Filer1 and Takeover under Filer2. Please check the screenshot attached.
How can I get the HA config back on my filers?
Thanks - AG
Are the two physical filers up and running? Please check by logging to the console.
Thank You ALeksandar for your reply.
Yes, both Filers are up and I can ping both filers from my pc. I can also connect to both filer through on command system manager and work on volumes, etc. Only HA is showing as in the attachment earlier.
I also did SSH to Filer-2 and this is what I get there.
Filer-2(takeover)> Thu May 16 10:00:00 AST [Filer-2:kern.uptime.filer:info]: 10:00am up 3 days, 1:36 5 NFS ops, 219901 946 CIFS ops, 0 HTTP ops, 0 FCP ops, 0 iSCSI ops
Thu May 16 10:00:00 AST [Filer-2:monitor.shelf.fault:CRITICAL]: Fault reported on disk storage shelf attached to channel 0a . Please check fans, power supplies, disks, and temperature sensors.
Thu May 16 10:00:10 AST [Filer-2:cf.ic.hourlyRVnag:error]: Cluster Interconnect sessions with partner have been DOWN for 44 14 minute(s)
Thu May 16 10:00:10 AST [Filer-2:cf.ic.hourlyNicDownTime:info]: Interconnect adapter link #0 has been down for 4414 minutes
Thu May 16 10:00:10 AST [Filer-2:cf.ic.hourlyNicDownTime:info]: Interconnect adapter link #1 has been down for 4414 minutes
Thanks - AG
Yes, but Filer-2 is in takeover mode. How are the two filers connected between them? Check the cables.
Unfortunately the filers are at a remote location so I can't check cables. Is there a way to check things remotely? Sorry.
If the cables are disconnected you will just see ports offline without knowing what is going on.
execute ifconfig -a on both filers to see if interconnect ports on both filers are down.
I am not able to SSH to Filer-1 as it says "Shell not supported on takeover partner"
On filer-2, this is the result of ifconfig -a
Filer-2(takeover)> ifconfig -a
e0a: flags=0x2fec867<UP,BROADCAST,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSUM> mtu 1500
inet 10.166.x.x netmask 0xffffff00 broadcast 10.166.x.x
partner inet 10.166.x.x (e0a)
ether 00:a0:x:x:x:x (auto-100tx-fd-up) flowcontrol full
e0b: flags=0x270c866<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:x:x:x:x (auto-unknown-down) flowcontrol full
e3a: flags=0x170e866<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:x:x:x:x (auto-unknown-down) flowcontrol full
e3b: flags=0x170e866<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:x:x:x:x (auto-unknown-down) flowcontrol full
e4a: flags=0x2fec867<UP,BROADCAST,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSUM> mtu 1500
inet 10.166.x.x netmask 0xffffff00 broadcast 10.166.x.x
partner inet 10.166.17.11 (e4a)
ether 00:a0:x:x:x:x (auto-100tx-fd-up) flowcontrol full
e4b: flags=0x270c866<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:x:x:x:x (auto-unknown-down) flowcontrol full
e4c: flags=0x270c866<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:x:x:x:x (auto-unknown-down) flowcontrol full
e4d: flags=0x270c866<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500
ether 00:a0:x:x:x:x (auto-unknown-down) flowcontrol full
e0M: flags=0x2bec867<UP,BROADCAST,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSUM,MGMT_PORT> mtu 1500
inet 10.166.x.x netmask 0xffffff00 broadcast 10.166.x.x noddns
partner inet 10.166.x.x (e0M)
ether 00:a0:x:x:x:x (auto-100tx-fd-up) flowcontrol full
lo: flags=0x1be8049<UP,LOOPBACK,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSUM> mtu 8160
inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1
ether 00:00:00:00:00:00 (VIA Provider)
losk: flags=0x40a400c9<UP,LOOPBACK,RUNNING> mtu 9188
inet 127.0.20.1 netmask 0xff000000 broadcast 127.0.20.1
OK, login to Filer-2 and try to perform giveback with this command: cf giveback
Even if the interconnect state under both filers shows down?
You can always try. Can you connect to the RLM on Filer-1 and check what is the status?.
I tried cf giveback on Filer-2 and it says:
Filer-2(takeover)> cf giveback
Partner not waiting for giveback, giveback cancelled.
To do a giveback without checking for partner readiness, please either set option "cf.giveback.check.partner" to "off" before doing "cf giveback" again, or do "cf giveback -f".
The first choice disables checking for all future "cf giveback", until it's turned back to "on". The second choice is good for this giveback only.
Yes I am able to connect to RLM on Filer-1 and it says: The system has booted in maintenance mode.
May 13 05:24:49 [localhost:mgr.boot.reason_ok:notice]: System rebooted after a power down due to environmental condition.
May 13 05:24:49 [localhost:callhome.reboot.unknown:info]: Call home for REBOOT
Ipspace "acp-ipspace"May 13 05:24:51 [localhost:acp.configWarn:debug]: Could not configure ACP administrator due to invalid Ethernet port in maintenance mode.
created
May 13 05:24:53 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.
May 13 05:24:59 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.
May 13 05:25:05 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.
May 13 05:25:11 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.
May 13 05:25:17 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.
May 13 05:25:23 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.
halt
May 13 05:25:25 [localhost:kern.cli.cmd:debug]: Command line input: the command is 'halt'. The full command line is 'halt'.
May 13 05:25:29 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.
Do you want me to run any specific command on the RLM of Filer-1 or Should I do cf giveback -f?
If there's an environmental issue which caused this, you need to figure out what that is //before// performing a giveback and get it resolved. From the RLM, check "events" and see if it gives you any details as to what happened. Once that is resolved, whatever it may be, I'd also considering doing a diagnostic boot and run through the tests/checks just to make sure everything is fine.
Something else that comes to mind; are there no open NetApp cases for this system? I *think* the box would have asup'd when it had an environmental problem and then rebooted itself but dont hold me to that...
Hi Mike,
Yes, there was power outage in the DC which caused all the servers to shutdown. So how do I do diagnostic boot?
Thanks - AG
Hi Asrar,
Looking into this myself today, here is a link to the Diagnostics guide for 31xx systems. https://library.netapp.com/ecm/ecm_download_file/ECMP1112531
This guide should give you the details you are looking for to run the proper diagnostic tests Aleksander has mentioned. To enter the diag mode enter the following at the LOADER prompt.
LOADER> boot_diags
Hi,
I connect through RLM and this is what i see in the messages: "Fault reported on disk storage shelf attached to channel 0a. Please check fans, power, and temperature.
Filer-2(Takeover)> environment status shelf 0a
Channel: 0a
Shelf: 2
SES device path: local access: 0a.32
Module type: ESH4; monitoring is active
Shelf status: non-critical condition
SES Configuration, via loop id 32 in shelf 2:
logical identifier=0x50050cc002112326
vendor identification=XYRATEX
product identification=DS14-Mk2-FC
product revision level=1414
Vendor-specific information:
Product Serial Number: OPS445022112326
Optional Settings: 0x00
Status reads attempted: 19272; failed: 0
Control writes attempted: 320; failed: 0
Shelf bays with disk devices installed:
13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0
with error: none
Power Supply installed element list: 1, 2; with error: none
Power Supply information by element:
[1] Serial number: PMA441920116102 Part number: <N/A>
Type: 34
Firmware version: <N/A> Swaps: 0
[2] Serial number: PMA441920116091 Part number: <N/A>
Type: 34
Firmware version: <N/A> Swaps: 0
Power control element status: power control status not supported (DS14Mk4 shelf required)
Cooling Unit installed element list: 1, 2; with error: 2
Temperature Sensor installed element list: 1, 2, 3; with error: none
Shelf temperatures by element:
[1] 26 C (78 F) (ambient) Normal temperature range
[2] 36 C (96 F) Normal temperature range
[3] 35 C (95 F) Normal temperature range
Temperature thresholds by element:
[1] High critical: 50 C (122 F); high warning: 40 C (104 F)
Low critical: 0 C (32 F); low warning: 10 C (50 F)
[2] High critical: 63 C (145 F); high warning: 53 C (127 F)
Low critical: 0 C (32 F); low warning: 10 C (50 F)
[3] High critical: 63 C (145 F); high warning: 53 C (127 F)
Low critical: 0 C (32 F); low warning: 10 C (50 F)
ES Electronics installed element list: 1, 2; with error: none
ES Electronics reporting element: 1
ES Electronics information by element:
[1] Serial number: IMS6981331F5949 Part number: <N/A>
CPLD version: <N/A> Swaps: 0
[2] Serial number: IMS6981331F597B Part number: <N/A>
CPLD version: <N/A> Swaps: 0
Embedded Switching Hub installed element list: 1, 2; with error: none
Take a look at the solution section of these KB articles:
https://kb.netapp.com/support/index?page=content&id=2011427&locale=en_US
https://kb.netapp.com/support/index?page=content&id=2013504&locale=en_US
https://kb.netapp.com/support/index?page=content&id=2013351&locale=en_US
Basically, check power, cabling, etc (I've reseated the PSU before and alerts have cleared). If after all of that it still shows a fault, swap the PSU with a replacement.