ONTAP Discussions

HA Configuration: Interconnect Status is DOWN on both Filers

ASRARGUNA
28,497 Views

Hello All,

I am in some trouble here and would appreciate some help in resolving the issue.

I have 2 Filers: Filer1 and Filer2 as HA pair. Both are FAS3160 model with 8.1.1 7-mode running. When I login to On-Command system manager and go to "HA Configuration" It shows Interconnect status under both filers as Down (red X). Under Active/ Active State, it shows Failover under Filer1 and Takeover under Filer2. Please check the screenshot attached.

How can I get the HA config back on my filers?

Thanks - AG

15 REPLIES 15

aleksandar_stefanov
28,409 Views

Are the two physical filers up and running? Please check by logging to the console.

ASRARGUNA
28,409 Views

Thank You ALeksandar for your reply.

Yes, both Filers are up and I can ping both filers from my pc. I can also connect to both filer through on command system manager and work on volumes, etc. Only HA is showing as in the attachment earlier.

I also did SSH to Filer-2 and this is what I get there.

Filer-2(takeover)> Thu May 16 10:00:00 AST [Filer-2:kern.uptime.filer:info]:  10:00am up 3 days,  1:36 5 NFS ops, 219901  946 CIFS ops, 0 HTTP ops, 0 FCP ops, 0 iSCSI ops

Thu May 16 10:00:00 AST [Filer-2:monitor.shelf.fault:CRITICAL]: Fault reported on disk storage shelf attached to channel 0a . Please check fans, power supplies, disks, and temperature sensors.

Thu May 16 10:00:10 AST [Filer-2:cf.ic.hourlyRVnag:error]: Cluster Interconnect sessions with partner have been DOWN for 44                             14 minute(s)

Thu May 16 10:00:10 AST [Filer-2:cf.ic.hourlyNicDownTime:info]: Interconnect adapter link #0 has been down for 4414 minutes                            

Thu May 16 10:00:10 AST [Filer-2:cf.ic.hourlyNicDownTime:info]: Interconnect adapter link #1 has been down for 4414 minutes

Thanks - AG

aleksandar_stefanov
28,409 Views

Yes, but Filer-2 is in takeover mode. How are the two filers connected between them? Check the cables.

ASRARGUNA
28,409 Views

Unfortunately the filers are at a remote location so I can't check cables. Is there a way to check things remotely? Sorry.

aleksandar_stefanov
28,409 Views

If the cables are disconnected you will just see ports offline without knowing what is going on.

execute ifconfig -a on both filers to see if interconnect ports on both filers are down.

ASRARGUNA
28,409 Views

I am not able to SSH to Filer-1 as it says "Shell not supported on takeover partner"

On filer-2, this is the result of ifconfig -a

Filer-2(takeover)> ifconfig -a

e0a: flags=0x2fec867<UP,BROADCAST,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSUM> mtu 1500

        inet 10.166.x.x netmask 0xffffff00 broadcast 10.166.x.x

partner inet 10.166.x.x (e0a)

ether 00:a0:x:x:x:x (auto-100tx-fd-up) flowcontrol full

e0b: flags=0x270c866<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500

ether 00:a0:x:x:x:x (auto-unknown-down) flowcontrol full

e3a: flags=0x170e866<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500

ether 00:a0:x:x:x:x (auto-unknown-down) flowcontrol full

e3b: flags=0x170e866<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500

ether 00:a0:x:x:x:x (auto-unknown-down) flowcontrol full

e4a: flags=0x2fec867<UP,BROADCAST,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSUM> mtu 1500

        inet 10.166.x.x netmask 0xffffff00 broadcast 10.166.x.x

partner inet 10.166.17.11 (e4a)

ether 00:a0:x:x:x:x (auto-100tx-fd-up) flowcontrol full

e4b: flags=0x270c866<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500

ether 00:a0:x:x:x:x (auto-unknown-down) flowcontrol full

e4c: flags=0x270c866<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500

ether 00:a0:x:x:x:x (auto-unknown-down) flowcontrol full

e4d: flags=0x270c866<BROADCAST,RUNNING,MULTICAST,TCPCKSUM> mtu 1500

ether 00:a0:x:x:x:x (auto-unknown-down) flowcontrol full

e0M: flags=0x2bec867<UP,BROADCAST,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSUM,MGMT_PORT> mtu 1500

        inet 10.166.x.x netmask 0xffffff00 broadcast 10.166.x.x noddns

partner inet 10.166.x.x (e0M)

ether 00:a0:x:x:x:x (auto-100tx-fd-up) flowcontrol full

lo: flags=0x1be8049<UP,LOOPBACK,RUNNING,MULTICAST,MULTIHOST,PARTNER_UP,TCPCKSUM> mtu 8160

        inet 127.0.0.1 netmask 0xff000000 broadcast 127.0.0.1

ether 00:00:00:00:00:00 (VIA Provider)

losk: flags=0x40a400c9<UP,LOOPBACK,RUNNING> mtu 9188

        inet 127.0.20.1 netmask 0xff000000 broadcast 127.0.20.1

aleksandar_stefanov
28,409 Views

OK, login to Filer-2 and try to perform giveback with this command: cf giveback

ASRARGUNA
28,409 Views

Even if the interconnect state under both filers shows down?

aleksandar_stefanov
28,409 Views

You can always try. Can you connect to the RLM on Filer-1 and check what is the status?.

ASRARGUNA
19,551 Views

I tried cf giveback on Filer-2 and it says:

Filer-2(takeover)> cf giveback

Partner not waiting for giveback, giveback cancelled.

To do a giveback without checking for partner readiness, please either set option "cf.giveback.check.partner" to "off" before doing "cf giveback" again, or do "cf giveback -f".

The first choice disables checking for all future "cf giveback", until it's turned back to "on". The second choice is good for this giveback only.

Yes I am able to connect to RLM on Filer-1 and it says: The system has booted in maintenance mode.

May 13 05:24:49 [localhost:mgr.boot.reason_ok:notice]: System rebooted after a power down due to environmental condition.

May 13 05:24:49 [localhost:callhome.reboot.unknown:info]: Call home for REBOOT

Ipspace "acp-ipspace"May 13 05:24:51 [localhost:acp.configWarn:debug]: Could not configure ACP administrator due to invalid Ethernet port in maintenance mode.

created

May 13 05:24:53 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.

May 13 05:24:59 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.

May 13 05:25:05 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.

May 13 05:25:11 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.

May 13 05:25:17 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.

May 13 05:25:23 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.

halt

May 13 05:25:25 [localhost:kern.cli.cmd:debug]: Command line input: the command is 'halt'. The full command line is 'halt'.

May 13 05:25:29 [localhost:cf.fmns.skipped.disk:notice]: While releasing the reservations in "Waiting For Giveback" state Failover Monitor Node State(fmns) module skipped the disk 0a.54 that is owned by 151742736 and reserved by 151742631.

Do you want me to run any specific command on the RLM of Filer-1 or Should I do cf giveback -f?

mike_burris
19,551 Views

If there's an environmental issue which caused this, you need to figure out what that is //before// performing a giveback and get it resolved.  From the RLM, check "events" and see if it gives you any details as to what happened.  Once that is resolved, whatever it may be, I'd also considering doing a diagnostic boot and run through the tests/checks just to make sure everything is fine.

Something else that comes to mind; are there no open NetApp cases for this system?  I *think* the box would have asup'd when it had an environmental problem and then rebooted itself but dont hold me to that...

ASRARGUNA
19,551 Views

Hi Mike,

Yes, there was power outage in the DC which caused all the servers to shutdown. So how do I do diagnostic boot?

Thanks - AG

SIRTECHIE42
19,551 Views

Hi Asrar,

Looking into this myself today, here is a link to the Diagnostics guide for 31xx systems.  https://library.netapp.com/ecm/ecm_download_file/ECMP1112531

This guide should give you the details you are looking for to run the proper diagnostic tests Aleksander has mentioned.  To enter the diag mode enter the following at the LOADER prompt.

LOADER> boot_diags

ASRARGUNA
19,551 Views

Hi,

I connect through RLM and this is what i see in the messages: "Fault reported on disk storage shelf attached to channel 0a. Please check fans, power, and temperature.

Filer-2(Takeover)> environment status shelf 0a

Channel: 0a

        Shelf: 2

        SES device path: local access: 0a.32

        Module type: ESH4; monitoring is active

        Shelf status: non-critical condition

        SES Configuration, via loop id 32 in shelf 2:

         logical identifier=0x50050cc002112326

         vendor identification=XYRATEX

         product identification=DS14-Mk2-FC

         product revision level=1414

        Vendor-specific information:

         Product Serial Number: OPS445022112326

         Optional Settings: 0x00

        Status reads attempted: 19272; failed: 0

        Control writes attempted: 320; failed: 0

        Shelf bays with disk devices installed:

          13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0

          with error: none

        Power Supply installed element list: 1, 2; with error: none

        Power Supply information by element:

          [1] Serial number: PMA441920116102  Part number: <N/A>

              Type: 34

              Firmware version: <N/A>  Swaps: 0

          [2] Serial number: PMA441920116091  Part number: <N/A>

              Type: 34

              Firmware version: <N/A>  Swaps: 0

        Power control element status: power control status not supported (DS14Mk4 shelf required)

        Cooling Unit installed element list: 1, 2; with error: 2

        Temperature Sensor installed element list: 1, 2, 3; with error: none

        Shelf temperatures by element:

          [1] 26 C (78 F) (ambient)  Normal temperature range

          [2] 36 C (96 F)  Normal temperature range

          [3] 35 C (95 F)  Normal temperature range

        Temperature thresholds by element:

          [1] High critical: 50 C (122 F); high warning: 40 C (104 F)

              Low critical:  0 C (32 F); low warning: 10 C (50 F)

          [2] High critical: 63 C (145 F); high warning: 53 C (127 F)

              Low critical:  0 C (32 F); low warning: 10 C (50 F)

          [3] High critical: 63 C (145 F); high warning: 53 C (127 F)

              Low critical:  0 C (32 F); low warning: 10 C (50 F)

        ES Electronics installed element list: 1, 2; with error: none

        ES Electronics reporting element: 1

        ES Electronics information by element:

          [1] Serial number: IMS6981331F5949  Part number: <N/A>

              CPLD version: <N/A>  Swaps: 0

          [2] Serial number: IMS6981331F597B  Part number: <N/A>

              CPLD version: <N/A>  Swaps: 0

        Embedded Switching Hub installed element list: 1, 2; with error: none

mike_burris
19,551 Views

Take a look at the solution section of these KB articles:

https://kb.netapp.com/support/index?page=content&id=2011427&locale=en_US

https://kb.netapp.com/support/index?page=content&id=2013504&locale=en_US

https://kb.netapp.com/support/index?page=content&id=2013351&locale=en_US

Basically, check power, cabling, etc (I've reseated the PSU before and alerts have cleared).  If after all of that it still shows a fault, swap the PSU with a replacement.

Public