Solved: Re: FAS2240-2 Inaccessible Filer - HA Down

KamiMcLeod · ‎2017-12-26

Hi All

Really hoping someone can help here as I am very new to FAS

One of our arrays had a disk fail, so we replaced this last week, assigned ownership and assumed rebuild started, but since this point:

1. We cannot CLI to the filer (1A) hosting this LUN - it asks for credentials and then closes once we enter these

2. We now cannot GUI into the clustered pair - previously this was working partially - could not access disks or aggregates for the filer with the dead disk - now it sits there "authenticating to filer 1A" and never completes

3. Snapmirror between this filer (1A) and a partner unit (2A) has stopped

The disk that eventually died was in rebuild back in October - the filer started a repair of the disk and ground the systems to a halt - we lost CLI access then as well, clients very much noticed the rebuild process as their data was running slowly, but it restored itself eventually and since then we havent had any reports.

HA mode apparently has been offline since that point though which we were not alerted on.

1A
Message:
HA mode, but takeover of partner is disabled due to reason : status of backup mailbox is uncertain.

CLI from partner filer:
1B> cf monitor
current time: 27Dec2017 09:54:56
UP 68+07:35:13, partner '1A', CF monitor enabled
VIA Interconnect is up (link up), takeover capability on-line
partner may be down, last partner update TAKEOVER_ENABLED (20Oct2017 22:59:23)
takeover scheduled 00:00:15

1B> cf status
1A may be down, takeover will be initiated in 15 seconds.
VIA Interconnect is up (link up).

1B> cf hw_assist status
Local Node(1B) Status:
Active: 1B monitoring alerts from partner(1A)
port 4444 IP address 192.168.1.15
Partner Node(1A) Status:
Active: 1A monitoring alerts from partner(1B)
port 4444 IP address 192.168.1.14

I am not sure where to go from this point being this is our first time managing a FAS unit but I am almost at the point of moving all the data from this LUN to protect our clients setup.

Any help would be extremely appreciated!

KamiMcLeod · ‎2018-01-21

As an update if anyone comes across this thread - we had the controller crash eventually in the end. The LUNs completely went offline and accessing the SAN via direct console was responding perfectly, and believed itself to be in perfect health.

Until i ran "vol status" - then I lost it completely.

We had to hard reset the controller, which forced the LUN into failover to the parner filer finally.

Restarting the controller it performed a filesystem boot repair, and mailbox disk repair.

All happy for now.

View solution in original post

kahuna · ‎2017-12-27

- is the LUN on node 1A still serving data?

- what about the Service Processor? Is it configured and working? (check the command 'sysconfig -a')

- if the SP is not working, what about attaching a cable to the console port?

- what is the ONTAP version?

KamiMcLeod · ‎2017-12-28

Hi Kahuna

Yes, the LUN on 1A is still functional.

Cannot access the CLI on 1A so I cannot run that command. We havent attached a cable just yet as the SAN isn't nearby.

ONTAP version = NetApp Release 8.2.3 7-Mode

kahuna · ‎2017-12-28

"cannot run that command" - you can find the output of the 'sysconfig -a' command in the daily autosupport (if the system is sending). In there, you will find the IP address of the Service Processor (if configured)

SSH to the SP IP. username: 'naroot' - password is the root password. Once in, run the command 'system console'

If that doesn't work, you'd have to use a console cable and manage the node from there

dbenadib · ‎2017-12-30

System is in takeover now ? If yes maybe 1A is stuck in Loader then u should use or a console cable or the SP to make it boot...

kahuna · ‎2017-12-30

according to the output of 'cf status' and the message below, the system is NOT in takeover

"HA mode, but takeover of partner is disabled due to reason : status of backup mailbox is uncertain"

KamiMcLeod · ‎2018-01-21

As an update if anyone comes across this thread - we had the controller crash eventually in the end. The LUNs completely went offline and accessing the SAN via direct console was responding perfectly, and believed itself to be in perfect health.

Until i ran "vol status" - then I lost it completely.

We had to hard reset the controller, which forced the LUN into failover to the parner filer finally.

Restarting the controller it performed a filesystem boot repair, and mailbox disk repair.

All happy for now.