FAS2050 storage controller not communicating

Nicolas_ja · ‎2019-08-27

Hello All,

We're currently experiencing issues with our FAS2050 storage controller. I could not find any documentation on the FAS2050's assuming it's because they're already End-of-Life. Our FAS2050 supports our Vmware VDI infrastructure (Horizon 7/vCenter 6.5) by storing the profiles for our VDI customers. It's running two storage controllers where one contains the location for user profiles and the other controller oversees the storage of other miscellaneous files.

Over the weekend, we had to perform a power cycle of our server room and in preperation, was in the process of shutting down our VDI environment when suddenly, all power to our servers as well as the FAS2050 was shut off causing a hard shut down.

Afer rebooting, our VMware/VDI environment came back up without too many issues but one of the two storage controllers didn't come up as expected which was the controller holding our VDI profiles. There is an amber light on the exclamation point LED in the front of the chassis as well as green and amber lights on the NICs in the back (on the bottom module, although the top only has green NIC lights).

We utilize NetApp OnCommand System Manager to view the storage locations for our profiles and now we can't discover or add that location back in.

Been scouring the interwebs for any documentation in regards to the FAS2050 that's not about upgrading it to newer versions.

aborzenkov · ‎2019-08-27

What do you see on console of the controller in question? What “cf status” on good controller says?

The only hardware specific documentation is related to parts replacement; everything else is the same so just use Data ONTAP manuals for your version.

Nicolas_ja · ‎2019-08-28

Utilizing DATA ONTAP 7.3.1.1, FilerView, the status page shows: "This node has taken over. /vol/vol_mb1_db1 is full (using or reserving 100% of space and 0% of inodes, using 100% of reserve). I'm very much a beginner in the storage world so I'm not sure if the volume being full would have any impact on the takeover process.

Nicolas_ja · ‎2019-08-28

Edit to the recent post above:

The status was taken from the operational controller.

Unfortunately, I don't have a working serial cable so I'm going to try to putty into it. I'll post any results from my findings.

Thank you guys for your time and assistance with this.

-Nick

AlexDawson · ‎2019-08-27

Hi there! Yes, you're right that this system is now end of support from us, sorry.

Can you connect via RLM or serial cable (RJ-45 9600 8N1)? what does the controller say? With any luck, it will just have "Waiting for giveback" scrolling on the console, and you can just type "cf giveback" into the other controller.

This would suggest that when one node came up succesfully, it took over for the other one, but didn't have failover setup properly. This recent thread goes through the process of validating HA for a system of that era - https://community.netapp.com/t5/FAS-and-V-Series-Storage-Systems-Discussions/Restart-the-Controllers-7-mode-HA/td-p/149941

Please share output of serial console and we'll see what to do next

Nicolas_ja · ‎2019-08-28

Unfortunately, we were not able to console into the controller that is down since we don't have a working serial cable to physically console into the system, we had to resort to Putty (which wouldn't work due to network issues).

On the operational controller, after logging in, the prompt displayed:

"controllerhostname"(takeover)>

We ran a cf status and got the following result:

"partner has been taken over by <controllerhostname>"

The resources, in this case, that is owned by the downed controller is our volume containing our VDI profiles. VDI users still are not able to connect to their profiles which leads us to believe that the takeover process didn't complete as intended even though the status might prove otherwise.

Would a restart of the FAS chassis be recommended or would we need to manually run the takeover commands instead? Thank you again for your help.

-Nick

AlexDawson · ‎2019-08-28

Hi there,

The serial cables are the same pinout as the light blue Cisco serial cables, if that helps.

If you don't have any other option, just run "cf giveback" on the surviving node and hope for the best. It does a number of checks before giving it back entirely, and it sounds like you're having an outage right now, so it won't get any worse.

The volume being full is bad, but I believe it wouldn't cause this behavior.

Hope this helps!

Nicolas_ja · ‎2019-08-28

Alex,

I ran the cf giveback command and got the following output:

survivingnode (takeover)> cf giveback
survivingnode (takeover)> Wed Aug 28 13:03:50 EDT [survivingnode(takeover): cf.misc.operatorGiveback:info]: Cluster monitor: giveback initiated by operator
Wed Aug 28 13:03:50 EDT [survivingnode(takeover): disk.failed.abortGiveback:warning] Failed disk 0a.77 should be removed before the giveback command is invoked.
Wed Aug 28 13:03:50 EDT [survivingnode(takeover): cf.rsrc.givebackVeto:error] Cluster monitor: disk check : giveback cancelled due to active state
Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.infoMsg:info] CIFS: Warning for server \\DC-001: Connection terminated.
Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.errorMsg:error]: CIFS: Error for server \\DC-001: Error while negotiating protocol with server STATUS_IO_TIMEOUT.
Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.infoMsg:info] CIFS: Warning for server \\DC-002: Connection terminated.
Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.errorMsg:error]: CIFS: Error for server \\DC-002: Error while negotiating protocol with server STATUS_IO_TIMEOUT.
Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.infoMsg:info] CIFS: Warning for server \\DC-003: Connection terminated.
Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.errorMsg:error]: CIFS: Error for server \\DC-003: Error while negotiating protocol with server STATUS_IO_TIMEOUT.
Wed Aug 28 13:08:56 EDT [survivingnode(takeover): asup.smtp.host:info]: Autosupport cannot connect to host xxx.xxx.xxx.xxx (Network comm problem) for message: DISK FAILED
Wed Aug 28 13:08:56 EDT [survivingnode(takeover): asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts frmo the auto.support.mailhost option. (DISK FAILED)

One thing to note, prior to running the cf giveback command above, we just did a power cycle of the entire chassis to see if everything would come up okay but no luck.

SpindleNinja · ‎2019-08-28

I would get access to the second controller at this point. Either via ssh into the SP/BMC or console cable to see what it's displaying.

Are their any amber lights on any of the disks or any ports on the rear showing offline.

AlexDawson · ‎2019-08-28

@Nicolas_ja wrote:

Wed Aug 28 13:03:50 EDT [survivingnode(takeover): disk.failed.abortGiveback:warning] Failed disk 0a.77 should be removed before the giveback command is invoked.

Drive 77 should be the left-most drive of the 4th shelf along - https://library.netapp.com/ecm/ecm_download_file/ECMP1112854 It should have an orange LED on it.

Follow advice, try "cf giveback" again.

aborzenkov · ‎2019-08-28

One should never perform giveback without having console access to controller which was taken over - especially in the case. If “cf status” does not indicate partner is “ready for giveback”, it means either partner did not boot or there is some communication issue. Blindly performing giveback in this state can easily result in outage and data loss.

Console connection (either directly or via RLM/SP/BMC) is really a must when doing any maintenance in NetApp.

AlexDawson · ‎2019-08-29

A fair comment there, if at all possible, check the partner console - but you also shouldn’t run production workloads on unsupported hardware or operate them without out of band access and regular takeover/giveback tests done during ONTAP upgrades, but here we are.

Nicolas_ja · ‎2019-08-29

Unfortunately, we do not have any replacement disks laying around to replace the failed disk (0a.77). I was hoping to run the following command to force a giveback but the command line did not recognize it;

storage failover giveback -ofnode <nodename>

Is it possible to just remove the disk and leave the bay empty?

Sincerely,

Nick

AlexDawson · ‎2019-08-29

Hi there,

That command is for modern ONTAP and won't work in your system.

Yes, you can remove that drive and leave the slot unused (but for airflow management, just leave the drive in the slot unplugged)

Nicolas_ja · ‎2019-09-10

Hello Alex,

We removed the failed hard drive but after attempting the giveback command, it showed yet another failed drive that is stopping the process from continuing. Since we do have multiple drives in the chassis with amber lights, we expected a waterfall of failed drive errors. We made the decision to not keep pulling failed drives out of the chassis and left them in there, ultimately going forward with the cf giveback -f command which completed successfully.

Now, after entering "cf status", the output displayed is "cluster enabled, partner is up". Even though this is the case, we still cannot ping or add the controller in Netapp OnCommand System Manager via it's IP or hostname.

I feel like it's something small that we're just missing in order to get it communicating properly.

-Nick

SpindleNinja · ‎2019-09-10

Do you have a good backup of all the data that's on this 2050? Multiple drive failures is not good.

As far as whats wrong with the other node, what happens if you console in to it? Does anything display? are their any lights on the rear? There is a ! on each controller are either of those lit up?

Nicolas_ja · ‎2019-09-11

We do not have a backup.

At the start of this thread, we couldn't physically console into the down storage controller (fas01) or PuTTy into it. After recently acquiring a laptop, we're now able to console in and login to fas01. There was an amber light (back of chassis) on the "!" indicator for the down storage controller (fas01) but after performing a reboot (as mentioned below), the amber light went away.

Once consoled in, we're greeted with the login prompt as normal. Ran a cf status and got a return of "cluster enabled, fas02 is up". What's interesting here is that after running the same cf status command on fas02 (the working storage controller), we get "cluster enabled, partner is up". So fas02 is not able to see the hostname of fas01 but fas01 can see fas02's hostname.

We rebooted fas01 and after consoling back into it, received several logs, shortened versions are below:

Interconnect link is UP
Connection for cfo_rv failed
broken disks errors (as expected)
- Broken Disk 0b.80 Shelf ? Bay ? detected prior to assimilation. It should be removed
- Broken Disk 0a.87 Shelf ? Bay ? detected prior to assimilation. It should be removed

After running sysconfig and ifconfig on fas01, we found that it lost two IP's that were attached to the e0a and e0b interfaces. We ran the same commands on fas02 and noticed the two IPs that were lost via the ifconfig -a command. For example:

(on fas02) ifconfig -a

e0a: ...partner inet 1.2.3.4 (redacted actual IP) (not in use)

e0b: ...partner inet 1.2.3.5(redacted actual IP)(not in use)

I take it that the IP's mentioned above were the IP's originally attached to those interfaces before fas01 took the initial power hit (which caused this whole issue). We used the ifconfig commands to assign those IPs back to the e0a and e0b interfaces.

After attaching the IP's to the interfaces, we can ping our Domain Controller as well as several DNS servers but can't ping individual PC's. We then used the route add command to add the default gateway for good measure (assuming that IP/route was also lost).

SpindleNinja · ‎2019-09-12

Sounds like the giveback wasn't fully finished?

do the etc/rc and hosts files look good/match?

How are your aggrs looking? (aggr status)