FAS and V-Series Storage Systems Discussions

Highlighted

FAS2050 storage controller not communicating

Hello All,

We're currently experiencing issues with our FAS2050 storage controller. I could not find any documentation on the FAS2050's assuming it's because they're already End-of-Life. Our FAS2050 supports our Vmware VDI infrastructure (Horizon 7/vCenter 6.5) by storing the profiles for our VDI customers. It's running two storage controllers where one contains the location for user profiles and the other controller oversees the storage of other miscellaneous files.

Over the weekend, we had to perform a power cycle of our server room and in preperation, was in the process of shutting down our VDI environment when suddenly, all power to our servers as well as the FAS2050 was shut off causing a hard shut down.

Afer rebooting, our VMware/VDI environment came back up without too many issues but one of the two storage controllers didn't come up as expected which was the controller holding our VDI profiles. There is an amber light on the exclamation point LED in the front of the chassis as well as green and amber lights on the NICs in the back (on the bottom module, although the top only has green NIC lights).

We utilize NetApp OnCommand System Manager to view the storage locations for our profiles and now we can't discover or add that location back in. 

Been scouring the interwebs for any documentation in regards to the FAS2050 that's not about upgrading it to newer versions.

17 REPLIES 17
Highlighted

Re: FAS2050 storage controller not communicating

What do you see on console of the controller in question? What “cf status” on good controller says?

 

The only hardware specific documentation is related to parts replacement; everything else is the same so just use Data ONTAP manuals for your version.

Highlighted

Re: FAS2050 storage controller not communicating

Hi there! Yes, you're right that this system is now end of support from us, sorry. 

 

Can you connect via RLM or serial cable (RJ-45 9600 8N1)? what does the controller say? With any luck, it will just have "Waiting for giveback" scrolling on the console, and you can just type "cf giveback" into the other controller.

 

This would suggest that when one node came up succesfully, it took over for the other one, but didn't have failover setup properly. This recent thread goes through the process of validating HA for a system of that era - https://community.netapp.com/t5/FAS-and-V-Series-Storage-Systems-Discussions/Restart-the-Controllers-7-mode-HA/td-p/149941

 

Please share output of serial console and we'll see what to do next

Highlighted

Re: FAS2050 storage controller not communicating

Utilizing DATA ONTAP 7.3.1.1, FilerView, the status page shows: "This node has taken over. /vol/vol_mb1_db1 is full (using or reserving 100% of space and 0% of inodes, using 100% of reserve). I'm very much a beginner in the storage world so I'm not sure if the volume being full would have any impact on the takeover process.
Highlighted

Re: FAS2050 storage controller not communicating

Edit to the recent post above:

The status was taken from the operational controller. 

 

Unfortunately, I don't have a working serial cable so I'm going to try to putty into it. I'll post any results from my findings.

 

Thank you guys for your time and assistance with this.

 

-Nick

Highlighted

Re: FAS2050 storage controller not communicating

Unfortunately, we were not able to console into the controller that is down since we don't have a working serial cable to physically console into the system, we had to resort to Putty (which wouldn't work due to network issues). 

 

On the operational controller, after logging in, the prompt displayed:

"controllerhostname"(takeover)>

 

We ran a cf status and got the following result:

"partner has been taken over by <controllerhostname>"

 

The resources, in this case, that is owned by the downed controller is our volume containing our VDI profiles. VDI users still are not able to connect to their profiles which leads us to believe that the takeover process didn't complete as intended even though the status might prove otherwise. 

 

Would a restart of the FAS chassis be recommended or would we need to manually run the takeover commands instead? Thank you again for your help.

 

-Nick

Highlighted

Re: FAS2050 storage controller not communicating

Hi there,

 

The serial cables are the same pinout as the light blue Cisco serial cables, if that helps.

 

If you don't have any other option, just run "cf giveback" on the surviving node and hope for the best. It does a number of checks before giving it back entirely, and it sounds like you're having an outage right now, so it won't get any worse.

 

The volume being full is bad, but I believe it wouldn't cause this behavior.

 

Hope this helps!

Highlighted

Re: FAS2050 storage controller not communicating

Alex,

 

I ran the cf giveback command and got the following output:

 

survivingnode (takeover)> cf giveback
survivingnode (takeover)> Wed Aug 28 13:03:50 EDT [survivingnode(takeover): cf.misc.operatorGiveback:info]: Cluster monitor: giveback initiated by operator
Wed Aug 28 13:03:50 EDT [survivingnode(takeover): disk.failed.abortGiveback:warning] Failed disk 0a.77 should be removed before the giveback command is invoked.
Wed Aug 28 13:03:50 EDT [survivingnode(takeover): cf.rsrc.givebackVeto:error] Cluster monitor: disk check : giveback cancelled due to active state
Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.infoMsg:info] CIFS: Warning for server \\DC-001: Connection terminated.
Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.errorMsg:error]: CIFS: Error for server \\DC-001: Error while negotiating protocol with server STATUS_IO_TIMEOUT.
Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.infoMsg:info] CIFS: Warning for server \\DC-002: Connection terminated.
Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.errorMsg:error]: CIFS: Error for server \\DC-002: Error while negotiating protocol with server STATUS_IO_TIMEOUT.
Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.infoMsg:info] CIFS: Warning for server \\DC-003: Connection terminated.
Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.errorMsg:error]: CIFS: Error for server \\DC-003: Error while negotiating protocol with server STATUS_IO_TIMEOUT.
Wed Aug 28 13:08:56 EDT [survivingnode(takeover): asup.smtp.host:info]: Autosupport cannot connect to host xxx.xxx.xxx.xxx (Network comm problem) for message: DISK FAILED
Wed Aug 28 13:08:56 EDT [survivingnode(takeover): asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts frmo the auto.support.mailhost option. (DISK FAILED)

 

One thing to note, prior to running the cf giveback command above, we just did a power cycle of the entire chassis to see if everything would come up okay but no luck. 

Highlighted

Re: FAS2050 storage controller not communicating

I would get access to the second controller at this point.     Either via ssh into the SP/BMC or console cable to see what it's displaying.   

 

Are their any amber lights on any of the disks or any ports on the rear showing offline.   

Highlighted

Re: FAS2050 storage controller not communicating


@Nicolas_ja wrote:


Wed Aug 28 13:03:50 EDT [survivingnode(takeover): disk.failed.abortGiveback:warning] Failed disk 0a.77 should be removed before the giveback command is invoked.


Drive 77 should be the left-most drive of the 4th shelf along - https://library.netapp.com/ecm/ecm_download_file/ECMP1112854 It should have an orange LED on it.

 

Follow advice, try "cf giveback" again.

Highlighted

Re: FAS2050 storage controller not communicating

One should never perform giveback without having console access to controller which was taken over - especially in the case. If “cf status” does not indicate partner is “ready for giveback”, it means either partner did not boot or there is some communication issue. Blindly performing giveback in this state can easily result in outage and data loss.

 

Console connection (either directly or via RLM/SP/BMC) is really a must when doing any maintenance in NetApp.

 

Highlighted

Re: FAS2050 storage controller not communicating

A fair comment there, if at all possible, check the partner console - but you also shouldn’t run production workloads on unsupported hardware or operate them without out of band access and regular takeover/giveback tests done during ONTAP upgrades, but here we are.

Highlighted

Re: FAS2050 storage controller not communicating

Unfortunately, we do not have any replacement disks laying around to replace the failed disk (0a.77). I was hoping to run the following command to force a giveback but the command line did not recognize it;

 

storage failover giveback -ofnode <nodename>

 

Is it possible to just remove the disk and leave the bay empty?

 

 

Sincerely,

Nick

Highlighted

Re: FAS2050 storage controller not communicating

Hi there,

 

That command is for modern ONTAP and won't work in your system.

 

Yes, you can remove that drive and leave the slot unused (but for airflow management, just leave the drive in the slot unplugged)

Highlighted

Re: FAS2050 storage controller not communicating

Hello Alex,

 

We removed the failed hard drive but after attempting the giveback command, it showed yet another failed drive that is stopping the process from continuing. Since we do have multiple drives in the chassis with amber lights, we expected a waterfall of failed drive errors. We made the decision to not keep pulling failed drives out of the chassis and left them in there, ultimately going forward with the cf giveback -f command which completed successfully.

 

Now, after entering "cf status", the output displayed is "cluster enabled, partner is up". Even though this is the case, we still cannot ping or add the controller in Netapp OnCommand System Manager via it's IP or hostname.

 

I feel like it's something small that we're just missing in order to get it communicating properly. 

 

-Nick

Highlighted

Re: FAS2050 storage controller not communicating

Do you have a good backup of all the data that's on this 2050?    Multiple drive failures is not good. 

 

As far as whats wrong with the other node,  what happens if you console in to it?    Does anything display?   are their any lights on the rear?   There is a ! on each controller are either of those lit up? 

Check out the KB!
Knowledge Base
All Community Forums