We're currently experiencing issues with our FAS2050 storage controller. I could not find any documentation on the FAS2050's assuming it's because they're already End-of-Life. Our FAS2050 supports our Vmware VDI infrastructure (Horizon 7/vCenter 6.5) by storing the profiles for our VDI customers. It's running two storage controllers where one contains the location for user profiles and the other controller oversees the storage of other miscellaneous files.
Over the weekend, we had to perform a power cycle of our server room and in preperation, was in the process of shutting down our VDI environment when suddenly, all power to our servers as well as the FAS2050 was shut off causing a hard shut down.
Afer rebooting, our VMware/VDI environment came back up without too many issues but one of the two storage controllers didn't come up as expected which was the controller holding our VDI profiles. There is an amber light on the exclamation point LED in the front of the chassis as well as green and amber lights on the NICs in the back (on the bottom module, although the top only has green NIC lights).
We utilize NetApp OnCommand System Manager to view the storage locations for our profiles and now we can't discover or add that location back in.
Been scouring the interwebs for any documentation in regards to the FAS2050 that's not about upgrading it to newer versions.
Hi there! Yes, you're right that this system is now end of support from us, sorry.
Can you connect via RLM or serial cable (RJ-45 9600 8N1)? what does the controller say? With any luck, it will just have "Waiting for giveback" scrolling on the console, and you can just type "cf giveback" into the other controller.
Utilizing DATA ONTAP 126.96.36.199, FilerView, the status page shows:
"This node has taken over. /vol/vol_mb1_db1 is full (using or reserving 100% of space and 0% of inodes, using 100% of reserve).
I'm very much a beginner in the storage world so I'm not sure if the volume being full would have any impact on the takeover process.
Unfortunately, we were not able to console into the controller that is down since we don't have a working serial cable to physically console into the system, we had to resort to Putty (which wouldn't work due to network issues).
On the operational controller, after logging in, the prompt displayed:
We ran a cf status and got the following result:
"partner has been taken over by <controllerhostname>"
The resources, in this case, that is owned by the downed controller is our volume containing our VDI profiles. VDI users still are not able to connect to their profiles which leads us to believe that the takeover process didn't complete as intended even though the status might prove otherwise.
Would a restart of the FAS chassis be recommended or would we need to manually run the takeover commands instead? Thank you again for your help.
The serial cables are the same pinout as the light blue Cisco serial cables, if that helps.
If you don't have any other option, just run "cf giveback" on the surviving node and hope for the best. It does a number of checks before giving it back entirely, and it sounds like you're having an outage right now, so it won't get any worse.
The volume being full is bad, but I believe it wouldn't cause this behavior.
I ran the cf giveback command and got the following output:
survivingnode (takeover)> cf giveback survivingnode (takeover)> Wed Aug 28 13:03:50 EDT [survivingnode(takeover): cf.misc.operatorGiveback:info]: Cluster monitor: giveback initiated by operator Wed Aug 28 13:03:50 EDT [survivingnode(takeover): disk.failed.abortGiveback:warning] Failed disk 0a.77 should be removed before the giveback command is invoked. Wed Aug 28 13:03:50 EDT [survivingnode(takeover): cf.rsrc.givebackVeto:error] Cluster monitor: disk check : giveback cancelled due to active state Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.infoMsg:info] CIFS: Warning for server \\DC-001: Connection terminated. Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.errorMsg:error]: CIFS: Error for server \\DC-001: Error while negotiating protocol with server STATUS_IO_TIMEOUT. Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.infoMsg:info] CIFS: Warning for server \\DC-002: Connection terminated. Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.errorMsg:error]: CIFS: Error for server \\DC-002: Error while negotiating protocol with server STATUS_IO_TIMEOUT. Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.infoMsg:info] CIFS: Warning for server \\DC-003: Connection terminated. Wed Aug 28 13:04:00 EDT [survivingnode(takeover): cifs.server.errorMsg:error]: CIFS: Error for server \\DC-003: Error while negotiating protocol with server STATUS_IO_TIMEOUT. Wed Aug 28 13:08:56 EDT [survivingnode(takeover): asup.smtp.host:info]: Autosupport cannot connect to host xxx.xxx.xxx.xxx (Network comm problem) for message: DISK FAILED Wed Aug 28 13:08:56 EDT [survivingnode(takeover): asup.smtp.unreach:error]: Autosupport mail was not sent because the system cannot reach any of the mail hosts frmo the auto.support.mailhost option. (DISK FAILED)
One thing to note, prior to running the cf giveback command above, we just did a power cycle of the entire chassis to see if everything would come up okay but no luck.
One should never perform giveback without having console access to controller which was taken over - especially in the case. If “cf status” does not indicate partner is “ready for giveback”, it means either partner did not boot or there is some communication issue. Blindly performing giveback in this state can easily result in outage and data loss.
Console connection (either directly or via RLM/SP/BMC) is really a must when doing any maintenance in NetApp.
A fair comment there, if at all possible, check the partner console - but you also shouldn’t run production workloads on unsupported hardware or operate them without out of band access and regular takeover/giveback tests done during ONTAP upgrades, but here we are.
Unfortunately, we do not have any replacement disks laying around to replace the failed disk (0a.77). I was hoping to run the following command to force a giveback but the command line did not recognize it;
storage failover giveback -ofnode <nodename>
Is it possible to just remove the disk and leave the bay empty?
We removed the failed hard drive but after attempting the giveback command, it showed yet another failed drive that is stopping the process from continuing. Since we do have multiple drives in the chassis with amber lights, we expected a waterfall of failed drive errors. We made the decision to not keep pulling failed drives out of the chassis and left them in there, ultimately going forward with the cf giveback -f command which completed successfully.
Now, after entering "cf status", the output displayed is "cluster enabled, partner is up". Even though this is the case, we still cannot ping or add the controller in Netapp OnCommand System Manager via it's IP or hostname.
I feel like it's something small that we're just missing in order to get it communicating properly.