generic question about high availability

sandeept · ‎2015-01-31

Someone in our company performed a pre-production task in production accidentally and this brought a file system offline. This in turn created a prolonged outage.

This seems like a failover of some sort should have kicked in. I don't know the nature of the "offline" system or what was exactly in the script.

In general...If a file system goes offline..there is a failover that should work?

paulstringfellow · ‎2015-02-01

probably need more details of what you mean by "file system" goes offline and also a little more about your setup.

in essence though if you have a HA 7 mode build and a controller fails then you will get a failover and all resrouces should be presented from the alternate controller.

if a file system has gone offline because of a full volume for example, then this is ot a HA incident and is an admin incident, so the volume would go offline but no HA would occur.

if you can provide more detail, sure we can advise

sandeept · ‎2015-02-02

Thanks for the reply.

It's a touchy subject at work since it caused a major issue. Regardless of the issue...Our larger team is trying to see if there is some learning from it. I'll ask around for this info.

Do you know if there are processes that are normal for file systems that may not be for other middleware teams. For example...even if you accidentally run a script in production that is quite destructive and should only be run during a release...the program actually doesn't have the correct credentials to run. I have little to no experience..but certainly file system scripts are not run as often as middleware teams that run the servers where the applications are..so maybe this extra level of security (if feasible) is allowed. This is different than the topic I posted..but it all comes down to prevention.

sandeept · ‎2015-02-02

Here is what I was able to find out. Keep in mind...I have no experience with enterprise file systems.

A file system was "un-exported", which means no hosts were able to access it. An NFS storage migration script was accidentally run in production and run within the management server. The system was re-exported and the affected applications were restarted.

What happened was the the application binaries are coming from one place and this place was unexported.

paulstringfellow · ‎2015-02-05

hi and thanks for the update.

so it sounds like what happened was the file system was disconnected from the prodcution hosts, as oppose to it been a problem with the NetApp NFS export.

assuming this didn't take down the export at the NetApp level, not sure there is much you can do other than as you suggested yourself, which is to ensure security and procesees are in place to stop this been run again.

at the netapp level access to the exports can be controlled as you expect, however the right user with the right security rights, doing the wrong thing is difficult to stop!

not sure there is a lot you can do other than review of policy and procedure i'd think.

sorry i couldn't be more help, not really my area of expertise NFS scripting!