OnCommand Unified Host 1.2... nothing but problems.

JEREMY_SUNSHINE · ‎2012-12-19

Hey all,

Experiencing nothing but problems since we started trying to use OnCommand host for our VMware volumes.

Currently, we are having two severe issues: DFM seems to lose connectivity to the host agent, all backups fail from that point on, and restarting the host service hangs, I'm having to reboot the server to restore functionality.

Secondly, the local backup schedule isn't being followed consistently: I have it set to keep hourlies for a day, daylies for a week, weeklies for a month, and monthlies for 3 months, however, right now, I have: the last hourly, an hourly from 9 days ago, a daily from 2 weeks ago. (past this are snapshots from the previous backup setup.) However, it isn't consistent across all the datasets, as one of the other datasets has a few more dailies kept, even though all of them are on the same local policy.

I have a case open with support about this, but I've been getting very little traction (no contact in the last 2 days, despite attempts on my part to contact them) and am looking for any help I can get.

Not sure what info will be needed to troubleshoot, but here's some salient details: OC core 5.1, OC host 1.2, FAS3210s, both software packages are installed on Win 2k8 R2-64, all living on the same subnet. OC Host is installed on the VMware vCenter server, VMware is all on 4.1.

Thanks for any help you mabe be able to provide.

JEREMY_SUNSHINE · ‎2012-12-19

Oops, ONTAP version is 8.0.1P4 7-Mode

kryan · ‎2012-12-19

Jeremy,

your problem description reads like a server resource problem.

You stated that OCHP is installed on the WIn2k8 vCenter server.

1) Do you have MS hotfix 2577795 installed on that server?

http://support.microsoft.com/kb/2577795

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=536261

2) Are the resources reserved for the vCetner/OCHP VM (assuming it is one)?

3) What CPU/RAM resources does the server have reserved?

JEREMY_SUNSHINE · ‎2012-12-19

Ryan,

Thanks for your reply.

Downloading the hotfix now.

as for your other questions: We don't really have resources reserved for any of our VMs, since we generally run at 25 to 50 percent of our VMware cluster capacity. To be clear, OC core is installed on a VM as well. Utilization on that server tends to run very low, as well, but taking into consideration our cluster utilization, would you recommend I still reserve resources?

JEREMY_SUNSHINE · ‎2012-12-19

update: the hotfix installer claims the update isn't applicable.

kryan · ‎2012-12-19

Always reserve the recommended resources for OCUM and OCHP servers to prevent resource issues.

That hotfix should be applicable for any Win2k8 server, so perhaps you already had it installed?

C:\>systeminfo |findstr 2577795

JEREMY_SUNSHINE · ‎2012-12-19

Kevin,

I get no result on the systeminfo command, nor did i find the hotfix listed in "control panel>programs & features>view installed updates" (based on a search for the number.)

Is there a whitepaper on the recommend resources for these VMs? If it helps, we're only running 2 filers with 3 total controllers.

kryan · ‎2012-12-19

Apologies, the hotfix is applicable to any Windows Server 2008 R2 server.

There is no published documentation that I am aware of for the reserved resources, although it has been requested to be added Installation/Setup guides.

It has been observed that reserving server resources has resolved "out of memory" and missed scheduled job conditions in the past, particularly on the OCHP server which kicks off the backup jobs and then registers them to UM once completed (similar to Snapmanager product integration scenarios).

JEREMY_SUNSHINE · ‎2012-12-19

Kevin,

I reserved the entire memory allocated to each server, and 10000 Mhz CPU.

The vCenter/OCHP server is Server 2008 R2 standard.

This is the error I'm encountering in the failed jobs currently happening: (somewhat of a new symptom, in that a reboot hasn't cleared it up.) OnCommandHSVMware: hsBackup8 1ddbaefad80a96414abc3b00bf865b18: Failed to connect to vCenter Server <servername>.

JEREMY_SUNSHINE · ‎2012-12-20

Just to update, I fixed the problem that was causing backups to outright fail, so I'll monitor it to see if the resource reservations make a difference.

I did, however, notice that the one dataset that mirroring isn't currently functional on (wouldn't initiate the mirror due to space constraints) seems to have had no problems with retaining backups to schedule, is it possible that the problem is a conflict between the local backup policy and the storage service being used for mirroring?

edit: I somewhat lied above: upon looking into the dataset in more detail, it decided to create the mirror some time since the first try failed.

So to summarize, current situation is that one dataset is working right, 5 others aren't.

kryan · ‎2012-12-21

Jeremy,

When you run the command "dfpm backup list DATASET_ID" on the OC UM server are there primary backups listed or no backups listed?

The backups not running on the OCHP side can be resource related. Typically this failure coincides with errors in the system/app event logs regarding resource constraints - are you seeing any such errors in your event logs?

JEREMY_SUNSHINE · ‎2012-12-21

Kevin,

To bring you up to date, and to clarify a misconceptions:

I was wrong about one of the datasets working differently: that volume actually still has It's pre-OCHP dataset, which is working correctly.

As far as the backups go, let me be clear: all the scheduled backups seem to be happening. The problem is that the retention setting are not being adhered to: the backups are getting deleted far before the retention schedule calls for their deletion.

kryan · ‎2012-12-21

The retention settings are controlled in 2 locations:

1) the OCHP/primary backup retention is controlled in by the "Local Policy" visible in the virtual dataset within the OCUM UI (not the NMC).

2) the secondary backup retention is specified by the protection policy assigned to the storage service and must be viewed edited from the OCUM server CLI or the NMC.

OCUM Core should not be deleting the OCHP primary side backups that do not meet the configured retention. if you suspect this is the case, the controller audit logs should be inspected to determine which system is authenticating to delete the backups.

JEREMY_SUNSHINE · ‎2012-12-21

Kevin,

The retention policies on both the protection policy attached to the storage service, and local policy, specify keeping hourly snapshots for two days, dailys for a week, and weeklies for a month.

What I'm seeing is more akin to each snapshot type getting deleted as soon as a newer one is made, ie, I have one hourly, one daily, and one weekly at any given time.

kryan · ‎2012-12-21

It would be beneficial to see the output of the following:

1) "dfpm backup list DATASET_ID"

2) "dfpm policy node get -q SECONDARY_POLICY_ID"

3) "dfpm policy get -q PRIMARY_POLICY_ID"

JEREMY_SUNSHINE · ‎2012-12-21

Backup Id Backup Version	Retention Type Retention Duration (in seconds)
Node Name	Description	Properties(Name=Value)

--------- --------------------- -------------- ---------------------------------

-------------------- ----------------------------------- ----------------------

---

181147 21 Dec 2012 09:00:00 hourly	172800
Primary data	CreateVmwareSnapshot=f

alse IncludeIndependentDisks=false

180548 20 Dec 2012 13:00:00 hourly	172800
Mirror	CreateVmwareSnapshot=f

alse IncludeIndependentDisks=false

180525 20 Dec 2012 13:00:00 hourly	172800
Primary data	CreateVmwareSnapshot=f

alse IncludeIndependentDisks=false

177890 16 Dec 2012 22:00:01 daily	604800
Mirror	CreateVmwareSnapshot=t