Data Infrastructure Management Software Discussions

Highlighted

OnCommand Unified Host 1.2... nothing but problems.

Hey all,

Experiencing nothing but problems since we started trying to use OnCommand host for our VMware volumes. 

Currently, we are having two severe issues: DFM seems to lose connectivity to the host agent, all backups fail from that point on, and restarting the host service hangs, I'm having to reboot the server to restore functionality.

Secondly, the local backup schedule isn't being followed consistently: I have it set to keep hourlies for a day, daylies for a week, weeklies for a month, and monthlies for 3 months, however, right now, I have: the last hourly, an hourly from 9 days ago, a daily from 2 weeks ago.  (past this are snapshots from the previous backup setup.)  However, it isn't consistent across all the datasets, as one of the other datasets has a few more dailies kept, even though all of them are on the same local policy.

I have a case open with support about this, but I've been getting very little traction (no contact in the last 2 days, despite attempts on my part to contact them) and am looking for any help I can get.

Not sure what info will be needed to troubleshoot, but here's some salient details:  OC core 5.1, OC host 1.2, FAS3210s, both software packages are installed on Win 2k8 R2-64, all living on the same subnet.  OC Host is installed on the VMware vCenter server, VMware is all on 4.1.

Thanks for any help you mabe be able to provide.

21 REPLIES 21
Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

Oops, ONTAP version is 8.0.1P4 7-Mode

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

Jeremy,

your problem description reads like a server resource problem.

You stated that OCHP is installed on the WIn2k8 vCenter server.

1) Do you have MS hotfix 2577795 installed on that server? 

http://support.microsoft.com/kb/2577795

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=536261

2) Are the resources reserved for the vCetner/OCHP VM (assuming it is one)? 

3) What CPU/RAM resources does the server have reserved?

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

Ryan,

Thanks for your reply.

Downloading the hotfix now.

as for your other questions:  We don't really have resources reserved for any of our VMs, since we generally run at 25 to 50 percent of our VMware cluster capacity.  To be clear, OC core is installed on a VM as well.  Utilization on that server tends to run very low, as well, but taking into consideration our cluster utilization, would you recommend I still reserve resources?

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

update:  the hotfix installer claims the update isn't applicable.

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

Always reserve the recommended resources for OCUM and OCHP servers to prevent resource issues.

That hotfix should be applicable for any Win2k8 server, so perhaps you already had it installed?

C:\>systeminfo |findstr 2577795

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

Kevin,

I get no result on the systeminfo command, nor did i find the hotfix listed in "control panel>programs & features>view installed updates" (based on a search for the number.)

Is there a whitepaper on the recommend resources for these VMs?  If it helps, we're only running 2 filers with 3 total controllers.

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

Apologies, the hotfix is applicable to any Windows Server 2008 R2 server.

There is no published documentation that I am aware of for the reserved resources, although it has been requested to be added Installation/Setup guides.

It has been observed that reserving server resources has resolved "out of memory" and missed scheduled job conditions in the past, particularly on the OCHP server which kicks off the backup jobs and then registers them to UM once completed (similar to Snapmanager product integration scenarios).

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

Kevin,

I reserved the entire memory allocated to each server, and 10000 Mhz CPU.

The vCenter/OCHP server is Server 2008 R2 standard.

This is the error I'm encountering in the failed jobs currently happening: (somewhat of a new symptom, in that a reboot hasn't cleared it up.)  OnCommandHSVMware: hsBackup8 1ddbaefad80a96414abc3b00bf865b18: Failed to connect to vCenter Server <servername>. 

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

Just to update, I fixed the problem that was causing backups to outright fail, so I'll monitor it to see if the resource reservations make a difference.

I did, however, notice that the one dataset that mirroring isn't currently functional on (wouldn't initiate the mirror due to space constraints) seems to have had no problems with retaining backups to schedule, is it possible that the problem is a conflict between the local backup policy and the storage service being used for mirroring?

edit: I somewhat lied above: upon looking into the dataset in more detail, it decided to create the mirror some time since the first try failed.

So to summarize, current situation is that one dataset is working right, 5 others aren't.

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

Jeremy,

When you run the command "dfpm backup list DATASET_ID" on the OC UM server are there primary backups listed or no backups listed?

The backups not running on the OCHP side can be resource related.  Typically this failure coincides with errors in the system/app event logs regarding resource constraints - are you seeing any such errors in your event logs?

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

Kevin,

To bring you up to date, and to clarify a misconceptions:

I was wrong about one of the datasets working differently:  that volume actually still has It's pre-OCHP dataset, which is working correctly. 

As far as the backups go, let me be clear: all the scheduled backups seem to be happening.  The problem is that the retention setting are not being adhered to:  the backups are getting deleted far before the retention schedule calls for their deletion.

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

The retention settings are controlled in 2 locations:

1) the OCHP/primary backup retention is controlled in by the "Local Policy" visible in the virtual dataset within the OCUM UI (not the NMC).

2) the secondary backup retention is specified by the protection policy assigned to the storage service and must be viewed edited from the OCUM server CLI or the NMC. 

OCUM Core should not be deleting the OCHP primary side backups that do not meet the configured retention.  if you suspect this is the case, the controller audit logs should be inspected to determine which system is authenticating to delete the backups. 

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

Kevin,

The retention policies on both the protection policy attached to the storage service, and local policy, specify keeping hourly snapshots for two days, dailys for a week, and weeklies for a month.

What I'm seeing is more akin to each snapshot type getting deleted as soon as a newer one is made, ie, I have one hourly, one daily, and one weekly at any given time.

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

It would be beneficial to see the output of the following:

1) "dfpm backup list DATASET_ID"

2) "dfpm policy node get -q SECONDARY_POLICY_ID"

3) "dfpm policy get -q PRIMARY_POLICY_ID"

Highlighted

Re: OnCommand Unified Host 1.2... nothing but problems.

Backup Id Backup Version    Retention Type Retention Duration (in seconds)
Node Name        Description                     Properties(Name=Value)

--------- --------------------- -------------- ---------------------------------

-------------------- ----------------------------------- ----------------------

---

   181147 21 Dec 2012 09:00:00  hourly     172800
Primary data                                         CreateVmwareSnapshot=f

alse IncludeIndependentDisks=false

   180548 20 Dec 2012 13:00:00  hourly     172800
Mirror                                               CreateVmwareSnapshot=f

alse IncludeIndependentDisks=false

   180525 20 Dec 2012 13:00:00  hourly     172800
Primary data                                         CreateVmwareSnapshot=f

alse IncludeIndependentDisks=false

   177890 16 Dec 2012 22:00:01  daily      604800
Mirror                                               CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   177872 16 Dec 2012 22:00:01  daily      604800
Primary data                                         CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   177183 16 Dec 2012 00:00:01  weekly     2419200
Mirror                                               CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   177168 16 Dec 2012 00:00:01  weekly     2419200
Primary data                                         CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   177120 15 Dec 2012 22:00:01  daily      604800
Mirror                                               CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   177102 15 Dec 2012 22:00:01  daily      604800
Primary data                                         CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   176352 14 Dec 2012 22:00:01  daily      604800
Mirror                                               CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   176335 14 Dec 2012 22:00:01  daily      604800
Primary data                                         CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

c:\Users\DSCC_Admin\Desktop\DFMDCv2>dfpm policy node get -q mirror

nodeId=1

nodeName=Primary data

hourlyRetentionCount=2

hourlyRetentionDuration=172800

dailyRetentionCount=2

dailyRetentionDuration=604800

weeklyRetentionCount=1

weeklyRetentionDuration=2419200

monthlyRetentionCount=0

monthlyRetentionDuration=7257600

backupScriptPath=

backupScriptRunAs=

failoverScriptPath=

failoverScriptRunAs=

snapshotScheduleId=49

snapshotScheduleName=Sunday at midnight with daily and hourly

lagWarningEnabled=Yes

lagWarningThreshold=129600

lagErrorEnabled=Yes

lagErrorThreshold=172800

nodeId=2

nodeName=Mirror

c:\Users\DSCC_Admin\Desktop\DFMDCv2>dfpm policy get -q "VMware local backup poli

cy - AMARG

Name=VMware local backup policy - AMARG

Description=VMware local backup, Customized for AMARG retention

Type=vmware

backupScript=

hourlyRetentionCount=1

hourlyRetentionDuration=172800

dailyRetentionCount=1

dailyRetentionDuration=604800

weeklyRetentionCount=1

weeklyRetentionDuration=2419200

monthlyRetentionCount=1

monthlyRetentionDuration=7257600

lagWarningThreshold=129600

lagErrorThreshold=172800

Those retention counts worry me, but I don't remember seeing them as a separate option when configuring these policies.

Check out the KB!
Knowledge Base
All Community Forums