Active IQ Unified Manager Discussions

OnCommand Unified Host 1.2... nothing but problems.

JEREMY_SUNSHINE
10,337 Views

Hey all,

Experiencing nothing but problems since we started trying to use OnCommand host for our VMware volumes. 

Currently, we are having two severe issues: DFM seems to lose connectivity to the host agent, all backups fail from that point on, and restarting the host service hangs, I'm having to reboot the server to restore functionality.

Secondly, the local backup schedule isn't being followed consistently: I have it set to keep hourlies for a day, daylies for a week, weeklies for a month, and monthlies for 3 months, however, right now, I have: the last hourly, an hourly from 9 days ago, a daily from 2 weeks ago.  (past this are snapshots from the previous backup setup.)  However, it isn't consistent across all the datasets, as one of the other datasets has a few more dailies kept, even though all of them are on the same local policy.

I have a case open with support about this, but I've been getting very little traction (no contact in the last 2 days, despite attempts on my part to contact them) and am looking for any help I can get.

Not sure what info will be needed to troubleshoot, but here's some salient details:  OC core 5.1, OC host 1.2, FAS3210s, both software packages are installed on Win 2k8 R2-64, all living on the same subnet.  OC Host is installed on the VMware vCenter server, VMware is all on 4.1.

Thanks for any help you mabe be able to provide.

21 REPLIES 21

JEREMY_SUNSHINE
9,863 Views

Oops, ONTAP version is 8.0.1P4 7-Mode

kryan
9,863 Views

Jeremy,

your problem description reads like a server resource problem.

You stated that OCHP is installed on the WIn2k8 vCenter server.

1) Do you have MS hotfix 2577795 installed on that server? 

http://support.microsoft.com/kb/2577795

http://support.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=536261

2) Are the resources reserved for the vCetner/OCHP VM (assuming it is one)? 

3) What CPU/RAM resources does the server have reserved?

JEREMY_SUNSHINE
9,863 Views

Ryan,

Thanks for your reply.

Downloading the hotfix now.

as for your other questions:  We don't really have resources reserved for any of our VMs, since we generally run at 25 to 50 percent of our VMware cluster capacity.  To be clear, OC core is installed on a VM as well.  Utilization on that server tends to run very low, as well, but taking into consideration our cluster utilization, would you recommend I still reserve resources?

JEREMY_SUNSHINE
9,536 Views

update:  the hotfix installer claims the update isn't applicable.

kryan
9,863 Views

Always reserve the recommended resources for OCUM and OCHP servers to prevent resource issues.

That hotfix should be applicable for any Win2k8 server, so perhaps you already had it installed?

C:\>systeminfo |findstr 2577795

JEREMY_SUNSHINE
9,863 Views

Kevin,

I get no result on the systeminfo command, nor did i find the hotfix listed in "control panel>programs & features>view installed updates" (based on a search for the number.)

Is there a whitepaper on the recommend resources for these VMs?  If it helps, we're only running 2 filers with 3 total controllers.

kryan
9,864 Views

Apologies, the hotfix is applicable to any Windows Server 2008 R2 server.

There is no published documentation that I am aware of for the reserved resources, although it has been requested to be added Installation/Setup guides.

It has been observed that reserving server resources has resolved "out of memory" and missed scheduled job conditions in the past, particularly on the OCHP server which kicks off the backup jobs and then registers them to UM once completed (similar to Snapmanager product integration scenarios).

JEREMY_SUNSHINE
9,864 Views

Kevin,

I reserved the entire memory allocated to each server, and 10000 Mhz CPU.

The vCenter/OCHP server is Server 2008 R2 standard.

This is the error I'm encountering in the failed jobs currently happening: (somewhat of a new symptom, in that a reboot hasn't cleared it up.)  OnCommandHSVMware: hsBackup8 1ddbaefad80a96414abc3b00bf865b18: Failed to connect to vCenter Server <servername>. 

JEREMY_SUNSHINE
9,864 Views

Just to update, I fixed the problem that was causing backups to outright fail, so I'll monitor it to see if the resource reservations make a difference.

I did, however, notice that the one dataset that mirroring isn't currently functional on (wouldn't initiate the mirror due to space constraints) seems to have had no problems with retaining backups to schedule, is it possible that the problem is a conflict between the local backup policy and the storage service being used for mirroring?

edit: I somewhat lied above: upon looking into the dataset in more detail, it decided to create the mirror some time since the first try failed.

So to summarize, current situation is that one dataset is working right, 5 others aren't.

kryan
9,582 Views

Jeremy,

When you run the command "dfpm backup list DATASET_ID" on the OC UM server are there primary backups listed or no backups listed?

The backups not running on the OCHP side can be resource related.  Typically this failure coincides with errors in the system/app event logs regarding resource constraints - are you seeing any such errors in your event logs?

JEREMY_SUNSHINE
9,582 Views

Kevin,

To bring you up to date, and to clarify a misconceptions:

I was wrong about one of the datasets working differently:  that volume actually still has It's pre-OCHP dataset, which is working correctly. 

As far as the backups go, let me be clear: all the scheduled backups seem to be happening.  The problem is that the retention setting are not being adhered to:  the backups are getting deleted far before the retention schedule calls for their deletion.

kryan
9,582 Views

The retention settings are controlled in 2 locations:

1) the OCHP/primary backup retention is controlled in by the "Local Policy" visible in the virtual dataset within the OCUM UI (not the NMC).

2) the secondary backup retention is specified by the protection policy assigned to the storage service and must be viewed edited from the OCUM server CLI or the NMC. 

OCUM Core should not be deleting the OCHP primary side backups that do not meet the configured retention.  if you suspect this is the case, the controller audit logs should be inspected to determine which system is authenticating to delete the backups. 

JEREMY_SUNSHINE
9,582 Views

Kevin,

The retention policies on both the protection policy attached to the storage service, and local policy, specify keeping hourly snapshots for two days, dailys for a week, and weeklies for a month.

What I'm seeing is more akin to each snapshot type getting deleted as soon as a newer one is made, ie, I have one hourly, one daily, and one weekly at any given time.

kryan
9,582 Views

It would be beneficial to see the output of the following:

1) "dfpm backup list DATASET_ID"

2) "dfpm policy node get -q SECONDARY_POLICY_ID"

3) "dfpm policy get -q PRIMARY_POLICY_ID"

JEREMY_SUNSHINE
9,582 Views
Backup Id Backup Version    Retention Type Retention Duration (in seconds)
Node Name        Description                     Properties(Name=Value)

--------- --------------------- -------------- ---------------------------------

-------------------- ----------------------------------- ----------------------

---

   181147 21 Dec 2012 09:00:00  hourly     172800
Primary data                                         CreateVmwareSnapshot=f

alse IncludeIndependentDisks=false

   180548 20 Dec 2012 13:00:00  hourly     172800
Mirror                                               CreateVmwareSnapshot=f

alse IncludeIndependentDisks=false

   180525 20 Dec 2012 13:00:00  hourly     172800
Primary data                                         CreateVmwareSnapshot=f

alse IncludeIndependentDisks=false

   177890 16 Dec 2012 22:00:01  daily      604800
Mirror                                               CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   177872 16 Dec 2012 22:00:01  daily      604800
Primary data                                         CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   177183 16 Dec 2012 00:00:01  weekly     2419200
Mirror                                               CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   177168 16 Dec 2012 00:00:01  weekly     2419200
Primary data                                         CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   177120 15 Dec 2012 22:00:01  daily      604800
Mirror                                               CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   177102 15 Dec 2012 22:00:01  daily      604800
Primary data                                         CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   176352 14 Dec 2012 22:00:01  daily      604800
Mirror                                               CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

   176335 14 Dec 2012 22:00:01  daily      604800
Primary data                                         CreateVmwareSnapshot=t

rue IncludeIndependentDisks=false

c:\Users\DSCC_Admin\Desktop\DFMDCv2>dfpm policy node get -q mirror

nodeId=1

nodeName=Primary data

hourlyRetentionCount=2

hourlyRetentionDuration=172800

dailyRetentionCount=2

dailyRetentionDuration=604800

weeklyRetentionCount=1

weeklyRetentionDuration=2419200

monthlyRetentionCount=0

monthlyRetentionDuration=7257600

backupScriptPath=

backupScriptRunAs=

failoverScriptPath=

failoverScriptRunAs=

snapshotScheduleId=49

snapshotScheduleName=Sunday at midnight with daily and hourly

lagWarningEnabled=Yes

lagWarningThreshold=129600

lagErrorEnabled=Yes

lagErrorThreshold=172800

nodeId=2

nodeName=Mirror

c:\Users\DSCC_Admin\Desktop\DFMDCv2>dfpm policy get -q "VMware local backup poli

cy - AMARG

Name=VMware local backup policy - AMARG

Description=VMware local backup, Customized for AMARG retention

Type=vmware

backupScript=

hourlyRetentionCount=1

hourlyRetentionDuration=172800

dailyRetentionCount=1

dailyRetentionDuration=604800

weeklyRetentionCount=1

weeklyRetentionDuration=2419200

monthlyRetentionCount=1

monthlyRetentionDuration=7257600

lagWarningThreshold=129600

lagErrorThreshold=172800

Those retention counts worry me, but I don't remember seeing them as a separate option when configuring these policies.

kryan
8,714 Views

Counts can only be modified by the CLI, there is no manner to modify them in the NMC.  The retention behavior will be a combination of the counts and time specified.

You are using a mirror protection policy so there is no "retention" for the secondary - it will be an exact copy of the primary volume when the last protection was kicked off. 

Your primary retention times match the secondary policy's primary node settings, but not the counts.  The primary retention policy will be on control of the primary backups.

For example, based on the below settings you will always keep 1 of each style (hourly, daily, weekly, monthly) backup no matter their age.  However you will keep every backup that is younger than the time set in the policy.

c:\>dfpm policy get -q "VMware local backup policy - AMARG

Name=VMware local backup policy - AMARG

Description=VMware local backup, Customized for AMARG retention

Type=vmware

backupScript=

hourlyRetentionCount=1

hourlyRetentionDuration=172800  (2 days)

dailyRetentionCount=1

dailyRetentionDuration=604800  (7 days)

weeklyRetentionCount=1

weeklyRetentionDuration=2419200  (28 days)

monthlyRetentionCount=1

monthlyRetentionDuration=7257600  (84 days)

lagWarningThreshold=129600

lagErrorThreshold=172800

Your dataset currently has 2  hourly backups, 3 daily backups, and 1 weekly primary backup.  Based on their age and your primary policy settings this appears to match.

The hourly backups are both under 2 days old.  The daily backups are all under the 7 day mark - expect to lose the oldest hourly and daily backups on the next backup.  That weekly backup is only 5 days old and therefore should remain for at least 3+ weeks from today and longer if you do not make a new weekly style backup. 

Backup Id Backup Version   Retention Type Retention Duration (in seconds) Node Name  

--------- --------------------- -------------- ----------------------------------------------

   181147 21 Dec 2012 09:00:00  hourly

172800




Primary data
   180525 20 Dec 2012 13:00:00  hourly

172800




Primary data
   177872 16 Dec 2012 22:00:01  daily

604800




Primary data
   177168 16 Dec 2012 00:00:01  weekly

2419200   



 
  Primary data
   177102 15 Dec 2012 22:00:01  daily

604800




Primary data
   176335 14 Dec 2012 22:00:01  daily

604800




Primary data

JEREMY_SUNSHINE
8,714 Views

If I understand your response correctly, then the counts are not the reason that I'm not keeping my full 2 days of hourlies, for example:  the policy should be retaining 2 days of hourlies, with a minimum of 1 backup retained, but no maximum.

If so, then where else should I look for the cause of my retention problems?

kryan
8,713 Views

You are correct in the behavior that you should be seeing.

I would need to see data on the backups that are getting removed prematurely to make further comments.

Did you say you have an open support case?

JEREMY_SUNSHINE
8,713 Views

Indeed,

2003782819

kryan
8,330 Views

That does not appear to be a case related to this behavior.   You would want to open a case for the behavior so that specific data can be collected and analyzed.

Specifically you will want to provide a DFMDC, new autosupports, and system ids of the controllers involved in addition to the controller audit logs to show the backups that are being purged and allow investigation of what is deleting them.  You might also include "dfpm job detail JOBID" for a few backups. 

Public