Experiencing nothing but problems since we started trying to use OnCommand host for our VMware volumes.
Currently, we are having two severe issues: DFM seems to lose connectivity to the host agent, all backups fail from that point on, and restarting the host service hangs, I'm having to reboot the server to restore functionality.
Secondly, the local backup schedule isn't being followed consistently: I have it set to keep hourlies for a day, daylies for a week, weeklies for a month, and monthlies for 3 months, however, right now, I have: the last hourly, an hourly from 9 days ago, a daily from 2 weeks ago. (past this are snapshots from the previous backup setup.) However, it isn't consistent across all the datasets, as one of the other datasets has a few more dailies kept, even though all of them are on the same local policy.
I have a case open with support about this, but I've been getting very little traction (no contact in the last 2 days, despite attempts on my part to contact them) and am looking for any help I can get.
Not sure what info will be needed to troubleshoot, but here's some salient details: OC core 5.1, OC host 1.2, FAS3210s, both software packages are installed on Win 2k8 R2-64, all living on the same subnet. OC Host is installed on the VMware vCenter server, VMware is all on 4.1.
Thanks for any help you mabe be able to provide.
your problem description reads like a server resource problem.
You stated that OCHP is installed on the WIn2k8 vCenter server.
1) Do you have MS hotfix 2577795 installed on that server?
2) Are the resources reserved for the vCetner/OCHP VM (assuming it is one)?
3) What CPU/RAM resources does the server have reserved?
Thanks for your reply.
Downloading the hotfix now.
as for your other questions: We don't really have resources reserved for any of our VMs, since we generally run at 25 to 50 percent of our VMware cluster capacity. To be clear, OC core is installed on a VM as well. Utilization on that server tends to run very low, as well, but taking into consideration our cluster utilization, would you recommend I still reserve resources?
Always reserve the recommended resources for OCUM and OCHP servers to prevent resource issues.
That hotfix should be applicable for any Win2k8 server, so perhaps you already had it installed?
C:\>systeminfo |findstr 2577795
I get no result on the systeminfo command, nor did i find the hotfix listed in "control panel>programs & features>view installed updates" (based on a search for the number.)
Is there a whitepaper on the recommend resources for these VMs? If it helps, we're only running 2 filers with 3 total controllers.
Apologies, the hotfix is applicable to any Windows Server 2008 R2 server.
There is no published documentation that I am aware of for the reserved resources, although it has been requested to be added Installation/Setup guides.
It has been observed that reserving server resources has resolved "out of memory" and missed scheduled job conditions in the past, particularly on the OCHP server which kicks off the backup jobs and then registers them to UM once completed (similar to Snapmanager product integration scenarios).
I reserved the entire memory allocated to each server, and 10000 Mhz CPU.
The vCenter/OCHP server is Server 2008 R2 standard.
This is the error I'm encountering in the failed jobs currently happening: (somewhat of a new symptom, in that a reboot hasn't cleared it up.) OnCommandHSVMware: hsBackup8 1ddbaefad80a96414abc3b00bf865b18: Failed to connect to vCenter Server <servername>.
Just to update, I fixed the problem that was causing backups to outright fail, so I'll monitor it to see if the resource reservations make a difference.
I did, however, notice that the one dataset that mirroring isn't currently functional on (wouldn't initiate the mirror due to space constraints) seems to have had no problems with retaining backups to schedule, is it possible that the problem is a conflict between the local backup policy and the storage service being used for mirroring?
edit: I somewhat lied above: upon looking into the dataset in more detail, it decided to create the mirror some time since the first try failed.
So to summarize, current situation is that one dataset is working right, 5 others aren't.
When you run the command "dfpm backup list DATASET_ID" on the OC UM server are there primary backups listed or no backups listed?
The backups not running on the OCHP side can be resource related. Typically this failure coincides with errors in the system/app event logs regarding resource constraints - are you seeing any such errors in your event logs?
To bring you up to date, and to clarify a misconceptions:
I was wrong about one of the datasets working differently: that volume actually still has It's pre-OCHP dataset, which is working correctly.
As far as the backups go, let me be clear: all the scheduled backups seem to be happening. The problem is that the retention setting are not being adhered to: the backups are getting deleted far before the retention schedule calls for their deletion.
The retention settings are controlled in 2 locations:
1) the OCHP/primary backup retention is controlled in by the "Local Policy" visible in the virtual dataset within the OCUM UI (not the NMC).
2) the secondary backup retention is specified by the protection policy assigned to the storage service and must be viewed edited from the OCUM server CLI or the NMC.
OCUM Core should not be deleting the OCHP primary side backups that do not meet the configured retention. if you suspect this is the case, the controller audit logs should be inspected to determine which system is authenticating to delete the backups.
The retention policies on both the protection policy attached to the storage service, and local policy, specify keeping hourly snapshots for two days, dailys for a week, and weeklies for a month.
What I'm seeing is more akin to each snapshot type getting deleted as soon as a newer one is made, ie, I have one hourly, one daily, and one weekly at any given time.
It would be beneficial to see the output of the following:
1) "dfpm backup list DATASET_ID"
2) "dfpm policy node get -q SECONDARY_POLICY_ID"
3) "dfpm policy get -q PRIMARY_POLICY_ID"
|Backup Id Backup Version||Retention Type Retention Duration (in seconds)|
--------- --------------------- -------------- ---------------------------------
-------------------- ----------------------------------- ----------------------
|181147 21 Dec 2012 09:00:00 hourly||172800|
|180548 20 Dec 2012 13:00:00 hourly||172800|
|180525 20 Dec 2012 13:00:00 hourly||172800|
|177890 16 Dec 2012 22:00:01 daily||604800|
|177872 16 Dec 2012 22:00:01 daily||604800|
|177183 16 Dec 2012 00:00:01 weekly||2419200|
|177168 16 Dec 2012 00:00:01 weekly||2419200|
|177120 15 Dec 2012 22:00:01 daily||604800|
|177102 15 Dec 2012 22:00:01 daily||604800|
|176352 14 Dec 2012 22:00:01 daily||604800|
|176335 14 Dec 2012 22:00:01 daily||604800|
c:\Users\DSCC_Admin\Desktop\DFMDCv2>dfpm policy node get -q mirror
snapshotScheduleName=Sunday at midnight with daily and hourly
c:\Users\DSCC_Admin\Desktop\DFMDCv2>dfpm policy get -q "VMware local backup poli
cy - AMARG
Name=VMware local backup policy - AMARG
Description=VMware local backup, Customized for AMARG retention
Those retention counts worry me, but I don't remember seeing them as a separate option when configuring these policies.
Counts can only be modified by the CLI, there is no manner to modify them in the NMC. The retention behavior will be a combination of the counts and time specified.
You are using a mirror protection policy so there is no "retention" for the secondary - it will be an exact copy of the primary volume when the last protection was kicked off.
Your primary retention times match the secondary policy's primary node settings, but not the counts. The primary retention policy will be on control of the primary backups.
For example, based on the below settings you will always keep 1 of each style (hourly, daily, weekly, monthly) backup no matter their age. However you will keep every backup that is younger than the time set in the policy.
c:\>dfpm policy get -q "VMware local backup policy - AMARG
Name=VMware local backup policy - AMARG
Description=VMware local backup, Customized for AMARG retention
hourlyRetentionDuration=172800 (2 days)
dailyRetentionDuration=604800 (7 days)
weeklyRetentionDuration=2419200 (28 days)
monthlyRetentionDuration=7257600 (84 days)
Your dataset currently has 2 hourly backups, 3 daily backups, and 1 weekly primary backup. Based on their age and your primary policy settings this appears to match.
The hourly backups are both under 2 days old. The daily backups are all under the 7 day mark - expect to lose the oldest hourly and daily backups on the next backup. That weekly backup is only 5 days old and therefore should remain for at least 3+ weeks from today and longer if you do not make a new weekly style backup.
|Backup Id Backup Version||Retention Type Retention Duration (in seconds) Node Name|
--------- --------------------- -------------- ----------------------------------------------
|181147 21 Dec 2012 09:00:00 hourly||172800||Primary data|
|180525 20 Dec 2012 13:00:00 hourly||172800||Primary data|
|177872 16 Dec 2012 22:00:01 daily||604800||Primary data|
|177168 16 Dec 2012 00:00:01 weekly||2419200|| ||Primary data|
|177102 15 Dec 2012 22:00:01 daily||604800||Primary data|
|176335 14 Dec 2012 22:00:01 daily||604800||Primary data|
If I understand your response correctly, then the counts are not the reason that I'm not keeping my full 2 days of hourlies, for example: the policy should be retaining 2 days of hourlies, with a minimum of 1 backup retained, but no maximum.
If so, then where else should I look for the cause of my retention problems?
You are correct in the behavior that you should be seeing.
I would need to see data on the backups that are getting removed prematurely to make further comments.
Did you say you have an open support case?
That does not appear to be a case related to this behavior. You would want to open a case for the behavior so that specific data can be collected and analyzed.
Specifically you will want to provide a DFMDC, new autosupports, and system ids of the controllers involved in addition to the controller audit logs to show the backups that are being purged and allow investigation of what is deleting them. You might also include "dfpm job detail JOBID" for a few backups.
I understand your concern as to the correctness of the case, however, please understand that while we have been working on this issue via that case, the tech support person has not been updating it to reflect that (I spent much of yesterday on the phone with him) and that he told me that he would open another case as you suggest, but never accomplished that act.
He told me yesterday he would be out of the office today.