Solved: Re: VMware SRM Reprotect Fails with Peer array ID provided in the SRM input is incorrect

CJSCOLTRINITY · ‎2013-07-18

I have the following setup

Site-A

VMware vCenter 5.1 U1

VMware SRM 5.1.1

NetApp SRA 2.0.1P2

FAS3140C - Data ONTAP 8.1.2P4

Site-B

VMware vCenter 5.1 U1

VMware SRM 5.1.1

NetApp SRA 2.0.1P2

FAS3140C - Data ONTAP 8.1.2P4

I have created a basic Protection group at Site-A containing a single VM with a single vmdk hard disk on a NFS volume.

The NFS volume is snapmirrored to Site-B

I can perform a planned migration to Site-B, reprotect and then another planned migration back to Site-A but then when I attempt to reprotect again so that I am ready for another recovery to Site-B the reprotect fails on the first step Configure Storage to Reverse Direction with "Error - Failed to reverse replication for failed over devices. SRA command 'prepareReverseReplication' failed. Peer array ID provided in the SRM input is incorrect Check the SRM logs to verify correct peer array ID."

I cannot see anything on the Filer logs or in the SRM logs to indicate what the issue is.

This happens for every Protection Group I create so it is not isolated to just this one volume. I have also tried with iSCSI VMFS volumes and get exactly the same results.

If I create a Protection Group at Site-B I can recover to Site-A and cannot reprotect it to fail it back to Site-B.

Initially I though the issue I was seeing is that I could do a failback but couldn't perform a second reprotect because the SnapMirrors were left in the wrong state but now I can see that the issue is that I cannot perform a reprotect from Site-A to Site-B.

I have completely un-installed SRM at both locations, removed the SRM database at both locations and started again but still get the same issue.

I've actually got IBM N series N6040 controllers and am using the IBM branded Data ONTAP and SRA. I have a call open with VMware and IBM but not getting very far.

Has anyone seen this issue before and got a solution?

CJSCOLTRINITY · ‎2013-07-25

I have now identified what is causing this issue and can work around it until there is a fix from NetApp available.

The issue was that the filer name at the Recovery Site had the same name as the filer at the Protected Site with a prefix on it, i.e. in my case the Recovery Site filer was named NSERIES01 and the Protected Site was DRNSERIES01. Remember I had already performed a fail-over and fail-back so the Protected Site was my original Recovery Site, so yes the filer on the Protected Site for this Protection Group is DRNSERIES01 and the Recovery Site has NSERIES01 on it.

When the Reprotect task is run the first step is to call the SRA with the command prepareReverseReplication, this calls reverseReplication.pl which attempts to check that the SnapMirror is broken off. It gets the status of all of the SnapMirrors from the filer at the Recovery Site, i.e. in this case NSERIES01. It then goes through each of these looking for a match of the local-filer-name:volume-name in the source of the snapmirror, e.g. for my test group it was attempting to match NSERIES01:NFS_VMware_Test, at this point the source of the SnapMirror is DRNSERIES01:NFS_VMware_Test which is correct but because the script is using a pattern matching test it matches NSERIES01:NFS_VMware_Test to DRNSERIES01:NFS_VMware_Test as NSERIES01:NFS_VMware_Test is contained within DRNSERIES01:NFS_VMware_Test. It then checks if the destination of the snapmirror matches the peerArrayID (i.e. in this case DRNSERIES01) which it does not as the destination, correctly, is NSERIES01 and then reports that the peerArrayID is incorrect. If there is no match on the local-filer-name:volume-name in the source of the snapmirror then it goes on to check the destination of the snapmirror and when it finds a match it check if the peerArrayID matches the source of the SnapMirror and if it does it then checks that the status of the SnapMirror is broken-off.

I never hit the issue with the first reprotect because DRNSERIES01:NFS_VMware_Test is not contained within the source of the SnapMirror (NSERIES01:NFS_VMware_Test) and therefore it goes on to the next test of checking for DRNSERIES01:NFS_VMware_Test in the destination of the SnapMirror, which it finds and then checks DRNSERIES01 against the destination that also matches and finally confirms that the SnapMirror relationship is broken-off.

I had changed the volume on DRNSERIES01 a while ago because I thought the issue may have been due to the volume names being the same but I had changed it by putting a suffix of _Repl on the end and therefore the script was still matching NSERIES01:NFS_VMware_Test to DRNSERIES01:NFS_VMware_Test_Repl.

I have now configured my test group as follows:

Protected Site

Filer Name = NSERIES01

Volume Name = NFS_VMware_Test_HMR

Recovery Site

File Name = DRNSERIES01

Volume Name = NFS_VMware_Test_EWC

I can now recover to the Recovery Site, Reprotect, Fail-Back, Reprotect and Fail-over again and continue performing recoveries and reprotect over and over again as often as you can re-record on a Scotch VHS tape!

If the filer on Site-B had a suffix on its name instead of a prefix, e.g. it was named NSERIES01DR, or had a completely different name then I would never have hit this bug in the SRA. I will be waiting for NetApp to fix the SRA. In the meantime I will be renaming all of my volumes at the recovery site so that to avoid this issue.

View solution in original post

CJSCOLTRINITY · ‎2013-07-18

I can now see where the error is occurring. when the SRA runs the reverseReplication.pl script with the command prepareReverseReplication it is not detecting the broken-off snapmirror as I am seeing the following error in the SRM logs

validating if path NFS_VMware_Test is valid mirrored device for given peerArrayId

curr state = invalid, prev state = ,source name:DRNSERIES02:NFS_VMware_Test, destination:NSERIES02:NFS_VMware_Test, path=NFS_VMware_Test, arrayID:NSERIES02

Skipping /vol/NFS_VMware_Test as peerArrayId DRNSERIES02 is not valid

If I run a snapmirror status on the controller at the site I have just failed over to, i.e. Site-A, I see the snapmirror is broken off and the Source and Destination Filers are listed as the Filer host names and not an IP address, FQDN or Connection Name.

When I do a reprotect after a failover to Site-B I see the following equivalent messages in the SRM logs

validating if path NFS_VMware_Test is valid mirrored device for given peerArrayId

curr state = , prev state = broken-off,source name:NSERIES02:NFS_VMware_Test, destination:DRNSERIES02:NFS_VMware_Test, path=NFS_VMware_Test, arrayID:DRNSERIES02

radek_kubka · ‎2013-07-18

Does this thread give any promising clues?

https://communities.netapp.com/message/93675#93675

Regards,

Radek

CJSCOLTRINITY · ‎2013-07-25

I have now identified what is causing this issue and can work around it until there is a fix from NetApp available.

The issue was that the filer name at the Recovery Site had the same name as the filer at the Protected Site with a prefix on it, i.e. in my case the Recovery Site filer was named NSERIES01 and the Protected Site was DRNSERIES01. Remember I had already performed a fail-over and fail-back so the Protected Site was my original Recovery Site, so yes the filer on the Protected Site for this Protection Group is DRNSERIES01 and the Recovery Site has NSERIES01 on it.

When the Reprotect task is run the first step is to call the SRA with the command prepareReverseReplication, this calls reverseReplication.pl which attempts to check that the SnapMirror is broken off. It gets the status of all of the SnapMirrors from the filer at the Recovery Site, i.e. in this case NSERIES01. It then goes through each of these looking for a match of the local-filer-name:volume-name in the source of the snapmirror, e.g. for my test group it was attempting to match NSERIES01:NFS_VMware_Test, at this point the source of the SnapMirror is DRNSERIES01:NFS_VMware_Test which is correct but because the script is using a pattern matching test it matches NSERIES01:NFS_VMware_Test to DRNSERIES01:NFS_VMware_Test as NSERIES01:NFS_VMware_Test is contained within DRNSERIES01:NFS_VMware_Test. It then checks if the destination of the snapmirror matches the peerArrayID (i.e. in this case DRNSERIES01) which it does not as the destination, correctly, is NSERIES01 and then reports that the peerArrayID is incorrect. If there is no match on the local-filer-name:volume-name in the source of the snapmirror then it goes on to check the destination of the snapmirror and when it finds a match it check if the peerArrayID matches the source of the SnapMirror and if it does it then checks that the status of the SnapMirror is broken-off.

I never hit the issue with the first reprotect because DRNSERIES01:NFS_VMware_Test is not contained within the source of the SnapMirror (NSERIES01:NFS_VMware_Test) and therefore it goes on to the next test of checking for DRNSERIES01:NFS_VMware_Test in the destination of the SnapMirror, which it finds and then checks DRNSERIES01 against the destination that also matches and finally confirms that the SnapMirror relationship is broken-off.

I had changed the volume on DRNSERIES01 a while ago because I thought the issue may have been due to the volume names being the same but I had changed it by putting a suffix of _Repl on the end and therefore the script was still matching NSERIES01:NFS_VMware_Test to DRNSERIES01:NFS_VMware_Test_Repl.

I have now configured my test group as follows:

Protected Site

Filer Name = NSERIES01

Volume Name = NFS_VMware_Test_HMR

Recovery Site

File Name = DRNSERIES01

Volume Name = NFS_VMware_Test_EWC

I can now recover to the Recovery Site, Reprotect, Fail-Back, Reprotect and Fail-over again and continue performing recoveries and reprotect over and over again as often as you can re-record on a Scotch VHS tape!

If the filer on Site-B had a suffix on its name instead of a prefix, e.g. it was named NSERIES01DR, or had a completely different name then I would never have hit this bug in the SRA. I will be waiting for NetApp to fix the SRA. In the meantime I will be renaming all of my volumes at the recovery site so that to avoid this issue.