Data Backup and Recovery
Data Backup and Recovery
I am having an issue with Snap Drive for Unix 4.1.1 and creating a FlexClone volume off a snapshot so that we can backup up that volume.
I have two parent volumes, on one each filer head. The vol options are such:
napp1> vol options xdb3vol0
nosnap=off, nosnapdir=off, minra=off, no_atime_update=off, nvfail=off,
ignore_inconsistent=off, snapmirrored=off, create_ucode=on,
convert_ucode=off, maxdirsize=167690, schedsnapname=ordinal,
fs_size_fixed=off, compression=off, guarantee=volume, svo_enable=off,
svo_checksum=off, svo_allow_rman=off, svo_reject_errors=off,
no_i2p=off, fractional_reserve=0, extent=off, try_first=volume_grow,
read_realloc=off, snapshot_clone_dependency=off
napp1>
A snap was created yesterday:
Filesystem total used avail capacity Mounted on
/vol/xdb3vol0/ 1850GB 1754GB 96GB 95% /vol/xdb3vol0/
/vol/xdb3vol0/.snapshot 462GB 63GB 399GB 14% /vol/xdb3vol0/.snapshot
But when the AIX admin tries to connect to the snapshot and create a flexclone volume we get the following error:
Sun May 2 15:14:06 CDT [napp1: wafl.volume.clone.fractional_rsrv.changed:info]: Fractional reservation for clone 'Snapdrive_xdb3vol0_volume_clone_from_snap_vgmfgdss_snapshot' was changed to 100 percent because guarantee is set to 'file' or 'none'.
Sun May 2 15:14:11 CDT [napp1: wafl.volume.clone.created:info]: Volume clone Snapdrive_xdb3vol0_volume_clone_from_snap_vgmfgdss_snapshot of volume xdb3vol0 was created successfully.
Creation of clone volume 'Snapdrive_xdb3vol0_volume_clone_from_snap_vgmfgdss_snapshot' has completed.
Sun May 2 15:14:11 CDT [napp1: lun.newLocation.offline:warning]: LUN /vol/Snapdrive_xdb3vol0_volume_clone_from_snap_vgmfgdss_snapshot/lun13 has been taken offline to prevent map conflicts after a copy or move operation.
Sun May 2 15:14:11 CDT [napp1: lun.newLocation.offline:warning]: LUN /vol/Snapdrive_xdb3vol0_volume_clone_from_snap_vgmfgdss_snapshot/lun12 has been taken offline to prevent map conflicts after a copy or move operation.
Sun May 2 15:14:11 CDT [napp1: lun.newLocation.offline:warning]: LUN /vol/Snapdrive_xdb3vol0_volume_clone_from_snap_vgmfgdss_snapshot/lun11 has been taken offline to prevent map conflicts after a copy or move operation.
Sun May 2 15:14:11 CDT [napp1: lun.newLocation.offline:warning]: LUN /vol/Snapdrive_xdb3vol0_volume_clone_from_snap_vgmfgdss_snapshot/lun10 has been taken offline to prevent map conflicts after a copy or move operation.
Sun May 2 15:14:11 CDT [napp1: lun.newLocation.offline:warning]: LUN /vol/Snapdrive_xdb3vol0_volume_clone_from_snap_vgmfgdss_snapshot/lun15 has been taken offline to prevent map conflicts after a copy or move operation.
Sun May 2 15:14:11 CDT [napp1: lun.newLocation.offline:warning]: LUN /vol/Snapdrive_xdb3vol0_volume_clone_from_snap_vgmfgdss_snapshot/lun16 has been taken offline to prevent map conflicts after a copy or move operation.
Sun May 2 15:14:11 CDT [napp1: lun.newLocation.offline:warning]: LUN /vol/Snapdrive_xdb3vol0_volume_clone_from_snap_vgmfgdss_snapshot/lun14 has been taken offline to prevent map conflicts after a copy or move operation.
Sun May 2 15:14:19 CDT [napp1: wafl.vol.autoSize.done:info]: Automatic increase size of volume 'Snapdrive_xdb3vol0_volume_clone_from_snap_vgmfgdss_snapshot' by 70464308 kbytes done.
Sun May 2 15:14:31 CDT [napp1: wafl.vol.autoSize.fail:info]: Unable to grow volume 'Snapdrive_xdb3vol0_volume_clone_from_snap_vgmfgdss_snapshot' to recover space: Volume cannot be grown beyond maximum growth limit
If this a flexclone that is backed by a snapshot, then shouldn't it utilize the snapshot for any deltas? What changes do I need to make to the parent volume.
What is the point of having snapshot and flxeclone if I have to have 100% of the space reserved for it? If, for some unforseen reason, more than 20% of data were to change, I wouldn't mind the snap and flexclone being auto deleted as long as the parent volume and it's LUNs stay online. It would seem to be an inefficient use of resources if I have to have 100% fractional reserve.
I am still having the issue:
05/07/10 10:20:17 STATUS:INFORM ERRCODE:999 connect_snap: Snap connect started.
connecting vgksdss:
connecting lun napp1:/vol/virtual1sd1/lun1
creating unrestricted volume clone napp1:/vol/Snapdrive_virtual1sd1_volume_clone_from_snap_vgksdss_snapshot ... success
connecting lun napp2:/vol/virtual1sd2/lun2
creating unrestricted volume clone napp2:/vol/Snapdrive_virtual1sd2_volume_clone_from_snap_vgksdss_snapshot ... success
mapping new lun(s) ... done
discovering new lun(s) ... done
Importing vgksdssbc
Successfully connected to snapshot napp1:/vol/virtual1sd1:snap_vgksdss
disk group vgksdssbc containing host volumes
bclvksdss_log
bclvksdss_fs (filesystem: /bc/FS1)
0002-245 Command error: cannot write Flexclone metadata to napp1:/vol/Snapdrive_virtual1sd1_volume_clone_from_snap_vgksdss_snapshot storage system volume.
05/07/10 10:22:52 STATUS:ERROR ERRCODE:32 connect_snap: snapdrive connect failed with return code 18.
from_snap_vgksdss_snapshot' was changed to 100 percent because guarantee is set to 'file' or 'none'.
Fri May 7 10:22:51 CDT [napp2: wafl.volume.clone.created:info]: Volume clone Snapdrive_virtual1sd2_volume_clone_from_snap_vgksdss_snapshot of volume virtual1sd2 was created successfully.
Creation of clone volume 'Snapdrive_virtual1sd2_volume_clone_from_snap_vgksdss_snapshot' has completed.
Fri May 7 10:22:51 CDT [napp2: wafl.vol.full:notice]: file system on volume Snapdrive_virtual1sd2_volume_clone_from_snap_vgksdss_snapshot is full
Fri May 7 10:22:51 CDT [napp2: lun.newLocation.offline:warning]: LUN /vol/Snapdrive_virtual1sd2_volume_clone_from_snap_vgksdss_snapshot/lun2 has been taken offline to prevent map conflicts after a copy or move operation.
Fri May 7 10:23:04 CDT [napp2: lun.map:info]: LUN /vol/Snapdrive_virtual1sd2_volume_clone_from_snap_vgksdss_snapshot/lun2 was mapped to initiator group xvirt1=4
Fri May 7 10:23:08 CDT [napp2: wafl.vol.full:notice]: file system on volume Snapdrive_virtual1sd2_volume_clone_from_snap_vgksdss_snapshot is full
Hi,
Have you got vol autosize on source volume? Also, what is your snap reserve?
As a non-intrsuive test set snap reserve to ZERO, try again. Set it back to what it was before after the test.
Eric
Yes. I set vol autosize on source volume, and I have turned it off in a "test" environment. I have tried with snap reserve set to 0 and same result. snap reserve at 0 and fractional reserve at 20 and same result. Not sure what I am missing here. There is no clear guide as to what I should have set for this to work, other than what appears to be trial and error. I don't like trial and error very much as it gives the illusion I don't know what I am doing. Which might be true in this case, but it isn't for lack of ingesting hundreds of postings, various TR readings and endless google searches plus a case open for over a week with IBM on the issue. They seem less knowledgeable than I in this circumstance.
Now, I have been able to get it work unreliably and still with errors on the filer side, but luckily with no fault to the source volume by setting the snap reserve to 0, fractional reserve to 20 and throwing a couple hundred more GBs at the volume. This has let me get the required blocks off to tape before we destroy the flexclone volume and delete the snapshot.
When the snapshot has been taken and the flexclone active, I go from close to 700GB free in each volume (1.7TB used volumes) to about 300GB free. Where did the 400GB go? It seems like I am requiring a lot more space to make these operations successful. If that is the case, so be it, we need it to work but it wasn't the behavior we were expecting.
Now, if we had been told that to make FlexClone's work we would need 220GB volume for 100GB LUN (like some articles suggest) then I would have asked long ago what is the value in the NetApp over say a similar Tier2 array that has snap reserve pools? Instead we are told that FlexClones are thin provisioned and require no additional space other than the snapshot that keeps track of all changes between the current state of the AFS and the time the FlexClone was generated off the backing snapshot. Now I agree that things can get hairy pretty quickly if we were then going to the FlexClone and making adjustments to it.
What an interesting example of bad interaction of different features I guess NetApp will have to eventually provide explicit autosize control during FlexClone creation just like the one for volume guarantees.
Now my question - does it actually make clone connect to fail or is it just cosmetic issue? Because testing it (without SD involved) I can reproduce this behaviour, but clone is created, I can online LUN and clone actually does not consume any space in aggregate (even though it seems to):
Mon May 10 18:09:46 MSD [wafl.vol.autoSize.fail:info]: Unable to grow volume 'test1_clone1' to recover space: Volume cannot be grown beyond maximum growth limitsimsim*> lun show
/vol/test1/lun1 70m (73400320) (r/w, online)
/vol/test1_clone1/lun1 70m (73400320) (r/w, online)
simsim*> df -r
Filesystem kbytes used avail reserved Mounted on
/vol/test1/ 102400 72028 30372 0 /vol/test1/
/vol/test1/.snapshot 0 40 0 0 /vol/test1/.snapshot
/vol/test1_clone1/ 122880 122880 0 (71832) /vol/test1_clone1/
/vol/test1_clone1/.snapshot 0 52 0 0 /vol/test1_clone1/.snapshot
simsim*> aggr show_space aggr0
Aggregate 'aggr0'Total space WAFL reserve Snap reserve Usable space BSR NVLOG A-SIS
1024000KB 102400KB 46080KB 875520KB 0KB 0KBSpace allocated to volumes in the aggregate
Volume Allocated Used Guarantee
test1 103156KB 72492KB volume
test1_clone1 1236KB 408KB none
Also notice, that this issue happens only if you create non-space reserved clone. If your clone is space reserved, fractional_reserve is not forced to 100. You do need extra space in agregate though:
test1 103156KB 72736KB volume
test1_clone1 31168KB 160KB volume
The space you will need is exactly free space in parent volume, which is somehow logical
Oh, and BTW answering your question:
When the snapshot has been taken and the flexclone active, I go from close to 700GB free in each volume (1.7TB used volumes) to about 300GB free. Where did the 400GB go?
They have been reserved by virtue of fractional_reserve being set to 100. In my example above you see that it tries to reserve 70M contained in base snapshot. It fails to do it, but because clone space guarantee is "none", NetApp ignores this error and let me continue.
We have a source volume with following options:
napp1> vol options virtual1sd1
nosnap=off, nosnapdir=off, minra=off, no_atime_update=off, nvfail=off,
ignore_inconsistent=off, snapmirrored=off, create_ucode=on,
convert_ucode=off, maxdirsize=167690, schedsnapname=ordinal,
fs_size_fixed=off, compression=off, guarantee=volume, svo_enable=off,
svo_checksum=off, svo_allow_rman=off, svo_reject_errors=off,
no_i2p=off, fractional_reserve=5, extent=off, try_first=volume_grow,
read_realloc=off, snapshot_clone_dependency=off
napp1>
napp1> df -Vh virtual1sd1
Filesystem total used avail capacity Mounted on
/vol/virtual1sd1/ 15GB 10GB 4583MB 70% /vol/virtual1sd1/
/vol/virtual1sd1/.snapshot 0KB 720KB 0KB ---% /vol/virtual1sd1/.snapshot
napp1>
napp1> snap list virtual1sd1
Volume virtual1sd1
working...
%/used %/total date name
---------- ---------- ------------ --------
0% ( 0%) 0% ( 0%) May 10 13:13 snap_vgksdss (busy,vclone)
napp1>
After snap created and connected following occurs:
Mon May 10 13:15:47 CDT [napp1: wafl.volume.clone.fractional_rsrv.changed:info]: Fractional reservation for clone 'Snapdrive_virtual1sd1_volume_clone_from_snap_vgksdss_snapshot' was changed to 100 percent because guarantee is set to 'file' or 'none'.
Mon May 10 13:15:51 CDT [napp1: wafl.volume.clone.created:info]: Volume clone Snapdrive_virtual1sd1_volume_clone_from_snap_vgksdss_snapshot of volume virtual1sd1 was created successfully.
Creation of clone volume 'Snapdrive_virtual1sd1_volume_clone_from_snap_vgksdss_snapshot' has completed.
Mon May 10 13:15:51 CDT [napp1: wafl.vol.full:notice]: file system on volume Snapdrive_virtual1sd1_volume_clone_from_snap_vgksdss_snapshot is full
Mon May 10 13:15:51 CDT [napp1: lun.newLocation.offline:warning]: LUN /vol/Snapdrive_virtual1sd1_volume_clone_from_snap_vgksdss_snapshot/lun1 has been taken offline to prevent map conflicts after a copy or move operation.
Mon May 10 13:16:19 CDT [napp1: lun.map:info]: LUN /vol/Snapdrive_virtual1sd1_volume_clone_from_snap_vgksdss_snapshot/lun1 was mapped to initiator group xvirt1=3
Mon May 10 13:16:23 CDT [napp1: wafl.vol.full:notice]: file system on volume Snapdrive_virtual1sd1_volume_clone_from_snap_vgksdss_snapshot is full
Now, if I had vol autosize on the source volume, I would be spammed with vol autosize failure. If I had snap autodelete on the parent I would see that the containing snapshot would be deleted after flexclone volume creation. I have tried fractional_reserve at 0 and at 20 with same result. Initially my volume was 10.1GB and the containing lun 10GB. It is a space reserved LUN because I want to guarantee writes and not have it go offline (this happened earlier in our migration and I was not too happy about it).
So now on my test volume I have 50% free space in the volume and I am having errors from SnapDrive reporting a metadata write error. SnapDrive then strands the clone and can no longer interact with it. It requires Storage Admin to go in and unmap the devices and destroy the flexclone volume and delete the snapshots. The storage commands from Snap Drive also incorrectly report the status of the flexclone volume, making it appear to be split from the backing snapshot, even though I see differently at the filer level.
Please, show
df -A for aggregate containing virtual1sd1
aggr show_space for aggregate containing virtual1sd1
vol options Snapdrive_virtual1sd1_volume_clone_from_snap_vgksdss_snapshot
df -r Snapdrive_virtual1sd1_volume_clone_from_snap_vgksdss_snapshot
df -h Snapdrive_virtual1sd1_volume_clone_from_snap_vgksdss_snapshot
I bumped into the same issue today, an AIX install too, 5% FR and 20%SR
flexclone would fail with the same writing metadata error.
In this case the flexvols were full, 100% utilized as the customer had created LUNs of the same size of the flexvol and had intended to use the 20%SR for the snapshots. But reducing the SR to 0% to free space didn't help as the same SDU operation would fail during the metadata write operation.
Quick fix was to modify snapdrive.conf to create a lunclone instead of a flexclone.
For this customer it wouldn't make any difference as they plan to refresh the luns (SMO clone) every night.
may have to create a burt for that SDU behavior, SDU shouldn't be touching the %FR values unless we tell it to,
I don't think the issue here is directly related to Snapdrive and it's interaction with Flexclones. This seems to be a fundamental change in the way Flexclones, and their reserves (volume, fractional, snap) behave with Ontap >= 7.2.6.1. Here is my example of the problem. Note the 'volume' guarantee, and fractional reserve set to 'zero' on the source volume.
egna10a> df -A egna10a_aggr01
Aggregate kbytes used avail capacity
egna10a_aggr01 10154150324 6355373044 3798777280 63%
egna10a_aggr01/.snapshot 0 0 0 ---%
egna10a> df -g egna10a_vol001
Filesystem total used avail capacity Mounted on
/vol/egna10a_vol001/ 5500GB 4640GB 859GB 84% /vol/egna10a_vol001/
snap reserve 0GB 19GB 0GB ---% /vol/egna10a_vol001/..
This does NOT happen with Ontap 7.2.5.1. The cloned volume takes on ALL the source volumes attributes including the fractional reserve.
So it seems now the only real way to thin provision with Netapp is to set the volume guarantee on the source to 'none'. No way!
Hi Greg,
You said: "
So it seems now the only real way to thin provision with Netapp is to set the volume guarantee on the source to 'none'. No way!"
No way? Why not? We do it here, theres no problems with this as long as its done properly and managed a bit. We ve claimed back
20TB in our non-prod. env. doing this. thats $$$$$$ mate.
Eric
Ontap 7.2.6.1 effectively broke this functionality
Guess what? It is not a bug, it is a feature
https://now.netapp.com/cgi-bin/bol?Type=Detail&Display=280845 :
Vol clone incorrectly allows fractional reserve to be set to 0 and guarantee to be set to 'none' or 'file'
yea, a feature that was added without any thought. See the following bug
http://now.netapp.com/NOW/cgi-bin/bol?Type=Detail&Display=348466
Still flawed though. How the heck can you reset the FR back to zero if there isn't enough space to first create the clones?
What a mess. I am actively working this with the Netapp Dudes. Hopefully they will get it straight down the road.
Hi Greg,
I ve read the last posting and I ve got a better understanding of your issue now. I wasnt aware of this, so thanks for flagging it.
I had a look at the bug you provided, its not a serious bug per NetApp at least:
Bug Severity 5 - Suggestion
Which more or less deems it to be an RFE (request for enhancement) and the fix looks to be an upgrade of Ontap. Upgrades
are never painless I know..
Do keep us up to date on this issue.
Eric
Hi Eric,
Yes, you are correct. Netapp really doesn't see it as a bug. The intent of the feature change was to provide a level of protection for over-writes on the clones themselves.
I really don't understand Netapp's change here. If there are customers that want to protect there clones, why not just create the clones with space guarantee set to volume, and Fractional Reserve set to 100%??
The point is.. to some the clones are <just> as important as the parent volumes, which is fine. Buy 2x's the disks and protect them just as they would the source volumes.
Instead, they took away functionality that saved money, and differentiated Netapp from the other vendors.
greg
It says fixed in 7.3.3. Do you know what actually has been fixed and did you have a chance to verify it? I do not remember seeing anything explicit about it in RN or documentation; browsing 7.3.3 manuals now, the statement that fractional reserve cannot be changed for file or none guaranteed volumes did disappear.
As for original problem ... well, it was a bug, because even in 7.2.5.1 manual quite clear stated that FR is fixed to 100% unless volume guarantee is none.
Hi,
Initially you said..
<Guess what? It is not a bug, it is a feature >
Now you say..
<As for original problem ... well, it was a bug, because even in 7.2.5.1 manual quite clear stated that FR is fixed to 100% unless volume guarantee is none.>
Not sure what you are trying to convey with regard to 7.2.5.1? It works as expected.
I have not tested 7.3.3 yet. It just came out of the oven. Sticking with 7.3.2P4 for a few more weeks.
greg
We will be updating to 7.3.3 in August barring any showstoppers. I'll inform if the behavior has changed. We are living with the current "arrangement".