Re: Copy offload unsuccessful (Source host details not found)

dinko_mitic · ‎2015-08-18

Hi,

I'm running RedHat OSP 6 (Juno) and cDOT 8.2.3 and running into problems trying to configure the copy offload tool.

My Glance and Cinder exports are sitting on different FlexVols on the same SVM.

This is the error message I see in the cinder volume.log

19819: DEBUG cinder.volume.drivers.netapp.nfs req-05579758-eb32-469b-a9e8-7cc2da916740 ca2aa482f10f4b6d9c801556d9cb040a 78a01d0cb8254c6b9d48e8fb23c6048f - - - Image location not in the expected format file:///var/lib/glance/images/99006fcf-b02e-4142-ad99-a86181449035_check_get_nfs_path_segs /usr/lib/python2.7/site-packages/cinder/volume/drivers/netapp/nfs.py:552
19819: ERROR cinder.volume.drivers.netapp.nfs req-05579758-eb32-469b-a9e8-7cc2da916740 ca2aa482f10f4b6d9c801556d9cb040a 78a01d0cb8254c6b9d48e8fb23c6048f - - - Copy offload workflow unsuccessful. Source host details not found.
19819: TRACE cinder.volume.drivers.netapp.nfs Traceback (most recent call last):
19819: TRACE cinder.volume.drivers.netapp.nfs File "/usr/lib/python2.7/site-packages/cinder/volume/drivers/netapp/nfs.py", line 1171, in copy_image_to_volume
19819: TRACE cinder.volume.drivers.netapp.nfs self._try_copyoffload(context, volume, image_service, image_id)
19819: TRACE cinder.volume.drivers.netapp.nfs File "/usr/lib/python2.7/site-packages/cinder/volume/drivers/netapp/nfs.py", line 1197, in _try_copyoffload
19819: TRACE cinder.volume.drivers.netapp.nfs image_id)
19819: TRACE cinder.volume.drivers.netapp.nfs File "/usr/lib/python2.7/site-packages/cinder/volume/drivers/netapp/nfs.py", line 1260, in _copy_from_img_service
19819: TRACE cinder.volume.drivers.netapp.nfs raise exception.NotFound(_("Source host details not found."))
19819: TRACE cinder.volume.drivers.netapp.nfs NotFound: Source host details not found.
19819: TRACE cinder.volume.drivers.netapp.nfs

Also, we are using QCOW2 images, I understand that's RAW is preferred, but I thought QCOW2 should still work.

According to the cloning flowchart this is failing when it tries to verify whether Cinder and Glance exports are on the same SVM. Is there a way to further debug this? I'm not sure if my config files are incorrect, if there's a DNS issue as seen from either the OpenStack controller or the NetApp, or something else.

Thank you!

dcain · ‎2015-08-18

Seeing the "file://" vs "nfs://" in your log output for the Image location makes me think that the Glance API call isn't returning the location metadata to Cinder.

See if my instructions for RHEL-OSP6 help:

Copy the copy offload tool binary the /usr/local/bin directory. Name it na_copyoffload_64.
Open the /etc/cinder/cinder.conf file and modify netapp_copyoffload_tool_path from None to /usr/local/bin/na_copyoffload_64.
In the same file, modify glance_api_version from 1 to 2.
Open the /etc/glance/glance-api.conf file, and under the [DEFAULT] stanza add the following contents:

[DEFAULT]
show_multiple_locations=True

In the same file, modify show_image_direct_url from False to True.
In the same file, modify filesystem_store_metadata_file from None to /etc/glance/metadata.json.
Create the /etc/glance/metadata.json file and modify it with the following contents (where nfs_lif_ip1 is your NFS LIF for Glance):

{
    "share_location": "nfs://<<nfs_lif_ip1>>/glance",
    "mount_point": "/var/lib/glance/images",
    "type": "nfs"
}

Add the cinder user to the glance group with the following command:

gpasswd –a cinder glance

Restart the Cinder subsystem to pick up the changes.

systemctl restart openstack-cinder-{api,scheduler,volume}

Restart the Glance subsystem to pick up the changes.

systemctl restart openstack-glance-{api,registry}

Now go back and delete and re-upload your favorite image back into Glance. This is very important as the metadata information should now be properly inserted into the Glance DB in MariaDB, specifically the image_locations row. You should see the metadata from the metadata.json file there after fixing your configuration files and restarting. Look at my environment, entry 10 is where metadata information isn't being passed properly (which results in that fun "source host not found" error) versus line 13 which has the CORRECT metadata information:

| 10 | 3ca9cfbf-1eb2-4ab5-80e3-9fe3745e82e3 | file:///var/lib/glance/images/3ca9cfbf-1eb2-4ab5-80e3-9fe3745e82e3 | 2015-08-18 19:12:58 | 2015-08-18 19:25:08 | 2015-08-18 19:25:08 |       1 | {}                                                                                                       | deleted |

| 13 | 65058e29-cb21-43a5-93c3-b73aa4bb8701 | file:///var/lib/glance/images/65058e29-cb21-43a5-93c3-b73aa4bb8701 | 2015-08-18 19:26:38 | 2015-08-18 19:26:38 | NULL                |       0 | {"mount_point": "/var/lib/glance/images", "type": "nfs", "share_location": "nfs://192.168.67.14/glance"} | active  |

Regarding QCOW vs. RAW, the NetApp Cinder driver automatically converts QCOW to RAW format when images are copied to Cinder volumes. QCOW2 is not Live Migration safe on NFS when the cache=writeback setting is enabled, which is commonly used for performance improvement of QCOW2. If space savings are the desired outcome for the Image Store, raw format files are actually created as sparse files on the NetApp storage system. Deduplication within NetApp FlexVol volumes happens globally rather than only within a particular file, resulting in much better aggregate space efficiency than QCOW2 can provide.

dinko_mitic · ‎2015-08-25

Hi, thank you so much for your response. I'm getting further now, there were definitely a few steps I was missing.

It looks like the na_copyoffload_64 actually gets invoked now, but fails with return code 13.

cinder.service.Service object at 0x22c91d0>> run outlasted interval by 157.81 sec
12576: ERROR cinder.volume.drivers.netapp.nfs [req-c7310859-493d-4595-91e4-49821f35b5b5 a945ababdaa945a3affa39e062c31121 3fa915bbdd8d4c68b1497fcb3c5bc9cd - - -] Copy offload workflow unsuccessful. Unexpected error while running command.
Command: None
Exit code: -
Stdout: u"Unexpected error while running command.\nCommand: /usr/bin/na_copyoffload_64 10.140.231.5 10.140.231.5 /phllnasnet1000v_vol4_qa_glance/2de1c20b-d6b6-49be-8a74-131c2fb30a2d /phllnasnet1000v_vol6_qa_cinder/e1bbc35d-d814-4148-935e-f45d83dd139d\nExit code: 13\nStdout: u'Program exiting with return code 13.\\n'\nStderr: u''"
Stderr: None
12576: TRACE cinder.volume.drivers.netapp.nfs Traceback (most recent call last):
12576: TRACE cinder.volume.drivers.netapp.nfs File "/usr/lib/python2.7/site-packages/cinder/volume/drivers/netapp/nfs.py", line 1171, in copy_image_to_volume
12576: TRACE cinder.volume.drivers.netapp.nfs self._try_copyoffload(context, volume, image_service, image_id)
12576: TRACE cinder.volume.drivers.netapp.nfs File "/usr/lib/python2.7/site-packages/cinder/volume/drivers/netapp/nfs.py", line 1197, in _try_copyoffload
12576: TRACE cinder.volume.drivers.netapp.nfs image_id)
12576: TRACE cinder.volume.drivers.netapp.nfs File "/usr/lib/python2.7/site-packages/cinder/volume/drivers/netapp/nfs.py", line 1282, in _copy_from_img_service
12576: TRACE cinder.volume.drivers.netapp.nfs check_exit_code=0)
12576: TRACE cinder.volume.drivers.netapp.nfs File "/usr/lib/python2.7/site-packages/cinder/utils.py", line 142, in execute
12576: TRACE cinder.volume.drivers.netapp.nfs return processutils.execute(*cmd, **kwargs)
12576: TRACE cinder.volume.drivers.netapp.nfs File "/usr/lib/python2.7/site-packages/cinder/openstack/common/processutils.py", line 200, in execute
12576: TRACE cinder.volume.drivers.netapp.nfs cmd=sanitized_cmd)
12576: TRACE cinder.volume.drivers.netapp.nfs ProcessExecutionError: Unexpected error while running command.
12576: TRACE cinder.volume.drivers.netapp.nfs Command: /usr/bin/na_copyoffload_64 10.140.231.5 10.140.231.5 /phllnasnet1000v_vol4_qa_glance/2de1c20b-d6b6-49be-8a74-131c2fb30a2d /phllnasnet1000v_vol6_qa_cinder/e1bbc35d-d814-4148-935e-f45d83dd139d
12576: TRACE cinder.volume.drivers.netapp.nfs Exit code: 13
12576: TRACE cinder.volume.drivers.netapp.nfs Stdout: u'Program exiting with return code 13.\n'
12576: TRACE cinder.volume.drivers.netapp.nfs Stderr: u''

However, when I invoke that command from my OS controller as root, it completes and returns code 0, and I can manually verify that the file is copied to the cinder volume.

# /usr/bin/na_copyoffload_64 10.140.231.5 10.140.231.5 /phllnasnet1000v_vol4_qa_glance/2de1c20b-d6b6-49be-8a74-131c2fb30a2d /phllnasnet1000v_vol6_qa_cinder/e1bbc35d-d814-4148-935e-f45d83dd139d
Program exiting with return code 0.

Any thoughts on what the errod code 13 is?

dinko_mitic · ‎2015-08-25

It looks like code 13 is permissions related, it looks like my Cinder user does have permissions to read the glance images, but does not have permissions to create a new file on the Cinder NFS export.

Permissions for the cinder mount path are

drwxr-xr-x. nobody nobody

And yet when the cinder service creates a new volume, it does get created with 666 permissions.

This is the ontap command we used to create all the exports

export-policy rule create -policyname default -clientmatch x.x.x.x/x -rwrule any -rorule any -protocol nfs -superuser any

Are we doing something wrong with nfs expors for cinder, or am I missing something else?

dcain · ‎2015-08-25

Error 13 is permission related.

Be sure that the FlexVol on the NetApp FAS device is owned by both user and group of 165 (cinder):

fas8040-openstack::> volume show -vserver osp6-svm -volume cinder -fields user,group
vserver volume user group
-------- ------ ------ ------
osp6-svm cinder cinder cinder

If not, issue the following:

fas8040-openstack::> volume modify -vserver osp6-svm -volume cinder -user 165 -group 165

See if that helps.

dinko_mitic · ‎2015-08-25

The cinder flexvol didn't have the right user,group permissions, they were 0. After I modified it "-user 165 -group 165" the manual copy and copyoffload seem to work without issues.

I do have a couple of questions based on this

Did we do something wrong when provisioning the flexvols or installing OpenStack? My glance volumes were correctly using user/group 161, but I don't think we took any manual steps to achieve that. How can I ensure that future cinder flexvols have the correct user/group assignment?
Should there be a log message indicating a successful copyoffload? I'm assuming it's working because there are no errors and I don't see anything in /var/lib/cinder/conversion/ which is where the image would be copied temporarily if the copyoffload didn't work.

dcain · ‎2015-08-25

The Copy Offload tool only helps on the first copy into the destination FlexVol. To really make sure it is working as intended, go back and delete the "img-cache-uuidxyz" in the Cinder FlexVol and try your copy operation again. The tool will be invovked and a log message can be seen in volume.log as shown here:

2015-08-18 16:22:08.256 4919 DEBUG cinder.volume.drivers.netapp.nfs [req-019b37ac-d6d0-4e2c-a27e-a05f802eba38 e2d06a57b1c94e8facd822da5bebe5c6 168007cc962f45e5b6f218c46e0d2857 - - -] Trying copy from image service using copy offload. _copy_from_img_service /usr/lib/python2.7/site-packages/cinder/volume/drivers/netapp/nfs.py:1253
2015-08-18 16:22:08.865 4919 DEBUG cinder.openstack.common.processutils [req-019b37ac-d6d0-4e2c-a27e-a05f802eba38 e2d06a57b1c94e8facd822da5bebe5c6 168007cc962f45e5b6f218c46e0d2857 - - -] Running cmd (subprocess): /usr/local/bin/na_copyoffload_64 192.168.67.14 192.168.67.14 /glance/b4ae061e-c293-4edf-b563-7c5bbc7f91eb /cinder/2043a9ee-2011-4ae4-adc9-94b63bb74318 execute /usr/lib/python2.7/site-packages/cinder/openstack/common/processutils.py:158
2015-08-18 16:22:19.600 4919 INFO cinder.volume.drivers.netapp.nfs [req-019b37ac-d6d0-4e2c-a27e-a05f802eba38 e2d06a57b1c94e8facd822da5bebe5c6 168007cc962f45e5b6f218c46e0d2857 - - -] Copied image b4ae061e-c293-4edf-b563-7c5bbc7f91eb to volume 451c8b1d-27e0-4841-adee-5dbcc138c77c using copy offload workflow.

Copy Offload will not run if the image from glance is cached locally on the destination FlexVol, as it is not needed. See the Flowchart here of this process.

You did nothing wrong, but do ensure that the Glance FlexVol is created and owned by user/group 161 upon creation would be my advise, along with the Cinder FlexVol(s) owned by user/group 165. Seeing nobody:nobody is generally not a good thing from a permission perspective.

dinko_mitic · ‎2015-08-27

Doing some testing after I deleted the cached images, and am seeing some failures with larger images.

It looks like my nova-compute times out after 3 minutes, since the image takes longer than that to get created. I will try to tweak these parameters, but this still seems odd

nova.conf:#block_device_allocate_retries=60
nova.conf:#block_device_allocate_retries_interval=3

When the copyoffload process kicks off, I see the file being created on the cinder nfs volume, but it's a fairly slow process, taking more than 3 minutes for a ~40GB file to get copied. The copy should be happening between 2 FlexVols on the same aggregate.

I was under the impression that FlexClone would do this copy much quicker, but looking at the flowchart you referenced, FlexClone is used only when the Glance and Cinder NFS exports are on the same FlexVol. Is that correct? What does the "Use the NetApp Copy Offload tool to create destination image by copying within cluster" actually mean on the filer? Is there any way to track the file copy operation? I'm concerned that it may be taking longer than actually copying to the OpenStack controller back and forth over a 10Gb network.

dcain · ‎2015-08-28

Be sure that you have the vstorage feature and NFS v4.0 enabled on the SVM in play here.

vserver nfs modify -vserver <yournamehere> -vstorage enabled –v4.0 enabled

You do have the FlexClone license installed too, right?

myFAS::> system license show

Serial Number: netapp
Owner: myFAS
Package           Type    Description           Expiration
----------------- ------- --------------------- --------------------
Base              site    Cluster Base License  -
NFS               site    NFS License           -
iSCSI             site    iSCSI License         -
FlexClone         site    FlexClone License     -
...

I just copied a ~2gb image in my environment and it took about 15 seconds from launching the VM (using copy offload) until it was available on the network. I don't have a 40gb image to test right at this moment. Do you get similar results with smaller images? Any log information you can share?

FlexClone is done automatically IF Cinder and Glance share the same FlexVol. If Cinder and Glance are on different FlexVols, the cache is checked first in the destination Cinder mount. If it exists, FlexClone uses that as no Copy Offload would be necessary. If the cache does not exist for the image in Glance, the Copy Offload Tool path is executed if all criteria are satisfied. If not, a regular copy occurs.

Copy offload unsuccessful (Source host details not found)

Copy offload workflow unsuccessful (Return code 13)

Aggregate dedupe VAAI Copy offload

Copy Offload Utility Image cloning unsuccessful

How to get more detail information in output?

There are no matching storage systems found for the selected SCP