OpenStack Discussions

Create volume from image cache became slow in cinder 2015.1.3

iwao
11,667 Views

 

Hello. We are using cDOT8.3.1P2 and OpenStack Kilo on RHEL7.2.

 

After we updated the version of Cinder from 2015.1.2 to 2015.1.3, volume creation from image cache became very slow.
We confirmed that it takes 30 - 60 seconds to create volume(but sometimes it is finished in 1 seconds as before).

 

When we created volumes, cinder-volume log usually showed that post cloning method was failed with message "Discover file retries exhausted."

2016-07-04 17:27:19.729 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Lock "img-cache-9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd" acquired by "_do_clone" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:444
2016-07-04 17:27:19.729 3978 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Cloning from cache to destination volume-b2ea88bb-044d-4309-834e-6686cb3954df
2016-07-04 17:27:19.855 3978 DEBUG cinder.volume.drivers.netapp.dataontap.client.client_cmode [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Cloning with params volume sys02, src img-cache-9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd,dest volume-b2ea88bb-044d-4309-834e-6686cb3954df, vserver test001-svm01 clone_file /usr/lib/python2.7/site-packages/cinder/volume/drivers/netapp/dataontap/client/client_cmode.py:376
2016-07-04 17:27:19.911 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Lock "update_stale" acquired by "_update_stale_vols" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:444
2016-07-04 17:27:19.911 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Lock "update_stale" released by "_update_stale_vols" :: held 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:456
2016-07-04 17:27:19.911 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Releasing file lock "/var/lib/cinder/tmp/cinder-img-cache-9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd" after holding it for 0.183s release /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:227
2016-07-04 17:27:19.911 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Lock "img-cache-9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd" released by "_do_clone" :: held 0.183s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:456
2016-07-04 17:27:19.912 3978 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Performing post clone for volume-b2ea88bb-044d-4309-834e-6686cb3954df
2016-07-04 17:28:05.924 3978 WARNING cinder.volume.drivers.netapp.dataontap.nfs_base [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Discover file retries exhausted.
2016-07-04 17:28:05.925 3978 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Image cloning unsuccessful for image 9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd. Message: NFS file could not be discovered.

We never found this message before package update.

 

Are there any bugs in Cinder 2015.1.3?

 


Best Regards.

1 ACCEPTED SOLUTION
9 REPLIES 9

Bishoy
11,587 Views

Hello,

 

Can you try to revert it back, if it fixes the issue I will open a bug for it but in my ENV isn't happening. It could be that I don't have much of workload though.

 

Thanks,

Bishoy Mushraky

iwao
11,514 Views

Thanks Bishoy.


We reverted it back and it fixed the issue. Then I wrote the detail what we had confirmed. Please check it.

 

[Detail]
We found "Discover file retries exhausted." message was raised in "_discover_file_till_timeout" method. We assume that this method is the bottleneck of the issue.

cinder/volume/drivers/netapp/dataontap/nfs_base.py
548 def _post_clone_image(self, volume): 549 """Do operations post image cloning.""" 550 LOG.info(_LI('Performing post clone for %s'), volume['name']) 551 vol_path = self.local_path(volume) 552 if self._discover_file_till_timeout(vol_path): <- This method raises "Discover file retries exhausted" message. 553 self._set_rw_permissions(vol_path) <- Next step 554 self._resize_image_file(vol_path, volume['size']) 555 return True 556 raise exception.InvalidResults( 557 _("NFS file could not be discovered."))

So, We calculated the time between _discover_file_till_timeout method and next step(_set_rw_permissions) both in 2015.1.2 and 2015.1.3 using command like below(need debug and verbose log options).

sudo grep -E "post clone|permission" /var/log/cinder/volume.log
---
2016-07-13 15:12:50.450 10129 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-0855671f-8e1a-4f1f-942f-ab59901bc594 - - - - -] Performing post clone for volume-34576af3-391b-4823-ac03-bb0289acd240
2016-07-13 15:12:50.450 10129 DEBUG cinder.volume.drivers.remotefs [req-0855671f-8e1a-4f1f-942f-ab59901bc594 - - - - -] File path /var/lib/cinder/mnt/6471519ef924c4401a6b1ab15ea30341/volume-34576af3-391b-4823-ac03-bb0289acd240 is being set with permissions: 660 _set_rw_permissions /usr/lib/python2.7/site-packages/cinder/volume/drivers/remotefs.py:301

As a result, We confirmed that the volume creation delay was not happened in cinder 2015.1.2, but it happened in cinder 2015.1.3.

 

This is the all results of the time between _discover_file method and _set_rw_permissions in 20 volumes.

trials2015.1.22015.1.3
try-10secTimeout
try-20sec30sec
try-30sec30sec
try-40sec0sec
try-50sec30sec
try-60sec30sec
try-70sec30sec
try-80sec30sec
try-90sec30sec
try-100sec0sec
try-110sec0sec
try-120sec30sec
try-130sec30sec
try-140sec30sec
try-150sec30sec
try-160sec30sec
try-170sec30sec
try-180sec30sec
try-190sec30sec
try-200sec30sec

 

In this time, We encountered "Discover file retries exhausted." message only once, but creation delay was frequently happened. So, If you can't reproduce message "Discover file retries exhausted.", could you please check the time between _discover_file and _set_rw_permissions method?

 

Our cinder.conf might help our understanding.

Spoiler
[DEFAULT]
host = cinder

# API
osapi_volume_listen = ***
osapi_volume_workers = 4
api_paste_config = /etc/cinder/api-paste.ini
enable_v1_api = True
enable_v2_api = True

# Log
debug = True
verbose = True
log_dir = /var/log/cinder
use_syslog = False
use_stderr = False

# Messaging
rpc_backend = cinder.openstack.common.rpc.impl_kombu
notification_driver = log
control_exchange = openstack
amqp_durable_queues = False
rpc_response_timeout = 600

# Scheduler
scheduler_driver = cinder.scheduler.filter_scheduler.FilterScheduler

# Glance
glance_api_servers = http://***
glance_api_version = 2
glance_num_retries = 0
glance_api_insecure = False
glance_api_ssl_compression = False

# Authentication
auth_strategy = keystone

# Availability Zone
storage_availability_zone = nova
default_availability_zone = nova

# Backend
enabled_backends = test_svm01_1, ... (other backends)

[BRCD_FABRIC_EXAMPLE]

[CISCO_FABRIC_EXAMPLE]

[database]
idle_timeout = 3600
max_retries = 10
retry_interval = 10
min_pool_size = 1
connection = ***

[fc-zone-manager]

[keymgr]

[keystone_authtoken]

[matchmaker_redis]

[matchmaker_ring]

[oslo_messaging_amqp]

[oslo_messaging_qpid]

[oslo_messaging_rabbit]
rabbit_hosts = *** (3 hosts ENV)
rabbit_use_ssl = False
rabbit_userid = ***
rabbit_password = ***
rabbit_virtual_host = /
rabbit_ha_queues = True

[profiler]

[test_svm01_1]
volume_backend_name = netapp-sys01
volume_driver = cinder.volume.drivers.netapp.common.NetAppDriver
netapp_server_hostname = ***
netapp_server_port = 80
netapp_storage_protocol = nfs
netapp_storage_family = ontap_cluster
netapp_login = ***
netapp_password = ***
netapp_vserver = test-svm01
netapp_copyoffload_tool_path = /usr/lib/netapp/na_copyoffload_64
nfs_shares_config = /etc/cinder/nfs.shares1
nas_secure_file_operations = false

 

Best Regards,
Iwao Endo

iwao
11,380 Views

Hi Bishoy,

 

We're still working on this problem, but we can't find the solution yet.
Could you please open a bug for it? If you need more information about it, please tell me.

 

Thanks,
Iwao Endo

Bishoy
11,361 Views

Hello I'm currently consulting my Colleauges in Engineering and will back to you.

Bishoy
11,351 Views

Hello Iwao,

 

Engineering is currntely looking at it for a permenant fix. I have a workaround for you to be tried out if it will resolve your issue for now.

 

At the file nfs_base.py please add the last line os.stat(path) as below:

 

def _discover_file_till_timeout(self, path, timeout=45):
        """Checks if file size at path is equal to size."""
        # Sometimes nfs takes time to discover file
        # Retrying in case any unexpected situation occurs
        retry_seconds = timeout
        sleep_interval = 2
        while True:
            if os.path.exists(path):
                return True
            else:
                if retry_seconds <= 0:
                    LOG.warning(_LW('Discover file retries exhausted.'))
                    return False
                else:
                    time.sleep(sleep_interval)
                    retry_seconds -= sleep_interval
                    os.stat(path)

 

Bishoy
11,298 Views

Hello iwao

 

Did the this change solve your issue?

 

BR,

Bishoy

iwao
11,292 Views

Hello Bishoy,

 

Thank you for providing the workaround, but it didn't solve the issue.

 

We added the os.stat(path) line to the nfs_base.py and tried to create volume. However, We found post clone method was immediately failed like below.

[Tue Jul 26 09:43:13.892 2016] 2016-07-26 09:39:12.101 2719 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-fb15b8d9-f832-4458-a0d8-f75915e2d18f - - - - -] Performing post clone for volume-362c9cb8-1c85-441e-9a94-815167bf3600
[Tue Jul 26 09:43:13.892 2016] 2016-07-26 09:39:14.102 2719 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-fb15b8d9-f832-4458-a0d8-f75915e2d18f - - - - -] Image cloning unsuccessful for image 62387d8b-c35a-4a41-ab8c-77a6a71e222c. Message: [Errno 2] No such file or directory: '/var/lib/cinder/mnt/fec1b5659637844e65bf0bbceb1678c3/volume-362c9cb8-1c85-441e-9a94-815167bf3600'

 So, I added try-except block as below, but it also didn't improve volume creation time. 

                 else:
                     time.sleep(sleep_interval)
                     retry_seconds -= sleep_interval
+                    try:
+                        os.stat(path)
+                    except Exception:
+                        pass

     def _is_cloneable_share(self, image_location):
         """Finds if the image at location is cloneable."""

 

Thanks,
Iwao Endo

Bishoy
11,227 Views

iwao
10,850 Views

Hello Bishoy,

 

Thank you for opening the bug.

 

We added "nfs_mount_options = lookupcache=pos" to cinder.conf according to comment #2 on #1608596, and it worked.

 

Volume creation from cache became very fast than before.

 

We really appreciate your cooperation.

 

 

Best Regards,

Iwao Endo

Public