OpenStack Discussions
OpenStack Discussions
Hello. We are using cDOT8.3.1P2 and OpenStack Kilo on RHEL7.2.
After we updated the version of Cinder from 2015.1.2 to 2015.1.3, volume creation from image cache became very slow.
We confirmed that it takes 30 - 60 seconds to create volume(but sometimes it is finished in 1 seconds as before).
When we created volumes, cinder-volume log usually showed that post cloning method was failed with message "Discover file retries exhausted."
2016-07-04 17:27:19.729 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Lock "img-cache-9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd" acquired by "_do_clone" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:444 2016-07-04 17:27:19.729 3978 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Cloning from cache to destination volume-b2ea88bb-044d-4309-834e-6686cb3954df 2016-07-04 17:27:19.855 3978 DEBUG cinder.volume.drivers.netapp.dataontap.client.client_cmode [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Cloning with params volume sys02, src img-cache-9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd,dest volume-b2ea88bb-044d-4309-834e-6686cb3954df, vserver test001-svm01 clone_file /usr/lib/python2.7/site-packages/cinder/volume/drivers/netapp/dataontap/client/client_cmode.py:376 2016-07-04 17:27:19.911 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Lock "update_stale" acquired by "_update_stale_vols" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:444 2016-07-04 17:27:19.911 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Lock "update_stale" released by "_update_stale_vols" :: held 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:456 2016-07-04 17:27:19.911 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Releasing file lock "/var/lib/cinder/tmp/cinder-img-cache-9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd" after holding it for 0.183s release /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:227 2016-07-04 17:27:19.911 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Lock "img-cache-9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd" released by "_do_clone" :: held 0.183s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:456 2016-07-04 17:27:19.912 3978 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Performing post clone for volume-b2ea88bb-044d-4309-834e-6686cb3954df 2016-07-04 17:28:05.924 3978 WARNING cinder.volume.drivers.netapp.dataontap.nfs_base [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Discover file retries exhausted. 2016-07-04 17:28:05.925 3978 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Image cloning unsuccessful for image 9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd. Message: NFS file could not be discovered.
We never found this message before package update.
Are there any bugs in Cinder 2015.1.3?
Best Regards.
Solved! See The Solution
We have opened a bug under https://bugs.launchpad.net/cinder/+bug/1608596
Hello,
Can you try to revert it back, if it fixes the issue I will open a bug for it but in my ENV isn't happening. It could be that I don't have much of workload though.
Thanks,
Bishoy Mushraky
Thanks Bishoy.
We reverted it back and it fixed the issue. Then I wrote the detail what we had confirmed. Please check it.
[Detail]
We found "Discover file retries exhausted." message was raised in "_discover_file_till_timeout" method. We assume that this method is the bottleneck of the issue.
cinder/volume/drivers/netapp/dataontap/nfs_base.py
548 def _post_clone_image(self, volume): 549 """Do operations post image cloning.""" 550 LOG.info(_LI('Performing post clone for %s'), volume['name']) 551 vol_path = self.local_path(volume) 552 if self._discover_file_till_timeout(vol_path): <- This method raises "Discover file retries exhausted" message. 553 self._set_rw_permissions(vol_path) <- Next step 554 self._resize_image_file(vol_path, volume['size']) 555 return True 556 raise exception.InvalidResults( 557 _("NFS file could not be discovered."))
So, We calculated the time between _discover_file_till_timeout method and next step(_set_rw_permissions) both in 2015.1.2 and 2015.1.3 using command like below(need debug and verbose log options).
sudo grep -E "post clone|permission" /var/log/cinder/volume.log --- 2016-07-13 15:12:50.450 10129 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-0855671f-8e1a-4f1f-942f-ab59901bc594 - - - - -] Performing post clone for volume-34576af3-391b-4823-ac03-bb0289acd240 2016-07-13 15:12:50.450 10129 DEBUG cinder.volume.drivers.remotefs [req-0855671f-8e1a-4f1f-942f-ab59901bc594 - - - - -] File path /var/lib/cinder/mnt/6471519ef924c4401a6b1ab15ea30341/volume-34576af3-391b-4823-ac03-bb0289acd240 is being set with permissions: 660 _set_rw_permissions /usr/lib/python2.7/site-packages/cinder/volume/drivers/remotefs.py:301
As a result, We confirmed that the volume creation delay was not happened in cinder 2015.1.2, but it happened in cinder 2015.1.3.
This is the all results of the time between _discover_file method and _set_rw_permissions in 20 volumes.
trials | 2015.1.2 | 2015.1.3 |
try-1 | 0sec | Timeout |
try-2 | 0sec | 30sec |
try-3 | 0sec | 30sec |
try-4 | 0sec | 0sec |
try-5 | 0sec | 30sec |
try-6 | 0sec | 30sec |
try-7 | 0sec | 30sec |
try-8 | 0sec | 30sec |
try-9 | 0sec | 30sec |
try-10 | 0sec | 0sec |
try-11 | 0sec | 0sec |
try-12 | 0sec | 30sec |
try-13 | 0sec | 30sec |
try-14 | 0sec | 30sec |
try-15 | 0sec | 30sec |
try-16 | 0sec | 30sec |
try-17 | 0sec | 30sec |
try-18 | 0sec | 30sec |
try-19 | 0sec | 30sec |
try-20 | 0sec | 30sec |
In this time, We encountered "Discover file retries exhausted." message only once, but creation delay was frequently happened. So, If you can't reproduce message "Discover file retries exhausted.", could you please check the time between _discover_file and _set_rw_permissions method?
Our cinder.conf might help our understanding.
Best Regards,
Iwao Endo
Hi Bishoy,
We're still working on this problem, but we can't find the solution yet.
Could you please open a bug for it? If you need more information about it, please tell me.
Thanks,
Iwao Endo
Hello I'm currently consulting my Colleauges in Engineering and will back to you.
Hello Iwao,
Engineering is currntely looking at it for a permenant fix. I have a workaround for you to be tried out if it will resolve your issue for now.
At the file nfs_base.py please add the last line os.stat(path) as below:
def _discover_file_till_timeout(self, path, timeout=45): """Checks if file size at path is equal to size.""" # Sometimes nfs takes time to discover file # Retrying in case any unexpected situation occurs retry_seconds = timeout sleep_interval = 2 while True: if os.path.exists(path): return True else: if retry_seconds <= 0: LOG.warning(_LW('Discover file retries exhausted.')) return False else: time.sleep(sleep_interval) retry_seconds -= sleep_interval os.stat(path)
Hello iwao
Did the this change solve your issue?
BR,
Bishoy
Hello Bishoy,
Thank you for providing the workaround, but it didn't solve the issue.
We added the os.stat(path) line to the nfs_base.py and tried to create volume. However, We found post clone method was immediately failed like below.
[Tue Jul 26 09:43:13.892 2016] 2016-07-26 09:39:12.101 2719 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-fb15b8d9-f832-4458-a0d8-f75915e2d18f - - - - -] Performing post clone for volume-362c9cb8-1c85-441e-9a94-815167bf3600 [Tue Jul 26 09:43:13.892 2016] 2016-07-26 09:39:14.102 2719 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-fb15b8d9-f832-4458-a0d8-f75915e2d18f - - - - -] Image cloning unsuccessful for image 62387d8b-c35a-4a41-ab8c-77a6a71e222c. Message: [Errno 2] No such file or directory: '/var/lib/cinder/mnt/fec1b5659637844e65bf0bbceb1678c3/volume-362c9cb8-1c85-441e-9a94-815167bf3600'
So, I added try-except block as below, but it also didn't improve volume creation time.
else: time.sleep(sleep_interval) retry_seconds -= sleep_interval + try: + os.stat(path) + except Exception: + pass def _is_cloneable_share(self, image_location): """Finds if the image at location is cloneable."""
Thanks,
Iwao Endo
We have opened a bug under https://bugs.launchpad.net/cinder/+bug/1608596
Hello Bishoy,
Thank you for opening the bug.
We added "nfs_mount_options = lookupcache=pos" to cinder.conf according to comment #2 on #1608596, and it worked.
Volume creation from cache became very fast than before.
We really appreciate your cooperation.
Best Regards,
Iwao Endo