Hello. We are using cDOT8.3.1P2 and OpenStack Kilo on RHEL7.2.
After we updated the version of Cinder from 2015.1.2 to 2015.1.3, volume creation from image cache became very slow. We confirmed that it takes 30 - 60 seconds to create volume(but sometimes it is finished in 1 seconds as before).
When we created volumes, cinder-volume log usually showed that post cloning method was failed with message "Discover file retries exhausted."
2016-07-04 17:27:19.729 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Lock "img-cache-9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd" acquired by "_do_clone" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:444
2016-07-04 17:27:19.729 3978 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Cloning from cache to destination volume-b2ea88bb-044d-4309-834e-6686cb3954df
2016-07-04 17:27:19.855 3978 DEBUG cinder.volume.drivers.netapp.dataontap.client.client_cmode [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Cloning with params volume sys02, src img-cache-9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd,dest volume-b2ea88bb-044d-4309-834e-6686cb3954df, vserver test001-svm01 clone_file /usr/lib/python2.7/site-packages/cinder/volume/drivers/netapp/dataontap/client/client_cmode.py:376
2016-07-04 17:27:19.911 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Lock "update_stale" acquired by "_update_stale_vols" :: waited 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:444
2016-07-04 17:27:19.911 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Lock "update_stale" released by "_update_stale_vols" :: held 0.000s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:456
2016-07-04 17:27:19.911 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Releasing file lock "/var/lib/cinder/tmp/cinder-img-cache-9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd" after holding it for 0.183s release /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:227
2016-07-04 17:27:19.911 3978 DEBUG oslo_concurrency.lockutils [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Lock "img-cache-9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd" released by "_do_clone" :: held 0.183s inner /usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py:456
2016-07-04 17:27:19.912 3978 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Performing post clone for volume-b2ea88bb-044d-4309-834e-6686cb3954df
2016-07-04 17:28:05.924 3978 WARNING cinder.volume.drivers.netapp.dataontap.nfs_base [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Discover file retries exhausted.
2016-07-04 17:28:05.925 3978 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-66fc87d4-f8ab-4773-86a8-a0e428379949 - - - - -] Image cloning unsuccessful for image 9a5b1eec-0ad6-4ef0-9706-2e98eb259bfd. Message: NFS file could not be discovered.
We never found this message before package update.
Can you try to revert it back, if it fixes the issue I will open a bug for it but in my ENV isn't happening. It could be that I don't have much of workload though.
Thanks,
Bishoy Mushraky
Re: Create volume from image cache became slow in cinder 2015.1.3
We reverted it back and it fixed the issue. Then I wrote the detail what we had confirmed. Please check it.
[Detail] We found "Discover file retries exhausted." message was raised in "_discover_file_till_timeout" method. We assume that this method is the bottleneck of the issue.
cinder/volume/drivers/netapp/dataontap/nfs_base.py 548 def _post_clone_image(self, volume):
549 """Do operations post image cloning."""
550 LOG.info(_LI('Performing post clone for %s'), volume['name'])
551 vol_path = self.local_path(volume)
552 if self._discover_file_till_timeout(vol_path): <- This method raises "Discover file retries exhausted" message.
553 self._set_rw_permissions(vol_path) <- Next step
554 self._resize_image_file(vol_path, volume['size'])
555 return True
556 raise exception.InvalidResults(
557 _("NFS file could not be discovered."))
So, We calculated the time between _discover_file_till_timeout method and next step(_set_rw_permissions) both in 2015.1.2 and 2015.1.3 using command like below(need debug and verbose log options).
sudo grep -E "post clone|permission" /var/log/cinder/volume.log
---
2016-07-13 15:12:50.450 10129 INFO cinder.volume.drivers.netapp.dataontap.nfs_base [req-0855671f-8e1a-4f1f-942f-ab59901bc594 - - - - -] Performing post clone for volume-34576af3-391b-4823-ac03-bb0289acd240
2016-07-13 15:12:50.450 10129 DEBUG cinder.volume.drivers.remotefs [req-0855671f-8e1a-4f1f-942f-ab59901bc594 - - - - -] File path /var/lib/cinder/mnt/6471519ef924c4401a6b1ab15ea30341/volume-34576af3-391b-4823-ac03-bb0289acd240 is being set with permissions: 660 _set_rw_permissions /usr/lib/python2.7/site-packages/cinder/volume/drivers/remotefs.py:301
As a result, We confirmed that the volume creation delay was not happened in cinder 2015.1.2, but it happened in cinder 2015.1.3.
This is the all results of the time between _discover_file method and _set_rw_permissions in 20 volumes.
trials
2015.1.2
2015.1.3
try-1
0sec
Timeout
try-2
0sec
30sec
try-3
0sec
30sec
try-4
0sec
0sec
try-5
0sec
30sec
try-6
0sec
30sec
try-7
0sec
30sec
try-8
0sec
30sec
try-9
0sec
30sec
try-10
0sec
0sec
try-11
0sec
0sec
try-12
0sec
30sec
try-13
0sec
30sec
try-14
0sec
30sec
try-15
0sec
30sec
try-16
0sec
30sec
try-17
0sec
30sec
try-18
0sec
30sec
try-19
0sec
30sec
try-20
0sec
30sec
In this time, We encountered "Discover file retries exhausted." message only once, but creation delay was frequently happened. So, If you can't reproduce message "Discover file retries exhausted.", could you please check the time between _discover_file and _set_rw_permissions method?
We're still working on this problem, but we can't find the solution yet. Could you please open a bug for it? If you need more information about it, please tell me.
Thanks, Iwao Endo
Re: Create volume from image cache became slow in cinder 2015.1.3
Engineering is currntely looking at it for a permenant fix. I have a workaround for you to be tried out if it will resolve your issue for now.
At the file nfs_base.py please add the last line os.stat(path) as below:
def _discover_file_till_timeout(self, path, timeout=45):
"""Checks if file size at path is equal to size."""
# Sometimes nfs takes time to discover file
# Retrying in case any unexpected situation occurs
retry_seconds = timeout
sleep_interval = 2
while True:
if os.path.exists(path):
return True
else:
if retry_seconds <= 0:
LOG.warning(_LW('Discover file retries exhausted.'))
return False
else:
time.sleep(sleep_interval)
retry_seconds -= sleep_interval
os.stat(path)
Re: Create volume from image cache became slow in cinder 2015.1.3