I'm running the latest version of WFA v18.104.22.168.32 in my environment and I've come across an issue I thought wouldn't exist because I'm using certified commands. The issue is that WFA does not appear to be aware of changes it has made before the OnCommand cache updates occur. Here is the scenario:
I'm creating a new CDOT export policy, export rule, and volume. This flow works without issue and is called ss_cdot_CreateNFSVolumeWithExport
Before WFA and/or OnCommand has had a chance to learn about the change from the step 1 workflow via scheduled cache updates I attempt to run a workflow
that creates a new rule in the policy created from step 1. This fails and will continue to fail until WFA's cache is updated from OnCommand and it learns about this new policy.
This flow is called ss_cdot_CreateExportRule
All the commands in both flows are certified so I had thought that would avoid this issue. I had originally been using a modified No-Op command in the first create flow for the reasons behind this post but even after removing that single non-certified command the problem remains. The only thing I can think of is that I created these flows in WFA v2.0 and recently upgraded to v2.1.
I'm either missing something or have encountered a bug in WFA regarding export policies and cache awareness of them, although I'm leaning towards an error I made somewhere but haven't found it yet. I'm attaching both flows in the hopes that they will reveal where I've tripped up. Hopefully it's something simple, thanks in advance.
Francois and Yann,
You guys have actually uncovered a limitation in WFA.
I was able to reproduce the issue of the "clone volume workflow" where you remove a volume and clone it again.
I digged deeper and here is what i found:
1. There are two commands that create reservations here:
Remove volume( if exists)
2. An important point here is that the volume that gets deleted and the clone being created have the same name.
3. The workflow runs perfectly fine the first time and reservations are created. But at this point, the "cache update" is set to "NO" for both the reservations. Which is expected. Now if you go ahead and run data acquisition on DFM and cache acquisition in WFA, the reservation for "create clone" is cleared("cache updated" is set to YES). But, the reservation for "remove volume" is not cleared("cache updated" in still NO). This is what you guys observed.
4. Why is that happening?
There are two conflicting reservations that are happening here. One reservation thinks "test_clone" volume does not exist(REMOVE VOLUME). The other reservation thinks "test_clone" volume exists and was newly created(CREATE CLONE).
Now, When the cache acquisition happens in WFA, the congruence test for "create clone" checks if the newly reported data contains the "test_clone" volume. It does exist, so the reservation is CLEARED. No problems here.
The congruence test for "remove volume" is expecting that "test_clone" volume is NOT present in the cache because it deleted the volume itself. But, surprise!! It is present, so it thinks that the OCUM has not yet reported that the volume is deleted and it remains in "cache updated" being NO. It will stay NO until the default period of 4hours(because OCUM will never report that the volume does not exist), which is why francois was able to come again the next day and execute it successfully.
5.Coming to the results of the filter being weird.
When you run a filter with "Use reservation" being checked, the filter takes the WFA cache, merges it with reservation data and provides the result.
Essentially, what is happening here is that the filter takes the "test_clone" volume from WFA and applies any reservations (Remember, the "remove volume" is not cleared yet). The resultant effect is that the volume is removed.
If you remove the "use reservation" check in the filter, the "test_clone" volume is taken from WFA cache but the "remove volume" reservation is NOT run. Therefore you will see the volume in the result.
6. How can we fix this?
This can happen to any object(not just volume) if a workflow create conflicting reservations.
To decide upon a fix and provide a workaround, i need to understand the use case here.
Why is this workflow being used exactly?
Are there similar workflows that are being used?
If not, what other workflows are being run?
How frequently will this workflow required to be run?
It will be helpful if i can get answers to those questions.
Good news! I've created a custom Create Export Policy command that includes reservations missing from the certified command, you could use it to also avoid the problem in a more robust manor, attached below.
I tested by replacing the certified command in the first workflow (CreateNFSVolWithExport), and the second workflow now finds the policy before a polling cycle.
Hope that's useful!
I'm looking for a way to use reservation in my custom commands, for caching purpose.
I saw in xml file:
INSERT INTO cm_storage.export_policy
SELECT NULL as id,
PolicyName as name,
vs.id as vserver_id
FROM cm_storage.vserver vs
ON (cl.primary_address=Cluster OR cl.name=Cluster)
AND vs.cluster_id = cl.id
AND vs.name = VserverName;
Is it the clue?
Custom command is Clone Volume like with clone on last snapshot with specific suffix name.
I use this for cloning SQL environnement snapshoted with SMSQL and unic name convention -UN
Reservation is not supported in custom commands.
This feature may be available in a future release.
Michael, you have used a certified command in the workflow
which has a reservation script.
I exported my custom command to dar and changed xml to integrate <reservationScript> section from certified command "Clone Volume" .
My problem is fixed.
Do you see something dangerous to use in this way?
Great tip François.
When I did the same thing, it seemed to work, until you do an acquisition in WFA.
I don't know why yet, but what I got, looking at the reservations in WFA web UI, was a "Cache Updated" YES, for an export policy that was not refreshed in OCUM yet (It was "NO" before acquisition). The volume reservation had a good status of refreshed to "NO" (i.e. waiting for OCUM to report it).
I might have done something wrong, I need to do some research
Where do you retrieve <reservationScript> xml portion? As certified command are not exportable, it was necessary to take a look directly in MySQL DB.
For that, I installed a separate MySQL server, where I have root access and I restored the full DB of WFA. wfa.command_definition was now accessible.
All the informations was in
SELECT reservation_script FROM wfa.command_definition
WHERE command LIKE '%clone%';
To answer this specific question, reservations are not directly committed to the specific tables. In your case, qtrees that are newly created are not directly committed to the qtree table. Rather they are stored in the wfa.reservation table and will be committed once the acquisition from OCUM confirms the objects presence. However from the WFA UI, if you were to use a filter to find the newly created qtree, it will use this reservation data and make it appears as though it were taken from the qtree table itself.
I experience the same behavior, event certified commands "clone volume", "remove volume"
my simple workflow delete cloned volume first, before clone again.
In the first round:
Tested the existence with "if volume was not found: disable this command" , the remove volume step was omitted. Good
Clone created successfully.
Second round cache works fine, because " remove volume" is executed, and clone works fine.
As Yann said, if I force OCUM acquire, cache updated change to YES and I tried to relaunch workflow, the first step is ommited again. So delete doesn't occur and clone failed.
What is wrong?
This looks a bit strange because i find that there are reservations and congruence tests in both the remove volume and clone volume command.
With the given description i cannot figure out much.
I will be able to help if you can attach both workflows and a backup of your WFA.
Attached the simple workflow, note my custom command is inside, but disabled.
This morning after one night, the situation was back to normal, workflow running fine.
root@gdc01093# dfm volume list |grep test
1761 gdc01148:/test_clone Flexible 64_bit No
root@gdc01093# dfm volume list |grep test
1764 gdc01148:/test_clone Flexible 64_bit No
After aquire, WFA pull the old volume definition 1761, because dfm discover was not launched.
Could be an incidence?
That is not an one-off, I can reproduce, just with Datasource acquire now.
Seems that clean the cache for the volume concerned or change the reference back to the old cloned volume in precedent round
Are you using your modified commands, where you have written the reservation/congruence scripts?
It is most likely an issue with the congruence script.
The congruence script is running late or not cleaning the reservation.
Congruence cleans the reserved entry when the entry is available in DB.
Hence when you uncheck "use reservation data" you are getting the correct result.
This is happening with certified commands "clone volume", "remove volume" as well.
After acquire volume disappears from query when cache is used.
Waiting on complet cycle OCUM discover and WFA acquire, volume come back again
Just to be clear, if "ITS Clone volume on Last Snapshot" command does not have congruenceTest, you will have reservation issue.
Now, you said it is happening with certified commands as well ? Canyou provide a workflow that demonstrate the issue ? Using only certified ?