A short while ago we updated our NetApps from 9.2 to 9.3, the upgrade was done using NetApp upgrade instructions and was successful, except.....
About a week and a bit later users could no longer save files, I managed to fix the symptom and give users back the ability to save files, but the problem remains.
CIFS Auditing Simple Overview
First create a CIFS/NFS share where the audit logs will be saved, then create an auditing policy which specifies what should be logged, the configuration should reference the CIFS/NFS share which you have created and what actions you would like to audit.
Auditing is then enabled on the NetApp, this action causes the creation of a "hidden" staging volume and an Audit Log Consolidation job is created.
Finally set the permissions on files/folders in a CIFS share which need to be audited.
Now when you perform specific actions on a file (Actions that you have listed in the audit policy) data is written to the staging volume and then the consolidation job runs and converts this data into a log file.
This has been working very well since and we chose to output XML which eventually ends up in Elastic and viewed in Kibana. So well did this work that we starting a process of rolling out the permission across the environment.
Then the upgrade to 9.3
We decided to upgrade Ontap to 9.3 as there are some nice new features with increase storage capacity efficiency.
We followed the required NetApp upgrade process and it was successful, there were no outages and all services were as they should be. a perfect upgrade.
Then about 10 days later I started receiving emails from OpenNMS about traps it was receiving from the NetApp appliances, quite a deluge of stuff I hadn't seen before. A staging volume was 95% full? Not the kind of emails you want to see on a Friday afternoon! Some investigation showed that a system volume was near full and CIFS access would be affected if it wasn't resolved.
The Staging Volume
One staging volume is created per aggregate the names are "MDV_aud_" and the Aggregate UUID. They can only be seen on the command line, the System Manager and OCUM web GUI's will not show them and OCUM will not generate volume full (near full, etc) alarms for them.
You can only see issues with them in syslog and traps directly from the NetApp heads, you will get no other warning until you have CIFS issues, probably because staging volumes simply just shouldn't fill up.
Further investigation showed that the Log Consolidation Job that creates the actual log files and effectively manages the data on the staging volumes had been deleted by the upgrade process somehow.
Check your staging volumes
To check if your staging volumes are filling up just run "df -h MDV*" on your NetApp cluster as an account with cluster admin privileges. e.g.
toaster::> df -h MDV*
Filesystem total used avail capacity Mounted on Vserver
/vol/MDV_aud_76a5ebd8a0a41189181cf40336ea04f4/ 1945MB 1004MB 0MB 52% --- toaster
/vol/MDV_aud_76a5ebd8a0a41189181cf40336ea04f4/.snapshot 102MB 50MB 102MB 49% --- toaster
If your capacity is above 80% used, check if your consolidation job is still running (as below) and call your support if its missing.
toaster::> job show -name Cons*
There are no entries matching your query.
In an emergency, to continue service there are two workarounds, the first is to increase the volume size and the second is to turn off audit guarantee. If you are using auditing for compliance then you only have one option, increase volume size.
Both options involve diagnostic mode, which should only be used in emergency, go back into admin ASAP.
1) Increase staging volume size
When increasing the volume size, remember the volume lives on the cluster "vserver".
toaster::> set diag
toaster::*> volume size -vserver toaster -volume MDV_aud_76a5ebd8a0a41189181cf40336ea04f4 3g
toaster::*> set admin
2) Disable audit guarantee (Will break compliance)
The below shows the process of checking audit guarantee of vserver svm01, setting to audit guarantee to false, checking audit guarantee again, and finally leaving diag privilege before doing anything else.
toaster::> set diag
toaster::*> vserver audit show -fields audit-guarantee
vserver audit-guarantee
-------------- ---------------
svm01 true
toaster::*> vserver audit modify -vserver svm01 -destination /audit_log -audit-guarantee false
toaster::*> vserver audit show -fields audit-guarantee
vserver audit-guarantee
-------------- ---------------
svm01 false
toaster::*> set admin