2018-04-12 08:29 AM
A short while ago we updated our NetApps from 9.2 to 9.3, the upgrade was done using NetApp upgrade instructions and was successful, except.....
About a week and a bit later users could no longer save files, I managed to fix the symptom and give users back the ability to save files, but the problem remains.
First create a CIFS/NFS share where the audit logs will be saved, then create an auditing policy which specifies what should be logged, the configuration should reference the CIFS/NFS share which you have created and what actions you would like to audit.
Auditing is then enabled on the NetApp, this action causes the creation of a "hidden" staging volume and an Audit Log Consolidation job is created.
Finally set the permissions on files/folders in a CIFS share which need to be audited.
Now when you perform specific actions on a file (Actions that you have listed in the audit policy) data is written to the staging volume and then the consolidation job runs and converts this data into a log file.
This has been working very well since and we chose to output XML which eventually ends up in Elastic and viewed in Kibana. So well did this work that we starting a process of rolling out the permission across the environment.
We decided to upgrade Ontap to 9.3 as there are some nice new features with increase storage capacity efficiency.
We followed the required NetApp upgrade process and it was successful, there were no outages and all services were as they should be. a perfect upgrade.
Then about 10 days later I started receiving emails from OpenNMS about traps it was receiving from the NetApp appliances, quite a deluge of stuff I hadn't seen before. A staging volume was 95% full? Not the kind of emails you want to see on a Friday afternoon! Some investigation showed that a system volume was near full and CIFS access would be affected if it wasn't resolved.
One staging volume is created per aggregate the names are "MDV_aud_" and the Aggregate UUID. They can only be seen on the command line, the System Manager and OCUM web GUI's will not show them and OCUM will not generate volume full (near full, etc) alarms for them.
You can only see issues with them in syslog and traps directly from the NetApp heads, you will get no other warning until you have CIFS issues, probably because staging volumes simply just shouldn't fill up.
Further investigation showed that the Log Consolidation Job that creates the actual log files and effectively manages the data on the staging volumes had been deleted by the upgrade process somehow.
To check if your staging volumes are filling up just run "df -h MDV*" on your NetApp cluster as an account with cluster admin privileges. e.g.
toaster::> df -h MDV* Filesystem total used avail capacity Mounted on Vserver /vol/MDV_aud_76a5ebd8a0a41189181cf40336ea04f4/ 1945MB 1004MB 0MB 52% --- toaster /vol/MDV_aud_76a5ebd8a0a41189181cf40336ea04f4/.snapshot 102MB 50MB 102MB 49% --- toaster
toaster::> job show -name Cons* There are no entries matching your query.
toaster::> set diag toaster::*> volume size -vserver toaster -volume MDV_aud_76a5ebd8a0a41189181cf40336ea04f4 3g toaster::*> set admin
toaster::> set diag toaster::*> vserver audit show -fields audit-guarantee vserver audit-guarantee -------------- --------------- svm01 true toaster::*> vserver audit modify -vserver svm01 -destination /audit_log -audit-guarantee false toaster::*> vserver audit show -fields audit-guarantee vserver audit-guarantee -------------- --------------- svm01 false toaster::*> set admin
Solved! SEE THE SOLUTION
2018-04-12 08:33 AM
There is now a fix for this, but because of the nature of the fix It would be irresponsible to publish it. Please contact Netapp support
Bug Ref: 1150270